A regular expression generator based on CSS selectors for efficient extraction from HTML pages
Özet
Cascading style sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a document object model (DOM) tree is constructed by an HTML parser for a web page. The construction process of this tree and the extraction process using this tree increase time and memory costs depending on the number of HTML elements and their hierarchies. For reducing these costs, regular expressions can be considered as a solution. However, preparing regular expression patterns is a laborious task. In this study, a heuristic approach, namely Regex Generator (REGEXN), that automatically generates these patterns through CSS selectors is introduced and the performance gains are analyzed on a web crawler. The analysis shows that regular expression patterns generated by this approach can significantly reduce the average extraction time results from 743.31 ms to 1.03 ms when compared with the extraction process from a DOM tree. Similarly, the average memory usage drops from 1054.01 B to 1.59 B. Moreover, REGEXN can be easily adapted to the existing frameworks and tools in this task. © TÜBİTAK
Cilt
28Sayı
6Koleksiyonlar
İlgili Öğeler
Başlık, yazar, küratör ve konuya göre gösterilen ilgili öğeler.
-
An efficient regular expression inference approach for relevant image extraction
Agün, H.V.; Uzun, Erdinç (Elsevier Ltd, 2023)Traditional approaches for extracting relevant images automatically from web pages are error-prone and time-consuming. To improve this task, operations such as preparing a larger dataset and finding new features are used ... -
An effective and efficient web content extractor for optimizing the crawling process
Uzun, Erdinç; Güner, Edip Serdar; Kılıçaslan, Yılmaz; Yerlikaya, Tarık; Agun, Hayri Volkan (John Wiley and Sons Ltd, 2014)Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information ... -
A Novel Web Scraping Approach Using the Additional Information Obtained from Web Pages
Uzun, Erdinç (Institute of Electrical and Electronics Engineers Inc., 2020)Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, ...