A regular expression generator based on CSS selectors for efficient extraction from HTML pages

dc.authorscopusid54783608800
dc.contributor.authorUzun, Erdinç
dc.date.accessioned2022-05-11T14:03:00Z
dc.date.available2022-05-11T14:03:00Z
dc.date.issued2020
dc.departmentFakülteler, Çorlu Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
dc.description.abstractCascading style sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a document object model (DOM) tree is constructed by an HTML parser for a web page. The construction process of this tree and the extraction process using this tree increase time and memory costs depending on the number of HTML elements and their hierarchies. For reducing these costs, regular expressions can be considered as a solution. However, preparing regular expression patterns is a laborious task. In this study, a heuristic approach, namely Regex Generator (REGEXN), that automatically generates these patterns through CSS selectors is introduced and the performance gains are analyzed on a web crawler. The analysis shows that regular expression patterns generated by this approach can significantly reduce the average extraction time results from 743.31 ms to 1.03 ms when compared with the extraction process from a DOM tree. Similarly, the average memory usage drops from 1054.01 B to 1.59 B. Moreover, REGEXN can be easily adapted to the existing frameworks and tools in this task. © TÜBİTAK
dc.identifier.doi10.3906/ELK-2004-67
dc.identifier.endpage3401
dc.identifier.issn1300-0632
dc.identifier.issue6en_US
dc.identifier.scopus2-s2.0-85102510844
dc.identifier.scopusqualityQ3
dc.identifier.startpage3389
dc.identifier.urihttps://doi.org/10.3906/ELK-2004-67
dc.identifier.urihttps://hdl.handle.net/20.500.11776/4567
dc.identifier.volume28
dc.identifier.wosWOS:000595611700020
dc.identifier.wosqualityQ4
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.institutionauthorUzun, Erdinç
dc.language.isoen
dc.publisherTurkiye Klinikleri
dc.relation.ispartofTurkish Journal of Electrical Engineering and Computer Sciences
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccess
dc.subjectComputational efficiency
dc.subjectHeuristic algorithms
dc.subjectRegular expressions
dc.subjectWeb data extraction
dc.subjectExtraction
dc.subjectHeuristic methods
dc.subjectHTML
dc.subjectPattern matching
dc.subjectTrees (mathematics)
dc.subjectWeb crawler
dc.subjectWebsites
dc.subjectXML
dc.subjectCascading style sheets
dc.subjectConstruction process
dc.subjectDocument object model
dc.subjectExtraction process
dc.subjectHeuristic approach
dc.subjectPerformance Gain
dc.subjectRegular expressions
dc.subjectWeb data extraction
dc.subjectData mining
dc.titleA regular expression generator based on CSS selectors for efficient extraction from HTML pages
dc.typeArticle

Dosyalar

Orijinal paket
Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
4567.pdf
Boyut:
449.38 KB
Biçim:
Adobe Portable Document Format
Açıklama:
Tam Metin / Full Text