Gelişmiş Arama

Basit öğe kaydını göster

dc.contributor.authorUzun, Erdinç
dc.date.accessioned2022-05-11T14:03:00Z
dc.date.available2022-05-11T14:03:00Z
dc.date.issued2020
dc.identifier.issn1300-0632
dc.identifier.urihttps://doi.org/10.3906/ELK-2004-67
dc.identifier.urihttps://hdl.handle.net/20.500.11776/4567
dc.description.abstractCascading style sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract data from web pages by using these patterns, a document object model (DOM) tree is constructed by an HTML parser for a web page. The construction process of this tree and the extraction process using this tree increase time and memory costs depending on the number of HTML elements and their hierarchies. For reducing these costs, regular expressions can be considered as a solution. However, preparing regular expression patterns is a laborious task. In this study, a heuristic approach, namely Regex Generator (REGEXN), that automatically generates these patterns through CSS selectors is introduced and the performance gains are analyzed on a web crawler. The analysis shows that regular expression patterns generated by this approach can significantly reduce the average extraction time results from 743.31 ms to 1.03 ms when compared with the extraction process from a DOM tree. Similarly, the average memory usage drops from 1054.01 B to 1.59 B. Moreover, REGEXN can be easily adapted to the existing frameworks and tools in this task. © TÜBİTAKen_US
dc.language.isoengen_US
dc.publisherTurkiye Kliniklerien_US
dc.identifier.doi10.3906/ELK-2004-67
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectComputational efficiencyen_US
dc.subjectHeuristic algorithmsen_US
dc.subjectRegular expressionsen_US
dc.subjectWeb data extractionen_US
dc.subjectExtractionen_US
dc.subjectHeuristic methodsen_US
dc.subjectHTMLen_US
dc.subjectPattern matchingen_US
dc.subjectTrees (mathematics)en_US
dc.subjectWeb crawleren_US
dc.subjectWebsitesen_US
dc.subjectXMLen_US
dc.subjectCascading style sheetsen_US
dc.subjectConstruction processen_US
dc.subjectDocument object modelen_US
dc.subjectExtraction processen_US
dc.subjectHeuristic approachen_US
dc.subjectPerformance Gainen_US
dc.subjectRegular expressionsen_US
dc.subjectWeb data extractionen_US
dc.subjectData miningen_US
dc.titleA regular expression generator based on CSS selectors for efficient extraction from HTML pagesen_US
dc.typearticleen_US
dc.relation.ispartofTurkish Journal of Electrical Engineering and Computer Sciencesen_US
dc.departmentFakülteler, Çorlu Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümüen_US
dc.identifier.volume28en_US
dc.identifier.issue6en_US
dc.identifier.startpage3389en_US
dc.identifier.endpage3401en_US
dc.institutionauthorUzun, Erdinç
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.authorscopusid54783608800
dc.identifier.wosWOS:000595611700020en_US
dc.identifier.scopus2-s2.0-85102510844en_US


Bu öğenin dosyaları:

Thumbnail

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Basit öğe kaydını göster