A hybrid approach for extracting informative content from web pages

dc.authorid0000-0003-4351-2244
dc.authorid0000-0002-4253-8920
dc.authorscopusid54783608800
dc.authorscopusid55293388500
dc.authorscopusid16232085100
dc.authorwosidUzun, Erdinç/AAG-5529-2019
dc.authorwosidAgun, Hayri Volkan/P-5002-2019
dc.contributor.authorUzun, Erdinç
dc.contributor.authorAgun, Hayri Volkan
dc.contributor.authorYerlikaya, Tarık
dc.date.accessioned2022-05-11T14:15:47Z
dc.date.available2022-05-11T14:15:47Z
dc.date.issued2013
dc.departmentFakülteler, Çorlu Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
dc.description.abstractEliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step. (C) 2013 Elsevier Ltd. All rights reserved.
dc.identifier.doi10.1016/j.ipm.2013.02.005
dc.identifier.endpage944
dc.identifier.issn0306-4573
dc.identifier.issn1873-5371
dc.identifier.issue4en_US
dc.identifier.scopus2-s2.0-84875710694
dc.identifier.scopusqualityQ1
dc.identifier.startpage928
dc.identifier.urihttps://doi.org/10.1016/j.ipm.2013.02.005
dc.identifier.urihttps://hdl.handle.net/20.500.11776/6069
dc.identifier.volume49
dc.identifier.wosWOS:000319543800015
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.institutionauthorUzun, Erdinç
dc.language.isoen
dc.publisherElsevier Sci Ltd
dc.relation.ispartofInformation Processing & Management
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.subjectWeb Content Extraction
dc.subjectTemplate Detection
dc.subjectWeb Cleaning
dc.subjectWeb Learning Modeling
dc.subjectSearching Strategies
dc.titleA hybrid approach for extracting informative content from web pages
dc.typeArticle

Dosyalar

Orijinal paket
Listeleniyor 1 - 1 / 1
Küçük Resim Yok
İsim:
6069.pdf
Boyut:
973.52 KB
Biçim:
Adobe Portable Document Format
Açıklama:
Tam Metin / Full Text