A hybrid approach for extracting informative content from web pages

Uzun, Erdinç; Agun, Hayri Volkan; Yerlikaya, Tarık

dc.contributor.author	Uzun, Erdinç
dc.contributor.author	Agun, Hayri Volkan
dc.contributor.author	Yerlikaya, Tarık
dc.date.accessioned	2022-05-11T14:15:47Z
dc.date.available	2022-05-11T14:15:47Z
dc.date.issued	2013
dc.identifier.issn	0306-4573
dc.identifier.issn	1873-5371
dc.identifier.uri	https://doi.org/10.1016/j.ipm.2013.02.005
dc.identifier.uri	https://hdl.handle.net/20.500.11776/6069
dc.description.abstract	Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step. (C) 2013 Elsevier Ltd. All rights reserved.	en_US
dc.language.iso	eng	en_US
dc.publisher	Elsevier Sci Ltd	en_US
dc.identifier.doi	10.1016/j.ipm.2013.02.005
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	Web Content Extraction	en_US
dc.subject	Template Detection	en_US
dc.subject	Web Cleaning	en_US
dc.subject	Web Learning Modeling	en_US
dc.subject	Searching Strategies	en_US
dc.title	A hybrid approach for extracting informative content from web pages	en_US
dc.type	article	en_US
dc.relation.ispartof	Information Processing & Management	en_US
dc.department	Fakülteler, Çorlu Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü	en_US
dc.authorid	0000-0003-4351-2244
dc.authorid	0000-0002-4253-8920
dc.identifier.volume	49	en_US
dc.identifier.issue	4	en_US
dc.identifier.startpage	928	en_US
dc.identifier.endpage	944	en_US
dc.institutionauthor	Uzun, Erdinç
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.authorscopusid	54783608800
dc.authorscopusid	55293388500
dc.authorscopusid	16232085100
dc.authorwosid	Uzun, Erdinç/AAG-5529-2019
dc.authorwosid	Agun, Hayri Volkan/P-5002-2019
dc.identifier.wos	WOS:000319543800015	en_US
dc.identifier.scopus	2-s2.0-84875710694	en_US

Bu öğenin dosyaları:

Ad:: 6069.pdf
Boyut:: 973.5Kb
Biçim:: PDF
Açıklama:: Tam Metin / Full Text

Göster/Aç

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Scopus İndeksli Yayınlar Koleksiyonu [4328]
Scopus Indexed Publications Collection
WoS İndeksli Yayınlar Koleksiyonu [4789]
WoS Indexed Publications Collection
Çorlu Mühendislik Fakültesi Koleksiyonu [990]

Basit öğe kaydını göster