A hybrid approach for extracting informative content from web pages

Uzun, Erdinç; Agun, Hayri Volkan; Yerlikaya, Tarık

A hybrid approach for extracting informative content from web pages

dc.authorid	0000-0003-4351-2244
dc.authorid	0000-0002-4253-8920
dc.authorscopusid	54783608800
dc.authorscopusid	55293388500
dc.authorscopusid	16232085100
dc.authorwosid	Uzun, Erdinç/AAG-5529-2019
dc.authorwosid	Agun, Hayri Volkan/P-5002-2019
dc.contributor.author	Uzun, Erdinç
dc.contributor.author	Agun, Hayri Volkan
dc.contributor.author	Yerlikaya, Tarık
dc.date.accessioned	2022-05-11T14:15:47Z
dc.date.available	2022-05-11T14:15:47Z
dc.date.issued	2013
dc.department	Fakülteler, Çorlu Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
dc.description.abstract	Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step. (C) 2013 Elsevier Ltd. All rights reserved.
dc.identifier.doi	10.1016/j.ipm.2013.02.005
dc.identifier.endpage	944
dc.identifier.issn	0306-4573
dc.identifier.issn	1873-5371
dc.identifier.issue	4	en_US
dc.identifier.scopus	2-s2.0-84875710694
dc.identifier.scopusquality	Q1
dc.identifier.startpage	928
dc.identifier.uri	https://doi.org/10.1016/j.ipm.2013.02.005
dc.identifier.uri	https://hdl.handle.net/20.500.11776/6069
dc.identifier.volume	49
dc.identifier.wos	WOS:000319543800015
dc.identifier.wosquality	Q2
dc.indekslendigikaynak	Web of Science
dc.indekslendigikaynak	Scopus
dc.institutionauthor	Uzun, Erdinç
dc.language.iso	en
dc.publisher	Elsevier Sci Ltd
dc.relation.ispartof	Information Processing & Management
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	Web Content Extraction
dc.subject	Template Detection
dc.subject	Web Cleaning
dc.subject	Web Learning Modeling
dc.subject	Searching Strategies
dc.title	A hybrid approach for extracting informative content from web pages
dc.type	Article

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1

İsim:: 6069.pdf
Boyut:: 973.52 KB
Biçim:: Adobe Portable Document Format
Açıklama:: Tam Metin / Full Text

İndir

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu
Çorlu Mühendislik Fakültesi Koleksiyonu