An effective and efficient web content extractor for optimizing the crawling process

Uzun, Erdinç; Güner, Edip Serdar; Kılıçaslan, Yılmaz; Yerlikaya, Tarık; Agun, Hayri Volkan

An effective and efficient web content extractor for optimizing the crawling process

dc.authorscopusid	54783608800
dc.authorscopusid	24481000500
dc.authorscopusid	56538500900
dc.authorscopusid	16232085100
dc.authorscopusid	55293388500
dc.contributor.author	Uzun, Erdinç
dc.contributor.author	Güner, Edip Serdar
dc.contributor.author	Kılıçaslan, Yılmaz
dc.contributor.author	Yerlikaya, Tarık
dc.contributor.author	Agun, Hayri Volkan
dc.date.accessioned	2022-05-11T14:15:47Z
dc.date.available	2022-05-11T14:15:47Z
dc.date.issued	2014
dc.department	Fakülteler, Çorlu Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü
dc.description.abstract	Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler (intelligent crawler), the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10% when compared with the texts obtained through classical HTML stripping. Copyright © 2013 John Wiley & Sons, Ltd.
dc.identifier.doi	10.1002/spe.2195
dc.identifier.endpage	1199
dc.identifier.issn	0038-0644
dc.identifier.issue	10	en_US
dc.identifier.scopus	2-s2.0-84908473420
dc.identifier.scopusquality	Q2
dc.identifier.startpage	1181
dc.identifier.uri	https://doi.org/10.1002/spe.2195
dc.identifier.uri	https://hdl.handle.net/20.500.11776/6074
dc.identifier.volume	44
dc.indekslendigikaynak	Scopus
dc.institutionauthor	Uzun, Erdinç
dc.language.iso	en
dc.publisher	John Wiley and Sons Ltd
dc.relation.ispartof	Software - Practice and Experience
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	Classification
dc.subject	Intelligent systems
dc.subject	Web content extraction
dc.subject	Web crawling
dc.subject	Classification (of information)
dc.subject	Decision trees
dc.subject	Extraction
dc.subject	HTML
dc.subject	Hypertext systems
dc.subject	Intelligent systems
dc.subject	Random forests
dc.subject	Trees (mathematics)
dc.subject	Websites
dc.subject	Block extraction
dc.subject	Decision tree learning algorithm
dc.subject	Extraction process
dc.subject	Intelligent crawlers
dc.subject	Sources of informations
dc.subject	Storage efficiency
dc.subject	Web content extractions
dc.subject	Web Crawling
dc.subject	Web crawler
dc.title	An effective and efficient web content extractor for optimizing the crawling process
dc.type	Article

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1

İsim:: 6074.pdf
Boyut:: 710.36 KB
Biçim:: Adobe Portable Document Format
Açıklama:: Tam Metin / Full Text

İndir

Koleksiyon

Scopus İndeksli Yayınlar Koleksiyonu
Çorlu Mühendislik Fakültesi Koleksiyonu