Gelişmiş Arama

Basit öğe kaydını göster

dc.contributor.authorUzun, Erdinç
dc.contributor.authorGüner, Edip Serdar
dc.contributor.authorKılıçaslan, Yılmaz
dc.contributor.authorYerlikaya, Tarık
dc.contributor.authorAgun, Hayri Volkan
dc.date.accessioned2022-05-11T14:15:47Z
dc.date.available2022-05-11T14:15:47Z
dc.date.issued2014
dc.identifier.issn0038-0644
dc.identifier.urihttps://doi.org/10.1002/spe.2195
dc.identifier.urihttps://hdl.handle.net/20.500.11776/6074
dc.description.abstractClassical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. But, Web pages contain additional information that can be useful for the crawling process. We have developed a crawler, iCrawler (intelligent crawler), the backbone of which is a Web content extractor that automatically pulls content out of seven different blocks: menus, links, main texts, headlines, summaries, additional necessaries, and unnecessary texts from Web pages. The extraction process consists of two steps, which invoke each other to obtain information from the blocks. The first step learns which HTML tags refer to which blocks using the decision tree learning algorithm. Being guided by numerous sources of information, the crawler becomes considerably effective. It achieved a relatively high accuracy of 96.37% in our experiments of block extraction. In the second step, the crawler extracts content from the blocks using string matching functions. These functions along with the mapping between tags and blocks learned in the first step provide iCrawler with considerable time and storage efficiency. More specifically, iCrawler performs 14 times faster in the second step than in the first step. Furthermore, iCrawler significantly decreases storage costs by 57.10% when compared with the texts obtained through classical HTML stripping. Copyright © 2013 John Wiley & Sons, Ltd.en_US
dc.language.isoengen_US
dc.publisherJohn Wiley and Sons Ltden_US
dc.identifier.doi10.1002/spe.2195
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectClassificationen_US
dc.subjectIntelligent systemsen_US
dc.subjectWeb content extractionen_US
dc.subjectWeb crawlingen_US
dc.subjectClassification (of information)en_US
dc.subjectDecision treesen_US
dc.subjectExtractionen_US
dc.subjectHTMLen_US
dc.subjectHypertext systemsen_US
dc.subjectIntelligent systemsen_US
dc.subjectRandom forestsen_US
dc.subjectTrees (mathematics)en_US
dc.subjectWebsitesen_US
dc.subjectBlock extractionen_US
dc.subjectDecision tree learning algorithmen_US
dc.subjectExtraction processen_US
dc.subjectIntelligent crawlersen_US
dc.subjectSources of informationsen_US
dc.subjectStorage efficiencyen_US
dc.subjectWeb content extractionsen_US
dc.subjectWeb Crawlingen_US
dc.subjectWeb crawleren_US
dc.titleAn effective and efficient web content extractor for optimizing the crawling processen_US
dc.typearticleen_US
dc.relation.ispartofSoftware - Practice and Experienceen_US
dc.departmentFakülteler, Çorlu Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümüen_US
dc.identifier.volume44en_US
dc.identifier.issue10en_US
dc.identifier.startpage1181en_US
dc.identifier.endpage1199en_US
dc.institutionauthorUzun, Erdinç
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.authorscopusid54783608800
dc.authorscopusid24481000500
dc.authorscopusid56538500900
dc.authorscopusid16232085100
dc.authorscopusid55293388500
dc.identifier.scopus2-s2.0-84908473420en_US


Bu öğenin dosyaları:

Thumbnail

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Basit öğe kaydını göster