Scraping Relevant Images from Web Pages without Download

Uzun, Erdinç

Scraping Relevant Images from Web Pages without Download

dc.contributor.author	Uzun, Erdinç
dc.date.accessioned	2024-10-29T17:43:29Z
dc.date.available	2024-10-29T17:43:29Z
dc.date.issued	2023
dc.department	Tekirdağ Namık Kemal Üniversitesi
dc.description.abstract	Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools, because it is based on textual data. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
dc.identifier.doi	10.1145/3616849
dc.identifier.issn	1559-1131
dc.identifier.issue	1	en_US
dc.identifier.scopus	2-s2.0-85174856054
dc.identifier.scopusquality	Q1
dc.identifier.uri	https://doi.org/10.1145/3616849
dc.identifier.uri	https://hdl.handle.net/20.500.11776/12423
dc.identifier.volume	18
dc.indekslendigikaynak	Scopus
dc.language.iso	en
dc.publisher	Association for Computing Machinery
dc.relation.ispartof	ACM Transactions on the Web
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	crawler design
dc.subject	evaluation
dc.subject	relevant images
dc.subject	web data extraction
dc.subject	Web mining
dc.title	Scraping Relevant Images from Web Pages without Download
dc.type	Article

Koleksiyon

Scopus İndeksli Yayınlar Koleksiyonu

Scraping Relevant Images from Web Pages without Download

Dosyalar

Koleksiyon