Automatically Discovering Relevant Images From Web Pages

Uzun, Erdinç; Ozhan, Erkan; Agun, Hayri Volkan; Yerlikaya, Tarık; Buluş, Halil Nusret

dc.contributor.author	Uzun, Erdinç
dc.contributor.author	Ozhan, Erkan
dc.contributor.author	Agun, Hayri Volkan
dc.contributor.author	Yerlikaya, Tarık
dc.contributor.author	Buluş, Halil Nusret
dc.date.accessioned	2022-05-11T14:03:00Z
dc.date.available	2022-05-11T14:03:00Z
dc.date.issued	2020
dc.identifier.issn	2169-3536
dc.identifier.uri	https://doi.org/10.1109/ACCESS.2020.3039044
dc.identifier.uri	https://hdl.handle.net/20.500.11776/4569
dc.description.abstract	Web pages contain irrelevant images along with relevant images. The classification of these images is an error-prone process due to the number of design variations of web pages. Using multiple web pages provides additional features that improve the performance of relevant image extraction. Traditional studies use the features extracted from a single web page. However, in this study, we enhance the performance of relevant image extraction by employing the features extracted from different web pages consisting of standard news, galleries, video pages, and link pages. The dataset obtained from these web pages contains 100 different web pages for each 200 online news websites from 58 different countries. For discovering relevant images, the most straightforward approach extracts the largest image on the web page. This approach achieves a 0.451 F-Measure score as a baseline. Then, we apply several machine learning methods using features in this dataset to find the most suitable machine learning method. The best f-Measure score is 0.822 using the AdaBoost classifier. Some of these features have been utilized in previous web data extraction studies. To the best of our knowledge, 15 new features are proposed for the first time in this study for discovering the relevant images. We compare the performance of the AdaBoost classifier on different feature sets. The proposed features improve the f-Measure by 35 percent. Besides, using only the cache feature, which is the most prominent feature, corresponds to 7 percent of this improvement.	en_US
dc.language.iso	eng	en_US
dc.publisher	Ieee-Inst Electrical Electronics Engineers Inc	en_US
dc.identifier.doi	10.1109/ACCESS.2020.3039044
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Web pages	en_US
dc.subject	Feature extraction	en_US
dc.subject	Layout	en_US
dc.subject	Machine learning	en_US
dc.subject	Crawlers	en_US
dc.subject	Predictive models	en_US
dc.subject	Task analysis	en_US
dc.subject	Image classification	en_US
dc.subject	image retrieval	en_US
dc.subject	feature extraction	en_US
dc.subject	web crawlers	en_US
dc.subject	web mining	en_US
dc.title	Automatically Discovering Relevant Images From Web Pages	en_US
dc.type	article	en_US
dc.relation.ispartof	Ieee Access	en_US
dc.department	Fakülteler, Çorlu Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü	en_US
dc.authorid	0000-0002-3971-2676
dc.authorid	0000-0002-9888-0151
dc.authorid	0000-0003-4351-2244
dc.identifier.volume	8	en_US
dc.identifier.startpage	208910	en_US
dc.identifier.endpage	208921	en_US
dc.institutionauthor	Uzun, Erdinç
dc.institutionauthor	Ozhan, Erkan
dc.institutionauthor	Buluş, Halil Nusret
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.authorwosid	OZHAN, Erkan/N-8743-2016
dc.identifier.wos	WOS:000594426400001	en_US

Bu öğenin dosyaları:

Ad:: 4569.pdf
Boyut:: 1.457Mb
Biçim:: PDF
Açıklama:: Tam Metin / Full Text

Göster/Aç

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

WoS İndeksli Yayınlar Koleksiyonu [4789]
WoS Indexed Publications Collection
Çorlu Mühendislik Fakültesi Koleksiyonu [990]

Basit öğe kaydını göster