An efficient regular expression inference approach for relevant image extraction
Özet
Traditional approaches for extracting relevant images automatically from web pages are error-prone and time-consuming. To improve this task, operations such as preparing a larger dataset and finding new features are used in the web data extraction approaches. However, these operations are difficult and laborious. In this study, we propose a fully-automated approach based on alignment of regular expressions to automatically extract the relevant images from web pages. The automatically constructed regular expressions has been applied to a classification task for the first time. In this respect, a multi-stage inference approach is developed for generating regular expressions from the attribute values of relevant and irrelevant image elements in web pages. The proposed approach reduces the complexity of the alignment of two regular expressions by applying a constraint on a version of the Levenshtein distance algorithm. The classification accuracy of regular expression approaches is compared with the naive Bayes, logistic regression, J48, and multilayer perceptron classifiers on a balanced relevant image retrieval dataset consisting of 360 image element samples for 10 shopping websites. According to the cross-validation results, the regular expression inference-based classification achieved a 0.98 f-measure with only 5 frequent n-grams, and it outperformed other classifiers on the same set of features. The classification efficiency of the proposed approach is measured at 0.108 ms, which is very competitive with other classifiers. © 2023
Cilt
135Koleksiyonlar
İlgili Öğeler
Başlık, yazar, küratör ve konuya göre gösterilen ilgili öğeler.
-
An effective and efficient web content extractor for optimizing the crawling process
Uzun, Erdinç; Güner, Edip Serdar; Kılıçaslan, Yılmaz; Yerlikaya, Tarık; Agun, Hayri Volkan (John Wiley and Sons Ltd, 2014)Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information ... -
A Novel Web Scraping Approach Using the Additional Information Obtained from Web Pages
Uzun, Erdinç (Institute of Electrical and Electronics Engineers Inc., 2020)Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, ... -
A regular expression generator based on CSS selectors for efficient extraction from HTML pages
Uzun, Erdinç (Turkiye Klinikleri, 2020)Cascading style sheets (CSS) selectors are patterns used to select HTML elements. They are often preferred in web data extraction because they are easy to prepare and have short expressions. In order to be able to extract ...