Computational analysis of virus-host protein-protein interactions using gene ontology and natural language processing

dc.contributor.authorCihan, Pinar
dc.contributor.authorOzger, Zeynep Banu
dc.contributor.authorCakabay, Zeynep
dc.date.accessioned2025-04-06T12:23:56Z
dc.date.available2025-04-06T12:23:56Z
dc.date.issued2025
dc.departmentTekirdağ Namık Kemal Üniversitesi
dc.description.abstractThe role of in-silico computational methods in identifying protein-protein interactions (PPIs) between target and host proteins is crucial for developing effective infection treatments. These methods are essential for quickly determining high-quality and accurate PPIs, predicting protein pairs with the highest likelihood of physical interaction from a large pool, and reducing the need for experimental confirmation or prioritizing pairs for experiments. This study proposes using gene ontology and natural language processing (NLP) approaches to extract and quantify features from protein sequences. In the first step, proteins were represented using gene ontology terms, and a set of features was generated. In the second step, NLP techniques treated gene ontology terms as a word dictionary, creating numerical vectors using the bag of words (BoW), count vector, term frequency-inverse document frequency (TF-IDF), and information content methods. In the third step, different machine learning methods, including Decision Tree, Random Forest, Bagging-RepTree, Bagging-RF, BayesNet, Deep Neural Network (DNN), Logistic Regression, Support Vector Machine (SVM), and VotedPerceptron, were employed to predict protein interactions in the datasets. In the fourth step, the Max-Min Parents and Children (MMPC) feature selection algorithm was applied to improve predictions using fewer features. The performance of the developed method was tested on the SARS-CoV-2 protein interaction dataset. The MMPC algorithm reduced the feature count by over 99%, enhancing protein interaction prediction. After feature selection, the DNN method achieved the highest predictive performance, with an AUC of 0.878 and an F-Measure of 0.793. Sequence-based protein encoding methods AAC, APAAC, CKSAAPP, CTriad, DC, and PAAC were applied to proteins in the SARS-CoV-2 interaction dataset and their performance was compared with GO-NLP. The performance of the relevant methods was measured separately and combined. The highest performance was obtained from the combined dataset with an AUC value of 0.888. This study demonstrates that the proposed gene ontology and NLP approach can successfully predict protein-protein interactions for antiviral drug design with significantly fewer features using the MMPC-DNN model.
dc.description.sponsorshipTurkish Scientific and Technical Research Council-TUBITAK [122E114]; Scientific and Technological Research Council of Turkiye (TUBITAK); TUBITAK; SIPRINGER NATURE
dc.description.sponsorshipThis work was supported by the Turkish Scientific and Technical Research Council-TUBITAK (Grant Number: 122E114). Open access funding provided by the Scientific and Technological Research Council of Turkiye (TUBITAK). The open access fee for this study was provided by the open access agreement between TUBITAK and SIPRINGER NATURE.
dc.identifier.doi10.1007/s10489-024-06223-1
dc.identifier.issn0924-669X
dc.identifier.issn1573-7497
dc.identifier.issue6
dc.identifier.scopus2-s2.0-85217766326
dc.identifier.scopusqualityQ2
dc.identifier.urihttps://doi.org/10.1007/s10489-024-06223-1
dc.identifier.urihttps://hdl.handle.net/20.500.11776/17265
dc.identifier.volume55
dc.identifier.wosWOS:001410348000002
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherSpringer
dc.relation.ispartofApplied Intelligence
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.snmzKA_WOS_20250406
dc.subjectGene ontology
dc.subjectNatural language processing
dc.subjectProtein-protein interactions
dc.subjectSARS-CoV-2
dc.subjectFeature selection
dc.titleComputational analysis of virus-host protein-protein interactions using gene ontology and natural language processing
dc.typeArticle

Dosyalar