Developing an effective focused crawler to retrieve data of Indian-origin scientists and utilizing text classification for comparative analysis

Shivani Gautam, Rajesh Bhatia, Shaily Jain

Abstract


This article presents the implementation of focused web crawling to retrieve data about scientists of Indian ancestry who are working in foreign nations. This study demonstrates the effectiveness of web scraping in obtaining large amounts of data from publicly available online pages. The objective is to construct a collection of data pertaining to Indian scientists who are now employed in national laboratories overseas. Collecting a vast quantity of data on the aforementioned Indian scientists through manual search is a pointless task. Therefore, this study proposes a detailed plan for a focused web crawler that can gather similar data. Subsequently, we present a comprehensive assessment of numerous classification models on this newly created dataset. Our assessments indicate that the random forest model surpasses the other supervised models. The empirical findings on large datasets demonstrated that the combination of random forest with synthetic minority oversampling technique (SMOTE) and k-fold cross-validation methods yielded better performance compared to K-nearest neighbors (KNN), support vector machine (SVM), and logistic regression (LR) for Indian origin scientists. Conversely, SMOTE with an 80-20 random split demonstrated superior performance on smaller datasets. Overall, the random forest classifier demonstrated the most favorable outcomes, attaining a micro-average area under curve (AUC) of 90%. The outcomes of our study provide a solid foundation for further investigation into classification of text of Indian origin scientists.


Keywords


Comparative analysis; Focused web crawler; Indian scientists database natural language processing text classification; Information retrieval; Web scraping

Full Text:

PDF


DOI: http://doi.org/10.11591/ijece.v14i5.pp5468-5480

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578

This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).