Arabic Tweeps Dialect Prediction Based on Machine Learning Approach

Khaled Alrifai, Ghaida Rebdawi, Nada Ghneim


In this paper, we present our approach for profiling Arabic authors on twitter, based on their tweets. We consider here the dialect of an Arabic author as an important trait to be predicted. For this purpose, many indicators, feature vectors and machine learning-based classifiers were implemented. The results of these classifiers were compared to find out the best dialect prediction model. The best dialect prediction model was obtained using Random Forest classifier with full forms and their stems as feature vector.


Author Profining; Arabic Dialects Detection; Machine Learning; Social Media Analysis; Text Mining;


Mart Busger op Vollenbroek, Talvany Carlotto, Tim Kreutz, Maria Medvedeva, Chris Pool, Johannes Bjerva, Hessel Haagsma, and Malvina Nissim. GronUP: Groningen User Profiling. Notebook for PAN at CLEF 2016. University of Groningen, Groningen, The Netherlands (2016).

TNS. Arab Social Media Report. First Report (2015).

Mohammed Abdulmalik Ali. Artificial intelligence and natural language processing: the Arabic corpora in online translation software. International Journal of Advanced and Applied Sciences (2016).

Fei Huang. Improved Arabic Dialect Classification with Social Media Data. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015).

Ahmed Ali, Najim Dehak, Patrick Cardina, Sameer Khurana1, Sree Harsha Yella, James Glass, Peter Bell and Steve Renals. Automatic Dialect Detection in Arabic Broadcast Speech. Published in Cornell University Library (2015).

Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. Working notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland (2017).

Guillaume Kheng, Léa Laporte, and Michael Granitzer. INSA LYON and UNI PASSAU’s participation at PAN@CLEF’17: Author Profiling task. Notebook for PAN at CLEF 2017. Institut National des Sciences Appliquées Lyon and Universität Passau (2017)

Ilia Markov, Helena Gómez-Adorno, and Grigori Sidorov. Language- and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling. Notebook for PAN at CLEF 2017. Instituto Politécnico Nacional, Center for Computing Research, Mexico City, Mexico (2017).

Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi and Liviu P. Dinu. Including Dialects and Language Varieties in Author Profiling. Notebook for PAN at CLEF 2017. University of Bucharest, Romania; University of Cologne, Germany; Harvard Medical School, USA (2017).

Adam Poulston, Zeerak Waseem, and Mark Stevenson. Using TF-IDF n-gram and Word Embedding Cluster Ensembles for Author Profiling. Notebook for PAN at CLEF 2017. Department of Computer Science University of Sheffield, UK (2017).

Nils Schaetti. UniNE at CLEF 2017: TF-IDF and Deep-Learning for Author Profiling. Notebook for PAN at CLEF 2017. University of Neuchatel, Switzerland (2017).

Mirco Kocher and Jacques Savoy. UniNE at CLEF 2017: Author Profiling Reasoning. Notebook for PAN at CLEF 2017. Computer Science Dept., University of Neuchatel, Switzerland (2017).

Yaritza Adame-Arcia1, Daniel Castro-Castro1, Reynier Ortega Bueno1 and Rafael Munoz. Author Profiling, instance-based Similarity Classification. Notebook for PAN at CLEF 2017. Desarrollo de Aplicaciones, Tecnología y Sistemas DATYS, Cuba; Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, España (2017).

Eric S. Tellez, Sabino Miranda-Jiménez, Mario Graff, and Daniela Moctezuma. Gender and language-variety identification with MicroTC. Notebook for PAN at CLEF 2017. CONACyT-INFOTEC Centro de nvestigacióne Innovación en Tecnologías de la Información y Comunicación, México; CONACyT-CentroGEO Centro de Investigación en Geografía y Geomática “Ing. Jorge L. Tamayo” A.C., México (2017).

Jamal Ahmad Khan. Author Profile Prediction Using Trend and Word Frequency Based Analysis in Text. Notebook for PAN at CLEF 2017. Department of Computer Science and Software Engineering, International Islamic University, Islamabad, Pakistan (2017).

Matej Martinc, Iza Škrjanec, Katja Zupan, and Senja Pollak. PAN 2017: Author Profiling - Gender and Language Variety Prediction. Notebook for PAN at CLEF 2017. Jožef Stefan Institute, Slovenia; Jožef Stefan International Postgraduate School, Slovenia (2017).

Rodrigo Ribeiro Oliveira and Rosalvo Ferreira de Oliveira Neto. Using character n-grams and style features for gender and language variety classification. Notebook for PAN at CLEF 2017. Universidade Federal do Vale do São Francisco (2017).

Liliya Akhtyamova, John Cardiff , and Andrey Ignatov. Twitter Author Profiling UsingWord Embeddings and Logistic Regression. Notebook for PAN at CLEF 2017. Institute of Technology Tallaght, Ireland (2017).

Alexander Ogaltsov and Alexey Romanov. Language Variety and Gender Classification for Author Profiling in PAN 2017. Notebook for PAN at CLEF 2017. Antiplagiat CJSC, Higher School of Economics, Moscow Institute of Physics and Technology (2017).

Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. N-GrAM: New Groningen Author-profiling Model. Notebook for PAN at CLEF 2017. University of Groningen, Groningen, The Netherlands (2017).

Don Kodiyan, Florin Hardegger, Stephan Neuhaus, and Mark Cieliebak. Author Profiling with Bidirectional RNNs using Attention with GRUs. Notebook for PAN at CLEF 2017. Zurich University of Applied Sciences (2017).

Sebastian Sierra, Manuel Montes y Gómez, Thamar Solorio, and Fabio A. González. Convolutional Neural Networks for Author Profiling. Notebook for PAN at CLEF 2017. Computing Systems and Industrial Engineering Dept., Universidad Nacional de Colombia, Bogotá, Colombia; Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico; Dept. of Computer Science, University of Houston (2017).

Marc Franco-Salvador, Nataliia Plotnikova, Neha Pawar, and Yassine Benajiba. Subword-based Deep Averaging Networks for Author Profiling in Social Media. Notebook for PAN at CLEF 2017. Symanto Research, Nuremberg, Germany (2017).

Ahmed Abdelali, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. Farasa: A Fast and Furious Segmenter for Arabic. Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar. Proceedings of NAACL-HLT 2016 (Demonstrations), pages 11–16, San Diego, California, June 12-17, 2016. Association for Computational Linguistics (2016).

Total views : 0 times

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 2088-8708, e-ISSN 2722-2578