Different valuable tools for Arabic sentiment analysis: a comparative evaluation

Received Oct 9, 2019 Revised Aug 11, 2020 Accepted Aug 26, 2020 Arabic Natural language processing (ANLP) is a subfield of artificial intelligence (AI) that tries to build various applications in the Arabic language like Arabic sentiment analysis (ASA) that is the operation of classifying the feelings and emotions expressed for defining the attitude of the writer (neutral, negative or positive). In order to work on ASA, researchers can use various tools in their research projects without explaining the cause behind this use, or they choose a set of libraries according to their knowledge about a specific programming language. Because of their libraries' abundance in the ANLP field, especially in ASA, we are relying on JAVA and Python programming languages in our research work. This paper relies on making an in-depth comparative evaluation of different valuable Python and Java libraries to deduce the most useful ones in Arabic sentiment analysis (ASA). According to a large variety of great and influential works in the domain of ASA, we deduce that the NLTK, Gensim and TextBlob libraries are the most useful for Python ASA task. In connection with Java ASA libraries, we conclude that Weka and CoreNLP tools are the most used, and they have great results in this research domain.


INTRODUCTION
Natural language processing (NLP) is a subfield of computer science, linguistics, artificial intelligence, and information engineering interested by the interactions between human (natural) languages and computers, in particular how to program computers to treat and process a massive quantity of natural language data. Arabic natural language processing ANLP tries to build software eligible to treat Arabic linguistic data automatically for a specific application. The Arabic language is recognized as the 4th most used language of the Internet. It is the formal language of twenty-two countries, spoken by more than four hundred million speakers. It is a Semitic language that is characterized by its literary abundance. Arabic morphology is rich, complex, and highly ambiguous. For this reason, it poses a variety of problems in the field of NLP. Nowadays, ANLP has obtained significant value. A large variety of applications have been built like: sentiment analysis [1,2], machine translation, question answering, named entity recognition, etc. These applications must adapt to the complicated structure of Arabic [3]. This Semitic language has its own special features; for example, it has no capitalization; the Arabic alphabet contains 29 consonants and 11 vowels. Moreover, the Arabic language is written from right to left, and its letters change format depending on their place in the word. variables to be typed dynamically and add attributes to objects on the fly, facilitating rapid development [14]. Many researchers recommend this powerful programming language. For instance: Steven Bird and Edward Loper in [14] strongly recommend the use of Python in NLP projects, and through their work [15], they deduce that this programming language is the best, providing a large variety of benefits.
In his paper [16], Nitin Madnani chose to employ Python because he confirms that this programming language has a large variety of benefits over the other programming languages, such as an easy-to-use object-oriented paradigm, high readability, strong Unicode support, easy extensibility, and a powerful standard library. It is very efficient and has been applied in complex and difficult NLP projects.
The Theano Development Team encourages the use of Python through this work [17], which they consider a flexible programming language providing a straightforward manner to react with data and allowing for fast prototyping. Moreover, paper [18] offers a critical assessment of existing Python infrastructure for NLP new tasks. Through their case study: Automatic Aspectual Classification of Verbs in an Untagged Corpus, the authors found that Python's core libraries offer perfect coverage of essential machine learning algorithms.
Python is especially more appropriate for various reasons: free and simple, object-oriented, and compatible with so many platforms, a large number of libraries for Python. Nevertheless, we also have to know the downsides of choosing it over another programming language: Speed limitations, Weak in mobile computing, and browsers.

Python libraries
It is fundamental first to show the most useful Python libraries that have been proven in the domain of ASA: NLTK, TextBlob, and Gensim. a. NLTK: is a leading platform for NLP. A set of core modules (libraries and programs) offers basic data types that are utilized throughout the tool. NLTK is a perfect starting point for researchers and students in the domain of NLP because of its numerous benefits. That is why NLTK has been named "a wonderful tool for teaching and working in computational linguistics using Python" and the "mother" of all NLP libraries. The significant advantage of using NLTK is that it is entirely self-contained. Not only does it provide suitable functions that can be used as building blocks for common NLP tasks. This group of applications and libraries from the University of Pennsylvania has earned considerable traction in Pythonbased SA systems since its conception in 2001. b. TextBlob: it is a python library for processing textual data; it provides a simple API to access its methods and do basic NLP tasks such as sentiment analysis, part-of-speech tagging, classification, translation... The sentiment function of TextBlob returns two properties, subjectivity, and polarity. Subjective sentences usually refer to personal opinion, judgment, or emotion, whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]. Polarity is a float that lies in the range of [-1,1] where 1 means a positive statement and -1 means a negative statement. c. Gensim: it is an open-source library for unsupervised subject modeling and NLP, using modern statistical machine learning. It is considered as a robust vector space modeling tool implemented in Python. Contrary to NLTK, Gensim is the best way to process massive datasets. Gensim library was primarily built for document similarity estimation, and this treatment is the most developed in the package. It supports three main NLP modern tasks: retrieve semantically similar documents, scalable statistical semantics, and analyze plain-text documents for semantic structure [18]. Gensim includes streamed parallelized implementations of many algorithms like fastText, word2vec, and doc2vec that are used a lot in the field of Arabic sentiment analysis. Its highly and native optimized implementation of Google's word2vec machine learning models makes it a strong contender for inclusion in a SA project, either as a core framework or as a library resource. In Table 1, we try to highlight many advantages and disadvantages of the most used Python libraries in Arabic sentiment analysis.

Arabic sentiment analysis using Java
Because of its best features, Java is a powerful programming language for performing NLP. The Java application, like just in time, processes a large quantity of data as rapidly as possible. The multithreading characteristic of Java is very significant for the heavily loaded application. This application is useful in NLP in that the task is divided into several threads, thus reducing the time.
NLP stored a wide variety of linguistic files. Java has an excellent ability to store data without any changing a single code. The Java database connectivity API serves as a bridge between Java application and the database. The linguistic knowledge updated without changing the single line of Java code, and it stored in the database. In [19], the authors strongly recommended the use of Java programming language. Besides, the authors of [20] found that Java is the best and the most useful programming language. The following section presents several benefits of Java: it is simple, secure, interpreted, distributed, object-oriented, platform-independent, and multi-threaded. According to sun microsystems, Java has the following essential strengths: security, portability, ease of use, robustness, and distributed process across the Web. There is, however, the scope for Java improvement as it continues to have some disadvantages: Java can be seen as significantly slower and more memory-intensive than natively compiled languages, the single paradigm language, Look and feel. The default feel and look of GUI applications written in Java using the Swing tool are very different from native applications.

Java libraries
In this section, we will show the most powerful Java library for ASA: Weka, CoreNLP, and Gate. a. The Stanford CoreNLP: it offers a set of human language technology tools. It is a Java annotation pipeline framework that provides language processing tasks and offers most of the common essential NLP steps, from tokenization through to co-reference resolution [21]. Stanford CoreNLP's purpose is to make it simple to apply a bunch of linguistic analysis tools to a text. This library is built to be highly flexible and extensible. The most supported language is the English language, but other languages, like Arabic, German, Chinese, Spanish, and French, are also available. Its features, relative ease of implementation, dedicated SA tools, and excellent community support make CoreNLP a severe contender to production, even if its Java-based architecture could entail a little extra engineering and overhead, in certain circumstances. The Stanford NLP library can be used using Python because there are several packages and interfaces for using Stanford CoreNLP in Python (independent of NLTK). b. Weka: it is open-source software available under the GNU general public license. It is an accessible suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. The Weka workbench includes a group of algorithms and visualization tools for predictive modeling and data analysis, with graphical user interfaces for easy access to this functionality. WEKA was used to perform sentiment classification to solve problems in various fields. It has been used for SA purposes by a large variety of researches and papers. c. Gate: it is an open-source and Cross-platform Java software toolkit capable of resolving all text processing problems. It contains a nearly-new information extraction system "ANNIE," which is a group of modules containing a named entities transducer, a part-of-speech tagger, a gazetteer, a tokenizer, a co-reference tagger, and a sentence splitter. This library supports various languages: Arabic, English, French, German, Chinese, Italian, Bulgarian, Romanian, Hindi, Cebuano, Romanian, Danish, and Russian. There are some valuable Gate plugins that are very useful in Arabic sentiment analysis, such as SEAS and SAGA.

COMPARATIVE STUDY OF ASA LIBRARIES
In our in-depth comparative study, we try to choose the most valuable group of libraries that meets our needs relying on a variety of valuable aspects and levels:

Comparative study of the most potent ASA libraries based on the literature
In Table 2, we try to highlight numerous characteristics of Arabic sentiment analysis libraries and famous works based on the literature. According to the literature, we concluded that NLTK, Weka, Gensim, TextBlob, and Stanford CoreNLP libraries are beneficial compared to other famous Libraries in the field of ASA and we found many Articles which adopted the use of NLTK, Weka, TextBlob and Gensim libraries in their works more than Stanford CoreNLP and Gate libraries.

Comparative study of open software libraries based on the community on GitHub results
GitHub provides plans for both free accounts and private repositories, which are commonly applied to host open-source projects. It is the biggest host of source code in the glob. The numbers in the GitHub site are permanently variable. That is why we will designate the visitation date of these pieces of information (10/06/2020). Table 3 shows the GitHub results. Through the results, we deduce that Gensim and NLTK are the most applied, pursued by CoreNLP, TextBlob, Weka, and lastly, GATE.

Comparative study of the open software libraries based on multiple criteria
In Table 4 (see appendix), we try to show various criteria of NLP Tools like the Documentation, Characteristics, also, the supported treatments of each NLP library, i.e., NLTK, Genism, TextBlob Python Libraries and CoreNLP, Weka, GATE Java Libraries. We will make a comparison between these libraries to reach a conclusion on the most potent libraries that meets our needs very well.

RESULTS AND DISCUSSION
In this comparative study, we try to adopt two major matters. The first is that numerous researchers are confounded about what programming language they have to apply for various ANLP modern tasks, especially for the Arabic sentiment analysis field. The second issue is that there are a large variety of NLP libraries, which is why many researchers find it very hard to select a suitable set of libraries in their ASA research projects and which ones meet their needs best. For this reason, they use ANLP libraries for their ASA research projects, but without justifying their option. Both matters are debated in more detail below:

Selecting the suitable programming language
Among various programming languages (such as C++, R, Perl, Prolog, Lisp...), we selected Java and Python programming languages in our study. This choice is justified by their broad popularity usefulness and importance for current ANLP tasks, especially for the Arabic Sentiment Analysis domain. Besides, these two programming languages have a large variety of powerful libraries in the ASA field.

Choosing an adequate library for Arabic sentiment analysis project
Thanks to the diversity of available NLP libraries, most researchers use various libraries in their research projects without explaining the cause behind this use. We aimed to rely on our review of the literature on the most potent and useful ASA libraries, namely: NLTK, Genism, TextBlob, CoreNLP, Weka, and Gate. The choice of the library relies on the specific problem you are dealing with in Sentiment Analysis. We can use each of them in various scenarios. we tried to give you a general summarize of them, and we hope it can help you make the right option for your problem: a. NLTK: is very useful. If you know to program in Python, then NLTK is a smart choice as it contains the functionalists of Stanford CoreNLP and Weka Tools. Other than this, you can benefit from lexical resources with ease, such as WordNet, often indispensable in the domain of ASA. Such as CoreNLP, NLTK provides various wrappers for many programming languages and comes with a variety of resources. b. Gensim: its highly and native optimized implementation of Google's word2vec machine learning models makes it a strong candidate for inclusion in a sentiment analysis project, either as a library resource or as a core framework. Contrary to NLTK, Genism is a great option for processing massive datasets. At the same time, it does not accept a significant number of current NLP tasks such as NLTK. c. TextBlob: it is relied on NLTK and Pattern. It has an excellent API for all the common NLP treatments.
It is a more practical library focused on everyday usage. It is perfect for initial prototyping in almost every NLP project. Unfortunately, it inherits the low performance from NLTK, and therefore it is not suitable for large scale production usage. Many researchers considered TextBlob Library as one of Python's libraries to execute Sentiment Analysis. d. StanfordCoreNLP: it is helpful if you need part of speech categories, co-reference, or named entities in text. These have been employed as potential features by the sentiment analysis research community. The Stanford CoreNLP is one of the most potent libraries among a large variety of great NLP libraries because it is easily comprehensible. Compared to other libraries, CoreNLP is easy to set up and run since users do not need to understand complex installations and procedures, and its users only require to have a little background of pieces of information about Java before they can get started. e. Weka: it is useful if we already hold data with each data point holding a feature vector, then we can employ this tool for clustering our data. Helpful if we also hold the gold predicted outputs for our data, we can build classifiers. Simple to employ GUI accessible and highly configurable. f.
Gate: it is advantageous if we want to create a pipeline. Developers contribute language analysis modules for various languages that are available to be used plugged into your pipeline. Helpful if you have a new approach, you can write a customized module in JAVA and plug it into the pipeline, and a complete system will be obtainable. As a conclusion of this part, each library has its advantages to ANLP Tasks, and each one was built to meet the researcher's purposes. Our inference raises two main parts : a. The first one is to do with ANLP programming languages Python and Java, which are very popular in the ANLP domain. However, we recommend Python because it is less complicated than Java, it has powerful and valuable ANLP libraries compared to Java, and Through our comparative study, we conclude that the most valuable, robust and used ASA Libraries (NLTK, Gensim, and TextBlob) belong to Python programming language. For this reason, we will adopt Python in order to accomplish our ASA research project easily and perfectly. b. The second point relates to ANLP libraries, which are all very useful. However, according to the literature and large variety of powerful and significant works in the domain of ASA, we conclude that the NLTK, Gensim, and TextBlob libraries are the most used for Python ASA task because they have numerous advantages compared to other ANLP libraries. As for the Java ASA tools, we find that Weka and CoreNLP tools are the most used and famous, and they have great results in this field.

CONCLUSION
Because of their popularity and large abundance in libraries for the ANLP domain, we selected Java and Python programming languages in our comparative study. In this work, we described a variety of ANLP tools which are considered as the most powerful and used. However, there are other tools in other programming languages that also could be very helpful and useful. Besides, we have tried to evaluate the various libraries using several aspects and multiple criteria. It can be deduced that each programming language has its benefits and advantages, each library also has its characteristics for ANLP new tasks, and each one was built to meet the researcher's purposes. Therefore, it is tough to select the best Arabic NLP libraries because there is not only one single aspect or criterion to do this. The selection of the most suitable libraries depends on the research project and which part of the ANLP field is concerned. For this reason, we relied on our work, which deals with the domain of ASA in order to select its most potent and useful libraries with ease. -An integrated NLP tool with a wide range of grammar analysis tools; -A robust annotator for arbitrary texts, widely applied in production; -A regularly updated package, with the highest quality text analytics support for several major (human) languages; -Available APIs for most significant new programming languages; -Ability to run as an easy web service. Sentiment analysis, information extraction, named entity recognition, part-of-speech tagging, co-reference resolution system, parsing, bootstrapped pattern learning Weka (Java) -Portable and simple to apply. -Adapted to make new ways to machine learning designs -Latest trends in artificial intelligence -Free online courses available -Extremely resourceful books and publications available -Highly educated, skilled and committed professors Sentiment Analysis data preprocessing, clustering, regression, classification, visualization, and feature selection.

APPENDIX
Gate (Java) -SEAS (Gate plugin): is a set of processing and linguistic resources, written in Java, developed to run sentiment and emotion analysis over text using the GATE platform. Because of the nature of GATE, the text format should be plain or XML. The sentiment analysis modules are executed in embedded inside SEAS. -SAGA (Sentiment and Emotion Analysis integrated into GATE) is a set of processing and linguistic resources, written in Java, developed to run sentiment and emotion analysis over text using the GATE platform. SAGA is distributed as a GATE plugin.