AgroSupportAnalytics: big data recommender system for agricultural farmer complaints in Egypt

The world’s agricultural needs are growing with the pace of increase in its population. Agricultural farmers play a vital role in our society by helping us in fulfilling our basic food needs. So, we need to support farmers to keep up their great work, even in difficult times such as the coronavirus disease (COVID-19) outbreak, which causes hard regulations like lockdowns, curfews, and social distancing procedures. In this article, we propose the development of a recommender system that assists in giving advice, support, and solutions for the farmers’ agricultural related complaints (or queries). The proposed system is based on the latent semantic analysis (LSA) approach to find the key semantic features of words used in agricultural complaints and their solutions. Further, it proposes to use the support vector machine (SVM) algorithm with Hadoop to classify the large agriculture dataset over Map/Reduce framework. The results show that a semantic-based classification system and ﬁltering methods can improve the recommender system. Our proposed system outperformed the existing interest recommendation models with an accuracy of 87%.


INTRODUCTION
It is expected in the future, big data applications will be widely used by farmers, experts, and others across the agricultural industry [1]. However, despite the huge amount of data already generated over thousands of agricultural farms each year in Egypt, the impact of big data is still incomplete. While the variety, velocity, volume, and data generated in the agriculture process have been available, the advantages of aggregation, analysis, and distilling value-creating decision support tools from that data remain in the early phases [2]. Agriculture plays an essential role in the country's economy. However, When the coronavirus paramedic happens, and because of the social distance, it is hard for farmers to interact or contact agricultural specialists to get suitable solutions to the different agriculture problems according to crop type [3]. One way of treating this virus is eating healthy food, which is essential for energy and crucial to defeating the disease. The shortage of different support for farmers to achieve good agricultural practices and prevent methods is another metric that hinders food productivity. Farmers need quick advice on plant diseases, seed patterns, and prevention methods to face environmental changes. However, farmers' access to such information is highly minimized to the support systems being incompatible, unreliable, and predominately not certain, so delivered advice becomes incorrect [1], [4].

Int J Elec & Comp Eng
ISSN: 2088-8708  AgroSupportAnalytics: big data recommender system for agricultural farmer … (Esraa Rslan) 747 The farmers submit their problems, then the AgroSupportAnalytics [5] recommend and suggest a suitable solution for the farmer's complaint. The AgroSupportAnalytics aimed to solute the problem of support and provide recommendations for farmers in Egypt. Agricultural problems are divided over 4,242 villages in Egypt [6]. In Arabic text, the problems are collected and submitted to one of the 198 centers spread over the state to support farmers in their regions. Storage of agriculture complaints and solutions has been made on a public cloud that hosts analytics observations and support toolkits [7]. The proposed AgroSupportAnalytic system developed a support vector machine (SVM) classification method for the used agricultural dataset with Hadoop Map (M)/Reduce (R) in the parallel environment. In the AgroSupportAnalytic system, we classify the agriculture dataset and latent semantic analysis (LSA) semantic similarity for semantic analysis [8], [9]. Map/Reduce approach provides a fast implementation of classification steps in large datasets and is a powerful big data analysis tool. Dataset is saved on the cloud, and it contains 10,000 complaint problems. In addition, the proposed system used the LSA [10] to measure the semantic similarity among the farmer problems and the available agriculture problems in the used dataset.
Due to increasing the text data available in different languages, many research papers focuses on semantic similarity measures across languages. In the work of semantic similarity in Arabic-English texts, the authors [11], [12] used Latent Semantic Indexing in semantic Arabic-English language to compute the semantic similarity between Arabic text and the English one. Alzahrani [13] introduced two Semantic Similarity methods for Cross-Language Arabic English Sentences (CLAES). The author used a dictionary translator in the first method, as an Arabic sentence is translated into English. After that, the semantic similarity is calculated by applying translation similarity techniques. The second method, Machine translation, is used for the Arabic sentence. Potthast et al. [14] discussed the Cross-Language Plagiarism Detection of Arabic-English documents. First, the system translates the text by retrieving all the available translations of synonyms for a word from WordNet [15], then applying keyphrase extraction. Finally, a combination of monolingual is calculated (Cosine similarity, N-Gram, and longest common subsequence (LCS)) to return similar sentences. These methods achieve great results with languages that are near in meaning to each other because of joint root words. However, measuring the semantic similarity could be more complicated if the languages are different. Dai and Huang [16], for example, computed the semantic word similarity for applications in the cross-language semantic space. They measured the similarities between two texts, one in the Chinese language and the other in English. Zou et al. [17] introduced a technique that extracts the main features of mono and cross-lingual semantic relations across different languages. They proposed a method storing the bilingual embedding between Chinese and English from a large corpus. Also, machine translation is used to align between words.
Processing large text is a challenging task, especially in text analysis. Map/Reduce is based on a distributed and parallel framework for utilizing several tasks. Such as text processing tasks, dividing data and computation loads in a cluster, text clustering, information extraction, storing, fetching unstructured data [18], natural language processing, text summarization, and sentence similarity [19], [20]. Text similarity is an extensively challenging problem in text analysis. Many techniques are proposed for handling large text for automatic text summarization. Nagwani [21] introduced a Map/Reduce framework of multi-document for text summarization. Many types of research are concerned with implementing techniques, algorithms, and approaches in parallel environments like Hadoop and Cloudera [22]- [24]. Hadoop is an Apache-based framework used to analyze massive data sets on clusters containing many machines, using Map and Reduce approach. Hadoop Map/Reduce allows applications to run in parallel environments. Many papers on the support vector machine algorithm in parallel machines are proposed.
In the proposed system, first, the farmer writes the agriculture complaint in the Arabic script; then, Google's machine translation is used to translate the complaint from Arabic into English. Second, analyzing the complaints using data analytics techniques to retrieve term frequency and classify the complaint to which problem class using SVM in Map/Reduce technique. Third, returning a recommended answer by searching for similar complaints in the agriculture historical dataset. The recommended response uses LSA [25] to measure the semantic similarity process between cross-language in Arabic-English sentences. There are two methods used in the LSA algorithm. The first method is applied using term frequency weighting (TF), while the second is based on inverse document frequency (TF-IDF). LSA can get much better results than the different plain vector space models. It creates a decomposing term-document matrix, since it is faster than other dimensional reduction methods. However, when the data representation is dense, it is hard to index words based on particular keywords. It works well on a dataset with diverse topics. Moreover, LSA can handle synonymy problems based on the dataset. Applying LSA on new data is easier and faster compared to other methods as the matrix in topics space has to be multiplied with the TF-IDF vector to get the latent vector of a document. The paper is as follows: in section 2, research method, we explain our proposed system LSA with SVM classification in Map/Reduce. The results and discussion section are explained in section 3. Finally, the conclusion section is presented in section 4.

RESEARCH METHOD
This section presents the AgroSupportAnalytics system. It has five steps. The main steps are machine translation, preprocessing, SVM Map/Reduce classifier, feature extraction, and finally, applying LSA. The system is presented in Figure 1.

Machine translation
The farmers write their complaints in Arabic script. Therefore, these complaints need to be translated into English as the agriculture dataset is located over the cloud in the English language. Google's Cloud Translator API is used to translate the Arabic complaints into English. Google Cloud Translate has high accuracy also considered more reliable [26].

Preprocessing
The farmer may write their problems with more details containing the plant age, planting method, watering method, disease description, and soil type. Such problems may be written in undesired form and un-understandable meaning or may contain useful words that affect the text processing of model phases. Preprocessing is important for bringing the farmer query and the historical complaint/solution data into a form that similar task can be processed with the farmer query [27]. Data preprocessing includes processes: tokenization, stop words removal, normalization, and lemmatization.

Tokenization
Tokenization is breaking up sentences into pieces such as keywords that pieces called tokens. Words are separated by blanks like white space, semicolons, commas, and quotations. In tokenization, some characters like punctuation marks, unique characters, and white spaces are removed [28]. This method minimizes text data processing and improves system performance.

Stop words removal
The process of stop word removal is removing all words that don't have any meaning. Stop words must be eliminated from farmer queries since they don't have any effect or importance in the sentences' meaning. Examples of stop words in the English language are "am", "is", "are", "the", and "a". We used the WordNet database to get the list of all stop words.

Normalization
This task is important for noisy texts; it focuses on removing unwanted data or characters in the query and historical dataset, like repeating words, text lowercase, and spaces. An example of some characters is ‫."?,],,),(,_,>,<,؛,×,÷"‬ So, for example, the word "croop" can be transformed into" crop " standard form.

Lemmatization
The process of lemmatization is finding the root (base form or lemma) of a word by considering its inflected forms like "appearance, appearing" have the same root "appear". For example, the lemmatization brings "mice" and "mouse" both as "mouse". We used an NLTK lemmatizer with POS tags.

Classification
This phase aims to group the similar problems to make them ready for text similarity and ensure that all similar problems in the dataset participate as a group in the similarity process. The dataset is classified into problem categories like the pest, weed, diseases, and irrigation. Using Map/Reduce framework in the proposed system, text processing speed and scalability are improved compared to other traditional systems. In the classification process, the SVM Map/Reduce approach is applied to the agriculture dataset and the farmer query. In the given set of training problems from the dataset, each one belongs to one of the main categories. SVM algorithm builds a model that specifies new farmer problems into one category or another, working as a non-probabilistic binary linear classifier. SVM model represents the agriculture problems as it points in space, so the examples of the separate categories are split by a clear gap that is broad as possible. Then, new queries are mapped into the same space and predicted the category based on which side of the gap they fall on. This classification process is performed in a parallel manner by parallelizing the classification process in several machines. SVM is one of the most important classification algorithms that work effectively on many high-dimensional tasks. Accuracy is mostly high; reliable results when training classes have errors; speed evaluation of the learned function. However, SVM algorithm might have a long training time; because it is not easy to learn the function weights.
Hadoop Map/Reduce is applied to classify the farmer query based on which problem class belongs to find the suitable solution. The innovation of Hadoop is that there is no need for expensive tools. In the state, it distributes large amounts of data on several machines with high reliability and scalability for data storage and processing. Map/Reduce is the main concept of big data. It is a programming method that allows extensive agriculture problems to be divided between multiple machines in a Hadoop. After this step, each complaint is classified according to the problem category (weeds, irrigation, pest, diseases category). The classification phase will help to increase the performance of the semantic similarity process and the system efficiency.

SVM in parallel network environment
Transferring each complaint in the dataset into one vector in the parallel environment having 2 phases: Map (M) and Reduce (R) phases. The input of the Map phase is one complaint, and the output is many components of a vector corresponding to the sentences of the text. In the Map phase, we transfer the text into one vector, similar to SVM input of the Reduce is the output of the Map phase, and it has many portions of a vector. The output of the Reduce is a vector that corresponds to the sentences in the text. In the Reduce phase, those vector components are merged into one vector.

Hadoop map (M)
The n vectors of one complaint are input into the Hadoop Map (M). Then, the SVM algorithm is performed to cluster, where every vector of n vectors of one text complaint in the testing dataset. The output is the result of classifying the vector into weed vector set, pest vector set, diseases vector set, or irrigation vector set.

Hadoop reduce (R)
The classification results of the n vectors into the problem category vector group in the Hadoop Map (M) phase are input into the Hadoop Reduce (R) in the parallel network environment. Then, in the Reduce (R) phase, the testing dataset's polarity of one complaint (corresponding to the n vectors) is specified correctly.

Word Extraction
The feature extraction process is generally utilized in text-similarity applications. The extraction process calculates the appearance of important word features in a text to construct the word vector. We use the extracted features from a group of sentences to give a value for each problem in the dataset. This process helps to construct a term frequency matrix. In this paper, algorithms are explained with their time complexity. Algorithm 1 is utilized to construct the sentence vector from word vectors with TF-IDF weights. The main procedure in this algorithm is in the first inner for loop, lines three-to eleven, the sentence vector is built for every sentence in the farmer problem. Let N be the number of words in the problem, M be the number of words in the sentence, and | | is the number of sentences in whole problem. The algorithm calculates the word vector for each word in all sentences in the problem, so there is an execution time  |). When using the TF-IDF to weight the word vectors, the score of TF-IDF for every word will be calculated in the same loop. The algorithm requires to visit each word in the sentence only one time where N= M.| | and the time complexity is O (N).

Algorithm 1. Build sentence vector
Input: sentences problem P. Output: vectors of the sentences Pv. Begin 1.
Step 1: For every sentence si in P do 2.

Applying LSA
Word vectors are created after preprocessing and classifying the farmer text and agriculture dataset. Then, we apply LSA [29], [30] to calculate the semantic similarity between the farmer text and the available agriculture dataset. LSA is a powerful corpus-based technique for calculating semantic similarity. It consists of three steps are input matrix creation, singular value decomposition (SVD), and sentence selection.

Term-sentence matrix
An input matrix is built for the farmer's complaint and historical dataset. Every row in the matrix represents the word or term in the farmer's complaint and agriculture dataset. Every column represents the complaint. The cell value is the intersection between term and complaint. Two methods of weighting schema are utilized to fill the cell values: TF-IDF or TF. Thence, we choose the sentences with important attributes using the most frequent term. TF-IDF is one of the methods to rank the most frequent terms. It is a statistical method used to know how a term occurs in a sentence. The first part is TF, constant for all term weighting methods, and calculated as shown in (1): where nij is the number of times the i th word exists in j th complaint, Nj is the complaint size (number of words in the complaint). The second section of the term weighting is calculated once.
In TF-IDF, The Inverse Document Frequency (IDF) represents how many times a term T occurs in all problems of a text. The cells are filled with the weight of (TF-IDF) of a term (i) in the complaint according to (2): where TFij is the frequency of a term (i) in each complaint/problem (j), and IDFij =log | | where |D| is the number of complaints in the dataset and ni is the number of complaints with the ith word.

Singular value decomposition
Singular value decomposition (SVD) is an algebraic matrix that plays an important part in identifying the relationships between words and sentences. It enhances the term sentence matrix and identifies the relations between terms and complaints [20]. SVD decomposes the term sentence matrix into three matrices that determine all the significant attributes of the matrices. After input matrix creation, SVD matrix X is constructed, which is the multiplication of three matrices, where the columns and rows are two vectors matrices built from eigenvalues, and the third one is a diagonal matrix. The matrix is calculated based on TF-IDF in the word frequency. Since the TF-IDF method has the primary metric to extract the most descriptive terms in a sentence and can compute the similarity between two sentences. The SVD can be presented using (3) where S is the eigenvector of the multiplication of the matrix and the transpose X T (XX T ), Σ is the square root of the eigenvalue of (X T X), and U T is the eigenvector of the multiplication of the transpose X T by the matrix X (X T X). SVD minimizes the number of columns while remaining the number of rows, keeping the similarity matrix between the words. Every word has a value corresponding to its rows represented as a vector, and the cosine semantic similarity is measured between these vectors' values in the next phases.

Semantic selection and ranking
After applying the SVD, the cosine similarity is calculated between user complaint and each agriculture problem to return the correct solution. The cosine [31] can be calculated as (4): where ( 1,2) is the similarity between the farmer query and agriculture complaints dataset, V1 is the weight of the term in the farmer query, and V2 is the vector weight of the term in the complaints dataset. Finally, the complaints are ranked corresponding to the semantic similarity score. If the score is more than a specific threshold (75%), the system returns the response (answer) with the highest score corresponding to the best matching complaint. Finally, the system retrieves the recommended solution for the query farmer complaint.
Algorithm 2 constructs the similarity matrix between each two-sentence vector built from farmer problem and agriculture historical dataset based on the problem category. Its complexity relies on the execution time of the internal loop (lines two-five). This loop mainly calculates the similarity between each sentence's vectors with other vectors. The overall time complexity is estimated as O (| | 2 ).

RESULTS AND DISCUSSION
This section introduces an evaluation of the proposed recommender system using TF-IDF and TF. We measured the system performance in terms of precision, recall, F-measure, and accuracy. Our AgroSupportAnalytics system was implemented with python language. The dataset was divided into 80% training and 20% testing with ten experiments. Finally, we test the system with the test dataset and save the results of SVM classification. The experiments are executed in dual-core processor systems with a Pentium CPU speed of 6.00 GHz, GPU Tesla 16 GB, and 32 GB RAM. The systems (up to 4 nodes) are connected over a 100 Mbps LAN and the Windows XP (using MS-DOS Prompt).

Dataset
The dataset acquired from Egypt's agriculture research center (ARC) and virtual extension and research communication network (VERCON) [6] contains historical complaints and solutions provided by the experts saved as unstructured data. The agricultural data was installed on a public Cloud. This dataset is important because it has real-world problems collected over a long time by Egypt's agriculture centers. The dataset has different crop types like wheat, tomato, cotton, and mango, also problem categories like irrigation, pest, weed, and diseases. Table 1 shows statistics about the VERCON agriculture dataset. It lists the crops which are planted in Egypt. Also, the dataset is available in text form.

Experiments and results
The conducted experiments and results are presented to evaluate the system's performance with different measures. We applied experiments with two settings without/with classification techniques. Recommendations are returned on recommender techniques with classification and semantic similarity; the result of SVM classification on the agriculture dataset is corporate to the recommendation process. We tested two semantic similarity methods for semantic analysis: TF and TF-IDF.
Consider the farmer query example: ‫الزيتون"‬ ‫محصول‬ ‫على‬ ‫بنية‬ ‫بقع‬ ‫."ظهور‬ The proposed system is applied to return the most relevant complaint and its solution for the farmer query, as shown in Table 2. First, the system translates the Arabic farmer query into English: "Appearance of brown spots on the olive crop". Second, we apply preprocessing on the farmer query like tokenization, stop word removal and lemmatization. Third, apply classification by Map/Reduce SVM algorithm using Hadoop to classify farmers' query based on problem category "weed class". Fourth create a term frequency matrix. Fifth, compute the semantic similarity score from the LSA matrix using TF-IDF or TF to return the most recommended solution.
Some metrics are used to evaluate the classification in our system. The accuracy is measured to know the accuracy of the classification results before semantic analysis. SVM is also applied to predict the farmer query belongs to which category before using recommendation methods. The results show that classification performance with accuracy is approximately 88%~89%, as shown in Table 3.   We applied experiments with two different settings with/without applying the classification technique to evaluate how the SVM classifier-based model enhances the system performance. We used different measurers such as Precision, Recall, F1-score, and accuracy in calculating the results of our system. As shown in Table 4, we used TF semantic similarity in our system. As a result, the F1-score is 83.82% using SVM classification and 69.94% without SVM classification, and the accuracy is 84.30% using SVM classification and 70.32% without SVM classification. It is noticed in Table 5, TF-IDF semantic similarity is used in our system. As a result, the F1-score is 86.64% using SVM classification and 73.42% without SVM classification, and the accuracy is 86.98% using SVM classification and 70.32% without SVM classification.  Different measures are used to evaluate the performance of the proposed recommender system, such as the root-mean-square error (RMSE), mean absolute error (MAE), and normal MAE (NMAE). RMSE, MAE, and NMAE are well-known metrics used as a baseline to evaluate the recommender system. Table 6 shows the comparative results acquired from the recommender using these metrics with semantic analysis. They were calculated based on SVM classification with the four main problem categories. Recommendations are based on recommender system methods with classification and semantic similarity. Table 6 shows that RSME, MAE, and NMAE yielded by the system that merges SVM classification with semantic similarity are better than the error rates obtained by methods without SVM classification.
We concluded from both Table 4 and Table 5 that using LSA with different methods in the system. Figures 2 illustrates the comparison of the semantic similarity-based methods (TF, TF-IDF) With SVM classification in Figure 2(a) and Without SVM classification in Figure 2(b). The system achieved an accuracy average of 84.30% in TF, while TF-IDF scores a better accuracy of 86.98% With SVM classification.

CONCLUSION
The main idea of this work is to find a solution to the farmers' problem by identifying the main causes of complaints and the features behind this. We developed a recommender system using LSA based on TF-IDF to calculate the semantic similarity between the user query and the problems in the agriculture dataset. Moreover, it is required to classify the farmer complaint based on the problem category using SVM in Map/Reduce environment. This paper built a semantic model for the agricultural data to help farmers. As a result, significant effects of many important challenges and problems facing the agricultural sector are hoped to be minimized. The AgroSupportAnalytics system provides more accuracy than existing techniques using Precision, Recall, F1 score, and accuracy. It performs better accuracy 87% of LSA using TF-IDF with SVM classifier.