Deep learning based Arabic short answer grading in serious games

ABSTRACT


INTRODUCTION
Assessment and evaluation of learning are important steps in learning and knowledge transmission processes.Instructional design processes that focus on designing and developing learning systems [1], include always a phase called "develop assessment instruments" [2].Teachers and instruction designers will create assessment tools like exams, assignments, or quizzes.They will usually create different types of questions, which answers are true/false, multiple choice, matching, numerical values, essay, or short answers.Short answer questions require students to answer in free text composing some sentences, typically one or two.This type of question has the advantage of requiring students to construct an answer by themselves, rather than selecting answers from predetermined lists.
Serious games refer currently to video games designed to train people or to transmit learning.Serious games can complement classroom transmission or help with distance learning.The development of serious games involves pedagogy, didactic, learning design, and game design [3].Assessment of learning within games is an important feature that helps make learning effective in serious games [4].
Answers written in free text such as short answers and essays have traditionally been absent from computerized tests and serious games because they were considered difficult to evaluate and grade automatically [5], [6].Because of this challenge, automatic short answer grading (ASAG) has become a research problem.Burrows et al. [5] classified the different approaches tried on ASAG problems in five eras, the fourth one being machine learning.Recent advances in natural language processing (NLP) as well as in machine learning applied to NLP are providing promising results on ASAG problems [7]- [9].Different ASAG applications started emerging and active research in this field has developed [5].
The objective of this research is to extend automated grading based on machine learning to questions and answers written in Arabic.Recent research has tested deep learning approaches on ASAG of answers written in English mainly.The approaches of these researchers seemed generic enough to adapt to Arabic.We wanted also to extend and test the same approaches on answers collected initially in Arabic and not originating from datasets translated from English.
This research targeted questions and answers aimed at schoolchildren from fifth and sixth grade and aged 11 and 12 years old.We have collected a dataset for this research.We have used a standard NLP pipeline.We have leveraged an existing machine-learning algorithm to project Arabic words on numerical vectors that deep-learning algorithms can work with.To grade our short answers initially written in Arabic, we have tested three deep learning approaches namely long-short-term-memory (LSTM), transformers, and bidirectional encoder representations from transformers (BERT).We have deployed our automatic grading in an operational environment and tested this grading in the context of continuous learning evaluation and serious games.We have also made the dataset available to other research projects.
The organization of the paper is as follows.We review the state of the art in section 2. We present our research method in section 3. We discuss the results of this research in section 4. We summarize this work and describe possible enhancements and future work in the last section.

STATE OF THE ART
ASAG grades answers written in free text and leverages the approach from NLP.The NLP approaches currently used for ASAG are exploring deep learning models with recurrent neural networks.Kumar et al. [10], Prabhudesai and Duong [11], and Xia et al. [12] have explored LSTM-based models for ASAG.Alikaniotis et al. [13] have introduced a model based on LSTM for text scoring and are able to discover which specific words impact the score.Roy et al. [14] have proposed a technique to overcome the need to have labeled training data and graded student answers for every assessment.In the first stage, they used a classifier of student answers coupled with a classifier of similarity with respect to model answers.In the second stage, they used a canonical correlation analysis based on transfer learning to build the classifier ensemble for questions having no labeled data.
Riordan et al. [7] have carried out a series of experiments across several short answer scoring datasets.They took as a reference the architecture of the neural network used by Taghipour and Ng [15].This neural network provided good performances on automated essay scoring.The network leveraged a convolutional neural networks (CNN) architecture with regression and a simple LSTM.Zhang et al. [16] addressed the grading of open-ended questions.These questions do not usually have a limited number of reference answers.Students can express opinions or personal thoughts on these questions.They have used a deep learning model that integrates both domain-general and domain-specific information.The proposed model used an LSTM to classify word sequence information.The dataset had about 16,000 sample answers related to seven reading comprehension questions.
Other researchers have used transformer and transfer learning in their systems to train models.Camus and Filighera [17] have fine-tuned existing and already trained transformer based architectures.They have explored the transfer learning from one dataset to another one and its impact on generalization and performance.Condor [18] has used BERT as a tool to assist instructors with ASAG.Condor targeted situations where final human judgment is considered necessary.

ASAG datasets
There is a variety of datasets already listed in the literature to train and test ASAG models.As we know, well-structured datasets lead to good results.The Hewlett Foundation [19] has released a dataset called ASAP to train and benchmark ASAG systems.ASAP is currently available on Kaggle.This dataset contains about 10,686 samples belonging to 8 different sets of essays.Each essay has an average of 150 to 550 words response.Each essay is followed by one or more scores given by human graders.The objective is to match the "expert human graders for each essay." Deep learning based arabic short answer grading in serious games (Younes Alaoui) 843 Some researchers have used datasets collected during university courses.The dataset used by Mohler and Mihalcea [20] at the University of Texas consists of 80 questions collected from a computer science course named Data Structures.The questions were used in multiple assignments and two examinations.They have collected answers through an online learning system.The size of the dataset is 80 questions, 2,273 responses provided by 31 students and 2 human expert graders.Menini et al. [21] released a dataset named Statistics to test short answer grading.They have built the dataset from statistics exams.

ASAG in Arabic
Gomaa and Fahmy [22] have pioneered ASAG for Arabic.They have collected a dataset in Arabic that has 61 questions.Each question had about ten student answers, and all answers were labeled with grades.They have built the dataset from the Environmental Science course of the Egyptian curriculum.For grading, they have used a text similarity-based grading that measures the similarity between student answers and reference answers.They have not used any automatic grading via a machine learning approach.
Nael et al. [23] researched a deep learning-based system to score short answers in Arabic and achieved good performances.However, they have not used a dataset built out of questions and answers written initially in Arabic.They have used a translated version to Arabic of an English test dataset called ASAP short answer scoring.
As we wanted to explore automated grading for schoolchildren in Arabic, we wanted to know how such data would look alike in reality.Our objective motivated us to collect and build our own dataset following the best practices already mentioned in the literature.Because data is key to machine learning algorithms, we targeted collecting real and genuine data from schoolchildren in Arabic to create good models that can handle our problems well.

OUR RESEARCH METHOD
All the datasets cited in the literature were collected either manually through forms or dedicated web applications [24], [25] or using an automated mechanism like web scraping.We have built our dataset manually from answers provided by schoolchildren aged between 11 to 12 years old and graded by a teacher.Figure 1 illustrates an excerpt of the dataset.We have used Google Forms to collect the answers.All participants were studying in the sixth grade of primary education in Morocco.The schoolchildren answered 18 questions related to the Islamic education course.We have collected 1,276 answers.A teacher has evaluated and graded all answers.The grades were between 0 and 2: 0 for completely incorrect, 1 for partially correct, and 2 for correct.Figure 1.Islamic Education short answer dataset (answers 1 to 4 read Gabriel, Gabriel, Gabriel peace be upon him, our master Gabriel peace be upon him, respectively) Schoolchildren have answered these questions at home on a computer or on mobile.We have noticed that 75% of the answers have 8 words or less.Figure 2 shows the number of answers for a given number of words.related to the number of characters and the number of words in the dataset.Compared to other datasets collected in university or from adult responses, the answers that we have collected have fewer words.We also looked at the data from the score perspective to ensure that all scores were present.We found out that it is important to ensure that all scores are present in the dataset to help machine learning algorithms work correctly.Table 2

NLP pipeline for ASAG
For Arabic NLP, researchers are currently using pipelines and architectures similar to what is being used for English NLP [26].On the other hand, the NLP pipelines used for ASAG are similar to the generic pipelines used for other NLP applications.For this research, we have used a pipeline similar to the ones used for English ASAG when leveraging NLP and machine learning.The adopted NLP pipeline was composed of three main stages as illustrated in Figure 3.The classification or grading happens in the third stage.We have used two stages upfront to preprocess the text and transform words into numerical vectors before classification can be applied.
The first stage is text processing and includes tasks like segmentation, tokenization, stop word removal, and stemming or lemmatization.The output of this first stage is the list of important words composing the initial answer but reduced to their root words.Stemming and lemmatization are both used in NLP to normalize words by reducing each word to it is root or dictionary form.Stemming algorithms chop off suffixes and are fast, but they may reduce words to wrong roots or non-existing words.Lemmatization algorithms apply a contextual analysis to words and link them on average to more appropriate root words.However, if the text is long, then lemmatization takes considerably more time.
The second stage called feature extraction or embedding is where we map each word with a numerical vector belonging to a relatively low-dimensional continuous space, called embedding space.An important requirement for this mapping is that words sharing similar meanings or semantics should translate in the embedding space to numerical vectors that are close to each other [26].Each dimension of the embedding space is usually linked to some semantic features of our vocabulary.Table 3 shows the example of embedding vectors associated with six different words and provides an example of six words projected on an embedding space of three dimensions where each dimension is associated with a pure semantic feature, namely {Person; Location; Duration}.Words used in similar contexts usually have similar meanings or semantics, thus these words must be close to each other along some dimensions of the embedding space.Different techniques are used for feature extraction.In this research, we have used word2vec [27] to generate the embedding vectors.The word2vec algorithm leverages machine-learning techniques and is key to NLP.The dimension of the embedding space was 300.This means that all the words of the Arabic corpus that we have used were projected on vectors of dimension 300.
The third stage performs the classification task.This stage leverages deep learning algorithms and implements our machine learning models.We have first trained these models on our data.We then tested them on unseen answers.For both training and testing, we have fed these models with data that went through the two first stages of our pipeline.
For both LSTM and transformer models, we have used the Gensim toolkit during lemmatization, tokenization, and word embedding [28].Gensim addresses many common NLP tasks and provides an implementation of the word2vec algorithm.We usually train the word2vec algorithm on a corpus and associated texts to generate a word vector encoding how to map each word from the corpus on a numerical low-dimension vector.In our research, we have used word2vec with "Wiki.ar.vec" as the pre-trained word vector [26]."Wiki.ar.vec" was trained on ar.wikipedia.For the BERT model, we have performed the tokenization task using a pre-trained model called "Bashar-talafha/multi-dialect-bert-base-arabic" [29].Our approach to Arabic ASAG was to test and adapt the models used for English ASAG.The first unknown was the quality of the word vectors.A second one was how the models would behave on answers written in simple words by schoolchildren.We have tested an LSTM architecture [30], a transformers-based architecture [31], and a transfer learning by fine-tuning a BERT pre-trained model [32].This section presents the results.

LSTM model
The architecture based on the LSTM model is composed of 7 layers and is described in Figure 4.The input layer has 54 nodes because the longest response in our system can have 54 words.LSTM is a recurrent network and will iterate on 54 words.The embedding layer has 300 nodes, 300 being the length of the vector after word2vec encoding.The LSTM layer has 64 units, followed by two dropout layers with 64 and 32 nodes, followed by one flattened layer with 3456 nodes.As we have three possible final grades {0, 1, 2}, the output layer has 3 nodes to provide the result of our classification task.In total, the trainable parameters of the LSTM model are around 204,163 parameters.For the ASAG problem, we found that the hyper-parameters used with LSTM have an impact on learning and test results.We have tested different hyper-parameters in Table 4.The Hyper-parameter of LSTM Architecture for ASAG.l is the best hyper-parameter found to train the LSTM model for Arabic ASAG.We present the performances achieved with these parameters in section 6.

Transformer model
The architecture based on transformer model is composed of 6 layers as shown in Figure 5.The input layer has 54 nodes, 54 being the length of the longest response of our dataset.Follows a token position embedding layer with 300 nodes where 300 is the size of the computed embedding vectors.The transformer layer also has 300 nodes, followed by a max-pooling layer for dimensionality reduction, then one dropout layer with 300 nodes and finally a 3 nodes output layer to perform the classification task.The trainable parameters of the transformer model are about 768,311 parameters.
For transformer, we have also tested several hyper-parameters and compared results.Table 5 hyperparameter of transformer architecture for ASAG presents the hyper-parameters that provided the best performances.We discuss the performances achieved in section 6.

BERT model
The last tested architecture is based on BERT Model.We have used an architecture composed of 6 layers, see Figure 6.The input layer has 309 nodes.The BERT layer has 110,617,344 parameters, followed by two dense layers with 64 and 32 nodes and two dropout layers with 64 and 32 nodes.The output layer has 3 nodes to perform the classification task.Overall, the trainable parameters of the BERT model are about 51,395 parameters.As for LSTM and transformer, the performance of a BERT model on ASAG varies with the parameters used.We have tested and experimented with different parameters before finding a set of parameters that provided good performances on our ASAG problem.We have listed these parameters in Table 6.We present and discuss the performances achieved with these parameters in section 4.

Deployment to an operating environment
We wanted to test the deployment and the behavior of the grading service in an operating environment.In operations, short answers need to go through text processing and feature extraction before classification.We have embedded the 3 stages of our NLP pipeline in our deployment server, as shown in Figure 7.
To execute the machine learning models in our operating environment, we have installed TensorFlow [33] on our server and used it as an inference engine to execute our classification models.We have deployed our models from Colab and Kaggle after training and tuning to TensorFlow in a 'H5' container [34].We wanted to make the automatic grading service available to multiple front ends.We made this service available through a service-oriented architecture (REST API).We consumed this service through a web and a mobile application.The service-oriented architecture has proved flexible to deploy and operate the trained models.

RESULTS AND DISCUSSION
To evaluate the performance of each model, we have used four evaluation metrics: accuracy, precession, recall, and Cohen kappa.The values of these metrics for the LSTM model are listed in Table 7, while Figure 8 plots the metrics against the number of epochs.For the transformer model, Table 8 provides the metrics, and Figure 9 plots the metrics against epochs.For the BERT model, Table 9 and Figure 10 provide the values of the metrics and the graph plotting the metrics against epochs.

Discussion
From the results of the model evaluation section, we notice that the transformers model has the best accuracy with 95.67%, followed by the BERT model with an accuracy of 88.68%, and then LSTM with an accuracy of 83.95%.The same remark applies to the other metrics like precision, recall, and Cohen kappa.When we see the metrics graphs of each proposed model, we notice that the transformer model overfits faster compared to both LSTM and BERT.The difference between the training curve and the test curve becomes larger as the epochs increase.This is due to the complexity of the model architecture and the size of the used dataset.In order to deal with this problem, we have used two technics usually used for this kind of problem.The first one consisted of using the dropout layers to reduce the complexity of the model.The second one consisted of using an early stopping technique.
We have compared our models with other models from the literature as illustrated in Table 10.As for accurate results on ASAG, we can conclude that we have followed a good approach and achieved results comparable to the other research.We can improve our results by varying some hyper-parameters like architecture or regularization techniques, by adding more data to the dataset, or by using dedicated lemmatization and word embedding.[35] 76 % ASAG based LSTM, Saha et al. [6] 79.26 % ASAG based LSTM, Liu et al. [36] 88.9 % Proposed ASAG based LSTM 83.95% ASAG based transformers, Wang et al. [37] 80.17 % ASAG based transformers, Camus and Filighera [17]

CONCLUSION
We have shown in this research that we can set up an ASAG system for schoolchildren and for Arabic.ASAG systems in Arabic can leverage and adapt natural language processing pipelines and deep learning architectures used for English ASAG.We have used a lemmatization adapted to Arabic to transform words into their dictionary forms.As deep learning algorithms require us to map or embed Arabic words into low-dimension numerical vectors, we have used the open-source word2vec algorithm trained on Arabic Wikipedia to compute these numerical embedding vectors.Our ASAG system targeted schoolchildren from fifth and sixth grades aged 11 and 12 years old.The answers given by these schoolchildren turned out to be short and composed of 5.6 words on average.The collected dataset proved large enough to train our model and to provide good results.We have also made this dataset public and available for future research projects.Moreover, service-oriented architecture proved beneficial to deploying our models to production in an environment providing ASAG services.We were able to consume the ASAG service via different front ends.
We have used a word-embedding algorithm trained on Arabic Wikipedia.As the style and expressions used by schoolchildren are different from what we can find in Wikipedia, one can explore training a word-embedding algorithm on a corpus made out of school textbooks in Arabic.We can also improve our ASAG system by adding correction capabilities.This means that the system will correct wrong or ambiguous answers and propose to schoolchildren how to improve their responses.Finally, Our ASAG system was trained on one chapter of the curriculum of the fifth grade.One can extend this to cover all chapters of the fifth grade or of primary education.Such extension will make a continuous evaluation of learning in primary schools easy and will help teachers detect early schoolchildren with learning problems.Once detected, teachers can help these schoolchildren overcome their difficulties.With the recent emergence of large language models, developing ASAG systems to cover full primary curricula and for continuous evaluation of learning seems a promising direction.

Table 1 .
provides the number of answers per score.Statistical indicators on the length of answers Figure 2. Number of answers per number of words

Table 2 .
Number of answers per score

Table 3 .
Example of embedding vectors associated with six different words

Table 4 .
Hyper-parameter of LSTM architecture for ASAG

Table 5 .
Hyper-parameter of transformer architecture for ASAG

Table 6 .
Hyper-parameter of BERT architecture for ASAG

Table 8 .
Transformer model for ASAG metrics results

Table 1 .
BERT model for ASAG metrics results Figure 10.BERT model for ASAG: metrics' graphs

Table 10 .
Accuracy results on ASAG 79.7 % Deep learning based arabic short answer grading in serious games (Younes Alaoui) 851