Classification of heterogeneous Malayalam documents based on structural features using deep learning models

The proposed work gives a comparative study on performance of various pretrained deep learning models for classifying Malayalam documents such as agreement documents, notebook images, and palm leaves. The documents are classified based on their visual and structural features. The dataset was manually collected from different sources. The method of research proceeds with preprocessing, feature extraction, and classification. The proposed work deals with three fine-tuned deep learning models such as visual geometry group-16 (VGG-16), convolutional neural network (CNN) and AlexNet. The models attained high accuracies of 99.7%, 96%, and 95%, respectively. Among the three models, the fine-tuned VGG-16 model was found to perform better attaining a very high accuracy on the dataset. As a future work, methods to classify the documents based on content as well as spectral features can be developed.


INTRODUCTION
Ancient documents reveal the history of the people, nation as well as tradition. The preservation and segregation of these documents is a tedious process. The documents get degraded over time due to various natural factors like aging, environmental factors, and accidental errors [1]. Digitization of documents is an effective method for the preservation as well as classifying the documents. Document classification will throw a new light on the new era of digitization and categorization. As the need for digitization increases, the requirement for the classification of these digitized documents comes into play. Document classification is the process of classifying or grouping documents into various categories based on the structural features. Document segregation and storage is an important step in information management and retrieval [2]. As each document belongs to different categories, classifying the documents manually would consume more time and effort. A deep learning-based method is proposed to check which classifier obtains high accuracy for the classification of ancient Malayalam documents based on structural features. The documents are classified into three categories namely palm leaves, agreement copies, and notebook images.
Mushtaq et al. [1] investigated a deep learning-based convolutional neural network (CNN) model for spectral image classification on datasets which consist of 10, 10 and 50 classes respectively have got a good accuracy of 99.04%, 99.49%, and 97.57% were reported respectively for each dataset. In [2], [3] [14] to classify documents based on visual features with an accuracy of 77% was obtained. Kanchi et al. [15] discussed a deep multi model-based approach to classify documents on datasets containing 16 and 10 classes respectively. The proposed approach obtained an accuracy of 90.3%. A deep active learning-based approach was investigated by Hemmer et al. [16] for the classification of images that attained an accuracy of 90% on the Modified National Institute of Standards and Technology (MNIST) and Cifar-10 datasets. Indraswari et al. [17] proposed a mobile net-based classification of melanoma images which achieved an accuracy of 85% over four different datasets containing images belonging to two classes. Ahmed et al. [18] investigated a deep neural net model with attention mechanism for Bangla document classification on a manually collected dataset. The proposed model obtained an accuracy rate of 86.56% was obtained over 13 document classes. Pan et al. [19] experimented with an ontology-driven approach to classify scientific literature that achieved a score of 95% on DBLP dataset. Jiang et al. [20] proposed three various deep models for technical documents for the classification that could yield a decent accuracy rate of 77.9% over 50 distinct classes. A deep learning-based adaptive multiscale segmentation method was proposed by Zhao et al. [21] on Indian Pines, Salinas Scene and University of Pavia datasets containing 16.9 and 15 classes respectively on which accuracies of 94.312%, 99.217%, and 92.693% were obtained. A deep learning-based hybrid machine learning based model was developed by Swetanisha et al. [22] for classifying multispectral images on Landsat-8 dataset containing 7 different classes of satellite images could attain decent accuracy scores. In [23], [24] a combined approach of deep learning and machine learning models were used to perform multiclass classification of documents which could yield exceptionally high accuracy values. Deep neural network-based models were proposed in [25]- [27] for classification of hyperspectral images on the Indian Pines and University of Houston and Salinas Seas datasets containing over 16 image classes. The proposed models could attain a high accuracy rate varying between 90-99%. Jayakumari and Nair [28] proposed a deep learning based ResNet model to perform binarization of ancient horoscopic palm leaf images which attained a very high accuracy of 95.38% on a manually collected dataset consisting of ancient horoscopic palm leaves. Deep learning is definitely the pick of the bunch when the problem requires processing of huge and unstructured deep model data processing [29]. The further sections of paper are described below as methods, results and discussion, and conclusion.

METHOD
The proposed method classifies ancient documents such as agreement copies, palm leaf manuscripts, and notebook images based on their structural features. The methodology has three approaches using three different deep learning models such as CNN, fine-tuned VGG-16 and modified AlexNet along with various enhancement methods for classifying the documents. Each model is evaluated on the basis of the accuracy obtained over the dataset. The model that performs better on the dataset is identified from the evaluation results.

Data collection
The proposed work is to classify Malayalam documents belonging to three different categories of documents such as agreement copies, palm leaves, and notebook images. The datasets used for our proposed work are manually collected from people as well as from online repositories. The datasets obtained contained degraded documents which made them difficult to be classified because of the same pattern of Malayalam characters present in all the documents.  Table 1 displays the details about the datasets, their sources, and the number of samples collected. The agreement copies were manually collected from various Taluk offices as well as from the internet. The palm leaf data set is collected from various people in Kerala as well as from Varikkasheri Mana, Palakkad, Kerala; and notebook images [30] are collected from various schools and colleges of Kottayam Kerala. For the research, 1,500 samples of each category of documents were collected.

Preprocessing
In the initial approach using CNN, the input image is resized to 224×224 and is converted into grayscale format. The grayscale image then undergoes Otsu thresholding for image enhancement. Thresholding is done to binarize the input image based on its pixel intensity values. The Otsu enhancement process uses (1).
In (1), ωbg(x) and ωfg(x) are the probability of the number of pixels of each class at a threshold value of X. σ 2 represents the color value variance. In the second approach using VGG-16, the enhancement process is done by normalizing the RGB values for each pixel of the input document image. Here, the mean pixel value is reduced from each pixel in this process. The image is normalized and resized to 224×224. The normalization enhancement process uses (2).
In (2), s is the input data that ranges from s1 to sn, and yi becomes the i th normalized data. In the approach using AlexNet, the input image is initially normalized by rescaling. The image is then resized to 227×227 as it is the standard input size for AlexNet architecture. In third approach also uses the same pre-processing method as normalization.

Classification using CNN
In Figure 1, after pre-processing the image is taken as the input for the CNN model. The model consists of three stacks of convolutional and MaxPooling layers, a flattening layer, and three dense layers. The input image of size 224×224 is passed through a convolutional layer having 32 filters of size 3×3 with ReLU activation function from which it is passed to a max pooling layer of filter size 2×2. The image is then inputted into the next convolution layer with filter having 64 filters of size 3×3. It is then passed to the MaxPooling layer of filter size 2×2 that follows. The image is forwarded to the next convolutional layer having number of filters 64 of size 3×3. It is then inputted to the next MaxPooling layer of filter size 3×3 and a dropout layer that follows. The output image is flattened and fed into the hidden dense layer with 128 filters and ReLU activation function. The layer that follows is the output layer trained for three classes.

Classification using fine-tuned VGG-16
In Figure 2, the enhanced image is passed through the VGG-16 network which is a stack of two convolutional layers having 64 filters of size 3×3, from which it is passed to max pooling layer of filter size 2×2. The image size is reduced to 112×112×64 after max pooling. The image is then passed to the next stack of two convolution layers and a max pooling layer where the same process is repeated and as a result, an output image of size 56×56×128 is obtained. This is followed by the next stack containing three convolution layers of kernel size 256 which makes the output size 28×28×256. The next two consecutive stacks again contain three convolution layers and 512 filters each. After the image passes through the two stacks, the output will be of size 7×7×512. The obtained output is flattened and passed to a stack of three fully connected layers. The model is fine-14 by replacing the final fully connected dense layer which serves as the output layer.

Classification using modified AlexNet
The AlexNet model in Figure 4 consists of five stacks of convolutional layers activated by ReLU activation. Each convolution layer is followed by a batch normalization layer. The input image if size 227×227 is passed to the input convolution layer with 32 filters of size 11×11 with an activation function ReLU which is then forwarded to the batch normalization layer. The output image from the first stack of layers is then sent to a MaxPooling layer and further forwarded to the next stack of layers. Here the number of filters of the convolutional layer is increased to 64 and the filter size becomes 5×5. After MaxPooling, the modified image is sent to the following stack of three convolutional layers with 128, 128, and 256 filters respectively of 3×3 size each. The image is then flattened and forwarded to a couple of dense layers each with ReLU activation. It is then forwarded to the output layer that has been trained for classifying the three classes. The output layer is activated by SoftMax activation. Figure 4. Classification using modified AlexNet

RESULTS AND DISCUSSION
The experiment was carried out on a dataset consisting of a total of 4,500 images belonging to three different classes. Out of which 3,600 images were used for training, 705 images for testing and 195 images for predictions. The models were tested on the dataset and the performance of each model has been assessed based on the performance evaluation metrics. Tables 2 to 4 show the performance evaluation metrics of each model.   Table 2 displays the accuracy, precision, recall, and F1-score values obtained by the proposed CNN method. The accuracy is found to be increasing gradually with the number of epochs. The maximum accuracy is obtained at the 6 th epoch after which the accuracy tends to decrease gradually. The precision, recall, and F1-score values varied inconsistently with each epoch.  Table 3, it is observed that the VGG-16 model accuracy tends to decrease in the initial epochs. However, from the 4 th epoch, accuracy increases gradually. The highest accuracy is obtained at the 7 th epoch after which a decline in the accuracy values can be witnessed. The model could successfully classify the input documents into the respective classes and achieved a very high accuracy of 99.7% on the dataset. The values such as precision, recall, and F1-score were found to be high and were balanced as the epochs varied. Table 4 shows the performance of the modified AlexNet on the test dataset. The model accuracy improves as the epoch's increases and reaches a maximum of 95.5% at the 7 th epoch. However, a decline was observed in accuracy after the 7 th epoch. The precision, recall, and F1-score values were found to be decreasing after the sixth epoch. Figure 5 depicts the accuracy, loss values of the CNN, fine-tuned VGG-16 and modified AlexNet models while training. From the accuracy loss graph of CNN, it can be inferred that the accuracy of the model increases with the epoch and reaches a very high value near 90%. Meanwhile, the loss reduces after the initial epoch, it finally reaches a very low value. For the VGG-16 model as the number of epoch's increases, the accuracy increases gradually and reaches a value in the range of 80 to 95% whereas the loss reduces to a value in the range 0.4% to 0.6%. Finally, the performance of the modified AlexNet model is depicted in the third graph. It is observed that the loss sharply declined in the model loss with the decrease in the epochs. Meanwhile, accuracy is found to be increasing with the number of epochs and reaches a value in the range 90 to 100%.  Table 5 shows the misclassifications by AlexNet. Compared to CNN and VGG-16 model, it was observed that in AlexNet notebook images were often confused with agreement images. A few palm leaf images were also misclassified as notebook images.
The graph in Figure 6 depicts the extent of homogeneity among the datasets of the three different classes.it clearly shows the structure wise similarity among notebook images, agreement copies as well as palm leaves. From the graph we can conclude that notebook and agreement images have more similarity due to which more documents are misclassified as notebook instead of agreement copies and agreements to notebook.  Figure 6. The homogeneity values between the three classes

CONCLUSION
The fine-tuned VGG-16 model was found to be achieving dominant results over the other two models with a remarkable accuracy rate of 99.7%. The CNN model achieved a good accuracy score of 96% whereas the modified AlexNet achieved an impressive accuracy of 95%. The AlexNet was found to give more misclassified results. The proposed approach can be used to perform document classification based on structural and visual features. This method of automatic classification of documents can be used to replace manual classification of documents for purposes such as document digitization, and cataloguing. The future work is to increase the number of classes by including several different categories of documents and to perform intra class classification based on textual as well as spectral contents.