Convolutional auto-encoded extreme learning machine for incremental learning of heterogeneous images

ABSTRACT


INTRODUCTION
Information systems widely differ in their form and application, but converts all the data into meaningful information. Real-world applications generate data in huge volumes, where the process of acquiring knowledge becomes complex. Irrespective of the type of data, which may be homogeneous (same feature set across the chunks) or heterogeneous (different feature set across the chunks) [1], the models generated from the systems must continually learn to predict or classify. Incremental learning (or constructive learning) [2], a machine learning technique introduced for continual learning, updates the existing model when data streams in continually. Figure 1 shows an incremental learning model, which helps grow the network with the data arrival belonging to new classes. It applies on classification and clustering applications, addressing the data availability and resource scarcity challenges. An incremental learning algorithm must meet these criteria [3], i) accommodating new classes, ii) minimal overhead for training new classes, iii) previously trained data must not be retrained, and iv) preserve previously acquired knowledge. Challenges faced by incremental learning are: a) Concept drift: refers to the changes observed in the data distribution over time [4], which falls under two categories: i) virtual concept drift or covariate concept drift where changes are seen in the data distribution ISSN: 2088-8708  Convolutional auto-encoded extreme learning machine for incremental learning … (Sathya Madhusudanan) 5855 d) Neural network architectures Replay-based methods: replay-based methods replay training samples from previous tasks into the new batch of training data to retain the acquired knowledge. The replay can be done in two ways: a) creating exemplars and b) generating synthetic data. A deep incremental CNN [24] model, where the exemplars of the old classes update the loss function when trained with the new class samples. This architecture used for largescale incremental learning, where exemplars help in bias correction for the created model representing both the old and new classes. ELM++ [25], an incremental learning algorithm, creates a unique model for each incoming batch. The model is picked by using the intensity mean of the test sample and the classes trained. Here authors included a data augmentation step in generating samples to accommodate old and new classes in the incremental model, using adjusting brightness, contrast normalization, random cropping, and mirroring techniques. Deep generative replay for continual learning of image classification tasks and constitutes a deep generative model ("generator") and a task solving model ("solver"). Brain-inspired replay (BI-R) [26] uses the variational auto-encoder for generating samples learned previously. Variational auto-encoder generates the old samples and serves the purpose of data augmentation. It also uses Gaussian mixture model (GMM) to generate specific desired classes while regenerating trained samples.
Regularization-based methods: regularization-based methods use regularization terms to prevent the current task parameters from deviating too much from the previous task parameters. A deep incremental CNN architecture can use a strong distilling and classification loss in the last fully connected layer to effectively overcome the catastrophic forgetting problem. He et al. [27] proposed a two-step incremental learning framework with a modified cross-distillation loss function addressing the challenges in online learning scenario.
Parameter isolation methods: parameter isolation methods eradicate catastrophic forgetting by dedicating a subset of a model's parameters from previous tasks to each current incremental task. Guo et al. [28] implemented incremental ELM (I-ELM), which trains online application data one-by-one or chunk by chunk using three alternatives-minimal-norm incremental ELM (MN-IELM), least square incremental ELM (LS-IELM), kernel-based incremental (KB-IELM). Among the three methods, KB-ELM provides best accuracy results. Online sequential ELM (OS-ELM) [29] trains data sequentially one by one or chunk by chunk based on recursive least-squares algorithm. In the sequential learning phase, for each new observation, OS-ELM calculates the current hidden and output layer weight based on the previous hidden and output weights. Error-minimized ELM (EM-ELM) [30] adds random hidden nodes to ELM network one by one or chunk by chunk and reduces the computational complexity by updating the output weights incrementally. Learn++ [31], an incremental training algorithm trains an ensemble of weak classifiers with different distributions of samples. A majority voting scheme generates the final classification rule, which eliminates over-fitting and fine-tuning. ADAIN [32], a general adaptive incremental learning framework, maps the distribution function estimated on the initial dataset to the new chunk of data. The pseudo-error value assigns the misclassified instances with a higher weight, making an efficient decision boundary. A noniterative, incremental, hyperparameter-free learning method iLANN-SVD, updates the network weights for the new set of data by applying singular value decomposition on the previously obtained network weights.
Our proposed work comes more or less under the first category, as the correctly classified test samples of the current batch serve as augmented samples. Thus, the next set of images generates the incremental model along with the augmented samples. Table 1 (see in Appendix) summarizes the strategies adopted by various neural network architectures to achieve incremental learning.
The rest of the article is organized as follows: the contributions listed in section 1 discuss the strategies to achieve incremental learning. The proposed work, CAE-ELM, is detailed in section 2. The experimental results and implementation of sample scenarios can be found in section 3 and also it discusses the pros and cons of CAE-ELM. Finally, section 4 concludes the paper with possible future work.

METHOD
CNN-ELM [33] is a recently developed deep neural network that replaces the multi-layer perceptron classifier part of CNN with the extremely fast ELM neural network. The flattened features of the input image retrieved by applying convolutional filters are fed to the ELM to fasten the classification process. CNN is combined with ELM to achieve the following benefits: i) the convolutional filters in CNN capture an image's spatial dependencies, classifying it with significant precision, ii) ELM randomly assigns the hidden node parameters initially, and they are never updated, and iii) ELM learns the output weight in a single step. Thus, it gives the best generalization performance at a breakneck learning speed.
Exploiting the advantages of CNN-ELM, we propose the novel CAE-ELM, an incremental learning algorithm for classifying supervised images. CAE-ELM is designed to solve two issues: i) classifies images with varying dimensions and ii) performs incremental learning. CAE-ELM classifies the images arriving in different batches with varying resolutions. CAE-ELM constitutes the following: i) CNN extracts feature from input images, ii) SAE does the dimensionality reduction process, and iii) ELM incrementally trains every batch  Figure 1 outlines the overview of the proposed work CAE-ELM. (Note: CAE-ELM framework embeds only one CNN, SAE, and ELM architecture for the entire incremental image classification process).
As CAE-ELM focuses on handling varied resolution images arriving in different batches, initially, CNN outputs varied size features from the flattened layer for every batch. Before the training process, we zeropad the extracted varied size features from CNN to a fixed length (maximum length among the extracted features). The SAE dimensionally reduces the zero-padded features to ease the training process. Finally, the ELM neural network trains and classifies the dimensionally reduced features extracted from the varied resolution input images.

CAE-ELM for image classification with varied resolution
In CAE-ELM, the CNN framework had two convolutional layers and two pooling layers alternatively-conv1 with ten 3×3 filters, pool1 with ten 2×2 filters, conv2 with twenty 3×3 filters and pool2 with twenty 2×2 filters. In our proposed work, the images arrive in batches with varying resolutions each day, i.e., batch 1 holds images with 64×64, batch 2 carries images with 39×39, and batch 3 holds images with 32×32 resolution. The CNN with the designed framework produces flattened features with varying sizes for each batch of images fed into it. For example, batch 1, batch 2 and batch 3 has 3920, 800, 720 features, respectively. The output dimension of each layer in the CNN is, where is image width, ℎ is height, is image padding, is stride, is convolutional filter size and is the number of convolutional filters applied. We zero-pad the extracted features from varied batches to a fixedlength features. For example, when the batch images end up with varied sized features 3920, 800, 720, we zeropad the features to a maximum length 3290.
SAE dimensionally reduces the zero-padded features to ease the training process. The autoencoder generates a compact code representation, which successfully reconstructs the input image. The dimensionally reduced features from the stacked autoencoder are fed into the ELM, a single hidden layer neural network. ELM generates a model with parameters ( , , ). Algorithm 1 explains how SAE converts the varied resolution images to the same features and the ELM training process for image classification. × 2 ), which helps reduce the computation time, overfitting, and the need for substantial memory resources. 3. Flatten the pooled output into a single 1-D array feature ( )( 1 . B. Reduce the single 1-D array feature ( ) 1 to a constant feature using Stacked Autoencoder, such that ∀ < . C. ELM neural network trains the compact feature set ( ) and generates a classifier model.
= ( where ′ : hidden neurons and : random weights are assigned. The output matrix is given by (3): and is a target matrix, represents the transpose of a matrix, represents output neurons. Endfor

CAE-ELM for incremental learning
Incremental learning methodology learns the arriving data batches continuously, updates the current model information addressing stability-plasticity dilemma, and overcomes the problem of catastrophic forgetting. CAE-ELM uses two types of testing to check the performance of the generated incremental model. i) Individual testing: the model is tested with class samples as in the individual training batches and ii) Incremental testing: samples belonging to all the classes are tested using the generated model. CAE-ELM supports incremental learning of images with varied resolutions through the individual test samples of each arriving batch. Individual test samples serve as support to synthesize new training samples to create an incremental model, a form of data augmentation. Each batch of images is divided into training and testing sets. CNN extracts the image features, and SAE reduces the features dimensionally to a countable number of features, which remains common to all the incoming data batches. ELM creates a model by training the samples in the training set. The model parameters generated ( , , ) classify the test samples of each batch. The proposed work feeds one-half of the correctly classified test samples (50%) along with the following training batch to support incremental learning. In this way, ELM learns the newly arrived classes, retaining the previously acquired knowledge about the old classes with model parameters ( , , ). The other half of the correctly classified samples and misclassified samples test the updated model, along with the subsequent batch of the test set. Thus, CAE-ELM serves the following advantages: i) learns newly arriving classes, ii) remembers the previously learned information on odd classes, iii) does not retrain any of the input samples and saves training time, and iv) discards old training samples and models when it generates the updated model, thus saving memory space.

RESULTS AND DISCUSSION
Experiments test the incremental performance of CAE-ELM using standard datasets like MNIST, JAFFE, ORL, FERET. Tables 2 to 5 prove the outstanding incremental performance of CAE-ELM against the existing incremental methods. Subsection 3.3.2 discusses the pros and cons of using CNN, SAE, and ELM in CAE-ELM for the process of feature extraction, feature size conversion (dimensionality reduction), and training for image classification.

Incremental learning results
We compared CAE-ELM against different incremental methods for MNIST and CIFAR100 datasets as follows: a) MNIST dataset: MNIST dataset comprises of 10 classes, with a total of 20,000 (28×28) images.
Without adding new classes: for the MNIST dataset with a fewer number of classes, we compared our algorithm CAE-ELM with existing incremental learning algorithms, Learn++, and ELM++, which use neural networks for image classification. We split the MNIST dataset into six sets (S1 − S6), where each set holds samples from classes 0 − 9. Table 2 (case 1) and Figures 3(a) and 3(b) show the accuracy comparison between Learn++, ELM++, and CAE-ELM for the MNIST dataset when adding new images to the already existing classes. The entire dataset was divided into six train sets (S1 − S6) and one test set. Each train set consists of 10 classes (0-9) with 200 images in each class. The set S2 will hold the same classes but different images from S1. Set S3 will have images that are not present in S1 and S2, and so on. The test set contains images from all these ten classes.
For the CIFAR100 dataset with a larger number of classes, we compared CAE-ELM against braininspired replay, end-to-end and large-scale incremental learning methods. We split the dataset into five train sets S1-S5, where every set train 20 new classes in addition to the old classes. Table 3 and Figure 4 tabulates the results obtained.

Implementation of application scenario
The proposed CAE-ELM incremental neural network is deployed in various lab to access the systems by different group of students. The student's images will be trained previously by the CAE-ELM, where the training images will be captured using cameras with varying resolutions fixed in different labs. For example, consider five batches (Batch B1, B2, B3, B4, and B5) and four labs (Python, R, MATLAB, and Java). Every batch will hold 10 students, where each student represents a unique class. Every student will learn two different subjects and have access to the corresponding labs. Python lab captures images with 64×64 resolution, R lab captures images with 39×39 images, MATLAB with 32×32 resolution whereas, Java lab with 28×28 resolution images. The student's images will be captured using these lab cameras with varying resolutions, for example, Batch B1 students will have MATLAB and Java classes, so their images will be captured in two different resolutions 64×64 and 28×28. CAE-ELM handles and trains all the student's images captured using varying resolutions. CAE-ELM model tests the captured image and marks the student's presence in the lab. Table 4 shows the different labs attended by different batch of students.

Discussions 3.3.1. Pros and cons of CAE-ELM
The advantages of CAE-ELM lie mainly in addressing the varied resolution images arriving in different batches and creating a single updated model for them without forgetting the learned knowledge. The data augmentation method helps us in preserving the learned information of every batch, and propagates it to the other upcoming batches. The use of CNN and ELM enhances the efficiency of the proposed work by extracting the best features and providing a faster training time. CAE-ELM stays efficient in i) memory by discarding all the old models and data and ii) time by avoiding the retraining of old samples.
ELM stands superior to other conventional methods like CNN with its fast-training time. The advantage of using ELM in CAE-ELM lies with the replacement of the backpropagation part in CNN for training purpose. With no doubt, ELM is a better choice than the conventional back-propagation process. ELM lacks efficiency with the feature extraction process, specifically for the red, green, and blue (RGB) images. In such cases, definitely CNN is a better choice over ELM. So, we have used CNN for the feature extraction process. So, the combined use of CNN and ELM in CAE-ELM definitely proves to show a significant improvement in the classification accuracy.
However, when does the performance of CAE-ELM becomes a concern? CAE-ELM becomes timeconsuming when the varied resolution images are very high dimensional i.e., 2048×1536, and 7680×6876. Running the CNN for feature extraction followed by the stacked autoencoder to convert into the same feature sizes may be time-consuming. Though the performance accuracy might be achieved, the time consumed in running CNN and SAE together will wash away the advantage of avoiding retraining samples. In such cases, some other efficient techniques must be used to handle the high-dimensional varied resolution images. However, in very high-dimensional images, the use of ELM in place of back-propagation for image classification compensates the time spent with SAE to some extent.

Pros of ELM training over the use of back-propagation
The CNN extracts the spatial features of images with the use of convolutional filters. After feature extraction and reduction, the ELM classifier replaces the backpropagation for training the datasets. ELM does not undergo any iterations for finding trained weights, thus requiring less time to train a model, even in the case of a large number of hidden neurons. ELM is also resistant to overfitting, except in the case of tiny datasets. As the training sample increases, ELM provides good performance accuracy.
Using the advantages of CNN and ELM, the proposed work CAE-ELM overcomes the memory requirement of a large dataset by training each batch of arriving data independently to create individual models, which avoids the retraining of previous batch data. CAE-ELM works better than the existing algorithm Learn++ and ELM++ for the following reasons. i) CAE-ELM uses the best feature extractor CNN, whereas ELM++ extracts the minute features of the image and ii) The training time decreases in CAE-ELM. It uses a feedforward ELM network that does not involve any iterations, whereas Learn++ uses multiple neural network classifiers, which involve iterations for backpropagating the errors.

Challenges addressed
From the illustration shown in section 2.3, it is clear that CAE-ELM addresses the challenges of incremental learning. Irrespective of the classes added or removed in different batches occurring at different times, CAE-ELM adapts the model efficiently and learns all the classes, addressing the challenges of stabilityplasticity dilemma and catastrophic forgetting. Irrespective of the varying image features and classes, CAE-ELM adapts the model dynamically and never forgets previously learned classes.

CONCLUSION
In this paper, our proposed incremental algorithm CAE-ELM learns varied resolution images arriving in different batches efficiently, both in terms of memory and time, using the ELM neural network. Except for the very high-dimensional images, the use of Stacked Autoencoder helps the incremental model to accommodate information about varied resolution images. Addition/deletion of new/old classes to the forthcoming batches never affects the learning performance of CAE-ELM. The most recently updated model is only retained with all previously learned information, discarding all other trained models and data, saving memory space. No input sample is retrained to save training time. Instead, CAE-ELM uses the correctly classified test samples as an augmented input source to achieve an efficient incremental model. The current incremental version of CAE-ELM works perfectly for varying image datasets. Our future work is to design an incremental algorithm for time series data with concept drift.

Suresh Jaganathan
Associate Professor in the Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, has more than 26 years of teaching experience. He received his PhD in Computer Science from Jawaharlal Nehru Technological University, Hyderabad, M.E Software Engineering from Anna University and B.E Computer Science & Engineering, from Madurai Kamarajar University, Madurai. He has more than 30 publications in referred International Journals and Conferences. Apart from this to his credit, he has two patents in the area of image processing and has written a book on "Cloud Computing: A Practical Approach for Learning and Implementation", published by Pearson Publications. He is an active reviewer in reputed journals (Elsevier -Journal of Networks and Computer Applications, Computer in Biology and Medicine) and also co-investigator for the SSN-NIVIDA GPU Education center. His areas of interest are distributed computing, deep learning, data analytics, machine learning and blockchain technology. He can be contacted at email: sureshj@ssn.edu.in

Dattuluri Venkatavara Prasad
Professor in the Department of Computer Science and Engineering and has more than 20 years of teaching and research experience. He received his B.E degree from the University of Mysore, M.E. from the Andhra University, Visakhapatnam, and PhD from the Jawaharlal Nehru Technological University Anantapur. His PhD work is on "Chip area minimization using interconnect length optimization". His area of research is computer architecture, GPU computing and machine learning. He is a member of IEEE, ACM and also a Life member of CSI and ISTE. He has published a number of research papers in International and National Journals and conferences. He has published two textbooks: Computer Architecture, and Microprocessors\Microcontrollers. He is a principle Investigator for SSN-nVIDIA GPU Education/Research Center. He can be contacted at dvvprasad@ssn.edu.in.