Monitoring Indonesian online news for COVID-19 event detection using deep learning

Even though coronavirus disease 2019 (COVID-19) vaccination has been done, preparedness for the possibility of the next outbreak wave is still needed with new mutations and virus variants. A near real-time surveillance system is required to provide the stakeholders, especially the public, to act in a timely response. Due to the hierarchical structure, epidemic reporting is usually slow particularly when passing jurisdictional borders. This condition could lead to time gaps for public awareness of new and emerging events of infectious diseases. Online news is a potential source for COVID-19 monitoring because it reports almost every infectious disease incident globally. However, the news does not report only about COVID-19 events, but also various information related to COVID-19 topics such as the economic impact, health tips, and others. We developed a framework for online news monitoring and applied sentence classification for news titles using deep learning to distinguish between COVID-19 events and non-event news. The classification results showed that the fine-tuned bidirectional encoder representations from transformers (BERT) trained with Bahasa Indonesia achieved the highest performance (accuracy: 95.16%, precision: 94.71%, recall: 94.32%, F1-score: 94.51%). Interestingly, our framework was able to identify news that reports the new COVID strain from the United Kingdom (UK) as an event news, 13 days before the Indonesian officials closed the border for foreigners.


INTRODUCTION
The rapid movement of people and goods globally increases the latent danger and potential spread of infectious diseases beyond international borders [1]. Coronavirus disease 2019 (COVID-19) pandemic gives us a lesson that preparedness towards emerging infectious diseases (EIDs) is urgently needed. It could be in the form of a monitoring system for early warning that allows stakeholders to conduct appropriate assessments, rapid responses, and collaborations at regional, national, and global scales [1]. In particular, the Association of Southeast Asian Nations (ASEAN) region is at risk for infectious diseases threat due to its high population density, mobility, and socio-economic development with inadequate public health services [2]. Rapid outbreak detection ensures timely reactions to minimize morbidity and mortality, as well as social and economic mitigation disruption.
The conventional system for monitoring public health depends on a hierarchical structure in which health care providers, laboratories, and health institutions report infectious diseases on the list of monitored categorization problem. Another approach is creating a hybrid CNN, LSTM, and multi-layer perceptron (MLP) in text classification, such as in sentiment analysis [20]. Compared to conventional NLP, deep learning models can avoid the possible complexity of the classification procedure by hiding the complexity of the procedure, such as the classifiers could automatically extract the determinant features and predict the results based on the data fed in the training process. Bangyal et al. [21] got similar results that deep learning models, especially bidirectional LSTM (BiLSTM) and CNN, gave the best performance compared to the non-deep learning classification algorithms in the text classification task. The introduction of transformers model in NLP, such as generalized pre-trained transformer (GPT) [22], and bidirectional encoder representations from transformers (BERT) [23] improved the performance in NLP tasks because the transformer adopted the encoder-decoder architecture with an attention mechanism. Furthermore, the training parallelization in transformer architecture makes building a pre-trained model with a large dataset possible. After that, the model can be retrained and fine-tuned for specific tasks. Qasim et al. [24] showed that fine-tuned BERT-based transfer learning can improve text classification tasks. Thus, transformer architecture has become state of the art in natural language processing tasks. As one of the architecture implementations, BERT has been adopted and retrained in various languages datasets. For example, in Bahasa Indonesia, a newly pre-trained BERT with a large dataset in Bahasa Indonesia is called IndoBERT [16]. However, the model has yet to be fine-tuned for a specific task, such as text classification with an additional labeled dataset. This paper presents a framework for monitoring an infectious disease using online news. In this case, we use COVID-19 as it is the latest infectious and attracts the public's attention. Sentence classification was explored to perform news titles classification related to COVID-19 events or non-events using machine learning methods (from classical to deep learning models). This paper extends the previously proposed classifiers [25] for the COVID-19 dataset. In addition, it implements BERT as the state-of-the-art language model for classification that has been trained using Bahasa Indonesia corpora (IndoBERT), which is the latest benchmark for Indonesian NLP [16].
This paper has three-fold contributions. First, it shows that a framework for monitoring COVID-19 infection event from Indonesian online news portal can be used to filter COVID-19 event news. To the best of our knowledge, our framework might be one of its kind frameworks where Indonesian online news articles are used for text classification and monitoring events especially related with infectious disease. The framework could be easily scaled up for other EIDs, such as dengue fever as shown in [25]. Further, the current framework is also possible to be cloned for other local languages given the appropriate data for the model training. Second, this paper gives a valuable insight on Indonesian text classification using conventional machine learning, deep learning and the latest language model BERT that is to acquire higher performance using IndoBERT fine tuning with the respective dataset is needed. Third, it shows the use of the framework to provide a geo-information visualization of COVID-19 related event news coverage from Indonesian online news portals.
The remainder of this paper is organized as follows. In section 2, we present the proposed framework for monitoring COVID-19 events using online news and deep learning. In section 3, we describe our experimental results and discuss the significance of our study. Finally, we conclude the paper in section 4.

METHOD
We build a framework for monitoring Indonesian online news as shown in Figure 1. First, online news is collected by crawling Indonesian online news portals and stored in a database. Vote-based data labeling is used in the data preparation. After the data preparation, we transform the input text into a feature matrix that will be fed to the classifier models for training. This trained classifier model is then used to label news articles into event and non-event news. The labeled data are stored in a database and REST API is developed for applications to retrieve the data. The following description provides a detailed explanation for each procedure.

Data acquisition
The news titles were acquired from seven Indonesian news portals, including the following: Tirto, Tempo, Republika, Merdeka, Kompas, Detik, and Antara. These news portals were selected because of their reliability, speed of news delivery and are verified by the Indonesian press council. We obtained 20,431 news titles by crawling the articles using several keywords: corona, covid, and COVID-19. The articles were collected from January 26 th until May 24 th , 2020. In the crawling process, we gathered the timestamp, title, contents, and metadata.
As opposed to using the entire article, we use the news title only to conduct the classification. This approach is chosen because a title should commonly summarize and represent the article's content. Smaller text size also reduces the computing cost during the model training. Preceding the data preparation, the title is extracted, and deduplication is conducted. Deduplication eliminates title duplicates that happen when online news is divided into several parts or pages.

Data preparation
Online news writes various information from COVID-19 infection status to economic impacts, from health tips to celebrities' activities during the pandemic. For an example, two news titles from Merdeka, which is one of Indonesian news portal, are 'Konser amal satu Indonesia kumpulkan donasi miliaran rupiah untuk lawan COVID-19' (English: One Indonesia charity concert collects billions of rupiah donations to fight COVID-19) and 'Camat Tambora gelar tes swab massal setelah 30 warga terpapar COVID-19' (English: Tambora Sub-district head holds mass swab test after 30 residents were exposed to COVID-19). The first title could be considered non-COVID-19 event, while the second one is related to the COVID-19 event. Hence, in data preparation, we categorize online news titles that report COVID-19 events and those that do not relate to COVID-19 events.
The labeling process is done with a vote-based labeling approach. Three respondents performed the labeling based on the news titles criteria. The title's label was then determined upon the most votes. We divided the classes into two categories that represent the relevant news towards the COVID-19 event and non-event news as: − 0 (negative): non-event class-news surrounding COVID-19 information that report non-infection incidents, for example, health recommendation to avoid COVID-19, donations for COVID and others. − +1 (positive): event class-news that report COVID-19 infection events, such as about a +/-case(s) happening or increasing/decreasing, color changes for the infected area zone, area lockdown and others. Table 1 shows examples of news titles for each data class and Figure 2 shows the distribution of online news classes extracted from each portal. Negative and positive data numbers are 12,292 and 4,549, respectively.

Event detection classifiers
For the classifiers, we used three deep learning models (MLP, CNN and LSTM) and conventional machine learning (naive Bayes, logistic regression, decision tree, support vector machine (SVM), AdaBoost, neural network) as comparisons. Further, we also investigated the latest Indonesian benchmark for NLP that is IndoBert ( [16], which is a BERT model ( [23] pre-trained with Indonesian corpora).
Textual data features are created by transforming text into a vector of term frequency-inverse document frequency [26]. Another feature is word embedding [27], [28], which transforms text into a fixedsize dense vector representation. We use various deep learning classifiers to classify textual data in those vector representations. Those classifiers are MLP, CNN, LSTM, and BERT.
MLP is a neural-network model with its architecture containing multiple layers consists of an input layer, hidden layer(s), and an output layer. The nodes in the hidden layer(s) and the output layers operate on nonlinear activation functions. Those layers construct a feed-forward structure that is outputs from one layer become inputs for the successive layer. MLP built its model using backpropagation technique [29], [30]. Although the model's architecture is simple, it is surprisingly able to perform with good results. The model benefits from the hierarchical network structure and its ability to learn the representation of the given data to consider every given feature in the classification problems. The architecture is shown in Figure 3.
CNN is an extension of MLP by adding a regularizer, which is a convolution layer to avoid overfitting [31]. In text analysis, a series of window filters serves to create a hierarchical and simpler pattern from a complex input pattern. At the end of the process, a classification function is given by adding a fully connected layer and output layer similar to MLP. The architecture is shown in Figure 4.
LSTM [32] is specialized recurrent neural network (RNN) model having capabilities to learn dependencies among data in the dataset. LSTM's main component is a cell that regulate the information flow from an input gate to an output gate or a forget gate for unnecessary information. On the inter-related inputs, a repetitive process occurs that implies a temporal sequence. This process benefits for tasks such as datarelated predictions and classification. The architecture is shown in Figure 5.
BERT is a transformer-based machine learning technique used for various natural language tasks. In this experiment, we utilize the model for BERT, pre-trained with Bahasa Indonesia, and retrain the model for our dataset classification for fine-tuning. The architecture is shown in Figure 6.  Within our experiment, we used term frequency/inverse document frequency (TF/IDF) as input features and a combination of sigmoid with rectified linear unit (ReLU) activation functions as well as Adam [33] as optimizer formula. For evaluation purposes, we also used conventional machine learning classifiers.

963
− Naive Bayes is a supervised learning algorithm that applies Bayes' theorem and the "naive" assumption of conditional independence between each pair of features given the value of the class variable. Maximum a posteriori (MAP) estimation is used to estimate the relative probability of a given class from the training set [34]. − Logistic regression is a learning algorithm based on statistical model using logistic function. The logistic function is derived from defined constraints to evaluate the probability of data to exist in the given class [35]. − Decision tree is a learning model that falls into a non-parametric supervised method category. The method learns simple decision rules inferred from the data features to create a model that predicts the value of a target variable [36]. − Support vector machine it is an algorithm that perform classification by constructing a hyperplane or set of hyperplanes in a high-dimensional space. The data separation within the hyperplane is used for classifying the data into the given classes [37]. − AdaBoost is a classifier based on meta-estimator. The method tries to fit the classifier to the original dataset to the classifier and then add weight-adjusted copies of the same classifier. The procedure is done repeatedly to give better classification results [38]. − Neural network is inspired by the biological brain. Input is represented by many features where each feature is involved in all possible inputs. Backpropagation is used to regulate the weight of the network. The weight regulation is done repeatedly to decrease the difference between actual output and desired output [29].

Mobile application implementation
As current EIDs event information is scattered on the Internet, users are required to find the information from the Internet. With current technology trend that is the data that should come to the user, our framework contains an API system to enable other applications/system developers to use the data/information generated by the framework. A mobile application named COVID-19 SIAPP is shown in Figures 7 to 10. SIAPP is an abbreviation for sistem informasi aplikasi pemantauan dan peringatan dini (English: information system of monitoring and early warning application). The mobile application is developed to show a prospective application of our approach. . Event and non-event filter menu Figure 10. Home screen after event filter selected: non-event news is not displayed Figure 7 shows the home screen for COVID-19 SIAPP application. The primary function of SIAPP is to display news collections from various portals. Currently, the news portals that can be accessed are Antara, CNN Indonesia, Detik, Kompas, Kumparan, Merdeka, Republika, Tempo, Tirto, and Viva. SIAPP COVID-19 accesses online news articles regarding COVID-19 using REST API.
As seen in Figure 8, the main display shows a list of news that are still mixed between event news (orange bounded news: "Pasien COVID-19 di NTT bertambah 34 menjadi 1.246 orang", English: "COVID-19 patient in NTT increased by 34 to 1,246 people") and non-event news (blue bounded news: "Bisnis hotel bangkit dari keterpurukan dampak pandemi COVID-19", English: "The hotel business rises from the downturn due to the COVID-19 pandemic"). Users can use the filter menu in Figure 9. If the event filter menu is selected, then when the user returns to the home display, the information news (non-event news) will not be displayed. Only event news will appear in the layer as shown in Figure 10.
The application specification contains mobile/front end and backend. In the mobile/front end, the specification is IDE: Android Studio, programming language: Dart, and mobile framework: Flutter. In the backend, the specification is database: MongoDB, Elastic; communication schema: Rest API, programming language: Python, and framework: Flask

Assessment 3.1.1. Performance
In this study, we assessed deep learning models and compared them with conventional machine learning models. Firstly, we used the TF/IDF feature on both models. Secondly, we assessed the performance of deep learning models using word embedding because word embedding could increase deep learning performance. The evaluation was done by k-fold cross-validation. To avoid the small number of test data for each fold, we used k=5 instead of 10. Additionally, we used early stopping based on the loss value to avoid over-fitting. Further, we also investigated the BERT model for our problem, both original and fine-tuned. The experiments are carried out using python 3 with Scikit-learn for machine learning library, Keras, TensorFlow, and Transformers for deep learning library.

News coverage
In addition to classifiers performance, we also looked at the news coverage area in our dataset. We used named entity recognition [39], [40], that is provided by Prosa.ai to identify geo-information from the news titles. Outputs which were labeled as geo-political entity (GPC), facilities (FAC), location (LOC), and product (PRO), then transformed into a map to visualize the news coverage. Visualization is useful to identify the areas where the events are reported in the news.
Deep learning models (MLP, CNN, and LSTM) could not outperform logistic regression when the same feature matrix (TF/IDF) was used as the input. However, when word embedding was used in deep learning models, their performance can compete with the best classical machine learning model's performance. The improved performance is because the word embedding technique is more suitable for neural network models due to its ability to keep the order and interaction of the words within sentences and the probability functions of each word sequence. All deep learning models acquired accuracy 92%. CNN excelled compared to other deep learning models in all metrics (accuracy: 92.87%, precision: 91.30%, recall: 89.95%, F1-score: 90.57%). Additionally, CNN with word embedding outperformed logistic regression in three metrics except precision (precision: 1%). However, the dataset is not a balanced dataset, and F1-score is frequently used as an indicator in that case. In F1-score, CNN outperformed logistic regression by F1-score: 1.24%.
Using BERT, the performance dropped significantly. We believe that this low performance was caused by the model that has not been trained with online news corpora. However, fine-tuning increased the performance significantly in all four metrics compared to deep learning models with word embedding. The differences between BERT+fine-tune and CNN with word embedding are as follows: accuracy: 2.29%, precision: 2.87%, recall: 4.37%, F-score: 3.94%.
Based on the results, we find that fine-tuned BERT for news title classification achieved the best performance in every aspect of the evaluation. As for the news area coverage, we acquired 574 unique geo-information. The geo-information consists of countries (e.g., China, German, and United State), regions (e.g., Europe, Latin America, and Southeast Asia), global cities (e.g., New York, Tokyo, and Wuhan), provinces (i.e., Indonesian provinces, such as Aceh, Bali, and DKI Jakarta), local cities (e.g., Bojonegoro, Depok, and Makassar) and facilities (e.g., hospital, market, and boarding facility). The number of each geo-information category is shown in Table 3. Various written variations that indicate the exact same location are eliminated. The number of unique geo-information is shown in the "After Deduplication" column. We then mapped global cities into countries, counted the number of event-news articles found for each country and plotted the number into color codes of the global map, as shown in Figure 11. The colors in Figure 11 are graded from light yellow to red that show low to higher number. For comparison, we used the timeline of COVID-19 pandemic from Coronavirus Resource Center of Johns Hopkins University's website at the date of May 24 th , 2020, as shown in Figure 12. The colors in Figure 12 are graded from light orange to dark orange that show low to higher number. We could observe that our mapping in Figure 11 is almost similar to the global COVID-19 timeline as shown in Figure 12. Based on Figure 11, some regions such as South America, Africa, and Europe seem to have no news coverage, even though there are a small number of event-news covered within those regions. This condition is because we did not map event-news articles having geo-information of region categories (such as, Africa, Latin America, and East Europe) into countries. Furthermore, the results show that even though we use Indonesian online news, the news reports almost all important global COVID-19 events. On an additional note, Indonesian areas are in red color that indicates a significantly high number of events presents in the online news compared to the China area. This condition is due to the dataset used is Indonesian online news. Hence, it is expected that the news reporting COVID-19 event mainly focuses on Indonesia area.
With a similar method, we mapped local cities into provinces, counted the number of event-news articles found for each province, and plotted the number into color codes of Indonesian maps, as shown in Figure 13. In comparison, we acquired timeline data from COVID-19 Indonesia Dataset on Kaggle and plotted the data, as shown in Figure 14. The colors in Figure 13 and Figure 14 are graded from light yellow to red that show low to higher number. As shown in both figures the pattern is similar. Java Island is the hot spot (red and orange colored area) in Indonesia for the COVID-19 event. Additionally, we could observe that some provinces with a relatively high number of COVID-19 events in the online news correlate with the number of cases in those provinces. They are North and South Sumatra in Sumatra Island; North and South Sulawesi in Sulawesi; Bali and West Nusa Tenggara; also, Papua in Papua Island. However, there are exceptions in Riau and Yogyakarta provinces. In these two provinces, the number of news was significantly higher compared to the actual case number. Based on this result, we could think that there are some relevant correlations between the number of event-related news in online news and the actual number of COVID-19 cases. However, further study should be done before using the online count number for further analyses, such as for early warning purposes. As an additional remark, we have implemented our framework and monitored the COVID-19 event until December 31, 2020. From the prototype, our framework can identify the new variant from the United Kingdom (UK). For example, the news with title 'Varian baru virus corona teridentifikasi di Inggris' (Eng.: New coronavirus variant identified in England) published by Merdeka portal on December 15, 2020, is labeled as event news by our classifier. As pandemic takes longer than predicted earlier and with the new-normal beginning, the public may be unaware of the information. An online news-based monitoring application may help the public to get the information at hand.

CONCLUSION
This paper presented a framework for online news monitoring and classification. It is used for COVID-19 event detection from collected news titles using deep learning. In this study, we investigated various deep learning models and compared them to conventional machine learning. Common deep learning models: MLP, CNN, and LSTM, excel compared to conventional machine learning when word embedding is used, and CNN acquired the highest performance (accuracy: 92.87%, precision: 91.30%, recall: 89.95%, F1-score: 90.57%). However, fine-tuned Indonesian pre-trained BERT model gives significantly better performance in all evaluation metrics than other classification models with accuracy: 95.15%, precision: 94.71%, recall: 94.32%, and F1-score: 94.51%.
From the application viewpoint, our event detection framework is able to detect COVID-19 events from online news not only from Indonesia but also from 51 countries. One of our highlights is that our implementation is able to identify the UK strain event 13 days before the local authority announcement for international entry restriction (December 28 th , 2020). With this kind of insight at the hands of the public, it could provide public awareness prior to a formal announcement and drive risk reduction towards the next wave of the pandemic.