Machine learning model for clinical named entity recognition

Received Feb 22, 2020 Revised Aug 1, 2020 Accepted Sep 25, 2020 To extract important concepts (named entities) from clinical notes, most widely used NLP task is named entity recognition (NER). It is found from the literature that several researchers have extensively used machine learning models for clinical NER. The most fundamental tasks among the medical data mining tasks are medical named entity recognition and normalization. Medical named entity recognition is different from general NER in various ways. Huge number of alternate spellings and synonyms create explosion of word vocabulary sizes. This reduces the medicine dictionary efficiency. Entities often consist of long sequences of tokens, making harder to detect boundaries exactly. The notes written by clinicians written notes are less structured and are in minimal grammatical form with cryptic short hand. Because of this, it poses challenges in named entity recognition. Generally, NER systems are either rule based or pattern based. The rules and patterns are not generalizable because of the diverse writing style of clinicians. The systems that use machine learning based approach to resolve these issues focus on choosing effective features for classifier building. In this work, machine learning based approach has been used to extract the clinical data in a required manner.


INTRODUCTION
The patient's data ranging from diagnoses, treatments, problems, medications to imaging and clinical notes like discharge summaries are available in electronic health records (EHR). For quality, billing and outcome structured data are important. On the other hand, narrative text is more engaging, more expressive and captures patient's data more accurately. Clinical notes also contain data indicating the level of concern and uncertainty to others who are reviewing the note. Hence, in order to obtain clear perspective on the condition of the patient, an analysis of narrative text needs to be done. But, the manual analysis of huge number of narrative text is time consuming and prone to errors.
To resolve this issue, machine learning based systems can be used. It can be observed from the literature that various machine learners have been used. support vector machines (SVMs) [1] and hidden markov model (HMM) [2] are examples of such learners. To understand the natural language [3], natural language processing that focuses on development of models is being used. The framework of NLP includes modules for syntactic processing like tokenization, parts of speech tagging and sentence detection. Modules for named entity recognition tagging, extraction of relation and concept identification are included in the NLP systems. An NLP system that has semantic processing models for extraction of pre-defined information 1690 is information extraction system. In the medical field, researchers are using NLP systems for identification of biomedical concepts and clinical syndromes from radiology reports [4] and discharge summaries [5].
Clinical researchers and other medical operations make use of important information extracted by analysis of clinical notes in detailed manner. These clinical notes provide rich and detailed medical information. In the present work, we have built a machine learning model for extraction of medical NERs namely disease, test and treatment. An analysis has been done from the text of doctor's notes and records generated during interaction with patient.

RELATED WORK
Decision tree based NER model was built by Sekine et al. [6] that used features such as part-ofspeech tags extracted by morphological analyzer, specialized dictionary and character based information. This was developed for Japanese. Bikel et al. [7] used hidden markov model (HMM) for identification of named entity. Features like bi-gram and orthographic features like word case, word shape etc. were used. In his Ph.D thesis, Borthwick [8] used maximum entropy (MaxEnt) algorithm. McCallum et al. [9] extracted NER using algorithm based on conditional random fields. A semi Markov conditional random field algorithm was proposed by Sarawagi et al. [10] for extraction of named entity. The researches extended the semi Markov model with use of dictionary and notion of similarity function. An overall survey of NER research was provided by Naidu and Sekine [11].
Luu [12] proposed a framework that is based on different text mining and machine learning algorithms for addressing the challenges of clinical named entity recognition. The framework proposed has multiple levels and builds complex NER tasks. Different data sets-the CLEF 2016 challenge and BIONLP/NLPBPA 2004 were used for evaluation of the proposed method and the results validated the framework.
Mao et al. [13] opine that important clinical information related to diagnosis is available in Electronic medical record. By data mining of electronic medical record, recognition of medical named entity is done. In this research work, authors have taken ophthalmic electronic medical record as research object. In the beginning, under the guidance of specialist, training corpus is annotated. Later, trained HMM model is used in test set for recognition of entity. Finally, experiment is conducted for making comparison between the proposed algorithm and the algorithm based on word segmentation model. The results of the experimentation indicate that the algorithm achieves good results in the named entity recognition of electronic medical record. Li et al. [14] proposed a deep neural model BiLSTM-Att-CRF that is a combination of bidirectional long-short time memory network and attention mechanism. This improved the performance of NER in Chinese electronic medical records (EMRs). The proposed model achieved better results than other widely used models.
Qiu et al. [15] write that the goal of the clinical named entity recognition (CNER) is identification and classification of clinical terms like symptoms, exams, treatments, diseases. This is a crucial and fundamental task for clinical and translation research. In recent years, deep learning models have been successful in CNER tasks. These models depend on recurrent neural networks which maintain a vector of hidden activations that propagate through time. This causes too much time for model training. In the present work, the researchers have proposed a residual dilated convolutional neural network with conditional random field (RD-CNN-CRF) to solve it. In this method, dictionary features and Chinese characters are projected first into dense vector representations. Later, they are fed into the residual dilated convolutional neural network to capture contextual features.
Li et al. [16] proposed a model combining language model conditional random field algorithm (CRF) and bi-directional long short-term memory networks (BiLSTM) to realize automatic recognition and entity extraction in unstructured medical texts. The researchers crawled 804 specifications of drug for asthma treatment from the Internet. Later quantization is done for the normalized field of drug specification word by a vector as the input to the neural network. Experimentation indicated that recall, system accuracy and F1 value are improved by 5.2%, 6.18% and 4.87% compared to traditional machine learning model. The proposed model can be applied to extract named entity information from drug specification.
Summarising the concepts, the electronic medical record is a description of patients physical condition [17]. Named entity recognition is the method used for clinical data extraction. The NER was a combination of dictionary and rules [18]. In clinical decision, NLP has become recent trend [19]. Researchers have evaluated various machine learning algorithms with various features [20]. UMLS, Ctakes and Medline were introduced as characteristics and using semi-Markov model, an accuracy of 85.23% was achieved [21]. Wang et al. [22] constructed tagged symptom corpus including 11,613 chief complaints. Wang et al. [23] completed manual annotation for 12 data of liver cancer in 115 medical records. Yan et al. [24] put forward a united model of word segmentation and named entity recognition based on dual decomposition. Jianbo, et al., [25] selected 800 medical records and established named entity tagged corpus among which word segmentation and partof-speech tagging utilize tools developed by Stanford University.

THE PROPOSED MODEL
The proposed model classifies clinical data and provides the data to concerned expert using machine learning framework and NLP technique. In the manual system, physicians and nurses have to go through the medical data and directs this data to concerned experts. It is time consuming, expensive and challenging task. The records of the patients include medical history, family history etc. The significant difference between classification of medical records and general text classification is word distribution. The proposed model uses machine learning framework for recognizing and extraction of concepts from clinical data. The framework includes an approach known as bidirectional long short tem memory-conditional random field (LSTM-CRF) initialized with general-purpose, off-the-shelf word embeddings. Figure 1 depicts the data flow used in the proposed model.
This is a mapping of entire input sequence paired with an entire state sequence to some dimensional feature vector. The probability as a log-linear model with the parameter vector has been modeled as where o ranges over all possible output sequences. The expression . ᶲ( , ) = ( , ) indicates a scoring how well the state sequence fits the given input sequence. Hence score can be defined as, where − 1, oi are weight vector is the bias corresponding to the transition from − 1 to oj1espectively. The algorithm used for the overall process is given in Figure 2. Medical records that consist of test conducted, patient's health status, response to the treatments and diseases are given as input. In the next stage, concepts like medical tests, diagnosis and treatments mentioned in the clinical records are classified into categories. Later, the records are divided into training data and testing data. 70% of data is used as training data and it is fed to the model. Testing data (30% of data) that consists of patient's information are fed to the model. Once the model is tuned for accuracy, the model will be ready to receive the real data. Then, the real data which is actually clinical records are fed to the pre developed model. The output includes list of words that indicate test conducted, problem diagnosed or treatment given. From the list of diseases and test conducted, the specializations are classified and displayed. The benefit of this is that the experts in specific area need not read all clinical record, they can directly read summary which saves lot of time. Using LSTM method which is based on machine learning, extraction of diagnosis and test names is extracted. NLP has been used for this. The screenshot is shown in Figure 3. Model building using training data The records are divided into training data and testing data. 70% of data is used as training data and it is fed to the model.

Step 4
Testing the model accuracy Testing data (30% of data) that consists of patient's information are fed to the model.

Step 5
Input Medical records The real data (clinical records) are fed to the pre developed model.

Step 6
Obtain output The output includes list of words that indicate test conducted, problem diagnosed or treatment given.
Step 7 Classify From the list of diseases and test conducted, the specializations are classified and displayed.
Step 7 End Once NER with NLP is applied for extraction of entities and their relationships, further processing is done. The disease names, test, diagnosis test are fed as input to machine learning framework. An output of the model will be classified data labeled with specialization as shown in Figure 4. Figure 5 and Figure 6 shows the execution screenshot during classification.

RESULTS AND DISCUSSIONS
In the proposed model machine learning algorithms used are support vector machine (SVM), naïve bayes, logistic regression, decision tree, random forest and light GBM. The screenshot related to accuracy of these algorithms is shown in Figure 7. The accuracy of the algorithms used is presented in graphical form in Figure 8. The model proposed can be used for extraction of medical data using NER and NLP technique. The machine learning model built into medical automation systems can be a good resource for medical experts as it saves lot of time spent for referring clinical records in detail. Also, administrative tasks can be easier as the model separates the diseases and treatment in to specializations.
The existing NLP systems for NER using clinical data consist of syntactic processing modules like sentence detection, tokenization, part-of-speech tagging etc. The semantic modules include concept identification, entity recognition, relation extraction and anaphoric resolution etc. So far, in the literature, it is observed that systems exists for extraction of named entities like disease, treatment etc which was useful for doctors to read summary information without reading complete clinical records. But, the proposed model goes one step further by classifying the named entities as per specialization. This can be embedded in health automation system for efficient delivery of services saving lot of time. Hence the proposed system can be a good candidate for the research in the area of NER in medical field.

CONCLUSION
Because of diverse writing style of clinicians, the rules and patterns are not generalizable. These issues can be addressed by making use of technologies like machine learning. Named entity recognition is grouped into three approaches. Machine learning based approaches, rule-based approaches and dictionary based approaches. The systems that use machine learning based approach focus on choosing effective features for classifier building. Several researchers have extensively used machine learning models for clinical NER. Databases such as PubMed which include medical publications have generated lot of interest among researchers for applying information extraction techniques to medical literature. In an attempt to contribute to the research in this area, this work proposed a machine learning model for clinical NER. The model proposed perfomed better compared to some of the existing methods.