Leveraging graph-based semantic annotation for the identification of cause-effect relations

ABSTRACT


INTRODUCTION
Research on natural language processing is increasing because it is influenced by the availability of information that always increases. One example is natural language processing in the health domain whose source of information is not only derived from the patient's medical record, but through community writing participation shared using internet media [1]. Not a few practitioners, communities and even health service units seek medical information and utilize information on the internet for certain purposes.
One of the characteristics of information seeking in the health domain is that the search for health information will end when information on cause and effect has been found that discusses the root causes of a medical condition.
Meanwhile, having a large collection of medical information from various sources brings challenges for practitioners and the public to find the information they need in a fast time. Getting new knowledge from medical articles automatically becomes a separate problem because the sentence usually written on medical articles has the characteristics of multiple sentences [2], namely a long article that usually has several paragraphs, the explanation consists of background then presents the problem and provides solutions to the core part of the article and ends with conclusions or messages that are considered important [3,4]. There is a possibility that in an article, the position of the causal relationship pattern allows to be in a different paragraph, meaning the sentence because it can be written at the beginning of paragraphs and sentences due to being explained at the end of the paragraph, and vice versa.
The purpose of this research is to optimize the use of annotations to do a summary of an online medical article in which it is able to display the meaning of cause and meaning as a result of which the two meanings have a mutually explaining relationship. Overall, this study aims to contribute as follows: (a) proposed phrase annotations used to identify implicit meanings in a sentence; (b) proposed paragraph annotations are used to carry out the core classification of the discussion in each paragraph in the medical article; (c) proposed annotations of medical elements are used to transform natural language into medical annotations; (d) proposed graph-based semantic annotations are used to identify cause-effect relationships through a template-based.

RESEARCH METHODOLOGY
The method used as a reference is Atkinson's research [5]. The difference with the current research lies in the summary section of the text, the transformation of medical elements, semantic annotations and paragraph annotations shown in Figure 1 (blocks given dark background colors). The current research is using an online medical article dataset for public health surveillance. The aim is to display information on causes and consequences, as well as the mitigation process from the occurrence of these health cases.
The challenge in the current research is to determine the medical elements that match the characteristics of the text in Indonesian medical articles.

Summary of extractive text
The principle of using text summaries in the current research is to complete the stages of feature selection needed by Atkinson [5], in addition summary techniques are used to narrow the problem search space and the characteristic used in the text summary phase [6].

Transforming natural languages into medical element annotations
Annotation is a technique used to give notes to each sentence. This solution is used to find important sentences discussed in a medical text. The fact is that the results of annotations actually have a lot of connection with other annotations that form new meanings [7]. Here is an annotation comparison proposed by several other researchers. Atkinson [5] proposed annotation are substance, effect, symptom, disease, and body part. Mihǎilǎ [7] proposed annotation are drug, physical stimulation, symptom, inhibition, and diagnosis. Byrd [8] utilizing the information occurs on Twitter for public health surveillance with annotations used are sentiment analysis and geocoding. Liang Wu [9] using twitter to find out information about the reactions of drugs consumed with the annotations used are drug name, brand name, prescribe for, effect. Yanging Ji [10] utilizing health articles by using annotations drug name, ICD Code, and symptom. Yepes [11] used annotations for surveillance problems. The proposed annotations are disease, symptoms, pharmacological and location.
The obstacle of the several annotations that have been proposed is that it has not been able to display a dependency relationship that explains the meaning of cause and effect, therefore in the current study the semantic annotation is proposed to relate the meaning of each annotation that has a connection pattern with other annotations according to medical information needs. Table 1 is the proposal to develop natural language into the LPpAJSFPePnGOD medical element by considering the needs of the dataset taken from online articles, namely handling in the form of statements from resource persons, facilities, and recovery that have been carried out as indicators of mitigation. Used to perform extraction issues and problems of the disease. Get the problem issues with the approach of the method of biological named entity recognition. The rule is Disease population (x, y)  biological probability named entity recognition (biological NER).

AKIBAT (A)
In English EFFECT Used to perform the extraction impact by the subject as specific disease populations. The rule is Effect (x, y)  disease_population (x), location (y).

SUM OF EVENTS
Used to perform the extraction of statistical count in numeral from the population of certain diseases. The rule is Sum_of_events (x, y)  effect (x, y).

SEBAB (S) In English CAUSE
Used to perform the extraction of the main causes. The method for obtaining a cause is to calculate the word emergence statistics on online articles and classification of certain disease populations. The rules are Cause_1 (x, y)  disease_population (x, y); Cause_2 (x, y)  sum_of_events (x, y); Cause_3 (x, y)  sum_of_events(x), location(y).

FASILITAS (F) In English FACILITY
Used to name extraction institutions that provide treatment based on the consequences of certain disease populations. The rules are Facility_1 (x, y)  disease_population (x), sum_of_events(y); Facility_2 (x, y)  effect(x, y); Facility_3 (x, y)  cause_1(x, y).

PEMULIHAN (Pe) In English RECOVERY
Used to solution extraction prepared and carried out from facilities in certain locations. The rules are Restitution_1 (x, y)  facility_1 (x, y); Restitution_2 (x, y)  facility(x), location(y).

RESPONSIBLE PERSON
Used to subject extraction which is responsible for certain disease population problems and recovery carried out by facilities in certain locations. The rules are Responsible_person_1 (x,y)  facility_1 (x,y); Responsible_person_2 (x,y)  cause_3 (x,y); responsible_person (x,y)  effect(x), facility(y)

GEJALA (G) In English SYMPTOMS
Used to extraction part of the body are diseased or in pain. The rules are Symptoms (x,y)  disease_population (x,y); Symptoms (x,y)  disease_population (x,y)

OBAT (O)
In English DRUGS Used to information extraction drug or vaccine, as one method of recovery. The rule is Drugs (x,y)  cause_1(x), recovery(y)

DIAGNOSA (D)
In English DIAGNOSE Used to extraction of information related to medical statement about the disease. The rule is diagnosis (x,y)  disease_population (x), symptoms(y)

Paragraph annotation
Paragraph annotations are needed because the important messages conveyed in online medical articles are in unpredictable paragraphs or each paragraph discusses the different core stories. Therefore the main task of paragraph annotations is to produce the essence of word categories on medical content consisting of core paragraphs, supporting paragraphs and conclusion paragraphs. List the paragraph position annotations as found in Table 2. If the core story of each paragraph has been obtained, then the pattern of the graph pair from the paragraph classification can be determined based on the rules of pattern in Figure 2.

Feature selection
Selection of features tailored to the characteristics of the text in complex medical articles and the meaning of medical sentences. Analysis of the pre-processing features used in medical articles as in the current research is better not to use this pre-processing stage, compared to using it because the results are ambiguous [12]. Figure 3 is the pre-processing stage in question.  The first stage is sentence splitting, which is to separate sentences until the delimiter '.' But not all delimiter signs are the final marker of a sentence. For example, like 'dr.' The word is not the end of a sentence, but a profession title. The second stage is tokenization, which is to separate each word from a sentence. Examples of problems such as the sentence <rumah sakit in English is hospital>. The sentence is one term not <home> and <sick> or <care>, as well as <Ministry of Health>. The sentence is one term not <ministry> and <health>.
The last step is the morphological analyzer, which converts a word into a basic word. An example is <immunized>. The word has a suffix in bahasa language (di-), (sasi-) and the basic word is <immune>. Immunized has the meaning of being given a vaccine for the immune system, while the immune system is the immune system. The analysis of the other feature selection is n-gram as seen in the following equation: P (k 1 , k 2 , … , k n ) = P(k 1 )P(k 2 |k 1 ) … P(k n | k 1 … k n−1 ) The use of n-gram in this research uses trigram, and the equation is as follows: P (k n |k 1 , … , k n−1 ) = P(k n | k n−2 … k n−1 ) The previous equation can be written as follows: The next step is to adopt the use of semantic analysis as done by Byrd [8] into phrase annotations to identify positive or negative meanings. Opinion analysis uses analysis of positive sentences (+) to represent the meaning of causes, and analysis of negative sentences (-) to represent the meaning of consequences. So, the syntactic results obtained are as follows: The results of the exploration of the sentence above show the sentence <taking casualties> / VB + (NEG) and <death> / VB + (NEG) having negative meanings.

Semantic annotation
Semantic annotation technique is a technique that expresses the connectedness of each word in an article so as to produce new information that can be understood by its meaning. This step is to adopt the use of semantic annotation analysis as done by several researchers [13][14][15][16]. The current research uses a hybrid method, namely a combination of pattern-based and machine-based learning to build semantic relationships and get interpretations that are appropriate to medical needs. Each node is defined as having an association with the other annotation node. Although the image looks separate, but each annotation node will look for other annotation nodes that have the same information.

Analysis of usage annotation
The analyses needed to design the tests that will be carried out research currently carried out implemented in the health surveillance system are: -Understand the use of online medical articles to support public health surveillance. -Determine feature selection that is most suitable for use in the characteristics of complex texts from online medical articles. -Transform natural language into annotations of medical elements. -Propose the transformation of medical texts into paragraph annotations: core paragraphs, supporting paragraphs and conclusion paragraphs; -Testing for the dependency sentences  medical element annotations (LPpAJSFPePnGOD) and graph-based of semantic annotation; -Testing system performance evaluation that will be compared with other studies.

TESTING
The tests are divided into three discussions, namely tagging (verb (VB) + noun (NN)) -phrase annotation, multi feature selection and dependency relation.

Parsing (VB + NN) + phrase annotation + opinion analysis
The discussion of the test begins with the analysis of opinions using positive (+) sentences to represent the meaning <cause>, and negative (-) sentences to represent the meaning <effect>. Giving  [17], namely (a) testing using lexical; (b) using the naïve bayes method; and (c) using the SVM method. Table 3 is the result of a comparison of explicit and implicit data modeling analysis to identify patterns of causal relationships with other researchers, as follows:

Multi-feature selection
This test uses a summary technique to narrow down the problem search space, then combines with the semantic pattern modeling method and the proposed annotation to find important sentences that have a meaningful cause-effect relationship. The following is the pseudo code for the feature selection, weighting and n-gram used in the summary system. The next test is to combine summary + annotation of medical elements + phrase tagging + opinion analysis. The aim is to get a summary that is more in line with the information search needs of the medical domain and has a pattern of causal relationships. Because this research is a subjective classifier, the learning model must be adjusted based on the individual preferences of the labeler or an expert and the label results may vary. Therefore, research activities explore features by means of classification. The following is one example of a medical article taken from online media and used as an experimental dataset in the current research. In this study using a dataset of 500 medical articles online and when extracted it had sentences of 10,176 sentences. The method used as a comparison for the classification of summaries + annotations of medical elements + phrase tagging + analysis of opinion is naïve Bayes.
The current research is able to answer from previous studies when the output of more than one entity then the meaning becomes ambiguous as mentioned in Khoo's research [20][21][22] and Park [23]. This study even though the results of medical elements obtained more than one entity, then the technique is by doing a ranking to get the medical element into one entity using the calculation of the frequency of occurrence of words. The results obtained using a summary approach + annotation of medical elements + phrase tagging + opinion analysis to find important information in medical text articles that have causal meanings as found in Table 4.

Dependency relation
At this stage the discussion adds a paragraph annotation and semantic annotation approach to display new knowledge, so that readers can easily understand the summary results in a medical text article that displays important information and has a meaningful cause-effect relationship [25]. The technique for dependency relations is a combination of rule-based and statistical based. Rule-based techniques are used when the annotation resulting from the classification of paragraph sentences in medical articles must be paired with the classification of other paragraph sentences as shown in Figure 2. Interrelation of meaning based on paragraph position annotations and Figure 4. The following is a rule-based pseudo code. Machine learning-based are used when conducting annotation classification of medical elements and paragraph annotations. The following is a statistical pseudo code: Pseudo code word-based statistical connectivity The technique used in the current research is to build new knowledge by combining rule-based and statistical-based dependency relations to display summaries of important information aided by the formation of new knowledge patterns as shown in Table 5. The testing scenarios carried out for the stages of dependency relations are more focused on extrinsic evaluation which is divided into two categories, namely in Table 6 which is used for testing based on the suitability of the summary results, appropriate summary outputs, and inappropriate outputs. Remarks for Table 6: (a) summary by an expert; (b) summary by the system; (c) the appropriate output (d) the output is not appropriate. Based on Table 6 seen in point (a), an expert determines the output that should be generated by the system, then in point (b) is the output generated by the system using a combination of summary features + annotations of medical elements + phrase tagging + opinion analysis. In point (c) is a comparison between the output determined by an expert, with the output produced by the system. Pattern (d) is a non-matching comparison of output between points (a) and points (b). The results of the average success of the system to produce output in accordance with an expert's decision are 0.857. Figure 5 is a comparison of the performance output produced by the system with a decision previously made by an expert. The results obtained on the system are said to be good when approaching the value of '1', meaning that the output on the system has a degree of similarity to the decisions made by an expert.

CONCLUSION
Generated semantic template method that has been successfully implemented on health domain is public health surveillance system. During implementation, there are several proposals have been generated, i.e. natural language transformation into medical element annotation (LPpAJSFPePnGOD) to identify causal relationship pattern, paragraph annotation pattern to classify sentence position on medical article paragraph and build a semantic relationship pattern for semantic annotation.