Query expansion using novel use case scenario relationship for finding feature location

ABSTRACT


INTRODUCTION
Feature location is a technique for tracing a function of software into specific sector of source code. The feature location is useful when a developer wants to fix, change or improve a method in a code. Changes to code appear to be easy if the amount of code is small or the programmer who wants to fix it is the same person, which means he or she already understands the variables, parameters, and logic of the code. However, the changing the code will be difficult when the source code of the project is large or the programmer who makes the changes is a different person than the creator who has never even worked on the code. For this reason, programmers need to perform program comprehension first, and studies show that programmers need approximately 21.5 hours a week (58% of 37.5 hours per week) [1].
There was a big gap between the usage of tokens on software requirement and the tokens of source code. It is greatest challenge in feature location research. The tokens in the software requirements level use abstract words such as billing, enter personal health records, and view prescriptions (health record domain). In contrast, the tokens in the source code are technical or specific on how logic works on source code (e.g., AddPHAAction, personnelDAO, and setPassword). Based on this fact, we need some techniques to create a pair among of tokens in the requirements and source code. ISSN: 2088-8708  Query expansion using novel use case scenario relationship for finding feature location (Achmad Arwan) 5503 description of the use case scenario; therefore, querying specific area of feature could be done using the use case scenario. However, the number of research that use the use case scenario as a query is quite rare, which is caused by the minimum of datasets that include use case scenarios in the projects. Our previous research [15] was performed by applying NLP (noun tagging and verb tagging) to use a case scenario and used it as a query for information retrieval. It could predict the feature location with an average precision of 11% and a recall of 4%. Another our previous research also used clustered use case scenarios as query expansion to find feature location [16]. The results were quite good on recall rate (56%). The story on use case scenario is sometimes described another sentence in other use case scenario. Or it needs more explanation which could be found on other use case scenario. Based on the facts on iTrust data [17], the usage of use case scenario relationships may be advantageous in finding feature locations.
This study contributes to creating a novel method for feature location by making expansion queries in feature locations by finding the relationship between use case scenarios. The inner association, outer association and intratoken association were the original ideas to capture additional tokens from other use case scenarios. As a result, the query becomes more numerous and could increase precision and recall rate of feature locations. The expanded use case scenarios were used as queries for information retrieval based on topic modeling of source code.

BACKGROUND
In this section, we will provide a brief introduction to information retrieval, topic modeling, and query expansion. Additionally, we will elaborate on our approach of the use case scenario relationship model. The final, we also elaborate how to implement it in a feature location case study.

Information retrieval
Information retrieval is a prevalent technique in feature location. It composed by a number of processes, including preprocessing and NLP, and creates a group of token into a corpus [18]. Then, users could prompt some queries. The token from query was compared against the token with specific similarity methods such as cosine and Jaccard to gain the high-rank of precision and recall. To ensure the validity results of information retrieval methods, we employed precision and recall as a common techniques for many researchers [8], [19].
Items relevant is the amount of files that were attached to an individual feature, whereas the items retrieved were the number of files that were recommended in this research.

LDA
LDA [20] is an unsupervised probabilistic procedure to determine the topic distribution on a corpus. The corpus is extracted from documents (e.g., Source code), which consist of tokens. Each document was given the probabilistic distribution to determine the topic proportion.
LDA inputs are the documents ( ), the number of topics ( ), and a set of hyperparameters. The hyperparameters are: − is the amount of topics that must be generated from the data, − is the influence on the topic distributions per document. A lower value results in fewer topics per document, − is the affects of the distribution of terms per topic, Lower value results in fewer terms per topic, as a result need to increase in the number of topics.

Query expansion
Query expansion is the method of adding the original query with additional words, which could help to obtain actual user intent [21]. Query expansion has been applied in many applications, such as question answering, multimedia information retrieval, information filtering, and cross-language information retrieval. Query expansion was the crucial part of this research by utilizing the use case relationship as an additional word to help the system understand the actual user intent.

Use case scenario relationship model
Use case scenarios are sets of sentences that describe the step, data input, logic and sometimes data output of a feature in software. It is used by developers as a key to develop the specific feature of software. The terms that are used are usually high-level language. The term in the use case scenario is sometimes repeated and has the same meaning from the software engineer's perspective.
The use case relationship model is our original and novel approach to capture the indirect query from the user. This model build intended to enrich the query with additional words from other use case scenarios that were associated. Based on observations from the iTrust [17] project, the use case scenario has a relationship with other use case scenarios. This section will elaborate on the concepts and characteristics of use case relationships that might exist.

Inner association
The first concept was an inner association. An inner association is the kind of association that describes a use case scenario associated with another use case scenario with the same actor, but it is used as an alternative or exception. This concept was found based of our deep inspection on use case scenarios, which some use case scenarios were the alternative of others. The example of inner association were shown on

Outer association
The second concept was an outer association an outer association was the kind of association that described a use case scenario associated with other use case scenarios with different actors, but it was used as a reference. This concept was found after we inspect on the use case scenario which has specific tags such as "(UC26)" on it. The iTrust software analyst might want to give the mark that show the use case have reference with others. Based on data, as shown in Figures 4 and 5, the use case scenario 11 mentioned use case 26, which is marked with "(UC26)".

Intratoken association
An intratoken association is the kind of association in which a token of a use case scenario has a relationship with semantic similarity meaning. For example, the keyword "Patient" might have semantic similarity with the keyword "blood pressure" (score 0.3750). Based on the dataset, UC10 S1 as shown in Figure 1 has a relationship with UC9S2 as shown in Figure 6.

RESEARCH METHOD
This section explains how our research framework finds the feature location based on query expansion using the use case relationship. There were five segments of our framework: dataset definition, modeling the topics of source codes, use case scenario relationship modeling, query expansion using use case relationships, and the evaluation process. All the structure shown on the Figure 7. The details will be explained in this section.

Dataset definition
The dataset we used was iTrust [17]. An iTrust is a Java-based Electronic Health Record system that was developed at North Caroline State University (NCSU) as a primary case study in a software engineering class. The version was version 19 (https://github.com/ncsu-csc326/iTrust/tree/v19/iTrust). The dataset contains approximately forty use cases mapped into 478 trace links of health record features such as personal health records, patients, diseases, safe drugs, visits, and lab procedures. The iTrust projects are equipped with complete data such as a use case, use case scenario, and codes. It also has a traceability link, which function as ground truth. It contains many files of source code (354 files). It also used by many researchers [22], [23].
The dataset was filtered into 20 use case scenarios that could be categorized into a seven kinds of use case scenarios Table 1. The selection of seven features was done based on the assumption that those features were the most common electronic health record. After the selection, the inner association, outer association and intratoken association of the use case scenario were defined manually by the researcher which described in detail on section 3.3. The trace links chosen were reduced into 102 trace links related to those 20. Source codes were also reduced into only 68 files since many trace links used the same files.

Modelling the topics of source codes
The source codes from the selected dataset were preprocessed using several subprocesses, such as tokenizing, stop word elimination, stemming and modeling their topics. The first subprocess was tokenizing, which was the process of splitting the source code into tokens. The methods include punctuations removal (.,'-_) using regex and split method/variable name which has camelCase format into token (e.g., "updateAllergies split into update allergies"). 4 1 Dataset

5507
The next subprocess was stop word elimination and stemming [24]. To eliminate a stop word on the source codes, we picked the tokens that were too common English (i.e., "is, the, and"). The stemming was the subprocess that eliminated suffixes or prefixes of the source codes. It intended to determine the root form of the word. The stemming was done by using the porter algorithm.
The last subprocess was modeling the topics of source codes that had already been preprocessed. The research use a Mallet [25] which implement LDA to make model topics. The topic parameters variable of LDA were set to 5 topics that correlated with 7 kinds of use case descriptions. The iteration number parameter was set to 4,000. It produced several files, a model of topics, an inference file and topic proportion of files (68 files). It also produces keywords per topic that are used further as translator tokens. The files contain of topic proportion which used as a ranking recommendation in process of query comparison. As a result, Mallet created five topics with the top 15 keywords, as shown in Table 2.

Use case scenario relationship modelling
The use case scenarios relationship modelling process were the most crucial parts of this research. The use case scenario has a relationship with others through our concept of inner association, outer association and intratoken associations. To implement the concept, we created a relational database to record the use case scenario associations Figure 8. The entities were use cases and tokens. The use cases were the entities to use to save the use case scenario data, while tokens entity was used to save the key/terms from use case scenario for intratoken association. The use cases have both inner and outer relations.
The inner relation and outer relation association were defined by inspecting the use case scenarios one by one manually. The inspection includes find specific tags on the use case scenarios. The tags [E1] represent the inner relation, while tag (UCxx) represents the outer relation, as shown in  To define intratoken association, we perform several subprocesses. The first was the extraction of meaningful tokens from use case scenarios. The meaningful tokens were extracted using Post Tagger [26] (https://parts-of-speech.info) as illustrated in Figure 11. PostTagger [27] needs a complete sentence to determine the tag of words. Therefore, in this case, we used unpreprocessed use case scenarios to extract correct tags. Noun and verb only words were used as token association candidates since it could help reduce the number of words and increase the success rate [15], [16], [28]. As a result, meaningful tokens were saved on tables to be processed further Figure 12. The second subprocess of intratoken association was created a matrix of word-based semantic similarity [29] to facilitate the intratoken association easily. All the words were compared one by one and calculated based on their semantic similarity. The words and similarity degree were saved to a table named the matrix of semantic similarity tokens. All similar words are used for expansion, which saves fields named intratoken association. The results of the use case scenario relationship were the data of the inner association of the use case scenario, outer association of the use case scenario, and intra token association of the use case scenario. All association have saved on database to make experiment easier to do. Each relationship tested one by one for their performance of precision and recall.

Query expansion sets
The words in the use case scenario were used as the initial query to our information retrieval approach. All words were preprocessed include tokenization, elimination of stop word and stemming. These processes were intended to ensure that the remaining tokens were meaningful and in the root form of words.
The first subprocess was query expansion using the first step of our novel use case scenario relationship called the inner relation association. It was done by finding the pairs of use case scenarios that comply with the rule given in section 2.4.1. For example, the query given was use case scenario 10 (UC10S1.txt). Based on the inner relation association pair in Figure 9, UC10S1 had pairs with both UC10E1 and UC10E2, so their tokens were included as query expansions of UC10S1. Figure 13 illustrated that the step to obtain the tokens was performed by applying a join query to produce the token pair of inner relation associations. Figure 13. The inner relation pairs of UC10S1 The second subprocess was query expansion using the second step of our novel use case scenario relationship called the outer relation association. It was done by finding the pairs of use case scenarios that comply with the rule given in section 2.4.2. For example, the query given was use case scenario 11 (UC11S1.txt). Based on the outer relation association pair Figure 10, UC11S1 had pairs with UC26S1, UC15S1, and UC33S1, so their tokens were included as query expansions of UC11S1. Figure 13 illustrated that the step to obtain the tokens was performed by applying a join query to produce the tokens pair of outer relation association. The third subprocess was query expansion using the third step of our novel use case scenario relationship called the intra token association. It was done by finding the pairs of semantically similar tokens of use case scenarios that comply with the rule given in section 2.4.3. The tokens are extracted and put on the matrix of word-based semantic similarity. For example, the query given was use case scenario 10 (UC10S1.txt). Each token is compared against all tokens from the matrix similarity. Similar tokens with degree > 0.5 are used for query expansion. The step to obtain the tokens was performed by applying a join query to produce the token pair of intratoken associations, as illustrated in Figure 15. The Latent Dirichlet Allocation algorithm defines the topics unsupervised by iterating to give topics to both documents and tokens. At the end, the documents and the tokens are assigned to specific topics.

Evaluation process
The final process where the result evaluation and analysis process. The recommendation of source codes was generated by comparing topic proportion of query expansion against the topic proportion of all source code files. Each topic from the query was calculated to measure the Euclidian distance with the topic of each file using cosine similarity. The ranking of recommendation presented by sorting the cosine similarity the nearest to furthest. The threshold of similarity was set to 0.3. Precision and recall were employed to determine the success rate of our methods.

RESULTS AND DISCUSSION
The experiments were performed in several sets. The first experiment was queried using a use case scenario without expansion as baseline experiments. The second experiment used query expansion of the inner association relationship, which means that the use case scenario concatenates with the inner association token. The third was the experiment using query expansion of the outer association relationship, which means that the use case scenario concatenates with the outer association token. The fourth was the experiment using query expansion of the intratoken association relationship, which means that the use case scenario concatenates with the intratoken association relationship. The final experiment used the query compound of all elements, the use case scenarios, inner association token, outer association token, and intratoken association. This section discusses the details of the experiments.

Experiment without query expansion
The first experiment was performed by using all word from a use case scenario as the query without query expansion. It used as the baseline of the testing. The words were preprocessed using tokenize, stop word elimination, and stemming using the porter algorithm. The rest of the words are then put into the query of the research. The result is depicted in Table 3. The total number of items retrieved was 541 documents, and the total number of items relevant and retrieved was 62 documents of 102 documents relevant. The average recall was 60.8%, which means that the baseline approach could provide 60 documents out of 100 correct documents. The average precision was 15%, which means it could recommend 15 documents of 100 documents.
The best recall was 100%, which comes from several use case scenarios (UC9S1, UC9S2, UC11S1, UC23S1, UC26S1, UC26S4). The reason was those files used many words which quite technical, e.g., "immunization, diagnoses, and office visit.", which could also be found on the source codes as implemented in the field of persistent files such as databases/tables.

5511
The worst recall was 0, which appeared on UC10S2 and UC10E2. The reason was that it was implemented in few files of source code, and as a result, the item relevance became limited (1 & 2 documents only). The words on the use case scenario UC10S2 were "HCP choose height weight graph. presented chart chosen measurements patient spanning 3 calendar years data, averaged quarters (January-March, April-June, July September, October-December)". It is quite specific and directed and might not be shared among use case scenarios.
The best precision was 50%, which came from UC16. It was also the most ideal since the recall was quite superior (75%). The reason was that the sentences of UC16 contain balanced words on both abstract and detail (technical) topics. Another reason was that UC16 was implemented in some files (8); as a result, the number of relevant items became 8 documents. The words on UC16 were "Personal Health Records LHCP chooses chronic disease patient. data database analyzed risk factors disease determine exhibits risk factor. Risk factors for chronic diseases included diabetes type 1 and type 2 heart disease. chosen patient satisfies preconditions chosen chronic disease, the LHCP warning message patient exhibits risk factors. message display risk factors patients exhibit". It contains many words (e.g., personal health records, disease, patient, diabetes, type 1, type 2, and heart disease). shared among use case scenarios, which is why it could obtain the best results.

Experiment using query expansion based on the inner use case relationship
The second experiment was a query using token of use case scenario with expansion from the inner use case relationship. It was preprocessed using tokenize, stop word elimination, and stemming using the porter algorithm. The preprocessed tokens originating from the use case scenario were concatenated with additional tokens from the inner use case scenario to build a query for information retrieval. The result is depicted in Table 4. The amount of items retrieved was reduced to 458 documents, and the amount of items relevant and retrieved was also decreased to 58 documents of 102 documents relevant. The average recall was 59.9%, which means that the baseline approach could provide 60 documents out of 100 correct documents. The average number of precisions was increased to 16,7%, which means it could recommend 17 documents of 100 documents.
The best recall was 100%, which comes from several use case scenarios (UC9S2, UC10E1, UC23S1, UC26S1, UC26S4). The reason was about the same, which those files used many words which quite technical, e.g., "immunization, diagnoses, and office visit", which could also be found on the source codes as implemented in the field of persistent files such as databases/tables.
The worst recall was also the same as 0, which appeared on UC10S2 and UC10E2. The reason was about the same, which it implemented in few files of source code, and as a result, the item relevance becomes limited (1 & 2 documents only). The words on the use case scenario UC10S2 were "HCP choose height weight graph. presented chart chosen measurements patient spanning 3 calendar years data, averaged quarters (January-March, April-June, July September, October-December)". It is quite specific and directed and might not be shared among use case scenarios.
The best precision was 50%, which came from UC16 and UC1S1. The reason for UC16 is still the same, which contains balanced words on both abstract and detail (technical) topics. UC1S1 had an intra extension from UC1E1, which could help capture additional tokens. The words on UC1S1 and UC1E1 were merged into "health care profession enter patient user iTrust medic record email provide assign mid secret key initial password person reset edit accord data format value default null appropriate number edit enter view secure question prompt enter editor correct format requires data field input match specific patient". The words also on stemmed form. It contains many words (e.g. patient, medical record, mid, key person, and password) shared among use case scenarios, which is why it could obtain the best results.

Experiment using query expansion based on the outer use case relationship
The third experiment was a query using tokens of use case scenario with expansion from the outer use case relationship. It were preprocessed using tokenize, stop word elimination, and stemming using the porter algorithm. The preprocessed tokens originating from the use case scenario were concatenated with additional tokens from the outer use case scenario to build a query for information retrieval. The result is depicted in Table 5.
The amount of items retrieved was reduced to 479 of 541 documents, and the amount of items relevant and retrieved was also decreased to 55 documents of 102 documents relevant. The average recall was 60.7%, which means that the baseline approach could provide 60 documents out of 100 correct documents. The average number of precision was approximately the same at 15.4%, which means it could recommend 15 documents of 100 documents.
The best recall was 100%, which comes from several use case scenarios (UC9S2, UC10E1, UC23S1, UC26S1, UC26S4). The reason was about the same, which those files used many words which quite technical, e.g., "immunization, diagnoses, and office visit" which could also be found on the source codes as implemented in the field of persistent files such as databases/tables. The advantage of the outer layer had an impact on UC23S3, UC11S1, and UC11S2 as shown in Figure 16. The precision of both UC11S1 and UC11S2 increased to 19% and 27.8%, respectively, with a baseline precision for UC11S1 of only 10% and UC11S2 of 23.8%. Meanwhile, the precision of UC23S3 also increased to 8% from 4%.
The worst recall was also the same as 0, which appeared on UC10S2 and UC10E2. The reason was about the same, which it implemented in few files of source code, and as a result, the item relevance becomes limited (1 and 2 documents only). The words on the use case scenario UC10S2 were "HCP choose height weight graph. presented chart chosen measurements patient spanning 3 calendar years data, averaged quarters (January-March, April-June, July September, October-December)". It is quite specific and directed and might not be shared among use case scenarios.

Experiment using query expansion based intratoken use case relationship
The fourth experiment was a query using tokens of use case scenario with expansion from the intratoken use case relationship. It was preprocessed using tokenize, stop word elimination, and stemming using the porter algorithm. The preprocessed tokens originating from the use case scenario were concatenated with additional tokens from the intratoken use case scenario to build a query for information retrieval. The result is depicted in Table 6.
The amount of items retrieved was reduced to 475 of 541 documents, and the amount of items relevant and retrieved was also decreased to 53 documents of 102 documents relevant. The average recall was 59.3%, which means that the baseline approach could provide 59 documents out of 100 correct documents. The average number of precision was approximately the same at 15,5%, which means it could recommend 15 documents of 100 documents.
The best recall was 100%, which comes from several use case scenarios (UC9S2, UC10E1, UC23S1, UC26S1). The reason was about the same, which those files used many words which quite technical, e.g., "immunization, diagnoses, and office visit.", which could also be found on the source codes as implemented in the field of persistent files such as databases/tables.
The advantage of the intratoken had an impact on UC10S2 and UC10E1. The precision of both UC10S2 and UC10E1 increased to 4.8% and 6.7%, respectively, and the baseline precision for both UC10S2 and UC10E1 was 0%. The precision of UC23S1 also increases to 8% from previous precision (4.7%).
The worst recall was also the same as 0, which appeared on UC10S2 and UC10E2. The reason was about the same, which it implemented in few files of source code, and as a result, the item relevance becomes limited (1 & 2 documents only). The words on the use case scenario UC10S2 were "HCP choose height weight graph. presented chart chosen measurements patient spanning 3 calendar years data, averaged quarters (January-March, April-June, July September, October-December)". It is quite specific and directed and might not be shared among use case scenarios.
The best precision was 50%, which came from UC16 and UC1S1. The reason for UC16 is still the same, which contains balanced words on both abstract and detail (technical) topics. This results about the same with the previous phase (non query expansion and inner use case relationship). The outer use case precision rate mostly underperforms against inner relationship, except on UC11S2 which 4% better. The reason was the token related with it could extent the search result.

Experiment using query expansion-based compound of all elements
The fifth experiment was query using tokens of use case scenario with expansion from the compound of all elements from the inner, outer and intratoken use case relationships. It was preprocessed using tokenize, stop word elimination, and stemming using the porter algorithm. The preprocessed tokens originating from the use case scenario were concatenated with additional tokens from the compound of all element token use case scenarios to build a query for information retrieval. The result is depicted in Table 7. The amount of items retrieved increased to 604 of 541 documents, and the amount of items relevant and retrieved also decreased to 62 documents of 102 documents relevant. The average recall increase was 68.3%, which means that the baseline approach could provide 68 documents out of 100 correct documents. Unfortunately, the average precision dropped to 10%, which means that it could recommend 10 documents of 100 documents.
The best recall was 100%, which comes from several use case scenarios (UC9S2, UC10S2, UC10E1, UC11S1, UC11S2, UC23S1, UC26S1). The reason was the increase in words due to the impact of all association tokens on the query. The worst recall was also the same as 0, which appeared on UC10E2. The reason was about the same, which it implemented in few files of source code, and as a result, the item relevance becomes limited (1 & 2 documents only).
The best precision was 33%, which came from UC1S1. The reason for this is that UC1S1 contains balanced words on both abstract and detail (technical) topics. UC16 no longer has the best precision since it has many additional words for query expansion. The advantage for compounds of all elements was the increase in recall. Most documents were returned, and only UC10E2 did not return correct recommendations (recall 0), while the precision was reduced since the research produced many documents.

Result analysis and comparison
The sets of experiments have been performed and produce several results. Each of the results was compared to each other to measure how effective our methods were at finding feature locations. The chart is shown in Figure 17. Based on the chart, the best recall was the experiments using compound off all relationships (inner, outer, and intratoken). The main reason was that the number of tokens was huge since all use case relationships are included here. The compound produces less precision since it takes all tokens, which impacts the increasing number of documents as a dividing factor in precision measurement.

5515
The best precision was the experiment using the inner relation. The worst recall was the experiment using intratokens, and the worst precision was all compounds. The Inner performs with the best precision because many use cases have inner relations among them, and the total tokens related were higher than the outer and intratokens.

CONCLUSION
This research introduces the novel concept of the use case relationship. It includes inner association, outer association, and intratoken association. The use case relationship was implemented and tested on the topic modeled source using the LDA algorithm. Query expansion based on inner, outer and intratoken associations was tested to find feature locations. The precision and recall rates were used to measure the success of the approach.
The best precision rate was 50% found in UC16, which contained tokens that were balanced on both the abstract side and technical side. The best recall was 100%, which was found in several use case scenarios implemented in a few files. The best average precision rate was 16.7%, which was found in inner association experiments. The inner association could help attract more tokens among other associations (outer, intratoken), which made average precision better than baseline (use case without expansion). The best average recall rate was 68.3% on all compound experiments since it contains all expansion tokens.
In the future, we plan to extend the methodology on the source code processing side by applying some structural exploration. The association among identifiers, methods, and comments could also be arranged on some data modeling to gain better precision and recall in feature location. The additional weighting of source code elements might also beneficial as an alternative methodology.