Automated server-side model for recognition of security vulnerabilities in scripting languages

ABSTRACT


INTRODUCTION
Web applications are famous for security vulnerabilities that can be exploited by malicious users. According to positive technologies (PT) [1], which is one of the top ten worldwide vendors of vulnerability assessment systems, a percentage that ranges from 60% to 75% (depending on the analysis method) of the analyzed sites contained critical vulnerabilities. A big portion of the detected vulnerabilities belongs to the Cross-Site Scripting weakness and SQL injection. These kinds of vulnerabilities are caused by faulty code. For example, cross-site script insertion is caused by the lack of sanitization for data supplied from the user, code injection vulnerabilities result from the mixing of code and data. Another obvious point in these statistics is that the largest share of web application vulnerabilities belongs to the general class of taint-style vulnerabilities [2]. Taint-style vulnerabilities are a class of vulnerabilities that are a direct result of a lack of or inadequate sanitization or validation of the integrity of data that is processed by the application. This paper presents a new static code analysis model that is targeted to spot security vulnerabilities in scripting languages. The model is also implemented in a prototype called (SCAT), which is implemented to scan the applications and detect: cross-site scripting [2], SQL injection [3], remote code execution, remote command execution, and XPath injection vulnerabilities [4]. This paper is organized as follows: the next section illustrates the background and related work, Section 3 represents a detailed description of the model implementation, and Section 4 describes the assessment methodology. Section 5 presents the empirical results, while Section 6 represents the conclusions.

BACKGROUND AND RELATED WORKS
Static code analysis is a well-known approach that can be used for detecting security problems in any program without the need of executing it [5]. Static code analyzers are usually used early in development, which reduces the cost of fixing any error found in the code. However, it is known that static analysis tools produce too many false positives; this is when a static analysis tool inappropriately marks a problem-free section of code as vulnerable [6]. This means that the output from a security tool usually requires human review.
There exist a considerable number of security assessment models for scripting languages; Pixy [7] is one good example for such models, it is an open source static code analyzer performs automatic scans of PHP 4 source code. Pixy takes a PHP program as input and outputs possible vulnerable points. Yu et al., [8] also used static analysis to detect vulnerabilities in PHP 4 scripts and create string signatures for these vulnerabilities. They implemented this process in Stranger, which stands for STRing AutomatoN GEneratoR. Stranger is a string analysis tool for PHP web applications [8]. However, the tool does not support a recent version of PHP.
Saner [9] is another security analyzer that uses an approach that consists of a static analysis component to identify the flows of input values from sources to sensitive sinks. Nevertheless, the tool does not support any object-oriented features in PHP. The author of RIPS [10] used an approach to build a static source code analyzer written in PHP using the built-in tokenizer functions. The last version of RIPS that was released in 2014 is implemented to find a wide range of known vulnerabilities [10].

PROPOSED MODEL IMPLEMENTATION
The proposed model is designed to detect the security issues in scripting languages like PHP. Figure 1 shows the underline architecture of the proposed model which was applied in SCAT. The proposed model first transforms the input program into a parse tree [11]. In the prototype, the lexical analyzer is generated from the famous lexical analyzer generator for Java (JFlex) [12]. While the parser in the prototype is built using a modified version of The Constructor of Useful Parser (CUP) v0.10 tool [13]. Some modifications had to be made in the source files of CUP, and so the production's symbol name, symbol index and length can be accessed by the rule actions.
Finally, in data flow analysis, the constructed parse tree is transformed into a control flow graph (CFG) for each encountered function [14]. The proposed model enforces a list of standards that must be satisfied by the performed data flow analysis. First, the output CFG must maintain the flow of types of each program point during execution, such requirement is necessary due to the dynamically typed nature of PHP [15]. Second, it is required to collect information about the complete program putting all function calls in considerations; this is the main role of the Inter-procedural data flow analysis phase [16]. Finally, the data flow analysis collects all associated information for each node [17].

6063
The produced information from the data flow analysis step is now ready for the taint analysis step [19]. Taint analysis simply determines for each program point whether it may hold a tainted value or not. In order to improve the capability of the analysis phase, alias analysis is performed first; alias analysis is concerned with collecting the alias relationships for all variables in the input program [18].
The parse tree is first generated into DOT language, and then the Dot file is transformed into a visualized tree using Graphviz [19]. We use Graphviz class libraries to create a graphical representation for parse tree and dependence graphs for program points that may receive tainted data during execution time. Figure 2 shows a representation of the implementation of these functionalities within the proposed model structure.
The last phase is result processing and report generating. For each sensitive sink that can receive tainted data during execution, the model generates a vulnerability record. The record shows the file name that contains the sensitive sink with the tainted data, the line number and the type of the detected vulnerability. The proposed model also creates dependence graphs for the tainted variable [9]. It keeps track of each point in the program that may result in changing variable type such as assignment statements, calling to functions and Set and Unset functions [20]. In each of the aforementioned cases, type analysis investigates the corresponding CFG and update related variables types. c. Dynamic object reference: The problem with many existing approaches is the lack of understanding any OOP features in scripting languages, for example, pixy marks any custom object as a tainted program point. Similarly, it marks all user-defined method return values as tainted. The proposed model applies an algorithm to both user-defined classes. The algorithm main function is to simulate stack and heap data structures for custom classes, object references, user-defined methods and variables, namespaces, and interfaces. Thus, the algorithm can maintain relations between all defined objects, their custom classes, methods, and variables. During the analysis phase each custom object is resolved with its class definition, this helps to detect vulnerabilities in user-defined objects and methods.

RESEARCH METHOD 4.1. Sub evaluation procedure
Evaluating the proposed model is mainly based on finding out how well the prototype SCAT confirms to static code analysis tools requirements such as accuracy, robustness, usability, and responsiveness [21,22]. For this purpose, two different sets of benchmark tests were performed. The evaluation process presented here adapted the same structure used by Poel [23]. However, this structure is extended by computing the evaluation metrics for each tool. Evaluation metrics computed for each tool include precision, recall, specificity and Fmeasure -Fmeasure: provides an aggregate measure for precision and recall.
Two others commonly used Fmeasures are the F2-measure, which weights recall higher than precision and the F0.5-measure, which puts more emphasis on precision than recall [23]. The formula for Fβ-measure is: Fmeasures ranges between 0 and 1 for a given tool, the three measures can be used to introduce a ranking of the performance of several tools. The methodology of the evaluation process is introduced in Figure 4, the process starts with choosing a group of related static analysis tools, and then each tool within the group is used to analyze both sets of benchmarks: Intra-Benchmark tests and Inter-Benchmark tests, finally the results obtained by each tool are manually analyzed in order to compute the evaluation metrics. Implementation codes are available through the link https://sourceforge.net/p/scat-static-analysis/code/ci/master/tree/. Before the empirical results reviewed, both benchmarks tests and the group of related tools involved in the evaluation process are further explained in the next three subsections.

Benchmarks tests 4.2.1. Intra-benchmarks tests
The Intra-benchmark tests consist of real-world web applications written in PHP. These applications are chosen with variety in its size, PHP supported version, coding style, and code complexity. The complete list of these applications is shown in Table 1. For each application, the table shows its name, the application version that was used in the experiments, the application type and code size of each application measured by the number of code lines (LOC), the code size was calculated using PHPLoc Pear package.
Some of the tested applications are deliberately vulnerable web-applications that are provided as a target for web-security scanners. These applications are Exploit.co.il, Mutillidae and Damn Vulnerable Web App (DVWA). The rest of the tested applications are real-world applications written in PHP like PBL Guestbook 1.32, MyBloggie 2.1.6, WordPress 1.5.1.3, and MyEasyMarket 4.1.
Intra-benchmark tests boil down to running a static code analysis on each one of these applications, then the results obtained by each tool are manually analyzed to gather basic information about each tool such as the total analysis time, the total number of spotted vulnerabilities (TP) and the number of false positives (FP) [6]. The experiments focus on a set of taint-style vulnerabilities, which are XSS, SQL Injections, Command Injection, and Code Injection, as these are the most frequently detected vulnerabilities by the selected set of static code analysis tools [25][26][27].

Intre-benchmarks tests
The inter-benchmark consists of 110 small php test cases stating 55 test cases, these cases are divided into three categories, which are language support, vulnerability detection, and sanitization routine support. Nico L. De Poel [24] used these test cases to evaluate a collection of commercial and open source static code analyzers. Each test case consists of a vulnerable program that includes a security problem, and a resolved program that resolves the vulnerability problem. The evaluation process focuses on both true positive and false positive situations, so for each test case, a given tool is said to pass the test if it succeeded to detect the vulnerability in the vulnerable file and did not fire alarm within the resolved file.

Selected tools
A wide range of related tools was investigated in order to choose the tools which are eligible to engage in the evaluation process. These tools must allow comparing their performance, usability and the range of covered vulnerabilities. This was the main reason for choosing open source tools, as they offer full access to the source code, which helps in understanding the evaluation results. However, some tools were discarded such as Ardilla [28] and IPAAS [29] since they do not provide source code yet and TAP [30] which is a recent tool to detect vulnerability using deep learning.
The set of selected tools includes Pixy, RIPS and Yet another Source Code Analyzer (YASCA). Pixy is the first and popular open source static code analysis tool targeted for PHP [7]. The second tool in the set is RIPS, which is a static code analyzer that was developed by Johannes Dahse. It is written in PHP and developed to detect a wide range of taint style vulnerabilities. The third tool in the set is YASCA which was initially created by Michael V. Scovetta [31]. It can scan source code written in PHP and other languages.

Analysis Time-Based Comparison
The analysis time is computed for each intra-benchmark test, while the analysis time in inter-benchmark tests was ignored, as it was significantly small. Table 2 shows the analysis time for intra-benchmark tests. SCAT took a noticeably long time in some applications; a significant part of this time returns to file inclusion resolution phase; however, this delay should be acceptable comparing the eminent number of the vulnerabilities detected by SCAT in these applications.  Table 3 shows the number of vulnerabilities detected by each tool in intra-benchmark tests. The results show that for most applications, SCAT managed to achieve better results than other tools. For example, in XSS detection, SCAT succeeded to detect XSS vulnerabilities that other tools failed to detect. Also, in WordPress application, SCAT managed to detect XSS vulnerability in "searchform.php" file in which WordPress allows remote attackers to inject arbitrary web script or HTML via the PHP_SELF portion of a Uniform Resource Identifier (URI) to "index.php". On the other hand, RIPS kept firing false alarms in files such as "archive.php" and "index.php" in which "searchform.php" file is included. While Pixy failed to parse WordPress among other applications that use some advanced PHP 5 features.

Table 3. Vulnerability detection in intra-benchmark tests
In order to standardize the results, Precision value for the detected vulnerabilities is calculated for each tool. Table 4 shows these calculated values, the results are categories by vulnerability type, the value calculated for each tool shows the average of the precision values achieved by each tool in the tested applications. The table clearly indicates that SCAT achieved the highest precision for XSS vulnerabilities.
The precision values for SQL Injection vulnerabilities (Precision) for each tool are shown in the second row of the table; the results attest that SCAT also achieved the highest percentage among comparing tools. The precision values for command execution and code injection vulnerabilities for each application are illustrated in the third and fourth rows, Although SCAT achieved the highest value, there was a considerable drop in the overall percentage values, this due to the absence of these types of vulnerabilities in most of the chosen benchmarks.

Vulnerability detection in inter-benchmark tests
The results for inter-benchmark tests are grouped in Table 5. The table is divided into three grouped sets or rows; the first column of the table shows the category name, the second column shows the subject name, each subject includes a set of test cases. The number of test cases in each subject is showed in the third column, the rest of columns display the results of true positives (TP tests), false positives (FP tests) and the success percentage of the tool in each subject [30]. The total percentage is calculated by dividing the total number of passed tests by the total number of tests in a given category. The equation of success percentage is shown in (6  In Vulnerability Detection category, the results show that SCAT detected all vulnerabilities types; except for Argument injection which is not supported by the prototype. On the other hand, RIPS failed to spot XSS and argument injection vulnerabilities, it also failed in one SQL injection test and one XPath test, and YASCA only detected command execution and XPath injection vulnerabilities. In the results of the Language Support category, SCAT managed to detect the vulnerabilities in object model files, it passes 7 tests out of 8 tests in this subject. This result indicates that the effort spent in order to support object-oriented features in the prototype model was paid off. In Sanitization Support group, the results show that for 86% of TP tests, SCAT was able to detect good sanitization routines when it encounters it. While YASCA failed to pass any of the test cases. The only test in which SCAT failed is SQL injection sanitization test, in this test an (htmlspecialchars) sanitization routine is used which SCAT considers as a strong sanitization method, so SCAT skips the vulnerability.
In false positive tests of Vulnerability Detected category, Pixy and RIPS remain silent for all the tests. However, false positive tests cannot be considered alone, as an evaluation of tool performance, for example, Pixy passes all tests because it is incapable to detect these vulnerabilities. This the main flaw in false positive tests; they cannot differentiate between a tool that can scan and actually take the decision to skip the resolved vulnerability and another tool that does not detect the vulnerability in the first place. SCAT came in the second place with 83% passing percentage. Table 6 shows the calculated evaluation metrics for each tool in the three categories of inter-benchmark tests In Vulnerability Detection category, Pixy managed to achieve better values in the metrics that weights false positives higher than the true positives, this is because Pixy does not cover these types of vulnerabilities. However, SCAT managed to score the highest value in Fmeasure (4). In the Language Support category, SCAT managed to score the highest value in Precision (1), Recall (2) and Fmeasure metrics (4). In specificity (3) values, RIPS and Pixy managed to achieve better performance because they managed to pass more false positive tests, however, Fmeasure values indicate that SCAT has a better performance. Precision (1), Recall (2) and Specificity (3) evaluation metrics in Sanitization Support category show that SCAT has the highest Fmeasure value, although RIPS achieved higher value in Precision and Specificity metrics.
The results of inter-benchmark tests clearly show that SCAT scores the highest percentage in the true positives tests (recall) of the three categories with 88% detection rate. It also managed to score 94% detection rate in vulnerability detection category in particular which was the highest rate overall comparing tools. Pixy passes these three categories with 63% detection rate, while RIPS only scores 28%, YASCA came in last with 7% detection rate. Table 7 shows the summary of results for the execution of the four tools against the inter-benchmark tests. The table presents the calculated Recall (1), Precision (2), Specificity (3) and Fmeasure evaluation metrics (4,5). The results obtained for both types of Fmeasure metric show that SCAT achieved the best values.

CONCLUSION
Web applications present a major role in almost all the principal services in the daily life. However, vulnerabilities that threaten the personal data of users are discovered frequently. Therefore, this paper proposed an automated server-side model for the dynamic recognition and justification of a wide range of taint-style attacks. The proposed model is able to overcome most of the challenges in securing scripting languages like PHP. The model was implemented in a prototype called (SCAT), which performs several types of analysis to detect security vulnerabilities in the input program.

6069
The proposed model performs a flow-sensitive, inter-procedural and context-sensitive data flow analysis in order to collect information about the program execution. Then, the model uses the information collected in the data flow analysis phase to detect security vulnerabilities such as XSS and SQL injection. Finally, it generates a detailed report which contains a detailed explanation of each sensitive sink that represents security vulnerability in the program.
To evaluate the proposed system, an empirical evaluation procedure is conducted in which the proposed prototype SCAT analyzes several real-world applications and categorizes sets of testing benchmarks. The results demonstrate that the proposed system managed to detect 94% (recall value) of security vulnerabilities found in the testing benchmarks which is the highest detection rate compared to other systems. This clearly indicates the accuracy and robustness of SCAT. The evaluation process assesses the compatibility of SCAT with PHP features, the prototype managed to achieve the highest score by 83%, which is higher than Pixy that came in second place with only 64%. As a result, SCAT provides an effective solution to complicated web systems by offering the benefits of securing private data for users and maintaining web application stability for web applications providers.