The Improved Hybrid Algorithm for the Atheer and Berry-ravindran Algorithms

ABSTRACT


INTRODUCTION
String matching is the process of identifying all occurrences of alignments by comparing two finitelength strings [1]. String matching is among the most important problems applied in many computer science applications, such as web search engines [2], operating systems, compilers, command interpreters [3], intrusion detection systems [4], information retrieval and artificial intelligence [5]. String matching involves a matching process involving patterns and texts to identify the identical characters between them. The matching character comparisons, and the number of attempts; these factors are changeable depending on the type of algorithm used [6], [7].
Permanent challenges require the use of the most efficient string matching algorithms with increasing memory size and computer speed [8], [9]. Thus, string matching algorithms have been recently proposed to minimize these problems [10]. The hybrid string matching algorithm is the appropriate solution to mitigate disadvantages of original string matching algorithms [11]. The proposed algorithm in this paper depends on the good advantages of two exact string matching algorithms which are (Atheer and Berry-ravindran) and decreasing the disadvantages of them. All types of databases in benchmark standard are used in this research to find the suited and unsuited databases with proposed algorithm. The objective of this research overcome the weaknesses and improves the performance of exact string matching algorithms.
The original algorithms: Two original algorithms were used as referenced in this research, which are Atheer and Berry-Ravindran algorithms. Atheer algorithm is a hybrid algorithm of three algorithms which are Raita, Smith, and Karp-Rabin [3]. There are three functions preprocessing of the Atheer algorithm which are, the BM bad character (bmBc) function, the quick search bad character (qsBc) function and the hashing function. In the searching phase of the Atheer algorithm, all the comparison steps depending on the hash process which were derived from the Karp-Rabin algorithm. The comparison technique start between the hash values of the three characters (rightmost, leftmost, and middle) in the pattern with the hashing value of the three characters in the text window. If matching occurs between them, then one by one these three characters of the text window versus the three characters of the pattern will be compared. When matching occurs, then comparison starts in the remaining characters, but without comparing the middle character again. When a matching or mismatching occurs, the shifting of the pattern would depend on the maximum value between the m value in the bmBc table and on the m+1 value in the qsBc table.
The Berry-ravindran algorithm is a hybrid approach and is characterized by left-right character comparisons [12]. This algorithm is a hybrid of the Zhu-takaoka and Quick-search algorithms and has two phases, namely, preprocessing and searching. The preprocessing phase of this algorithm depends on the Berry-ravindran bad character (brBc) function. The searching phase of the Berry-Ravindran algorithm has left-right character movements. This algorithm depends on the shifting operation of the next two characters in the text window, which depends on the m+1 and m+2 of the text window. The shifting value is obtained from the brBc table in the preprocessing phase. The comparison process starts from the leftmost character in the pattern with the leftmost character in the text window. If a match is found, the comparison will continue to another character until all characters are matched. When a matching or mismatching occurs, the shifting depends on the next two characters of the text window (m+1 and m+2) and the obtained value from the brBc table in the preprocessing phase.

PROPOSED E-ATHEER ALGORITHM
The proposed E-Atheer algorithm consists of two phases, namely, the preprocessing phase and searching phase.

Preprocessing Phase
The preprocessing phase contains the techniques selected from the Atheer and Berry-Ravindran algorithms. These techniques are regulated in functions to obtain the exact string matching of the E-Atheer algorithm. These functions are presented as follows: a) Boyer-moore Bad Character (Bmbc) Function The technique selected from this function is similar to that in the preprocessing phase of the Atheer algorithm. The main purpose of using the bmBc table in this function is to determine and choose the best shifting for each character in the matching operation as shown in Lines 21-26, Figure 1. The form of the bmBc function is defined by the equation below.

Searching Phase
The searching phase technique in the E-Atheer algorithm depends on the searching phase techniques of the Atheer and Berry-Ravindran algorithms and on some of the modulations obtained during the matching operation. In the first step, the hash values of the three characters in pattern (Fh) are compared with the hash values of the three characters (Fhw) in the text window. If a match is obtained between the hashing values, the remaining three characters in the text window and the remaining three characters in the pattern will be compared. If a match is obtained between these characters, the second step will be conducted as shown in Line 11; Figure 2. However, if a mismatch is obtained in the hashing comparison or in the character comparison, the new shift of this algorithm will depend on the maximum value of m from the bmBc table and the (m+1 and m+2) values from the brBc table as shown in Line 24; Figure 2. The m refers to the last character in the text window, the m+1 is the first character after the text window, and m+2 is the second character after the text window. The rehash function is then used to calculate the three characters of the new text window after the shift as shown in Line 26; Figure 2.  In the second step, when a match is obtained in the first step, the hashing from the second to middle -1 characters in the text window (denoted by Shw) is calculated and is compared with the hashing characters (Sh) in the pattern. If a match is obtained, the comparison of characters between will continue (Lines 12 and 14, as shown in Figure 2. If another match is obtained between the characters, the third step will commence. If a mismatch is obtained between the characters, the shift will depend on the same technique mentioned in the previous step as shown in Line 24; Figure 2 and the rehash function will be used. In the third step, when a match is obtained in the second step, the hashing from the middle +1 to last −1 characters in the text window (denoted by (Thw)) is calculated and is compared with the hashing characters (Th) in the pattern. If a match is obtained, the comparison of the characters between the characters will continue as shown in Lines 15 and 17; Figure 2. If a match or a mismatch is obtained between the characters, the shift will depend on the same technique mentioned on the previous steps as shown in Line 24; Figure 2 and the last step (i.e., the rehash function) will commence.

PROPOSED ALGORITHM ANALYSIS
Preprocessing of the proposed algorithm has three functions which are brBc, bmBc and hash function. The time complexity of brBc function is denoted as O (m+σ 2 ), the bmBc function is denoted as O (m+σ) and hash function is denoted as O (m). The space and time complexity of the preprocessing phase of the proposed algorithm is denoted as O (m+σ 2 ). The time complexity of the searching phase explains in the following section. Lemma.3.1 The time complexity of the searching phase is O (n/(m+2)) in the best case. Proof. In each attempt, if any character does not happen in the pattern during the matching process, then the shifting process will depend on maximum value between m from bmBc and (m+1 and m+2) from brBc functions that calculated in the preprocessing phase. The best case occurs when all characters in the pattern totally different than those in the text window, then the shifting will depend on m+2 and the time complexity will be O (n/(m+2)). Proof. In the matching process every character in the text does not occur more than m times and all the character comparisons for n characters of the text cannot be greater than n × m. The worst case happens if all the characters in the pattern are same with those in the text window in every attempt. Then the shifting If (fhx == fhy&&lastCh == c &&firstCh == y[j]&&middleCh ==y[j + m/2]) Then 12. shfy← gethy(j + 1, j + m/2, y) //calculate the hash of (Shw) 13.
If (shfx == shfy&& match(x + 1, m/2−1, y, j + 1, &temp) == 1) Then 15. shly← gethy(j+m/2+1, j + m−1, y) // calculate the hash of (Thw) 16 will be one and the time complexity will be O (n × m). For example: Text: yyyyyyyyyyyyyyyy Pattern: yyyy In this algorithm cannot accurately determine the average time complexity because of its dependence on the alphabet size of characters and the possibility of the appearance of each character individually in the text.

EXPERIMENTAL DESIGN AND EVALUATION OF STUDY OUTCOMES
The proposed algorithm design depended on selecting the good features of original algorithms, which are the hash and bmBc functions from Atheer algorithm and brBc function from Berry-Ravindran algorithm. The proposed algorithm used all types of benchmark standard databases and the results of E-Atheer compared to results of the original and recent and standard algorithms.

Databases
This study investigates the differences in the performance and properties of several exact string matching algorithms when different types of databases are used (200 MB data size). The benchmark standard of databases deals with common types of data, such as DNA, Protein, XML Pitch, English text, and Source code. These datasets were downloaded from the Pizza & Chili Corpus Web site (http://pizzachili.dcc.uchile.cl/ (Pizza Chili Corpus). Two pattern lengths were used in this study: the short pattern length, which ranged from 4 characters to 28 characters, and the long pattern length (length power of two), which ranged from 2^5 characters to 2^1 0 characters [13], [14]. The DNA data sequence is composed of four nucleotides, namely, Adenine, Guanine, Cytosine, and Thiamin, and these types are encoded as A, G, C, and T respectively.
The Gutenberg project is included in this database [15], [16]. The Proteins are necessary to the structure and function of the cells of an organism. The Protein data sequence is composed of 20 amino acids arranged in a linear series and encoded using uppercase letters [17], [18]. The databases were obtained from the Swiss-Prot database. The XML structure database contains the bibliographic information of computer sciences. The Pitch (Midi Pitch values) database contains tuning data used in digital music [19], [20]. The English text database uses all the characters in the English alphabet. The Gutenberg project has established this database [15]. The Source program code database is composed of all the characters used in the C and Java languages [21].

Implementation and Environment
This experiment was conducted using the Biruni cluster in the School of Computer Sciences at USM (biruni.cs.usm.my). The operating system used was Ubuntu Linux 10.04 and the compiler used was GCC v4.4.3. This study showed that the hybrid algorithm was "best in performance" as the result of the hybrid algorithm was better than those of the original algorithms. The tables of evaluation for each hybrid algorithm were arranged based on the best result and followed by the other algorithms. The algorithms were then ranked as "first," "second," or "third," In the evaluation the performance of the hybrid algorithm in various types of databases, the results are regarded as "best" when the hybrid algorithm performed better in specific databases compared with the other algorithms, whereas "worst" refers to the poorest performance of the hybrid algorithm for that database. When the hybrid or other algorithms obtained the best performance in all types of databases, then "all databases" isused, whereas when the hybrid or other algorithms obtained the best performance in almost all databases, "most databases" is used. To clarify the results in the figures in number of attempts, the proposed hybrid algorithms show only (10000) display units compared with the original algorithm. Compared with the recent and standard algorithms, the proposed algorithm has a logarithmic scale and base of (10), display units of (10000), and minimum number of (100000). In number of character comparisons, the proposed hybrid algorithm has a logarithmic scale and base of (10) and display units of (10000) compared with the original, standard, and recent algorithms.

RESULTS AND DISCUSION
The results of E-Atheer algorithm compared with those of the original algorithms in first step and then with those of the recent and standard algorithms in second step. Number of attempts and number of character comparisons considered the main factors that used in this study. The databases used in this study are DNA, Protein, XML, Pitch, English, and Source. The size of data is 200 MB. Two pattern lengths were used, which are short (4 to 28) and long that depends on the length power of two (2^5 to 2^1 0 ).

Evaluation and Results Analysis of the E-Atheer Algorithm and the Original
The E-Atheer algorithm obtains the best results compared with the Berry-Ravindran and Atheer algorithms in both short and long patterns. The Pitch database shows the best results in number of attempts when using short and long patterns, whereas the DNA database show the worst results as shown in Figures 3 and 4. The E-Atheer algorithm obtains the best results compared with the Atheer and Berry-ravindran algorithms in both short and long patterns. The Source database shows the best results in number of character comparisons when using short and long patterns, whereas the DNA database shows the worst results as shown in Figures 5 and 6.

Evaluation and Results Analysis of the E-Atheer Algorithm and Recent and Standard Algorithms
The E-Atheer algorithm considered the best algorithm in all types of databases when using short pattern except when using DNA it was the second best. The Maximum shift algorithm is the best algorithm in all databases when using long pattern and followed by the E-Atheer algorithm except when using Pitch database it is the best with E-Atheer. The Two-way algorithm is the worst algorithm in short and long pattern lengths as shown in Figures 7 and 8.
The E-Atheer algorithm considered the best algorithm in all databases when using short pattern, while AKRAM is the best algorithm in all databases and E-Atheer is the second best algorithm using long pattern length in all databases except XML the E-Atheer and AKRAM are the best. The Two-way is the worst algorithm in short and long pattern lengths as shown in Figures 9 and 10.

Evaluation of the E-Atheer Algorithm Compared with the Original Algorithms
The performance results of the E-Atheer algorithm and the original algorithms are compared in terms of the number of attempts and the number of character comparisons in both short and long patterns with different data types and sizes. Table 1 shows comparison of the results of the e-atheer and original algorithms. The E-Atheer algorithm obtains the best results in terms of the number of attempts made because it depends on the best shifting function from the maximum value of brBc and bmBc. The E-Atheer algorithm obtains the lowest number of character comparisons because it relies on the hash function, thus simplifying the character comparison between patterns and texts [22]. The best shifting functions also help reduce the number of character comparisons as shown in Table 1. Table 2 shows performance of the e-atheer algorithm in different database types. The E-Atheer algorithm obtains the fewest number of attempts in the Pitch database because this algorithm depends on two good functions, namely, hash and bmBc, in the Atheer algorithm [3]; these functions are considered efficient when employed in the Pitch database [23]. Pitch data contain a high percentage of numbers because the data are encoded as MIDI pitch numbers in computer applications [24], [25]. The hash function also uses integer numbers and the bmBc function, which is considered a good shifting function that helps reduce the number of attempts.
The E-Atheer algorithm obtains the lowest number of character comparisons in the Source code because it relies on the efficiency of the Atheer technique. The Source database also benefits from this technique. The two algorithms use the hash function in databases with large alphabet sizes to produce large hash values, thus reducing the probability of character comparison. The E-Atheer and Atheer algorithms show the highest number of attempts and character comparisons in the DNA database as shown in Table 2.

Evaluation of the E-Atheer Algorithm Compared With Recent and Standard Algorithms
The performance results of the E-Atheer algorithm and the recent and standard algorithms were compared in terms of the number of attempts and character comparisons using short and long patterns, with different data types and sizes. The standard and recent algorithms employed in this study are Horspool, Quick-search, Two-way, Fast search, SSABS, TVSBS, AKRAM, and Maximum shift. Table 3 shows comparison of the results between the e-atheer algorithm and recent and standard algorithms.
The E-Atheer algorithm obtains the fewest number of attempts when using short patterns because this algorithm depends on the efficient shifting functions (bmBc and brBc) of the Atheer algorithm. The Maximum shift algorithm shows the fewest number of attempts because this algorithm relies on the efficient functions of (ztBc) and (qsBc) in long patterns [26]. The E-Atheer algorithm obtains the lowest number of character comparisons when using short patterns because this algorithm depends on the useful technique of the Atheer algorithm in comparing characters. If a mismatch is obtained in the second step, the loss will only involve three characters because the first step depends on three characters only [3]. The AKRAM algorithm also obtains the lowest number of character comparisons in long patterns because the high hash value in long patterns reduces the mismatch probability [27]. The Two-way algorithm shows the worst results in terms of the number of attempts and character comparisons because this algorithm depends on the factorization technique [8] as shown in Table 3. Table 4 shows ranking of the e-atheer algorithm in different data types. The E-Atheer algorithm ranks first in most data types and sizes when short patterns are used in terms of the number of attempts performed. This algorithm ranks second in most databases when using long patterns. For the number of character comparisons, the E-Atheer algorithm ranks first in all databases with different sizes when short patterns are used. The E-Atheer algorithm ranks second in most databases when long patterns are used as shown in Table 4.

CONCLUSION
The E-Atheer algorithm obtains the best results in terms of the number of attempts compared with the original algorithms when short and long pattern lengths are used. The algorithm rank first in short patterns compared with the recent and standard algorithms and rank second in some of data types when long patterns are used. For the number of character comparisons, the proposed algorithm show the best results compared with the original algorithms in short and long pattern lengths. The improved algorithm also performs better than the recent and standard algorithms in short pattern lengths, while it ranks second in long patterns. The Pitch database shows the best performance in the number of attempts with the proposed algorithm, whereas the DNA database performs the worst. The best and worst databases in the number of character comparisons with the E-Atheer algorithm are the Source and DNA databases, respectively.