Solving multiple sequence alignment problems by using a swarm intelligent optimization based approach

In this article, the alignment of multiple sequences is examined through swarm intelligence based an improved particle swarm optimization (PSO). A random heuristic technique for solving discrete optimization problems and realistic estimation was recently discovered in PSO. The PSO approach is a nature-inspired technique based on intelligence and swarm movement. Thus, each solution is encoded as “chromosomes” in the genetic algorithm (GA). Based on the optimization of the objective function, the fitness function is designed to maximize the suitable components of the sequence and reduce the unsuitable components of the sequence. The availability of a public benchmark data set such as the Bali base is seen as an assessment of the proposed system performance, with the potential for PSO to reveal problems in adapting to better performance. This proposed system is compared with few existing approaches such as deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) alignment (DIALIGN), PILEUP8, hidden Markov model training (HMMT), rubber band technique-genetic algorithm (RBT-GA) and ML-PIMA. In many cases, the experimental results are well implemented in the proposed system compared to other existing approaches.


INTRODUCTION
The methanesulfonic acid (MSA) is one of the most challenging and powerful tasks in solving computational problems. Evaluating the biological sequences will helps in MSA process. MSA is defined by the combination of amino acids and nucleotide sequences. Primary and secondary structures will be useful in understanding the structure of a given sequence. MSA helps in finding the phylogenetic distance for a particular sequence. In computational biology, the main task is to predict the order of the structure of a given molecule [1], [2] and also MSA makes it more useful to find the structure. In computational biology, the MSA will be useful for solving a variety of tasks. So, this is the main reason why researchers are more focused on solving MSA problems effectively. The primary purpose of MSA is to differentiate sequences with similar inactive properties. However, if possible, not only distinguish rows with inactive characters, but also compare them with other characters. In bioinformatics, the main problem in MSA is knowing the relationship between phylogenetic and genetics. So far, various methods have been used to overcome this problem.  [4] introduced a new approach as two-level strategy for solving MSA problem. A swarm intelligence-based optimization approach is used to solve MSA problem in [5].
To effectively solve MSA problems, a dynamic programming (DP) approach is useful. In DP, the large domain scoring feature is used to update results for MSA issues. In 1970 the DP solution to the MSA problem was proposed by Needleman and Wunsch [6]. Assuming that the number and length of sequences increase, the DP approach will break easily. However, considering the time complexity and number of sof the dynamic program approach also increases with increasing sequence length and number. So, the MSA problem will led to NP-hard. Therefore, the major aim is to tackle MSA issues in an systematic manner i.e., similarities need to be developed among the sequences with less complexity. Another way to deal MSA problems is to use a progressive alignment approach. However, it will be useful in solving because it is less complex [6], [7]. According to this method, we first assume a similar sequence and then refer to a dynamic set of those sequences. CLUSTAL W [8] is a classic approach in progressive methods.
Another approach is an iterative procedure used to solve MSA problems. It starts with an initial alignment, filtering the frequent solutions until no possible improvement. Here, the decision does not depend on the initial alignment. The target of the iteration process is to enhance the quality of alignment. Simulated annealing is one type of iterative or stochastic method. A simulated annealing with particle swarm optimization (PSO) is proposed for MSA problem [9], which mainly focus on global search by PSO and to avoid the trapping in local optima is incorporated with simulated annealing.
Evolutionary algorithms (EAs) usually rely on population-based approach mainly focus on global search. In MSA, EA will be useful in randomization, and the next EA step will be applied to the sequence to improve similarity. Sequence alignment by genetic algorithm (SAGA) [10], rubber band technique-genetic algorithm (RBT-GA) [11], and several other genetic algorithm (GA) algorithms are based on MSA. Using reference sets 2 and 3, Taheri and Zomaya [11] solved 34 issues from the Bali database. In EA, local optimal solutions achieve low variance, and this is considered a major disadvantage. DP is used in the proposed local optimum solution method. We have an operator called inbuilt elitism in PSO. The proposed method of elitism and initial operator modification will help to solve the MSA problem effectively.
Another swarm intelligent based approach is proposed by inspired from the artificial fish algorithm. Dabba et al. [12] proposed multi objective artificial fish swarm algorithm called MOAFS for solving MSA problem. Here, they proposed two fitness functions to maintain diversity in exploration and exploitation in the search space. Chaabane et al. [13] proposed a hybrid approach for solving MSA as PSO with Tabu search called PSO-TS. The PSO used effectively explore the search space and Tabu search used for produce the global best solutions in high quality. A bird swarm align algorithm [14] also proposed for solving MSA problem. Brainstorm optimization based algorithm with a dynamic population size is proposed for solving MSA problem in [15], [16].
A hybrid approach is proposed in [17] which is combination of PSO and artificial bee colony (ABC) optimization technique called hybrid algorithm of artificial bee colony and particle swarm optimization (HABC-PSO). It integrates the Tent chaos search with opposition-based learning where they incorporate with recombination operator technique, to gain best solutions by using this recombination operator. A multi objective ABC proposed [18] for MSA problem, used two objective functions as one is sum of pairs and another one is entropy. Another ABC based three level multi objective approach for ribonucleic acid (RNA) multiple structural alignment problem [19]. A new kind approach based on flower pollination algorithm (FPA) [20] is proposed for solving MSA problem, this metaheuristic approach used for finding the alignment score based maximizing the sum of pairs and column score.
This article organizing as, required method PSO is discussed in section 2. The proposed method of PSO was discussed in step by step briefly along with illustration of decoding and encoding in section 3. In section 4, we discussed the proposed method experimental studies as well as comparison with other methods in briefly. At last, we conclude the paper in section 5.

METHODS
Many evolutionary computation techniques have been successes fully used for solving optimization problem. In this article we have adopted the one of the swarm intelligent based techniques is PSO. Finding the right solution in a large search space is very useful. In 1995, Kennedy and Eberhart [21] was inspired by schools of fish or birds. A binary version of PSO was introduced in [22] for MSA problem. We used same kind of approach in out experimental study. The (1) and (2) describe velocity and position updates.
Where X is the current position, r1 and r2 are represented as random vectors in [0,1], V is shown as particle velocity, c1 and c2 represented as accelerating constants, t is considered as iteration, and w represents the inertial weight.

PROPOSED APPROACH
This section going to discuss the proposed method of PSO and explained the process in step by step. The following steps are explained as: initializing the position of the particle, how representation as particle and how updating the particle position through velocity and (1), (2). Given a complete description of proposed fitness function, and calculation of alignment score. The explanation of this particular steps is discussed below.

Initial generation
The development of some initial solutions is considered the main goal. However, initial solutions can be developed by providing gaps between the residues. The following example shows the generation of the initial population. Figure 1(a) shows the initial MSA. From each pair we find pairwise alignment using the Wunsch and Needleman algorithms. In Figure 1(b) shows the initial pair (1, 2). Then, pairwise alignment for the initial pair (1, 2) need to be find, which was shown in Figure 1(c). Suppose if the problem is with 4 sequences, then 4*(4-2)/2=6 pairs are considered. So pairwise alignment must be found in the same way for each pair. Now random permutation from 1 to N must be generated. For example, if we have 4 sequences, any permutation from 1 to 4 can be generated. If every permutation of (3, 4, 1 and 2) is generated then Figure 1(d) shows the alignment process. K solutions can be generated by K times of random permutations from 1→N. At this stage, the initial population is developed.

Solution representation as a particle
Each and every solution was represented in PSO as particle.
Where d is the dimension size in a particle, N represents the number of particles. Out of the 200 solutions we record one solution and explain how we can encode the MSA problem into a particle. Figure 2 shows the initial solution. Binary encoding schematic-in this part of the schematic, place 1 at the gap position and 0 at the protein sequence position. This is shown in Figure 2 showing how the initial solution is encoding a particle. In each column we see decimal values from bottom to top with binary encoded values. So, X1=(1, 0, 0, 8, 4,  and 2) is a solution for this particle representation. Therefore, in MSA, the number of columns is equal to the number of dimensions in the particle. In MSA, with the encoding scheme, new particles are only developed after the initial solution generation.

Fitness function
MSA utilize the weighted sum of pair method (WSPM) function to calculate the fitness measurement. In this, score for every column is equal to the computation of product of every pair symbol score.
The overall alignment score is computed against the column scores through (3). Where l indicates the number of columns in the total alignment, S indicates the MSA cost. N represents the sequence count, Sl represents the value for the l th column of length l, Wij represents the weights of the rows for i and j. Its main purpose is to explain the variance between the two sequences.

Wij=(Miss)/(TL)
Where, Miss count of mismatch characters in the alignment, TL lengh of total align. Cost(Ai, Aj) is a function use to calculate the alignment score og two aligned sequences Ai and Aj. Here, there are four possible cases.

RESULTS AND DISCUSSION
In this section, we have implemented the propose approach and the results are examined and then compared with the few existing methods. The parameter setting of PSO are considered as N=200, Vmax=2 Number of sequences , Vmin=0, w=0.758, c1=2.15, c2=2.22, r1 and r2 are a random value in [0,1]. Here, the performance of our proposed approach is compared with few existing approaches using different reference datasets. The compared results presented in Tables 1 and 2.

Comparative performance of proposed method
In our experiments, we tool a benchmark dataset from publicly available database Benchmark alignment dataBASE (BaliBASE) [23], [24]. Randomly 16 datasets have picked from the database. So, 12 datasets are from reference set 2 and remaining 4 datasets are from reference set 3. The existing results are taken from the literature RBT-GA [11]. Also the experimental results compared with literature [25].

Experimental results in BAliBASE reference 2 datasets
From such reference, 12 datasets were extracted. All of these sets are different with respect to the terms of number and length. Our proposed method is well compared with various existing approaches like ML-PIMA, DIALIGN, rubber band technique-genetic algorithm (RBT-GA), hidden Markov model training (HMMT) and PILEUP-8 for evaluating the proposed method performance. Table 1, presented the results reference 2 datasets, and it determines that the proposed method performed well in 12 datasets and existing methods in 7 test cases. Also, the proposed approach does not achieve best solution within 5 test cases but it just closely related towards the best solution. Finally, Table 2 explains that the entire performance of the proposed system performed well than the other existing methods in reference 3 datasets. The proposed approach score was 0.708 whereas other approaches are an average less than 0.3. only one approach getting good results than our proposed approach that is RBT-GA which reported as 0.83.

Experimental results in BAliBASE reference 3 datasets
Within reference set 3, we have considered 4 datasets lidy, luky, lr69 and lubi out of 12 datasets. Table 2 describes the experimental results on reference 3 test cases that performed better and 1 test case did not perform well but it almost closes to the best. However, Table 2 shown the proposed approach was outperformed with better average solutions than existing methods. Table 2 shows that the comparative results on [3] dataset with few existing algorithms. The results are considered as average score, the proposed approach claim as 0.514, at same time other approaches reported RBT-GA as 0.395, and the next highest was ML-PIMA as 0.263, all other approaches were reported very less score comparing with our proposed approach.

CONCLUSION
This article is mainly discussed about an improved PSO for solving the alignment problem. Also, experiments have conducted in multiple times to evaluate the performance of proposed system. Meanwhile in PSO approach, an initial operator is used for evaluating. Also, initial operator is improved with more efficacy and dominance. The weighted pair function was used in the proposed system for fitness calculation. The powerful datasets have taken from public database i.e., BaliBASE with 2.0 version, which was explained the efficacy of the RBT-GA algorithm. Later, better solutions of Bali score have taken for comparison with various existing methods. In most of the test results, the proposed system performed best and, in few cases, the performance is slightly lacked but it almost closes to the solution. The major reason in utilizing the improved initial operator was to improve the proposed system efficacy. Therefore, we claim that our proposed system shown better results than the existing approaches.