Adapted branch-and-bound algorithm using SVM with model selection

ABSTRACT


INTRODUCTION
In real life, MILP has countless applications in different fields like logistics, finance and transportation. A very common solution technique of MILP framework is Branch-and-Bound. It continues to prove its relevance nowadays. Branch-and-Bound algorithm is an iterative algorithm, and at each iteration, we eventually get a feasible or optimal solution of an initial problem. Concretely, the algorithm constructs little by little a tree of nodes, where each node represents a modified version of the original problem. The construction of child nodes is conducted by a variable branching strategy. Another fundamental element in Branch-and-Bound algorithm is Node Selection Strategy that aims to choose among a nodes queue, one that will speed up the search.
Recently, some works has been trying to identify an analytic approach that decide about strategies described above, given a set of problem features. Authors use likely machine learning techniques. The main remark is that few authors who deal with node selection strategy, and if so, they did not use machine learning framework.
Our contribution is oriented towards learning efficient branch-and-bound strategies. This is the result of a consistent methodology beginning with the collection of the data set, and ending with the test of the final hypothesis. More explicitly, we: -Define features -Collect data set -Pick the optimal learning model -Learn the final hypothesis with the chosen model -Implement and test the final hypothesis Next, we address research papers relative to our work. To do so, we divide contributions in five subsections relatively to the strategy and to the learning technique used. Firstly, relatively to variable branching strategy, authors in [1], learned a function to be with approximatively the same performance as strong branching in term of precision, and in the same time, results in a gain of the processing time.
In this same category, we cite [2], wherein authors infer consistent data from applying an algorithm for detecting clauses. A clause being a combination of binary values affected to a set of indexed variables, that if it happens, the whole problem would be infeasible. In addition, that algorithm generates minimal clauses and restarts with active clauses ("those that can be used in fathoming child nodes"). In parallel, this information is used to choose branching variable with the best effect. Also [3] and [4] use backtracking to improve branching decisions.
Besides, there are classic variable branching rules, like strong branching, pseudo costs branching, hybrid branching, reliable branching, inference-based branching [5]. Note that reliability branching is known to be the best branching rule with the reliability = 4 8 [6]. For our experiments, we used: The reliability parameter is fixed to stop the calculation of pseudocost values after attaining a certain level of the branch and bound tree. This is because pseudocost value remain approximatively constant after calculating it several times for a determined variable.
Secondly, we cite from node selection strategy literature, in addition to classic node selection strategies, such as depth-first rule, breadth-first rule, and node best-estimate [5], authors of [1], extracted information from MILP Benchmark libraries by using specific algorithm called oracle. Thirdly, concerning learning algorithms, they are used in different engineering fields. Algorithms purposes are classification, regression, clustering [7] [8]. There are algorithms that tend to do well in practice more than others [9][10]. In the same context of applying learning in branch and bound, the ExtraTrees is applied in [11].
Fourthly, when looking at model selection and the performance of algorithms, there are techniques used to tune parameters such as Fuzzy Logic controller for Ant Colony System (ACS) epsilon parameter [12]. Also, [13] and [14] used Hidden Markov Model (HMM) algorithm to tune the Particle Swarm optimization population size and acceleration factors parameters. Besides, authors in [15] used HMM to tune the inertia weight parameter of the Particle Swarm Optimization algorithm. Moreover, [16] used Fuzzy controller to control Simulated Annealing cooling law, [17] and [18] used HMM to tune ACS evaporation parameter and local pheromone decay parameter respectively, [19] and [20] used HMM to adapt the simulated annealing cooling law. Furthermore, [14] used SVM algorithm to predict the performance of optimization problems. Finally, authors in [20] used the Expectation-Maximization algorithm to learn the HMM algorithm parameters. Finally, this paper is the continuity of our previous papers which deals with the learning of branch-and-bound algorithm strategies, namely variable branching strategy and node selection strategy [21], [22]. The learning algorithm used was Support Vector Machine (SVM).
The rest of this paper is organized as follows: Section 2 recalls some basics on branch-and-bound algorithm and SVM algorithm with parameter tuning. In section 3, we present our methodology of inferring efficient branch and bound strategies and experimentation configuration. Sections 4 is dedicated to results. Finally, we conclude and propose some future work.

BRANCH-AND-BOUND AND SVM
In this section, we are first going to see an overview of a formal description of branch-and-bound strategies and present the features used in the algorithm. Secondly, we will investigate SVM most important advantages with a remainder of learning theory.

Branch-and-bound algorithm
Branch-and-bound algorithm is outlined in this section. We first define useful notation and then proceed with the explanation of the algorithm steps. Let us define a general MILP problem P as follows: is * dimension matrix. We will use also define: is a relaxed version of : which is is the problem in the ℎ iteration which corresponds to a node in branch-and-bound tree.
, is a relaxed version of . is the objective value of ℎ node. ( * ) is the incumbent point at iteration , which means the vector that leads to the best so far.
* is the objective function value on ( * ) Briefly, the Branch-and-bound algorithm, in the case of minimization, is described as follows as shown in Algorithm 1. It is an iterative algorithm, and in each iteration , we have at least three steps which are: Firstly, the node selection step aims to retrieve a node from a node list that maximizes some criterion. This latter is specific to the node selection strategy. Secondly, and once we have picked a node , we solve its relaxation , by an algorithm from the linear programming framework such as simplex or interior points. Depending on the results, we distinguish three cases. The first one is when the problem , is infeasible or the resulting objective function value is greater than * . Consequently, the current iteration is termintated. The second case is when the solution is integer and < * . In this moment, we update the incumbent point and its objective value * , then we move to the next iteration. In the third case, when none of the condifions mentioned before happens, we perform variable branching. In this final step, we must select a variable from a set of non-integer variables relatively to some defined criterion. And this criterion is defined by the variable branching strategy.

Support Vector Machine
SVM is in top ten machine learning algorithms [9], it is used for both classification and regression. It aims to find the hyperplane with the best margin. The best is demonstrated to be the large one differentiating between the hyperplane and nearest data points called support vectors.

Case of Linear Hypothesis set for SVM:
In the case of regression, and especially one variant of SVM called -SVM, we will present nextly, the case of linear hypothesis set. Let's have in have in the input, training data, namely ( , ),0 ≤ ≤ The output of the algorithm is a linear function: ( ) = + , with a coefficient vector, x the unknown vector and b a constant.
The distance between a hyperplane of equation + = 0 and the support vectors, is Consequently, maximizing the margin is equivalent to the next optimization problem: With, being the error tolerance between and ( ). The last problem might be infeasible. And to add more chance to be feasible, we add slack variables to the problem, in the following way: with, * are the slack variables, and is the cost parameter used to penalize data points outside the margin .
By using lagragian function, and quadratic optimization or other resolution methods, one can prove that the solution is with the form of:

Case of non linear hypothesis set
In the situation, where we cannot find a hyperplane containing all training instances, one might transform the space of the training data to another, in such way can be comprised in one hyperplane on the new space. To do this transformation, one can use the well-known kernel methods. In fact, there are in literature different kernels used for SVM, such as RBF and polynomial. For the rest, we will present the distance calculation method for the RBF kernel.
Instead of using the standard L 2 − ||. ||, we used the norm associated with RBF kernel that is described as follows: with γ, is the gamma parameter. Its geometrical interpretation is, when the gamma parameter has larger values, the hyperplane associated with the solution will have more inclinations to contains, as far as possible, all training data. The form of the resulting target function, will be as follows: In this paper, we will use -SVM regression algorithm with the RBF kernel twice for learning node selection strategy and variable branching strategy respectively.

Learning of variable branching strategy and node selection strategy
Concerning the variable branching strategy, we aim in this paper to imitate the behavior of the reliability branching rule. This rule is based on strong branching, which is time consuming. By and large, reliability branching uses an unreliability quality for variable pseudo-costs values. For this reason, reliability depends on numerous problem features. These features are to be classified in node-based features and variablebased features.

Node-based features
We use in this category features below: -Reduced Objective values gain: with , is the ℎ component value of cost vector of iteration . These features aim to present either in minimization or maximization problems how we approach to the optimal solution. The other specificity in our work beyond changes in features based on those presented in [11], is we add the value of learned function representing node selection in the set of features. This last point is justified in the following sub-section. For learning node selection strategy, we will imitate node estimate strategy. This strategy is the default one used in SCIP solver.

Interaction of node selection strategy and variable selection strategy
Intuitively, the choice of a node, by a node selection strategy, influences the choice the next branching variable. For this reason, we describe formally the variable branching strategy function VB in function of a combination of NS (Node selection strategy function) and other features described below: where and are real numbers. Note that we double use NS features add more precision.

Overfitting and parameter tuning
In this sub-section, we will define overfitting, which is a very common problem in learning techniques that affects the final performance.

Overfitting
A learning model is, by definition, a couple of a learning algorithm and a hypothesis set. A learning algorithm is an iterative algorithm that searches the best hypothesis fitting the training data. This hypothesis is included in the hypothesis set chosen initially. A very common problem encountered in learning is overfitting. This phenomenon occurs when the learned hypothesis does not generalize well to all possible values beyond the training data. Causes are number of data points, noise and target complexity [7].
The choice of learning algorithms could affect the noise by affecting either bias or variance. In the case of SVM, the thorough choice of SVM parameters is required to prevent from overfitting. The RBF Kernel SVR algorithm used in this work has two parameters, cost and gamma. Cost defines how much is penalized misclassified examples and gamma defines how far the influence of a single training example reaches. As known small cost and large gamma, give higher bias and lower variance. In addition, large cost and small gamma give lower bias and larger variance. Consequently, we should tune cost and gamma parameters until we find tradeoff values to minimize the generalization error. One way to tune gamma and cost parameters is to use cross validation.

Cross validation with model selection
Before defining cross validation, let us find out what is validation. To do so, we define some useful notation: the data set the training set the validation set The goal of validation is to give an estimation of the generalization error. First, it divides of data points, to of size − and of size , then learns the target function based on . Finally, we calculate error of the target function in . This latter error is proven an estimation of the generalization error. The Figure 1 represents what is described above. The error is a good estimate of generalization error but it is not too precise. To improve the precision, other techniques repose on validation like cross validation. Without the loss of generality, we present next 10-fold cross validation process.
Let's partition to 1 , 2 to 10 . We use validation process ten times for , = where 1 ≤ ≤ 10 and , = \ . In the output, we have 10 errors 1 to 10 . Then we calculate cross validation error denoted by which is the mean validation errors. The cross validation error is more precise that validation error. We resume this process in the Figure 2. Now that we have presented cross validation, let us look forward model selection, that used in this paper to tune parameters of gamma and cost. For * different combinations of cost and gamma, let's note a couple ( , ) with 1 ≤ ≤ and 1 ≤ ≤ . As mentioned in the Figure 3, cross validation is executed multiple times with different parameter configuration. As result, we get errors 1,1 to , . In the end, we have the configuration that have the lower error.

RESEARCH METHOD
In this section, we outline the methodology, step by step, of learning the node selection strategy NS and variable branching strategy VB using parameter tuning. Then, we present the experiment configuration.

Collecting Datasets
We use the MIPLIB2010 library as instances to which we apply the Branch-and-Bound algorithm featured by reliability branching rule and best estimate selection rule. Then we extract information of features described before. Note that the best estimate selection rule is the default one is various optimization tools like SCIP. Here is the pseudo-code of the data collection step as shown in Algorithm 2.

Learning NS and VB
First, we divide collected data into two sets, one used for training and validation and the other for test. By applying twice cross validation based model selection descripted above, we firstly learn the score function of every node in the nodes queue. So the node having the best score ( ) will be chosen in branch-and-bound. Secondly, we will learn the score function VB of variable branching selection.

Experiments
In this sub-section, we present the lab-test used to experiment our methodology and we give the pseudo-codes used for tests.

Experimentation configuration
We used SCIP 3.2.1 for the raison that is the best in open-source and free tools [23]. Moreover, for SVM algorithm and model selection, we used the package e1071 [23] of the language R 3.2.5 known to be among the most performant languages in implementation of SVM algorithms [13].
The cost range used is {10 −4 , 10 −3 , … , 10 5 } and the gamma range is {2 −8 , 2 −7 , … , 2 1 }. These ranges cover too small and too high values of cost and gamma. The OS used is Debian 7 32 bits, 8 Go RAM, Intel 2.40 GHz Processor. We use for MILP instances the benchmark set of MIPLIB2010 [28] to collect data, valid it, and to test resulting models. For training and validation set, we took the following instances as shown in Figure 4.  These instances about tens of thousands of rows and columns. The total description is available in [28]. Concerning the validation set, it contains approximatively fifth of number of mentioned instances above [7]. Finally, the node limit is fixed to five hundred nodes and running time limit is fixed to six-hundred seconds.

Pseudo-codes
In the solving process, the algorithm as it is implemented in SCIP is executing numerous event codes related to some events. Next, we will describe these events, and give the pseudo-code relative to each one. Besides, main events modified are respectively:

Node selection event
This event occurs when the algorithm is in the phase of selecting the next node to solve. The criteria of selection is determined by the strategy implemented. Note that this event code is also executed even in the root node selection. In this event, we implement the node selection rule score function NS that is already established. Moreover, it calculates the score for each node in the node list. Finally, it returns the node with the maximum score. The pseudo-code is the Algorithm 3.

For each leap n Calculate NS(n)
In this test, we had ten instances from MIPLIB2010. These instances have three different types. The first one is mixed integer program (MIP) that regroups integer and continuous variables. The second is mixed binary program (MBP), which includes both continuous and binary variables. And the final one is Binary program (BP) that contains exclusively binary variables.
These results show up that our approaches give equivalent if not better dual bound comparing to standard branch-and-bound in term of dual bound in 80% of cases except from opm2-z7-s2 and ran16x16 instances. Another important result, is that our last approach gives equivalent or better running time comparing our last approach in 80% of cases. Also, when comparing it to the standard branch and bound algorithm ruled by reliability branching and node best esimate rule, our approach gives better or equivalent result in about half of total instances. We noticed that there is an empirical relation between the performance of dual bound and the number of constraints of the problem from the one hand, and a relation between the performance of running time and the number of variables from the other hand. To concretize these last points, we plot these in Figure 5. The left-hand figure shows that instances with less that 5000 constraints gave better dual boud for our approaches comparing to standard branch and bound. As a matter of fact, the opm2-z7-s2 instance, which is represented by the isolated point in the down-rignt side has approximatively 31000 variables. Concerning the right-hand figure, it shows that instances with more than 2500 variables, increased the performance of running time, when resolved by our approaches, especially for ns1208400, ns1688347 and rail507 instances.

CONCLUSION
In this paper, we add parameter tuning to infer better configuration of SVM. Saying this, we used −SVM regression learning algorithm known for his high accuracy to learn branch-and-bound algorithm node selection and variables branching strategies. These choices lead to better results comparing to reliability pseudo cost rule and best estimate selection rule, which are known to be from the best in literature. In perspectives, we will work on eliminating noise in data, compare with different learning algorithms available in literature.