Identification of individualization techniques for criminal records in sanction lists

ABSTRACT


INTRODUCTION
Currently, relevant information for any subject of study is in a large number of media formats, which can be parameterized and indexed for specialized use in public or private databases [1]. They can be in natural language, captured in images, or in any medium required to facilitate its use and disclosure. Updating the information went from being in hands of a few companies and media, to be available to any person who has an electronic device with access to the Internet, causing sources of information to proliferate, both reliable and of dubious origin.
Taking into account this vast amount of information of all kinds, companies have found the need to classify and individualize [2] it, in order to comply with the regulations that govern them or to improve internal selection processes of personnel, associated companies, search for solutions practices, among other objectives. One of the main needs in companies at legal level is the search of criminal records of natural and legal persons with whom they have some employment relationship, in order to minimize the risk of being used in money laundering and terrorist financing operations (LAFT by its acronym in Spanish). With this premise, it is required to have the ability to validate information that is relevant to a particular individual in specialized databases, as well as sources of constant updating and less standardization such as press documents, articles, national and international bulletins, and other online sources.
Usually, systems used for this purpose apply text recognition algorithms, comparing them with their own databases, which have dictionaries of terms similar to those used in news concerning criminal activities [3][4]. The arising premise is the use of these character recognition algorithms, complementing them with artificial intelligence techniques that allow us not only to identify these syntaxes, but also help assessing the most reliable sources over time and provide results according to what is required. At the same time, visual information of individuals is compared with facial recognition techniques, expanding or delimiting the range of results in cases where the information in written media is not correct.

CONCEPTUAL FRAMEWORK 2.1. Definitions
Text-search algorithms: Text-search algorithms are techniques used in order to find the occurrences of a pattern of characters in a given text corresponding to a combination of elements of a defined alphabet [5].
Algorithms of artificial intelligence for searching personalization: Personalized searching are the algorithms that use interests of users to produce fast and relevant search results. Among the input parameters for these algorithms are user profile, analysis of hyperlinks, analysis of pages content, and valuations of collaborative searches [6]. The objective of the mentioned algorithms is to give a weight of relevance to the results of the user's query. It [7] uses some classification and weighting algorithms based on the content of pages selected in user's query, while [8] it does the according work using automatic learning techniques -Referenced in several sources of consultation with the anglicism "machine learning" -which can be classified as artificial intelligence work aimed at the autonomous analysis of data flows.
Facial recognition: Facial recognition is the use of algorithms that take images or models of a face as input parameters in order to process them and generate, as a result, a corresponding identity matching to a database of individuals. As detailed by [9], depending on what is going to be analyzed, images or models, algorithms of different characteristics will take into account various biometric aspects for analysis can be used, each one having different benefits in terms of speed and effectiveness.

Analysis
Below, a comparative study between algorithms of the categories exposed in definitions section of this article is presented, taking into account indicators concerning the measurement of effectiveness of each one with respect to the particularities of each category they belong to: a. Text-search algorithms Brute force search: The aim of brute force search is to make a character-by-character comparison in the text T [s...s + m−1] for all ∈{0,...,n−m + 1} and the P[0...m−1] pattern. The algorithm returns all valid matches. However, as [10] points out, the problem with this approach is effectiveness, since the complexity of the algorithm is the worst possible, being of order O (M x N).
Knuth-Morris-Pratt Algorithm: KMP algorithm is composed of two phases: a text preprocessing in which a branch table based on partial failures of a brute force search is obtained. Using this table, the algorithm will scroll through the text advancing through it, not character by character as in the brute force search, but in the quantities described in the table. The complexity of the algorithm is given by the order O (n + k), where O (n) and O (k) are pre-process and subsequent search complexities.
Boyer-Moore Algorithm: As described by [10], the idea behind the Boyer-Moore algorithm is to perform a process analogous to KMP algorithm, but performing the search from right to left, which allows for larger jumps in the search in the main text, because if the last letter of the pattern to be searched is not found, the following n characters can be discarded, being n the length of the pattern. The complexity of this algorithm is sub-linear, that is, O (N / M).

b. Comparison between algorithms
Based on the order of algorithms analyzed, it is evident that the greatest effectiveness corresponds to Boyer-Moore, due to its complexity order as shown in Table 1. In tests conducted by [10] using an alphanumeric alphabet and several chains generated randomly, the following measurements of execution speed were observed in milliseconds, which corroborate the expected efficiency of each algorithm as shown in Table 2.   [11] the strength of this neuronal model is the ability to form characteristic maps in a similar way to what happens in the brain, this algorithm uses reinforced competitive learning, distinguishing a training stage and an exploitation stage.
Naive Bayesian (NB). Based on Bayes conditional probability theorem (1763), it treats different prediction variables independently, while assuming independence between predictor attributes. The algorithm calculates conditional probability for combinations of attributes with the objective. It establishes an independent probability from predictive data. This probability provides the likelihood of each objective, once the instance of each value category is given from each input variable.
Decision trees (C4.5). Using the inductive learning methodology, decision tree algorithm classifies from a set of training data. In each execution of the algorithm, an evaluation of each node is made and it is determined which is the best as a decision parameter. K-Nearest Neighbors (KNN). Known as lazy learning [12]. The parameters of classification by neighborhood are based on the search in a set of prototypes, of k prototypes closest to the prototype to be classified. A metric is specified in order to measure proximity, Euclidean distance is normally used for computational reasons. Support vector machines (SVM). As [13] says, a Support Vector Machine (SVM) learns the surface of two different classes of entry points. As a one-class classifier, the description given by support vectors data is capable of forming a decision boundary around the learning data domain, with very little or no knowledge of data outside this boundary. Data are mapped by means of a Gaussian kernel, or another type of kernel, to a feature space in a higher dimensional space, where the maximum separation between classes is sought. This border function, when brought back into input space, can separate data by different classes, each forming a grouping.
Comparison between algorithms. For comparing the different algorithms, we take the results obtained in previous works [11] in which a set of articles from news websites from 6 different sources in English is evaluated. They were pre-processed in order to eliminate recurring terms of the language, uppercase and lowercase were normalized, words within the vector or document given a weight or importance, and KEEL simulation tool was used in order to obtain the accuracy percentage of each algorithm in the news collection as shown in Table 3. Due to the fact that individualization will be made taking into account only images contained in sanction lists databases and images of web press articles, only methods of image analysis will be taken into account and those based on 3D models will have to be discarded. The following two facial recognition techniques are proposed: Principal component analysis (PCA). Known as PCA technique. This facial recognition technique employs an initial processing of a face image to convert the matrix of pixels into a set of vectors, then, they will be projected in a space of smaller values. These values are compared with those stored in a database of facial information taking into account a tolerance value.
Locality preserving projections (LPP). This algorithm is known as LPP. LPP performs the same reduction of initial data that PCA performs, but in addition, it performs another process which results in almost identical values in the small projected space of values when dealing with the face of the same person in consecutive images taken from the same video source. Such additional processing may result in lower computation speed, but the accuracy in results will be greater. e. Comparison between algorithms To verify the success rate and speed with which both methods positively identify individuals by feeding algorithms with face images, results obtained by [9] will be taken, who developed the software tests taking into account return values of instantaneous results and accumulated results; that is, those that required more processing time before generating a positive result as shown in Table 4. f.
Background As can be observed in the theoretical framework and in the articles referenced, algorithms intended to be used for a more efficient individualization process have already been widely developed by various experts in computer science. So it can be said that the development of this work is based on the compilation of previous works results rather than in development. On the other hand, despite the antiquity of some of the algorithms discussed in this article, they are widely used nowadays, because they have proven their effectiveness over time, such as KMP algorithm for searching text patterns, which is still used in current browsers when the user wants to search for a text in a web document. The optimization of processes described in this article would

RESULTS
From the analysis of text search algorithm complexities, as well as tests proposed and developed by [10], the result is that Boyer-Moore algorithm is the most recommended to carry out searches, being superior to the others in terms of execution speed. In the analysis of individualized algorithms results, we appreciate that SVM performance presents a consistency superior to other algorithms, always exhibiting a behavior above the average, regardless of the number of terms of the sample, and showing a behavior without negative fluctuations in cases of greater number of terms.
When comparing facial recognition algorithms proposed and evaluated by [9], it is concluded that the method generating the most positive results was locality preserving projections, reaching more than 85% identifications, so it would be the most recommended one when implementing a system. With analysis of results and obtaining the best algorithmic methods, the future incorporation of different methodologies is considered as a future work in order to optimize results in individualization process, leading to ideal results and with a lower error coefficient and false positives. Artificial intelligence algorithms allow not only to improve results searching databases and in documents of news websites, but also to classify different sources in order to give priority to those that have greater relevance in the individualization of subjects.

CONCLUSION
In this paper, three categories of algorithms to be used in the process of implementing an individualization system for criminal records searches were examined. Th algorithmic categories examined here were search in texts, artificial intelligence for personalization of searches, and facial recognition. They were compared using the metrics proposed in previous works, such as Hernández, Gou, and Betancour in order to obtain the best techniques from each category. Finally, it was found that the most recommendable algorithms for use in an individualization system are Boyer-Moore for text search, vector support machines for artificial intelligence, and locality preserving projections for facial recognition.