Matching data detection for the integration system

The purpose of data integration is to integrate the multiple sources of heterogeneous data available on the internet, such as text, image, and video. After this stage, the data becomes large. Therefore, it is necessary to analyze the data that can be used for the efficient execution of the query. However, we have problems with solving entities, so it is necessary to use different techniques to analyze and verify the data quality in order to obtain good data management. Then, when we have a single database, we call this mechanism deduplication. To solve the problems above, we propose in this article a method to calculate the similarity between the potential duplicate data. This solution is based on graphics technology to narrow the search field for similar features. Then, a composite mechanism is used to locate the most similar records in our database to improve the quality of the data to make good decisions from heterogeneous sources.


INTRODUCTION
Big data is like a small data but in a large amount of data with a higher complexity level, because it becomes very difficult to control it by any database management tool [1]. However, big data is characterized via a set of properties, including volume, veracity, variety, and velocity. Volume represents the size of data; it can be extended up to terabyte or more. Velocity denoted how fast the data came in. variety represents that data can be structured, semi-structured and unstructured format. Today, data has become the wealth of companies and management departments, contributing to its development. The decisions based on low-quality data can be very costly, hurting businesses, partners, and customers. Furthermore, the management departments and companies need to improve their relationships through data governance. Hence, having good data quality is very important for companies, especially when they interact with other organizations or make big decisions.
The concentrate on the structure of the data to be cleaned or integrated, in order to make some metrics and ways to solve the issues of data quality. That is what the suggested methods depend on to solve these problems. Thus, to get helpful data, we need to analyze it within the range of its usage [2]. As we know, integration projects may require some support to improve the quality of data, because there are few companies who execute the procedures of data quality management in the database or data warehouse they have created.
Currently, the problem of entity analysis is a field of research in the field of data quality [3]- [5]. Just as online mining for relationships and entities has established an extensive public knowledge base, companies, governments, and researchers can also use the true value of this data, which can only be used when multiple data sources are integrated. Entity resolution refers to the task of identifying records from the 1009 same entity in one or more data sources [6]- [9]. When only one database is used, this strategy can be called deduplication [10]. Comparing records that might be matched in an entity collection is a secondary problem. It is impractical to use traditional methods of comparison when collecting big data. Therefore, you generally organize similar records and then compare only those records that appear in the same block to improve the efficiency of the entity resolution algorithm. However, for data across multiple data sources, there are usually different standards, so before analyzing the entity, pattern matching must be performed between the data sources. Traditional pattern matching is no longer effective for large amounts of highly heterogeneous and noisy data on the network. Several techniques are currently used to define the likelihood of matching features. These include sorted neighborhood blocking (SN) [11], duplicate count strategy (DCS) [12], and q-gram based indexing (G-gram). These are feature resolution techniques. Basically, they operate on all records in the dataset. Then we group each record into one or more blocks, and finally we create the pairs for comparison. Therefore, these methods take a long time to identify possible pairings. The probability of error is also high because the search area is larger for large amounts of data. Consequently, to minimize these problems, we need to look for a technique that reduces search time and achieves good results.
The main focus of this study is on exploiting graph mechanisms to solve the problem of big data diversity. Here, a technique is discussed to identify matching data resulting from the integration of different types of sources such as databases, comma-separated values (CSV), spreadsheets, web services, and other formats. These duplicate data cause a lot of problems when we make decisions. For example, we might find many records representing the same record with minor differences, such as finding names in a different order, which makes us consider each record duplicate as a new one, or having some names or information errors, or using different abbreviations in each database sources. So, the goal is to convert the data into a graph where the search space for duplicate records can be reduced, the records that have shared data are identified, and then some algorithms are applied to obtain a match ratio between records that have some shared information.
The contributions of this paper are five aspects. First, the data integration problem solving is propose based on the deduplication approach. In the second aspect, the problems of data integration are described. Third, the proposal is presented, while in the fourth, the principal focus is on the evaluation of that proposed method. In the last phase, conclusions and ideas for future extensions are discussed.

RELATED WORK
A lot of studies have conducted many challenges of data integration which we need to overcome. Several propositions have been discussed by Yeganeh et al. [13], who discussed the problem of considering user preferences for data quality, within several settings and improving user satisfaction from query results. Improving query results helps the users' decision-making process and should lead to higher user satisfaction, this makes the field of data quality study interdisciplinary. Different synergies are proposed to provide comprehensive data quality solutions [14]. Several researchers have worked on grouping the dimensions of data quality into conceptual views of data, data values, and data formats. Similar to the above work, Bovee et al. [15] and Jarke et al. [16] recommended to classify the data quality dimensions based on the user's role in the data warehouse environment. Also, they propose different dimensions and sub-dimensions, based on the concept of data quality as applicability. The problem of describing the quality of data sources is at the core of data integration and exchange [17], [18]. Talburt [19] talked about entity resolution and information quality and discusses the Fellegi-Sunter theory of the relationship between records, the Stanford entity resolution framework and the algebraic model for entity resolution, which are the main theoretical models that support entity resolution. Also, the way of eliminating the redundant data and supporting the master data management programs is discussed through the concept of entity resolution and by using the Oyster open-source system [20].
In addressing the entity resolution problem, a method for distributing the workload between the different computing nodes is proposed [21]. The approach is based on the use of MapReduce with standard blocking. Benny et al. [22] proposed similar work for entity resolution on Hadoop, which works with semi-structured data. It also contains the preprocessing tasks and the results of the comparison, indexing, and classification. The use of different classification methods and their results is described to improve future comparisons. Papadakis et al. [23] proposed different techniques. The first technique aims to remove the superfluous comparisons from any redundancy-based blocking method, and the second solution is used for reducing the space requirements which are mapped to the Cartesian space.
Yan et al. [24] developed a methodology called multi-singer that supports structured and unstructured data types, as well as tasks considering data preprocessing and comparing reduction. The objective of the work [1] is to apply deduplication techniques in different merged data sources, in order to format comparison or finding duplicate elements of the same type. Various techniques for preprocessing and testing large amounts of data are proposed [25], this system allows matching entities assigned to the same block. It uses dynamic blocking to achieve high performance, reducing the search space and covering the same entities in blocking steps.

PROBLEM FORMULATION
Data inconsistency can be caused by heterogeneous data sources, which means more tools and techniques to optimize unstructured data are needed. In addition, structured data allows us to run query processes to filter, analyze, and use this data to make business decisions and build organizational capabilities [26]. Organizations face a different challenge when they need to expand to accommodate a wide range of data and create new domains. That is why a solution must be found. This involves creating high performance computing environments with advanced data storage systems. It also reduces latency while improving reliability and access to data quickly. Big data entities pose many challenges when faced with the accumulation of large data sets from different sources. Then ways of running common to the two fields to combine and execute the queries and algorithms must be found.
The wave of big data will soon spread to all areas of life, which will not only provide humans with unprecedented opportunities, but also bring major challenges. Therefore, it is very necessary to effectively solve diversity and heterogeneous data issues in big data integration [27]. Thus, the quality of information has become the latest technical requirement for users. To evaluate and enhance data quality, it is essential to implement continuous enhancement strategies. By combining data, duplicate data can be identified and then actions can be taken, like merging two similar or identical records into one. This also helps identify equally important non-duplicates, as you should know that two similar things are not the same. Overall, deduplication is a process used for moving the duplicate data in one or more databases containing information of poor quality, it is a very important technique for having good data quality. This also can be a difficult question, because the same "entity" can be referred to by different names, and these names can also contain typos, so matching ordinary strings is not enough.
For example, if you have a dataset consisting of many records, and you are trying to find records that represent the same entity, this problem is difficult to solve in a simple way. Although each entity uses a different lexical representation, direct string matching will not find duplicate records, as we can see in this example: records (1 MOBILE LIMITED, 30 CITY RD) and (1 MOBILE Ltd, 30 CITY ROAD) represent the same record entity but use a different dictionary. Therefore, in this paper, a processing technique using Sparks Rich API is proposed.

PROPOSED METHOD
The focus of this article is on exploiting graphs to solve the problem of big data variety. In particular, an approach is described to represent the data from multiple sources as a graph (entity, edge) showing in Figure 1. Consider the database R={r1, r2, r3... rn}. These registers are composed of registers with attributes A={R.A1, R.A2, R.Am}. Also consider the G (ri) function which maps the ri∈R record to the graph and the S (ri, rj) function used to measure similarity. Let us say that if S (ra, rb) is close to 1, record a is similar to record b. Typically, G (ri) mapping is a graphical method that can convert related data into graphs. In simple terms, an appropriate chart represents data with vertices and edges. Here, the relational data in a chart is rearranged so that the entities share common nodes. Like most technologies, there are various alternative approaches to forming the essential components of a graph database. One of these methods is the property graph model, in which data is divided into nodes, relations, and attributes (data stored in the nodes or relations). Thus, nodes present instances or entities of graphs, and can contain any information called attributes. In addition, an edge represents the link between two entities, and it always needs a direction, a start and an end node, and a type. The architecture of the technology is shown in Figure 2, which is roughly divided into four steps. a) Select the file for deduplication: The first phase summarizes the download the files sources for the deduplication technique. b) Create graph: In the entity graph creation stage, the first step is blocking, creating two types of entities with a set of records as input, the first type belongs to the main data, and the second type belongs to other data that are matched. c) Detect the potential duplicate's entity: The next step, perhaps the most important and also the most expensive, is to find a possible matching record. The goal is to find a record that looks like a record without having to use the same record in every field. The connection conditions are very specific: we choose to use GraphFrameMotives; we will remove the search space and find any duplicates. The range is wide enough to vary the amount of capture, but the selectivity is sufficient to avoid the use of cartesian products (the use of cartesian products should be avoided at all costs). Through the query of motif finding, possible pairs of duplicates are searched. d) Compute the similarity between potential duplicates: It involves detecting the similarity between potential duplicates after computing the vector representation of the graph entities. In this approach, the term frequency-inverse document frequency (TFIDF) is chosen to be used. The weight is a statistical measure that evaluates the importance of a word to a document in a collection or corpus. Then the output variable is used to calculate the similarity between two documents represented as a vector by finding the cosine of the angle between them.

EXPERIMENTS AND RESULTS
The method proposed in the previous section is mostly used to compute the similarity for finding the same record of duplicate elements. In this section, we evaluate it on sample input data and show the result. Table 1 indicates a partial view data set used; it keeps the information about the addresses of some companies in a file CSV.
As shown in Table 2, there is the detail of the three datasets contain the same address but it in a different format in which this type of data duplicate is not solved by the basic technique (as sorted neighborhood blocking, duplicate count strategy, or Q-Gram based indexing). All datasets contain the same address, as shown the company name is represented in a different format in each tuple of the same company. In the other colons, we have some information missing or in a different format. As shown in Figure 3, the data set is divided into two sets. The first is called the data master contain the correct data and an identifier unique entitle aid is picked. The second dataset, say transaction data, that held the records, can identify the same company named bid. In the above example, a deterministic similarity is based on feature vectors calculated by the term frequency-inverse dense frequency (TF-IDF) method (afeatures and bfeatures shown in Figure 4). Then, focus on the result shown in Figure 3 to draw the chart as shown in Figure 4 of similarity between the aid and bid for detected the most similar tuples that exist in our dataset input. Figure 4 exposed a comparison between the transaction data and the basic data when the similarity is greater than or equal to 0.4, which allows us to combine the most seven compatible records. As shown in the Figure 4, each color identifies a record in the master data and the x-axis identifies a record of transactional data, so the advantage of this approach is its ease of implementation and computational complexity, as we structure the data graphically, in any database provides us with different detection duplicates.

CONCLUSION AND FURTHER RESEARCH
In this paper, a method for solving the deduplication problem is presented using the Apache Spark framework and Scala language. The existing data integration is analyzed when the different formats of data combine in one format then there is a chance of duplication of data in the format as discussed. The proposed method is to compute the similarity to detect potential duplicate data. The project can be expanded to include further improvements to reduce comparisons between different registers and reduce computation time, such as using parallel computations. In future research, the method of big data integration based on Karma modeling is being explored.