A performance of comparative study for semi-structured web data extraction model

ABSTRACT


INTRODUCTION
The extraction of the information from the large database is known as Knowledge Discovery (KD). Meanwhile data mining is the process of extracting the useful and relevant information from the datebase. It allows user to analyze data from different views and categorizing it, prior to concluding the relationships between data. The extraction and analysis of the web page is an interesting research area in the field of data mining and web mining. Internet has made the World Wide Web as the main pool for the collection and distribution of information to the users.
The reports from Internet World Stats states that there are now more than 4 billion of people around the world are using the internet (Stats, 2018). The latest data shows that Asia has become the biggest population of region contributing to more than 2 billion users. Figure 1 shows the world internet usage and population statistics. The number of internet users in earlier part of 2018 was 4,156,932,140 (Stats, 2018). This directly implies and contributes to tremendous growth of data in the World Wide Web.
Web data extraction is a technology developed over the past decade and encounters many new challenges. It has been discussed from different perspectives and views. It leverages on various scientific methods from various disciplines [1]. Laender, Ribeiro-Neto [2] Proposed taxonomy for data extraction approach to generate wrapper. Figure 2 shows the suggested classification proposed by the researchers. This classification is very important and helpful in order to understand the existing approaches for web data extraction.  Languages for wrapper development, TSIMMIS [3][4][5], WebOQL [6], Lorel [7] and Minerva [8] are some of the techniques that employ language for wrapper development. These approaches are used to address the problem of wrapper's techniques. Various general languages were designed to construct wrapper. The example of languages that are used by programmers in developing wrappers are Perl and Phyton.
HTML aware, these approaches depend upon the structural features of web pages to perform both wrapper generation and data extraction. It is perform automatically without labour task. XWRAP, W4F and RoadRunner [9] are examples for this technique. Natural Language Processing (NLP) technique is used in order to mine facts from free text. Fact is indicated by entities and relationships between entities [10]. RAPIER [11], WHISK [12] and SRV [13] are some of the approaches of NLP. Wrapper induction, these approaches are based on certain features such as formatting. It can define the structure of data that found. The extraction rules can be generated based on training set. SoftMealy [14] and IEPAD [15].
Modelling based, these approaches try to find sections of the web pages that suit with pre-defined structures. NoDoSE [16] and DEByE [2] are the techniques for this category. Ontology based, this category is totally different with previous techniques. This is because it relying directly on the data. The obvious techniques for this approach are ODE [17] and DIADEM [18].

WEIDJ MODEL
In certain kind of web pages, it is common that data information of web pages are built dynamically according to specific template. Typical example is information of image, where image details has always the same structure, differing only in content loaded usually from a database. These data are rendered in the similar way. There have been discussed in former works [19][20][21][22][23] about how to extract data from web pages. Figure 3 shows basic concepts of data extraction process. In preliminary step, user need to know what types of data that they want to extract either text, image, video or others. Then, they must decide which data need to be extracted. This selection must be done earlier because each data has their own source of type. Different data has their own source of data and relevant methods. After selection type of data done, next process could be proceed to abstract and transform selected data in tabular format using specific approach or methods. First, user need to understand the proof of concept for Web Data Extraction before develop a wrapper.  Figure 3. Basic data extraction process

Problem formulation
As part of the input for the extraction, we suppose that the user has a number of structured web sources, denoted in the following, where each represents a set of web pages that describe images objects, which could be seen as relational tuples formed by three atomic type values; link, image, size. We assume a set of entity (atomic) types, where each such type represent an atomic piece of information. We continue by defining the typing formalism, by which any user can specify what data should be targeted and extracted from web page. We then describe the extraction problem.

Types and object description
In WEIDJ, it allows users to describe atomic types of objects. As building blocks for describing data, we assume a set of entity (atomic) types, where each such type represents an atomic piece of information, expressed as images.
An instance of an entity type is any images that exist. It is defined straightforward in a top-down approach and can be view as a tree structure whose internal nodes denotes the use of a complex type constructor. For example, image objects could be specified as a tuple type composed of three entity types: path of image, size of image, date of extraction and time of extraction. The first two entity types would be associated to predefined recognizers (for path of images and size of images), since this kind of information has easily recognizable representation patterns, while the last ones would have an instance of recognizers.

The extraction problem
For a given WEIDJ and source , a template with respect to and describes how instances of can be extracted from pages.  For each set type = [{ }, ] appearing in , defines a separator string , it denotes that consecutive instances of will be separated by this string.  For each tuple type = { 1 , … , }, defines total images over the collection of types and a sequence of + 1 separator strings 1 , … , 1 ; this denotes that the instances of the types forming , in the specified images, will be delimited by these separators.
The extraction problem can be described as follow. For a given input consisting of an WEIDJ and a set of sources { 1 , … , }, 1. set up type recognizers for all the entity types in , 2. for each source , a. find and annotate entity type instances or images in pages, b. infer a template ( , ) based on the sample, c. use to extract all the images of from , d. select images that want to store in single multimedia database A web page w is represented as a triple: Finite set of blocks is represented as: All these blocks are not overlapped. Each block can be recursively viewed as a sub web page associated with sub structure inspired from the whole page structure. Finite set of separators such as horizontal and vertical are represented as: Every separator in the same S has same weight. Weight for each separator indicating its visibility.
R Is the relationship of every two blocks in b . It can be expressed as For example, suppose and . Figure 4 shows layout structure visual segmentation of WWF web page while.

RESULTS AND DISCUSSIONS
A key element, path of required HTML element is important which allows to locate and extract information. This section discuss the experimentation of data extraction for multi-uniform resource locator (URL). This experimental is different for data extraction from surface of web. This is because when working with multiple web of URL, user need to input several web URL that contain various information to be extracted from different structure of web pages. Figure 5 shows interface for multi-url.
The work described in this experimental work uses the same approach in the surface web. However, it involves more process and time consuming because the extraction process recursively traversing all element's parents for each web page. Table 1 and Table 2 describes the benchmark of web address that are implemented on testing for image extraction of multiple web. Table 1 consists of three groups of URL such as group A, B and C. Each group has several web pages as shown in Table 2.     Figure 6 shows the result for multi-sources of web pages for web data extraction using DOM, WHDJ and our proposed model, WEIDJ. This experiment has been tested on benchmark for three different group of URL that can be referred on Table 2. Time extraction is referred to the time loading for extraction process since the first image until the last image. The time extraction is measured in seconds. The time of WEIDJ retrieval on any data set is significantly lower than other methods and outperforms efficiently.  Table 3 (a, b) shows summarize results of extracting images on four different model of wrapper; DOM, WHDJ, WEID and WEIDJ no rules. As it can be seen from Table 3 (a, b), the result shows that the amount of image found are average similarly but the amount image that have been retrieved and filtered by each model are totally different. Images that retrieved using DOM approach are many compared to other models because the rules of filtering information are minimal. In addition, DOM approach does not consider the redundancy. It will retrieve the similar filename for each image if it exist in the multiple web address that acts as input. While page load or known as time extraction is used as primary benchmark to compare the performance for each model, the findings show that WHDJ is better that DOM because it uses JSON approach to transform data. JSON is approved that can decreased time better. While WHDJ is better than DOM, but we need to extract data in high speed. So we propose AJAX technology to make sure that the entire web page does not have to be reloaded each time the user requests a change during web data extraction.   Table 4 shows the summary of time extraction in percentage. The percentage for each method shows the bigger comparison value especially for DOM approach. JSON is well known as fastest method [24]. That is the main reason why JSON can be fastest in extracting information compared to DOM method. Although JSON is good in performance of extracting data but the extracted information are not so efficient because this method will retrieve the same file image. So that, WEIDJ is proposed for this experimentation. WEIDJ is good in extracting beneficial information and the time performance is degrade. WEIDJ and WEIDJ no rules is the same method which apply combination of DOM and JSON but the difference between WEIDJ no-rules and WEIDJ method is WEIDJ no-rules will not filter noisy images. It will retrieve any images just like DOM and JSON.

CONCLUSION
We have proposed semi-structured web data extraction model, WEIDJ to extract data according to the predefined and simple data extraction rule. It consists of several parameters: link of the image which locates the data that has been retrieves from webpage, images, size of images and time extraction for each image. WEIDJ corresponds not only for single URL but also to multiple sources of web page. The current study limits its scope to extract images from surface of multiple web URL. Future research may consider image extraction from deep web of multiple URL.