Transformation of WSDL files using ETL in the E-orientation domain

ABSTRACT


INTRODUCTION
Choosing the right career can be difficult for students because they need to consider many elements in order to make the right choices that determine their academic and professional life. This need makes sense at the level of E-orientation systems [1], and for the sake of standardization of a generic platform meeting the different expectations of students. Our ultimate goal is to have an instantiable meta-model at the base of existing E-orientation platforms [2], using the model driven architecture (MDA) [3]. In order to reach our goal, a comparative study of these platforms proves to be indispensable because of their multiplicity, their variety and the approaches used for their implementation.
In this work we propose a comparative and descriptive study of the existing E-orientation platforms, first we generated a descriptive file for each feature of these platforms using the web services description language (WSDL), later we process these files using the ETL technology which is based on three steps: the first step is to target the data and apply the necessary filters on the data, the second step ensures several tasks that guarantee the reliability, the consolidation and the correction of the data, which eliminates any ambiguity in redundant data, and completes missing values, the latter is an indispensable step which is responsible for power and loading and data, to get at the end of the generic files that encompass all the attributes.

Description of existing E-orientation platforms
E-orientation platforms give high school students who wish to continue their higher education, to learn, to register, to make training vows, etc. they aim to satisfy a maximum of candidates to make the right choice, by allowing them to access the training that interests them. To do this we offer some examples: -Orientation.com [4]: A yearbook of diplomas, studies and trades well distributed according to the level of studies, the site offers sections specific to the alternation, to the outlets, and proposes numerous practical sheets. -9rayti.com [5]: the leading platform for orientation and education in Morocco, followed by more than 3 million users per year. And it is edited by Education Media Company. -Onisep.com [6]: is a public institution under the Ministry of National Education, Higher Education and Research. As a public publisher, the Onisep produces and publish information on training and trades. It also offers services to students, parents and educational teams. -Orientation-chabab.com [7]: a guide for high school graduates, high school students and Moroccan students for access to universities and private and public higher schools in Morocco. After describing the platforms, we now compare their general characterization, their strengths and their weak points.

Comparison of existing E-orientation platforms
The Table 1 illustrates the common features of the selected E-orientation platforms (orientation, 9rayti, Onisep, Orientation-chabab), and shows the advantages and disadvantages of each of them [8]. The next step is to generate a descriptive file for each feature.

Generation of descriptive files (in WSDL) for each feature
We chose the Axis platform [9] (integrated in Eclipse [10]). The latter has a Java plugin called Java2WSDL, which generates a WSDL [11] file for a Web service using a Java class. So we have developed all the features of the four aforementioned E-orientation platforms into multiple Java classes containing attributes and functions, and thanks to this tool feature WSDL files are generated automatically [12]. As an illustration, the Figure 1 represents the example of a WSDL file generating the Establishment feature for the 9rayti platform. Finally, we have obtained all the WSDL files of the selected E-orientations platforms, in the next chapter these files are processed using ETL technology. Figure 1. Extract from the WSDL file of the Establishment functionality

PROCESSING WSDL FILES USING ETL
This chapter deals with the descriptive files of the E-orientations platforms, ie. instead of having several WSDL files of the same functionality we will have only one standard WSDL file enriched by all possible attributes. This processing will be established later for all the functionalities of the existing platforms.

ETL tools
The main goal of ETL (Extract, Transform, and Load) processes is to facilitate moving and transformation of data. ETL is a collection of programs which adds significant value to data [13]. The ETL process is composed of the following: -Reformatting: put in a standard format different resources (files, records from database...etc) - Conciliation: during extraction ETL should detect duplicate data a process a merge to eliminate inconsistency -Cleaning: at the analysis step, the ETL process let only relevant data and delete non consistent ones

Design of the extraction programs
There are two main methods to extract data in ETL: the first one consists to make a workspace which is an image of the origin data source, which allows to data analyst to query data with possibility of recovery, but its disadvantage is the space needed to store backup data. The second method consist of divide data source into subset of data and give the possibility to work directly on origin data, this method has the advantage of having the same overview to all data analyst but multi-access to data should be well managed.

Design of the transformation programs
In ETL process the main part is the transformation, in which several issues should be addressed like: -Erroneous identifiers -Erroneous data -Multiple data sources -Irrelevant data -Synonyms and homonyms -Embedded process logic - The main steps used in transformation are: -Apply a standard to rename data in the same category -Make a merge of the duplicate data to a relevant one -Translate data in the same language for a later comparison

Design of the load programs
After the two previous steps Data is ready to be loaded, which is the final step in ETL process, at the stage data can be added in a database table via a script like SQL [14] or can be done with DBMS framework using some utility, however a special attention should be done on integrity and indexing [15]. The ETL process was traditionally developed by programmers, which mean a lot of hours of effort and the risk of major issues and also the reusability of the code is not guaranteed. With the apparition of ETL tools a lot of tasks has been eliminated, the tools offer a fluency in all steps of ETL process and available in accordance with requirements (technical and commercial ones) [16]. The Table 2 illustrates a comparative study of the features of the extraction tools, in particular Apatar OS Data Integration, CloverETL Engine, KETL, Pentaho, Scriptella, Talend.
By the comparing some of ETL tools it is concluded that Pentaho [18] and Talend [19] are good enough then other tools nd have wide vriety of products. It is proved by MySQL and many of companies by their case studies that Pentaho can handle small to large scale systems [17]. In the next part we will present in more detail the product we chose to set up by describing its features, as mentioned before our choice was for an open source tool, and by studying the best known on the market we opted for Pentaho, we noticed that it is much easier to handle.

Processing WSDL file using Pentaho 3.2.1. Penatho
Pentaho data integration (PDI), long known as the Kettle, is an open source ETL that allows to design and implement handling and data transformation [20], as Pentaho Data Integration is able correspondingly to be implemented for additional commitments: -Transferring data among programs or databases -Distributing data from databases to flat files -Filling data vastly into databases  [21]. Incorporating components of Pentaho Data Integration is informal to implement. Each procedure is formed through a graphical instrument where users can determine what to do deprived of creating an algorithm to specify how to do it; due to, it might be said that Pentaho Data Integration is focused on metadata. Pentaho Data Integration is able to be implemented as a separate artefact, or it might be implemented together with the Pentaho Suite. As an Extract Transform and Load instrument, this is the most widespread open source application offered. It supports a massive collection of input and output presentations, comprising text les, data pages, and different sets of database engines. Furthermore, the modification competences of Pentaho Data integration permit to operate data by means of almost zero restrictions.
This component is one of the best and most valued ETL solutions on the market [22]. Its long history, strength and robustness make it a highly recommended tool. It allows for transformations and works in a very simple and intuitive way. Similarly, data integration projects are very easy to manage. The Figure 2 represents the Pentaho data integration interface and Workflow; the ETL component permits renovations and function in a very easy and instinctive method.

Design analysis
The process of our work is unscrewed in two steps: - The first step is to standardize the WSDL file for each feature of existing E-orientation platforms in order to respect a generic template in terms of structure and BOM. It will be iterated on each platform and for the same functionality to generate "standard" WSDL files. -For the second step, the WSDL files of the same platform functionality are used as input, then the latter will be processed via XSLT [24] which contains the transformation rules mentioned below. At the end of the process we get a single standard WSDL file for this feature, and we set up the same operation for the rest of the features. Following this previous analysis, we will need to establish a dictionary to unite the names of the functionalities and attributes in order to avoid all the synonyms. Table 3 describes the management rules used throughout our treatment.

Case study
In this section, we show through a case study, the transformation of WSDL files of E-orientation platforms via ETL. The extraction of the data, their transformations as well as their loadings is carried out as follows. The creation of a transformation with Pentaho Data Integration (spoon) according to three steps: the first step is reserved for extracting data from WSDL (XML-based) files of the same functionality, the second step one aims the XSL transformation from of an appropriate file containing the transformation rules mentioned above, then as for the last step, it consists of saving the result in a new WSDL document created on the basis of the contents of the XML file [25].
The phases of the ETL process are as follows: -Extraction phase: As explained above, the first phase consists in acquiring scattered data. The tool pentaho has the ability to connect to WSDL files of the same functionality, it selectively reads the data of these files (it is the interest of the step), and thus filters the data in reading to extract only; relevant information. -Transformation Phase: Data transformation is the main phase of our process. To process WSDL documents an XSL transformation is done. Our XSL file checks that data is consistent with existing data in WSDL files, assigns the same value for elements that have the same meaning, adds elements that are not common (which belongs at least in a platform). -Integration phase or data feed: Once the processing ends, a phase of integration is made. The latter sends the resulting data to a WSDL document that is identical to the input data via a Pentaho Kettle data integration engine. The Figure 3 illustrates the transformation of WSDL files retrieved from an existing E-orientation platform in order to standardize them. Here is an excerpt from the XSL file using the dictionary above to standardize a WSDL file of the same functionality as illustrated in Figure 4. The transformation of WSDL files of the same functionality can be done using this PDI. The Figure 5 represents an overview transformation design to use PDI-Kettle. Once the transformation is done, the WSDL file for this functionality is automatically generated.

CONCLUSION AND FUTUR WORK
In this article we have described and compared the existing E-orientation platforms in the Moroccan and French Community, Then we have generated for each feature a descriptive file WSDL using the platform axis (integrated in Eclipse), During this work, we presented both ETL solutions such as Apatar OS Data Integration, CloverETL Engine, KETL, Pentaho Data Integration, Scriptella, Talend Open Studio, then their features were compared between them, and to process our WSDL files we chose to work with the tool pentaho data integration, this processing is done in two steps: the first step is to make the WSDL files of each feature "standard", the second step is to process these files via XSLT. At the end of this work, we have been able to display generic files that will serve as a basis for the expected meta-model. In the next work we will model these WSDL files in UML and we will propose a general and enriched meta-model that brings together several functionalities of the E-orientation platforms using the model driven architecture (MDA).