Framework for efficient transformation for complex medical data for improving analytical capability

ABSTRACT


INTRODUCTION
With the increasing adoption of the mobile network and ubiquitous computing, there are also tremendous increase in application as well as services that facilitate the user with extensive data availability [1,2]. Cloud computing makes this process easier by their different types of standard services [3]. The beneficial factor of this technologies is that there is a greater deal of synchronicity among different information capturing devices, protocols used for processing the information, and network devices to route the data to defined servers [4,5]. However, there are also extremely challenging situation for handling such massive forms of evolving data. The proposed study considers a case study of healthcare sector which autonomously gather data from different forms of analog, digital, and hybrid devices and fetches this data to the hub where it is further stored in a physical unit or forwarded to data center. A typical smart healthcare unit can be considered to handle multiple forms of data using different types of sensing devices where diversified information about the patient health is collected [6].
However, the challenges in this smart technological advancement are i) each sensor captures a continuous data and the size of the data is ever increasing which cannot be stored in physical storage over the premises of healthcare units, ii) the forms and formats of the data are highly diversified and there could be more inclusion of the such data from the same healthcare units leading to formation of complex medical data [7], iii) these data when thought of aggregated in one place is quite difficult to be stored as they form a condition of highly unstructured data (in such form, conventional knowledge extraction method cannot be applied), iv) existing approaches has larger degree of usage of distributed software framework like Hadoop and MapReduce; however, they are very much baseline standard and there is always a need to amend the complex architecture of it when the application demands changes. Besides there are many reported loopholes of such framework which cannot be recommended to be used for a critical application e.g. healthcare which involves sensitive clinical data to be used in futuristic analysis [8].
Existing mining approaches are applied on complex data but with highly defined and controlled environment and their cost effectiveness in the field of medical science is not yet been proven [9]. It is essential to ensure that a complex form of medical data be identified of its unique data structure so that there is a good possibility to perform structurization of the complex medical data. Once the complex medical data is formulated in efficient data structure than it opens up many ways to perform analysis over the structured data. However, it is not easy way as such data are always in the form of stream and there are various associated problems [10]. Another challenging scenario of complex medical data will be to perform storage optimization too over cloud. Not all the data should be stored over a cloud which also exponentially increases in size. Hence, a better storage optimization can be done if only the mined data are stored in the distributed cloud with a superior form of indexing mechanism. Therefore, the proposed system presents a solution where a unique transformation system is implemented as a solution toward curving unstructured data to the highly structured data without using any of the system that has been implemented till date or using system which has higher resource dependencies. The focus is towards incorporating cost effectiveness towards normalizing the complex medical data.
The paper discusses this solution in terms of methodology, algorithm, and benchmarked outcomes. The organization of this paper is as follows: Section 1 discusses about the existing literatures where different techniques are discussed for detection schemes used in power transmission lines followed by discussion of research problems and proposed. Section 2 discusses about algorithm implementation followed by discussion of result analysis in Section 3. Finally, the conclusive remarks are provided in Section 4.
The Background. This section discusses about the existing data transformation and processing schemes carried out towards healthcare sector. Study considering the EHR-based data was carried out by Muslim et al. [11] focusing towards reducing the discretion in the existing standards of medical data. Emphasis towards adoption of ETL (Extract/Transform/Load) towards processing the heterogeneous data is also discussed by Diouf et al. [12] where the authors claimed that existing transformation scheme still encounters bigger challenges in from of present state of complex data. Wang et al. [13] have developed a data sharing scheme using conventional ETL process over varied dataset to show that their work is capable of initiating better transformation process. Different work has been carried out towards making the healthcare system smart for precise diagnosing [14][15][16][17]. The work of Magarino et al. [18] has presented a prototype-based model for analyzing the action of sleep state. Existing system has also being carried out towards leveraging the diagnosis system using machine learning approach as well as focusing on data privacy as seen in the work of Zhong et al. [19] and Chen et al. [20].
Incorporation of the smart intellectual system using big data approach was carried out by Zhang et al. [21] for assisting effective diagnosis. Study towards EHR data was also carried out by Wu et al. [22] as well as Viceconti et al. [23] where big data approach has been used for monitoring the health-factor towards leading to an effective diagnosis system. Study towards existing approaches of big data analytics towards improving the healthcare system is carried out by Shafquat et al. [24] towards evolving out with cost effective schemes of existing times. A layer-based approach was carried out by Chen et al. [25] where a specific case study of diabetes has been considered with an intention for evolving up with highly customized treatment. The work of Garattini et al. [26] has presented discussion of significator attribute of using big data approach for managing diagnostic analysis over lethal disease. Utilization of Hbase for medical database management is carried out by Chrimes and Zamani [27] where big data analytics has been formulated towards contribution on varied clinical services. Nearly, the similar types of the approach were also discussed by the Tawalbeh et al. [28].
The work of Sarkar et al. [29] has presented a discussion of a novel security model towards medical dataset. The work of Zhang et al. [30] have discussed about the significance of machine learning towards smart clinical services. The work of Srinivasan and Arunasalam [31] has discussed about a unique form of data analytical system towards emphasizing over the offering securing big data in medical data. A closer look into the existing system towards transformation and analytical based approach considering medical data shows that it has adoption of machine learning or big data approach where the idea is to find an ellite solution as a part of the data transformation system. The approaches are claimed to offer better outcome towards addressing the problems stated in their respective literature. They existing works are associated with both advantageous facts as well as limiting factors too.
The work of Bhalla and Bagga [32] introduces a model known as opinion mining for the text categorization. This model has been developed by using RB-bayes model. The RB-Bayes compatation is having accuracy 83. 34 the web-based text. This model is useful for resolving the real-time problem for different types of text mining approaches. The next section outlines the problem that are extracted after reviewing and is considered as the essential reseach problems to be solvedand is addressed in proposed study. The problems that are yet unsolved in leveraing the analytical processing of medical data are: a. Existing studies are specific to disease-based approach which narrows down the scope of the existing approaches towards generalized diagnosis over medical data. b. There is no wide range of consideration to the fact that EHR data are now generated in the form of data stream where it is quite challenging to apply online analytics on the top of it. c. The granuliaty of the data and its internal structure is something which is significantly missing from the existing system that causes degradation in the accuracy in the analytical process. d. Apart from frequently used ETL scheme, there is no much novelty towards leveraging data transformation scheme towards facilitating better knowledge extraction process. Based on the above unsolved problem from existing system, the framing of the problem statement can be as follows "Developing a computationally cost effective data transformations scheme for facilitating processing the complex medical data for better analytical operation." The Proposed Solution. The prime aim of the proposed system is to evolve up an effective data transformation scheme that is capable of rendering the suitability of the complex medical data in order to optimize the storage as well as leverage the analytical processing of the medical data. The proposed system adopts an analytical research methodology in order to achieve its aim. The pictorial representation of the methodology adopted in proposed solution in shown in Figure 1.
The proposed system takes the input of the unstructured medical data which leads to generation of suitably structured data in a non-conventional manner in order to maintain cost effectiveness of medical data analysis. The proposed system considers Electronic Health Record (EHR) data which consists of both clinical as well as non-clinical information about the patient. Assuming that a healthcare system hosts its storage services over cloud as well as assuming that healthcare system adopts various advanced technologies to captured autonomous essential EHR information of the patient; it can be considered that all these EHR data are forwarded to the datacenter in the form of data stream which is further subjected to storage followed by analysis. However, different from any existing system, the proposed system considers that raw medical data should not be directly stored in cloud storage unit and instead a processing is applied on the top of it to make the data more suitable for storage and analysis. In this regards, the proposed system constructs an analytical environment that aggregates all the unstructured data in a very unique form which is further subjected to next part of processing.
The proposed sstem extracts data chunks (or individual patient data as a cell and index its values and all the core fields (also known as headers) followed by a unique indexing operation. This indexing operation assists in connecting all the respective values of each core fields in a temporary buffer over cloud while all the static header information are stored permanently in the cloud storage units. The benficial factor of this process is to resist the iterative identification of core fields and storing the same which considerably saves the processing effort and time. The proposed system also uses a virtual template based approach in order to facilitate the process of indexing and data structurization using a matrix-based operation.
The proposed system uses a syntactical parsing where semantic-based approach is applied with higher degree of customization over the values of the core fields. The unstructured data is thereby transformed to semi-structured and finally to structured medical data where finally semantics are applied to extract the correct inference of the medical data in the form of knowledge. The next section discusses about the implementation of proposed system with respect to algorithm and its respective execution.

SYSTEM IMPLEMENTATION
The proposed system introduces a unique form of smart EHR system that is capable of storing the mined information witin the cloud distributed storage system inspited of storing the raw medical data in it. Hence the target is to obtain higher degree of storage optimization and assisting a smooth analytical operation for the EHR data. This section discusses about the assumption and dependencies associated with the design construction followed by implementation strategy and execution flow.

Assumption and dependencies
The primary assumption of the proposed system is that there are various numbers of heathcare units which has an autonomous mechanism to upload the EHR data which are free from any errors or artifacts. It will mean that there are no problems within the EHR data from the local source of origination. The secondary assumption of the proposed system is that all these healthcare chains are connected to cloud-based a service which extracts the diversified EHR data from multiple sources and resposit it in distributed order in the cloud clusters. The tertiary assumption of the proposed system is that there are diverse forms of EHR data being collected and when the problems aggregated or artifacts are introduced over a cloud resposition process. Therefore, the goal of the proposed system will be to address this problem by introducing a technique that can rectify this data integration problem in cloud in order to reduce the adverse effect of unstructured data. The core dependencies of the proposed system is all the EHR data is in the format of text and could be resposited in any format; however, for easiness in computing, the proposed system considers the text is in plain text format.

Implementation strategy
It should be noted that proposed system is a computational model that targets to optimize the distributed storage system for healthcare unit in order to facilitate smart analytical operations. The study of the proposed system demands a stream of incoming data from the source of the EHR data; however, such mechanism is quite a complex Therefore, the proposed system constructs an analytical data aggregation model as the primary implementation strategy so that near-time traffic flow of textual EHR data could be mapped in proposed system. After the process of data aggregationis analytically design, the next process will be to perform transformation operation from unstructured data to semi-structured data so that analytical operation can be actually carried out on the top of it. The next part of the study implements a context-based mechanism which extracts the mined information on the basis of the correlated data obtained from semantics of the text extracted. For this purpose, the proposed system performs initial classification of the text data into two types viz. i) static data and ii) dynamic user data. For faster processing, the proposed system extracts the static data and reposit over the cloud storage units while for mainintaining non-redundant data over the distributed cloud storage system, the proposed system extracts the dynamic data, index it and then reposit over the temporary buffer. The proposed algorithm is then applied to this temporary buffer in order to facilitate data transformation operation followed by extraction of mining approach. In the entire process, the proposed system offers significant amount of cost effective analytical operation over the medical data. The execution process is as shown in next sub-section.

Execution flow
The first step of the execution flow is to process the input stream of text data. However, it is priorly discussed that extraction of streams of text data is challenging and needs an access to the buffer of cloud storage points which maintains an adaptive queue system. However, the proposed system attempts to realize this problem by constructing its own buffer of stream data d s over a sampled period of time t s, see Figure 2. Therefore, the first step of the proposed execution is to construct a synthetic unstructured data over the proposed buffer system. The algorithm takes the input of data stream ds over a definite sampled period of time ts which after processing yields tem_matrix (temporary matris of unstructured data). reposit ddyntem_matrix 10.

Algorithm for Synthetic Unstructured Data Construction
End

End End
The discussions of steps of the algorithm are as follows: The algorithm considers that there is an individual data d i for each respective stream of data d s (Line-1). The system than aggregates all the sampled incoming stream d s and store it in an aggregated matrix d a (Line-2). All these data are the cumulatively aggregated followed by repositing da in order to form a sampled buffer d mem (Line-4). Therefore, the buffer dmem can be considered to posses all the aggregated streams of sampled data d a , which can be now subjected to further processing. The proposed system considers that each individual data di consists of individual patient EHR records a, where the variable a will number of unique EHR record of individual patient. There are good possibilities that number of a in one individual data d 1 could differ in another individual d 2 as well as they could also be same, it all depends upon the density of the text data in the traffic.
Therefore, considering all the values of a (Line-5), the algorithm extracts static data d stat with respect to the headers head and pointers p o (Line-6). Basically, headers are the prime attribute of the field that represents the complete column of text data with respect to its type while pointer connects all the header with its respective value d dyn . It will mean that proposed system make use of a template based data reposition technique where dstat represents static data over the template while d dyn represents patient EHR data fed by the personel of the healthcare sector. It interprets that d dyn is a direct value to represent individual headers (h 1 , h 2 , …) where pointers segregates header with individual value of EHR (Line-7 and Line-8). Finally, all the dstat data are resposited over cloud-template storage while user EHR data i.e. d dyn is stored in a temporary matrix tem_matrix (Line-9) which has all the unstructured data. The constribution of this algorithm is that it offers a sandboxing mechanism to construct a memory system where the unstructured data can be stored in order to facilitate upcoming processing of data. Figure 3 pictorially represents the processing of the data stream and followed by repositing it. The next part of the implementation is associated with the extracting knowledge from the data stored in temporary matrix. The algorithm takes the input of h (headers) and d dyn (dynamic data) which after processing yields and outcome of d sem_struc (semi-structured data) and term know (mined data).
t1add < and > tags to h and ddyn 3.
extract termknowterm 6. End The discussions of steps of the algorithm are as follows: In order to perfoming mining operation, it is necessary for the algorithm for ensure proper confirmation of the user EHR records. As there are also possibilities of redundant data (which may affect the future part of analytical operation), there is a need to perform a proper indexing operation. For this purpose, the algorithm first accesses the temporary matrix which has all the user-fed EHR records of the data; however, there is no static data in the temporary matrix. Therefore, the algorithm will need to check the indexes of the each headers present in the variable a and construct and index. The construction of the index is carried out on the basis of the term used in the individual headers with respect to its individual location.
The proposed system calculates the number of entries for each individual sampled data streams in order to initially confirm the ownership of respect ddyn for the respective individual headers h. For an example, consider that there are three headers, in such case, h 1(loc1) d dyn1 , h 2(loc2) d dyn2 , and h 3(loc3) d dyn3 with respect to three different locations loc 1 , loc 2 , and loc 3 respectively. This index of the respective headers and locations are stored in cloud unit and is respectively indexed with the respective user fed EHR data. The algorithm initially considers all the maximum max headers present in data streams, followed by adding start tag < and end tag /> (Line-2). This operation is carried out for header h and user fed EHR data d dyn , which results in semi-structure data stored in matrix d sem_struc from unstructured data (Line-3). It should be noted that this operation is carried out wthout using any existing tools of any distributed framework and yet they are quite faster and efficient in their usage.
The next part of the proposed system is about using the semantic operation in order to obtain the corrected meaning and context of the medical EHR data (Line-4). This operation allows the proposed system to construct customized number of semantics on the basis of the problems in diagnosis and hence this operation with minor finetuning in the semantics can result in highly improve scope of automated diagnosis and knowledge extraction process. The proposed system can constructs semantics on the basis of the name of the prominent disease or the highly significant review of the disease made by certain physician. The constructed semantics are applied over the terms t 1 from Line-2 which connects both semi-structured headers and user fed value of EHR record of individual patient with respect to its terms and location (Line-4). Finally, the exracted term is considered to be final outcome of the proposed algorithm in the form of knowledge. The Figure 4 introduces the complete process of proposed system.

RESULT ANALYSIS
In order to carry out analysis of the proposed system, the challenging step is the primary process of taking a precise input of EHR data. This is because a complex form of EHR data is required for this purpose where the data is larger in dimension as well as there are various forms of heterogeneuity in the data. Therefore, the proposed system reviewed some of the publically available dataset [34][35][36][37] in order to visualize the patterns of possible medical data in the big data. Hence, the first process is to ensure that proposed system also consider EHR data which is similar to the pattern of big data. The study considered the EHR data from the existing standard dataset of [36,37] where each files are of size of megabytes with different accessible formats that are widely supported on any machine. However, such available dataset doesn't have consistencies and hence, there is a need to construct an artificial EHR data and hence, the proposed system constructs a synthetic EHR data which consists of multiple headers with all headers with different values in order to assess the effectiveness of mining approach. The proposed dataset consists of discrete information about the patients as well as it also has various clinical inferences from the physical of the respective patient. The purpose of the analysis will be to obtain analytical information about the disease criticality of the respective patient.

Assessment environment
The scripting of the proposed system has been carried out in MATLAB considering normal windows environment with core-i3 machine. Adoption of MATLAB offers various benefits of carrying out transformation operation using the matrix-based mechanism in order to assess the proposed system. The complete input data is splitted into smaller versions to generate individual data d i where the size of one individual data may differ from each other. The idea is also to assess the overall data processing time to see if different file size has any effect over the throughput of analytical operation. The analysis of the proposed study is carried out considering time for data structurization and memory saturation state.

Results obtained
The assessment of proposed system is carried out for overviewing the individual outcome as well as comparative outcome. The computation of the time for data structurization is carried out by obtaining the difference between the times for previous record with the current data structurization record. The analysis is carried out over 10 experimental trials, where the allocation of input datastream is increased randomly in order to map with the near-real world data streaming process over the network.
The outcome in Figure 5 shows that data structurization time increase with increase in experimental trials, which is very much natural in order. It is because every experimentals trials have varying allocation of size of data streams with 10-20% definitive increment and rest 80-90% allocation of data is carried out in random fashion. An interesting observation to see in this part of analysis are two folds viz. i) the proposed system offers a gradual increase in time, which is quite deterministic in order and shows that it offers a stabilized data processing capability, and ii) irrespective of fluctuating size of data streams, the proposed system offers better transformation performance where abnormal traffic increase doesn't have a significant effect towards the performing of performing analytical operation. because after the data structurization is carried out than the system will be always required to cross check with the indexes from the cloud resources in order to confirm the position and term obtained from the dynamic data. This could possibly increase the knowledge extraction time to some extent. Figure 6 highlights the same fact where it is shown that knowledge extraction time is slightly more than the time for daa structurization. Figure 6. Analysis of overall computational time Apart from this, the proposed system has been compared with the existing ETL scheme (Extract/Transform/Load) which is the frequently used data transformation scheme in existing system (e.g. [11][12][13]). The comparative analysis is carried out with respect to response time and memory consumption as shown in Figure 7.  Figure 7 shows that proposed system offers significantly better performance as compared to the existing system in terms of response time as shown in Figure 7(a) and memory consumption time as shown in Figure 7(b). The prime reason behind this is ETL scheme is shrouded with a problem of parsing with the source data especially of the data is massive and heterogeneous in itself. However, proposed system offers a very lightweight scheme without any dependencies of third party applications or plugins as well as uses a simplified semantics for extracting the knowledge. The outcome of memory consumption also shows that propsed system occupies a very low amount of memory. The core reason behind this is that proposed system uses a sandboxing mechanism where a temporary buffer is constructed which can significantly save the consumption of the memory. Hence, there is no record of the underlying process which makes the analytical operation faster as well as memory efficient. Therefore, the proposed sysem offers a cost effective data transformation process for EHR data.

CONCLUSION
The paper presents a unique approach of transformation where the complex unstructured stream of medical date is converted to highly structured data in different mechanism that is free from any dependencies. The contribution of the proposed system are as follow: i) the proposed system reduced the storage complexity as well as increases faster processing time, ii) the proposed system introduces a novel mechanism to identify static and dynamic data where the static data is stored in cloud and dynamic data is stored in temporary buffer using template based approach, iii) the proposed system doesn't store any raw data over cloud but stores only the mined data which reduces higher degree of complexity and increases enrichness of data.