Tiny datablock in saving Hadoop distributed file system wasted memory

ABSTRACT


INTRODUCTION
In Bigdata world, many companies declared their own analysis systems as a "platform as a service" product such as HDInsight's from Microsoft and Amazon S3 from Amazon [1], [2].However, whether the service is based on cloud or built on premise, most of these data platforms were developed and customized based on an open sources system namely Hadoop.Hadoop is a distributed open sources system built by Apache software and it is used to host and analyze all kinds and models of bigdata.Hadoop consists of one node called namenode or master node, with thousands of nodes connected together called datanodes [3].Hadoop is a set of sub systems called ecosystems whereby some of these systems are used for keeping and tracing purposes, while some others are for data analysis and extract, transform, load/extract, load, transform (ETL/ELT) data injection, and only one of them, which is Hadoop distributed file system (HDFS), is for data hosting [4].HDFS is the main ecosystem that Hadoop uses to host files in a distributed manner.HDFS is used to split every new datanode into a bunch of storing units called datablocks [5].Thus, the datablock is the smallest storing cell in HDFS that can be used to store Hadoop files in <key, value> manner [6], [7].The default size of any datablock in all datanodes is 64 MB, but in several custom usages, it can be upgraded to be 128 Mb or even 256 MB [8]. Figure 1 shown the basic assignment in namenode to datanodes in HDFS that B1 block in all datanodes are belonging to one file.The client could be anything, a human, a sensor, machines, and so forth.The file storing assignment begins by sending a dataset from client to the namenode.In this regard, the namenode jobs are: i) check if there is a free memory for the new dataset, ii) provide the free datablock IDs to the new dataset, and iii) splitting the large files =>64 MB to assigning to multi datablocks.After all operations above, the dataset will continue to the assigned datablock to be stored in there and send the required metadata files to the namenode to update the namenode on the new dataset files and location inside HDFS [9], [10].
Figure 1.How HDFS stores a new dataset [11] Each time Hadoop user injects HDFS with a bunch of new files, HDSF will go through the same scenario above to host these new data.One of Hadoop's features is the "data high availability" that's every single file is available for use even if it is corrupted or being deleted [12].Hadoop adopts a technique called "replication manner" whereby every file must go into three copies so that if one copy is somehow un-available, the other copies will be the replacement and the processing and insight job will resume.Figure 2 demonstrates the replication manner [13], [14].
Figure 2. HDSF replication manner [15] In Figure 2, the 1 Gb file requires approximately 16 Datablocks to store it.Then, to apply the HDFS high availability principle, all of these datablocks must be replicated 3 times in total.Lastly, all of these new datablocks and their replications must send some metadata files to the namenode to complete the files storing steps [16].In HDFS, each datablock can host only one file, irrespective of whether or not the file fits the datablock size [17].This technique works will with bigdata in big files that's every file in the dataset is greater than the datablock size [18].However, things are different with big data in small files whereby every file is smaller than the default datablock size [19].The datablock hosts only one file because the namenode can access the desired file via datablock ID only.In other words, the namenode cannot access anything inside the datablock directly.Thus, a small file will occupy the required memory and cause the rest of the datablock memory inaccessible [20].The (1) shows the amount of wasted memory for each datablock.Wm is the wasted memory, dbs is the datablock size and fs is the file size.Figure 3 shows the standard file hosting in HDFS whereby the injected dataset to HDFS datablocks is big data in small files.

𝑤𝑚 = 𝑑𝑏𝑠 − 𝑓𝑠
(1) Figure 3.The traditional datablocks hosting method [17] In order to determine the depth of the problem, (1) will be re-applied on the results in Figure 3, as (2): Thus, the summation of the wasted memory is 225 MB out of the total amount of datablocks size which is 448 MB.Lastly, the 225 MB wasted memory is one of three copies that HDFS must create in order of data high availability in case of data reading failure because of the unavailability of the desired data.Thus, a 225 MB will be turned into a 675 MB of total wasted memory [15].These numbers shown a huge wasted memory amount just to host a small dataset [11].The final formula for the datablock wasted memory is shown as in Table 1.

RELATED STUDIES
There are some popular solutions to solve the problem, the first one is Hadoop archive (HAR).HAR compresses the small files into one HAR inside one datablock and use two index ID's are used to access the desired file inside HAR [13].The second one is dynamic partitioning which is used to add a new node to the cluster called aggregator node to determine that if the upcoming new file if greater or less than the datablock size.The next section will discuss the most popular technique in solving the problem of this study.

HAR file
Hadoop archive (HAR) is the earliest attempt to solve the problem above which packs a number of small files into one archive file before pushing it inside HDFS.However, the small files inside this new archive file cannot be accessed directly because two index IDs have to be searched before reaching the desired file.The access is done in the main memory.Figure 4 presents HAR file access diagram [21].
Basically, HAR was designed to solve big data in small files issue.Thus, the technique that's adopted to solve this issue is also used to reduce the HDFS wasted memory [16].So, instead of placing every small file in a standalone datablock, HAR file will do the job by archiving all of them in one file.However, creating a HAR file involves running a MapReduce job into a targeted directory "/dir" that consist all the desired small files to be archived.In accessing a desired file in any HDFS archive, user needs to go through two index files: the first one is master index, and the second one is index table.Figure 5 is shown the accessing method for each single HAR archive.This access mechanism slows down the HDFS data retrieval as the use of HAR technique is only appropriate for the rarely accessed data (cold data) such as logs files [22].Another shortage for HAR is that it cannot append more files to the archive.This means that if a new file is already injected in HDFS, it will be hosted in a new datablock even if there is a HAR file that is not full yet making it unable to host more.Hence, any upcoming file cannot be appended to this HAR and it will be directed to a new one.The extra unused memory in the old HAR and the new one is considered as a wasted memory [23].

Dynamic partitioning
Hadoop cluster principle operates on one namenode and thousands of datanodes.However, dynamic partitioning approach is used to alter is principle by appending a decision maker node namely aggregator node which is located between the namenode and the rest of the cluster [22].Aggregator node is the decision maker of the cluster that's when a new file being injected to the namenode, namenode will split it and pass the results to the aggregator node to check if the size is 64 => dataset > 64.If the dataset's size is greater or equal than the datablock size, it will be directly assigned to the targeted datablock.On the other hand, if the dataset is less than 64 Mb [24], aggregator node will transfer it to the proposed datablock using a dynamic partitioning approach.Hence, partitioning approach will fit the size of the datablock with the upcoming dataset.The result of this approach will overcome the HDFS fragmentation that's that's causing memory wastage as demonstrated in Figure 6 [11], [25].
Dynamic partitioning is used to alter the attitude for the decision making of the namenode.The Namenode is the master of Hadoop that uses yet another resource negotiation (YARN) to ride the rest of the cluster.Thus, the decision was partially moved to be with the aggregator node which does not match the aim of this study, that is, to solve wasted memory problem without any major customization on Hadoop default structure.Thus, the reading process of dynamic partitioning will be slow because each analysis script and the attained result must go through the aggregator node, and this will be reflected negatively on the read/write performance, and the connection latency will increase.

Sequence file
This study is not targeting HDFS wasted memory.Sequence file was placed as a solution to reduce the negative impact of big data in small files.Here, a file that consists of a set of small files is created inside one file called sequence file.However, this new file will be reported to the namenode as a single file with one metadata file only, and this is a good solution for the problem of big data in small files whereby one metadata 1761 file will be sent to the namenode directory instead of hundreds of metadata files whereby each will present one datablock to the namenode [1].Thus, instead of hosting every small file inside a standalone datablock, Sequence file will collect all of these small files that act like a series of files with a unique ID for each one of them under the following rules: i) the sequence file cannot be greater than the datablock, and if so, the system will generate a new one in a new datablock; and ii) the reading mechanism adobes the binary search without indexing, which means that each time the search process accesses a small file in the sequence file, the binary search method must search for the desired file from the beginning of the sequence file till the end of it [13], [26].
Sequence file is reducing the HDFS wasted memory perfectly.But the problem of this solution is as following: i) the access method complexity is very high (n 2 ), and the binary search must include all files in the sequence file to find the desired file; and ii) since the datablock is not required to be changed (expanding or shrinking).Sequence file will follow the default datablock size is 64 Mb.However, in case of the sequence file is smaller than the datablock size, the rest of free memory of the datablock is unreachable and considering a wasted memory [1].
Based on the shortages of all the studied solutions (HAR, dynamic partitioning and sequence file), this study proposed a solution that will prevent the problem of slow file reading, which is the shortage of HAR and sequence file, and the problem of cluster datablock multiple capacity that could not fit all of the small files, which is the shortage of the dynamic partitioning.The next section will provide the details of the new solution proposed in this study.

THE PROPOSED METHOD: TINY DATABLOCK (TD)
The adopted dataset is big data in small files is a huge amount of data (more than 10 Tb) consisting of small files (each file is less than the default datablock size) [22].This kind of data will used as a quantitative methodology to find the gap between the standard HDFS file hosting, the available methods in file hosting boost, with the proposed method as shown in detail in the next part.A solution called tiny datablock was proposed.This solution is based on the principle which requires that every datablock drops its size to be 5 Mb only by default.However, the default datablock size's changing manner will be on the top of HDFS.

The algorithm
This part is shown the proposed method to come over the HDFS wasted memory and to save them for hosting more files in the Hadoop cluster.The proposed method will adopt the shrinking mechanism to host each small file in an independent manner, then merge them all in one datablock.The proposed solution uses two methods to reduce the problem highlighted in this study:

. Memory shrinking
This method will shrink all of the available datablocks to 5 MB.The aim of shrinking the datablock size down to be 5 MB is to match the most popular size required in hosting big data in small files in each datablock in HDFS. Figure 7 describes the shrinking action.

Datablock merging
This method is the next step after the execution of the memory shrinking method.Once these tiny datablocks are allocated to the new small files from the namenode, they will have an ability to merge with each other into one or more datablock(s) with only one metadata file (reference file) which is also head towards the namenode.This means that if three files with a total size of 15 Mb (small file) are injected into HDFS, then, three datablocks will be required to host all of these three files.These three datablocks can be merged together to become one datablock with the size of 15 MB and generates only one metadata file heading to the namenode, as shown in Figure 8.

Datablock reading
To read from a pre-merged datablock, this study will adopt a byte-to-byte algorithm that counts every data chunk in the merged datablock as a stand-alone file.Byte-to-byte formula is a counter that captures the end point of each chunk inside the merged file based on a pre-given address supported by the namenode.The formula of byte-to-byte will be explained in section 3.2.3.withen data reading stage.Meanwhile, the application of TD migration method in any Hadoop cluster requires that TD goes through several stages and steps to complete the migration correctly.These stages begin with reading each upcoming small file size, followed by placing the file into datablocks, and finally, migrating the datablocks into a single datablock.The second stage is related to how to read files from the new datablock.Figure 9 shows the tiny datablock-HDFS (TD-HDFS) file writing.

Stage one (reading each new file size)
HDFS file injection and the addressing method begin by the attainment of a positive request to the namenode about the ability to host an upcoming file(s) or dataset.In this regard, the returned response will carry out a new address somewhere in the datanodes to host the new file(s).Then, Hadoop will continue to the datablocks based on the given address (datanode then datablock inside).Finally, Hadoop will host the file.TD-HDFS will alter this technique by appending a method called size calculator that will work as follows: Step 1 (blocks packages) Namenode will be supported by a database consisting of capacity packages records.Here, the namenode will read the new files information and create a metadata file.This metadata file consists of information about the file size and the required datablocks to store it.Table 2 accordingly shows the file size range and the volume database already installed in the namenode.Step 2 (file addressing) The metadata file is now ready with all the data to be stored in the targeted datablock(s).These data consist of the datablock ID, datablock reference and the new one is the datablocks number that is required to store the new file.As a scenario: there is a new file (20 MB) that must be injected into the HDFS using one of Hadoop's ecosystems which are sqoop and/or flume.This new file is a structured database.Thus, the metadata file generated by the namenode will assign a four datablocks for this new 20 MB file.Table 2 shows the differences between the old file addressing method and the TD one to host a 20 MB file size.Based on Table 3, in the old file addressing method, HDFS will book a 64 MB datablock to host the new 20 MB file.In TD-HDFS file addressing method, HDFS will book only four tiny datablocks, and each of them is only 5MB in size.Step 3 (datablocks merging) Back to step two, the four datablocks must be located on the same datanode.In this regard, if the selected datanode has only 3 free tiny datablocks, the namenode must find another datanode with more sideto-side free datablocks.The datablock merging process will send the selected file to be stored in the preselected datablocks IDs as shown in Table 2. HDFS will split the new file of 20 to 4 MB related files that every file fits a tiny datablock.These datablocks are on the same datanode and each one of them hosts one file.TD-HDFS will alter HDFS behavior with an equation called "merger."However, "merger" equation will use the metadata file generated from the namenode back in step one to merge the four datablocks above into one datablock with only one reference ID similar to the earliest datablock ID.The goal of "merger" equation is to minimize the metadata files that are returning to the namenode.Figure 10 shows the mechanism of merger equation.
The merger mechanism equation will merge all the pre-booked datablock1 to datablock4 into one datablock referred as datablock1.The merging procedure will cover all the following: i) merger will start reading all the datablocks that hold a part of the same file.In Figure 10 there will be four datablocks; ii) merger already knows that there are four datablocks with a total size of 20 MB; iii) merger will expand the first datablock (datablock1) size into 20 MB; iv) if the datanode is full, merger will stop with an error message asking for free datablocks; v) if all of datablock1, datablock2, datablock3 and datablock4 are related to each other (each one is a part of the original file), the next step (step four) is not applicable; vi) datablock1 is ready to be read without a need for the reading stage (later on this paper); vii) if each of datablock1, datablock2, datablock3 and datablock4 has a non-related file, step four is required; viii) all of datablock2, datablock3 and datablock4 will be migrated to the new datablock1 as a new chunk; ix) the new datablock will be full of non-related chunks that each chunk presents only one file; and x) by now, all of datablock2, datablock3 and datablock4 are useless and need to be assigned as an empty block by the namenode.Step 4 (classifier) Before setting up the final phase of datablock1, the chunks inside it are separated, and at this point, there is no relation between these chucks but they reside as one piece that is not readable yet.Thus, the equation (classifier) will add an extra metadata to the original datablock metadata to tell the next equation (stage 3 in this paper) how to read these non-related and non-addressed chunks.This extra metadata is presented in Table 4. Byte capacity is the capacity of each chunk inside the datablock, and this information will be used by the namenode in stage three to read the chunks.

Stage two (erase the datablock)
Since Merger does not have a master permission to delete the useless datablocks, there is another equation that will assign this job to the namenode-YARN ecosystem.Thus, the new equation here is called a "deleter."This equation will deceive the namenode by sending it the empty datablocks ID.Thus, the namenode will automatically update the HDFS by the following new situation: i) there is no change on datablock1 but the size is expanded from 5 to 20 MB; ii) all of datablock2, datablock3 and datablock4 are assigned as free datablocks.Thus, the namenode-YARN will assign all of them as empty datablocks that are ready to use for another upcoming data; and iii) the final results are displayed in Figure 11.

Stage three (data reading)
There is no way to read each chunk inside a single datablock because the chunks are already separated but acting like one file to the namenode.HDFS files hosting is only 80% of the total datablock size, while the rest of it is used to host the metadata on the datablock, providing the namenode more information about it.To make the chunks inside each datablock readable and discernible to the namenode, the metadata in each datablock has to be altered as appeared in Table 3. before providing more information about the desired chunk in the datablock.TD-HDFS added an equation called "founder" and this equation is controlled by the namenode-YARN to read out the metadata about each datablock that is already updated by the "classifier" equation.However, founder will retrieve everything in the datablock as a metadata to the namenode memory in order to read them correctly.Moreover, "classifier" equation will tell the namenode how to separate the  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 2, April 2023: 1757-1772 1766 chunks inside each datablock to fetch the desired chunk data.Founder will use the byte-to-byte formula to separate the chunks from each other based on the given metadata.Byte by byte formula will read the whole data as a counting bytes of array.The array is the data inside the desired datablock.Thus, byte by byte will obtain the information about each chunk volume, and in which byte it will start and end.Table 5. shows an example about the chunks in a single datablock, with byte information about each chunk.Byte information is retrieved from datablock metadata.Founder via byte-to-byte formula will start separating the chunks away from each other, and the beginning of the datablock array is the beginning of the first chunk.The byte counter will count 5,000,000 bytes before it stops and checks out this chunk as a separated file.The second 5,000,000 bytes are counted in similar manner.The byte counting will end when the counting reaches the total datablock byte capacity.The outcome encompasses four separated files that are ready to show up to the end user.Byte-tobyte formula will not run if the datanode holds only one file as mentioned in step three of the first stage.Figure 12 is describing the steps to read the desired file in each datablock.

TESTING AND COMPARING
This paper attempts to fit the current HDFS files with the available in-charge datablocks.However, the TD technique utilizes a comprehensive model to absorb a lot of small files on the same datablocks total capacity without the traditional memory wasting like in the standard HDFS technique.Figure 13  Both of the clusters will running the same hardware.Normally, Hadoop used the commodity hardware to run the analysis jobs.Typically, each machine in the cluster must not less than 16 GB RAM, 500 GB SHD with most recent version of core i5 processor.However, each node of the cluster will run an OS (Ubuntu 20) that's the best practice for the Hadoop performance and reliability.Figure 13 presents a full series of traditional HDFS filing as compared to the TD technique.However, the TD technique shows that there is more added time to the file injection steps to be completed in terms of the extra steps (step one, two, three and four) and two more stages (stage one and stage two) that are already appended to the whole system.Figure 13.A parallel test between HDFS and TD-HDFS

Data reading
In data reading (stage 3), there are two scenarios for TD technique.The first one involves a datablock with related chunks, which is easy to read.The second one involves a datablock with non-related chunks.However, reading from no-related chunks inside the datablock requires metadata generated by the previous classifier.These metadata are: i) metadata about every chunk inside the datablock (in the namenode metadata file) and ii) metadata about the byte capacity for every chunk (in the datablock metadata part).
However, the first metadata is resident in the namenode memory, and the metadata make every chunk well known to the namenode in order to easily find it in the desired datablock.Suppose the namenode is willing to read chunk number 3 in datablock 1.Thus, in Hadoop, there is no actual reading mechanism as data processing is happening in the datablock, and the namenode job is to send the scripts to the data place to be processed and a readable insight is obtained.Thus, the namenode will send the "founder" script to the datablock, and the byte-to-byte counting is applied to reach the desired chunk.Figure 14 5, byte to byte counting already knows that the desired chunk 3 length is 4 MB.Thus, byte-to-byte will start from the end of chunk 2 and count 4 MB of bytes before stopping the counting and fetching the results.Figure 14.Check reading series

RESULTS AND DISCUSSION
The result of this study is standing on a comparison between the previous solutions to short coming HDFS memory wasting and the new proposed solution TD-HDFS that's presenting in Table 6. Figure 15 is transforming the results n Table 6 into a meaningful way in order to clarify the positive memory saving result of TD-HDFS against the other primary studies in this field.In Figure 14 TD-HDFS has the best results in reading complexity.The steps were taken during the writing step to generate more metadata on the desired chunk ID, and the byte-by-byte counter makes it easy to use these metadata to make a quick access to that file as a chunk in combined datablock.However, these extra steps increase the writing complexity and take more time to complete a single file write.However, the most important question in data hosting and analytics is "How fast can the system retrieve the desired file?"In this regard, TD-HDFS based on the results above provide the best performance in data retrieving.The test sample requires almost 1.250 Gb of datablocks capacity (20 datablocks) to host these 20 files, but when it comes to TD-HDFS, the required resources will reduce to only 2 datablocks with a total capacity 110 MB.This result can be change up and/or down depending on the following parameters: i) files number, ii) files size for each file, iii) default datablock capacity either for the standard HDFS or for TD-HDFS, and iv) total files capacity (for big and small files).
In Figure 17, HAR will working well in a comparison with the standard HDFS due to HAR will archive all of the small files in one file.But on the other hand, HAR is treating the files as a file, not like TD-HDFS that's treating the file as a datablock.Thus, in HAR case, the archive file could consist of files only and set the rest of the datablock free but un-reachable.TD-HDFS is treating the new files as a datablock and combine the datablocks depending one the new file size.The final results of comparing TD-HDFS with HAR shown that's TD-HDFS doing better than HAR regarding HAR file is just an archive inside a datablock with a chance to not fill it completely depends on the fils capacity.One last study, is to make the comparison results with TD-HDFS and dynamic partitioning.However, dynamic partitioning used to classify the upcoming data files based on its size to be stored in a suitable datablock.The results shown a light different between dynamic partitioning and TD-HDFS in memory saving due to dynamic partitioning does not has the combine step in combing the related datablocks together.Moreover, if dynamic partitioning has been injected with many of small files, are smaller than the smallest datablock in the cluster, the datablocks cannot combine to each other to release more memory.In other words, the small file will take place in the datablock and leave the rest free but unreachable, which  This study adopted three parameters to compare between TD-HDFS and other solutions in the same area.These parameters are reading, writing and wasted memory reduction.In data reading, all of other studies are slower than TD-HDFS.Achieving the final chunk in TD-HDFS and reading it are easy because all the required IDs and addresses are supported in the metadata file already stored in the namenode, and thus, it is easy to find the desired datablock and fetch the pointed chunk to read it.Meanwhile, the writing step is complex because it involves setting up the merging steps and generating the metadata file with all IDs and addresses which will consume more time compared to other studies.In reducing HDFS wasted memory which is the core of this study, TD-HDFS with all equations presented in section 6 were used to fit the required memory with the new files.Meanwhile, HAR cannot fit the required memory with the new files volume because HAR could archive 2 files only in one datablock and keep the rest of it free and usable.

Data integrity
Moving the data from one datablock to another datablock will not cause any data corruption because the moving mechanism is not of cut-paste, but of copy-paste followed by the deletion of the original one.Thus, the copy-paste mechanism is the same mechanism that's used in HDFS file replication manner already adopted in all Hadoop files transactions.This means that the chunks are holding a file with a start offset and end offset, and away from any customization.

Cold and hot data
TD is a customized HDFS behavior used to add extra steps to the existing system of writing purposes in order to reduce the HDFS memory wasting.Thus, these extra steps are adding more complexity to the write performance as the steps will slow down the daily files feeding performance.On the other hand, TD-HDFS will work much better in data reading due to there is no complexity in data reading.Thus, due to one of Hadoop principles is "write one-read many", the focusing on reading data performance is matter rather than the data writing, this will make TD-HDFS is reliable in data reading with both of hot and cold data.

CONCLUSION
There are several popular solutions to HDFS wasted memory problem.The first one is HAR file which is basically directed to solve big data in small files issue, but it is only able to save some wasted memory.Accordingly, the aim of this study is to solve the HDFS wasted memory without placing a major upgrade on the HDFS structure.Hence, both HAR and dynamic partitioning were used to move with the HDFS wasted memory problem to a good solution and save some memory.However, the complexity in reading from HAR makes this solution appropriate only with cold data.The other solution namely dynamic partitioning was used to alter the HDFS communication structure by adding a new node called aggregate node between the namenode and the datanodes, but this could expand the connection latency between the original nodes of HDFS which will slow down the reading process.Comparatively, TD-HDFS was able to resolve the problem of HDFS wasted memory without majorly altering the HDFS behavior.However, TD-HDFS has a complex writing mechanism during new data injection to HDFS because of the new metadata creation and appendage to the original datablock metadata.Thus, this new metadata will simplify the analysis and reading stage which will make the reading mechanism quick in fetching the required data from each combined datablock.

Figure 11 .
Figure 11.Chunks inside each datablock as a final step

Figure 12 .
Figure 12.TD-HDFS data reading shows a Int J Elec & Comp Eng ISSN: 2088-8708  Tiny datablock in saving Hadoop distributed file system wasted memory (Mohammad Bahjat Al-Masadeh) 1767 parallel test of pushing only 1 GB dataset of small files into a Hadoop cluster.However, the first cluster runs a standard HDFS while the second one runs a customized HDFS with TD technique (TD-HDFS).Both of the clusters are running a native copy of Hadoop 3.1.2version.Thus, since Hadoop is an open-sources that's written in Java, TD-HDFS cluster will got the updates equation on the default size of each datablock.The new added equations are: − MergingEquation().Included in Hadoop datanodes.− FounderEquation().Included in Hadoop namenode and running byte-to-byte formula − ClassifierEquation(). Used by the namenode metadata.

Figure 15 .
Figure 15.Reading/writing complexity results for all related studies Figure 16 presents memory saving between HDFS and TD-HDFS.The comparison is based on the injection of Metadata about the desired Chunk.Processing Script.Some other Jobs if needed ISSN: 2088-8708  Tiny datablock in saving Hadoop distributed file system wasted memory (Mohammad Bahjat Al-Masadeh) 1769 20 files towards Hadoop cluster, followed by the examination of the results of the standard HDFS vs. TD-HDFS.Each file of the testing sample is of 5 MB in size.

Figure 16 .
Figure 16.Memory saving in HDFS and TD-HDFS

Figure 17 .
Figure 17.Memory saving in HAR and TD-HDFS


ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 2, April 2023: 1757-1772 1770 returns to the first square of the problem.Figure 18 shown the final result for dynamic partitioning in a comparison with TD-HDFS.

Table 1 .
Formula flow of the wasted memory problem

Table 3 .
Old file addressing vs TD-HDFS file addressing

Table 5 .
An example of each chunk inside each datablock

Table 6 .
A results table for all previous studies and TD-HDSF