Big data anonymization using Spark for enhanced privacy protection

Abdelmadjid Guessoum Graba, Adil Toumouh


This article introduces an advanced solution for anonymizing large-scale sensitive data, addressing the limitations of traditional approaches when applied to vast datasets. By leveraging the Spark distributed computing framework, we propose a method that parallelizes the data anonymization process, enhancing efficiency and scalability. Utilizing Spark's resilient distributed datasets (RDD), the approach integrates two primary operations, Map_RDD and ReduceByKey_RDD, to execute the anonymization tasks. Our comprehensive experimental evaluation demonstrates our solution's effectiveness and improved performance in preserving data privacy while balancing data utility and confidentiality. A significant contribution of our study is the development of a wide array of solutions for data owners, particularly notable for a 500 MB dataset at an anonymity level of K=100, where our methodology produces 832 unique solutions. This study also opens avenues for future research in applying different privacy models within the Spark ecosystem, such as l-diversity and t-closeness.


Big data privacy; Data anonymization; Parallel computing; Privacy-preserving algorithms; Resilient distributed datasets; Spark distributed computing

Full Text:



Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578

This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).