Submitted:
01 May 2025
Posted:
02 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background
2.1. Data Portability and Interoperability in the Context of Cloud Computing and Data Protection
2.2. Security by Design
2.3. Infrastructure
2.4. Format Preserving Encryption
2.5. Data Lake Architecture
2.6. Data Ingestion
2.7. Data governance
3. Methodology
3.1. Design Science
3.1.1. Conceptual Analysis
3.1.2. Solution Design
3.1.3. Design Validation
3.1.4. Implementation
3.2. Systematic Mapping of the Literature
3.2.1. Research Questions
3.2.2. Inclusion and Exclusion Criteria
- (i)
- Language: Only studies published in English and Spanish are included, ensuring that the documents are accessible and relevant to the international academic literature.
- (ii)
- Publication Date: Only works published between 2014 and January 2024 are considered to include all relevant documents in the field of data protection in Big Data and Data Lakes environments.
- (iii)
- Sources: Only works from scientific journals and conferences are accepted.
- (i)
- Study Domain: All works not focused on the field of information security in Big Data environments and Data Lake repositories are excluded, ensuring that the research is exclusively focused on the topic of interest.
- (ii)
- Accessibility: Documents that could not be fully accessed or were not relevant to the analysis were excluded.
- (iii)
- Duplication: Duplicate documents between academic search engines were excluded, retaining only one and discarding the others.
3.2.3. Search and Selection Process
3.2.4. Classification Scheme
- Type of Contribution and Approach: The analyzed documents are classified according to the type of contribution and approach adopted. Regarding the contribution, three main categories are identified: (1) Methodology, which encompasses systematic techniques and tools to address problems; (2) Method or framework, which provides consistent structures or principles to solve specific problems; and (3) Technique, which includes specific improvements such as algorithms or specific implementations. Regarding the approaches, the documents can adopt one of the following: (1) Innovative, introducing significant advancements with new ideas, methods, or technologies; (2) Positional, analyzing phenomena from a particular viewpoint in relation to the context and existing practices; or (3) Canon, based on established practices, established methods, or accepted standards in the field. Each document may contribute more than one contribution but only one type of approach.
- Encryption Techniques: In the context of Big Data and Data Lake repositories, the encryption techniques used to protect personal and sensitive data include Advanced Encryption Standard (AES), recognized for its effectiveness and performance; Homomorphic Encryption (HE), which allows operations on encrypted data without the need to decrypt it; Format-Preserving Encryption (FPE), which maintains the original format of the data, facilitating its integration with existing systems; Elliptic-Curve Cryptography (ECC), which stands out for offering a level of security comparable to other traditional cryptographic techniques but with smaller keys, reducing storage and processing requirements, ideal for resource-limited environments; and Attribute-Based Encryption (ABE), which ensures fine-grained encrypted access control to externalized data. Other emerging techniques are also identified, expanding the available options according to the specific requirements presented by the documents.
- Format Requirements: Format requirements for data are classified according to their state. For data in use, the requirements focus on the needs for analysis and machine learning, ensuring that the data can be processed efficiently without fully decrypting it. For data at rest, the requirements focus on the data’s structure, ensuring its correct integration and storage while maintaining its integrity. Finally, for data in transit, the requirements are grouped according to the communication protocols and technologies employed, ensuring secure transmission of data across networks or between systems.
- Other Protection Strategies: Refers to other ways of protecting data in the context of the research. This includes anonymization, which involves modifying the original data to hide sensitive information and prevent the identification of individuals or entities; access control, which encompasses policies and mechanisms that determine who can access the data and under what conditions, ensuring that only authorized individuals have access to the information; and security audits, which involve the continuous monitoring and review of activities related to data access and usage, with the goal of identifying vulnerabilities and ensuring compliance with security policies.
- Domain of the Document’s Development: The application domain of the document can be classified into three areas: Industrial, Healthcare, and Academic. The Industrial domain refers to documents where the research focus is developed in the context of an organization or industrial sector. The Healthcare domain refers to research focused on medical data or data protection within the healthcare field. Finally, the Academic domain encompasses documents aimed at presenting general research without a specific focus on the industry or healthcare sector.
- Challenges and Gaps: The challenges and gaps identified in the reviewed documents can be classified into four key areas: Costs, which limit the adoption of advanced technologies; Data Standards, necessary to ensure interoperability and facilitate information exchange between systems; Security and Regulatory Compliance, which are essential for protecting sensitive data and complying with regulations such as Chilean law N. 19.628; and Data Management and Analysis, which refers to the challenges associated with efficiently managing and processing large volumes of data in Big Data and Data Lakes contexts. This scheme highlights the most relevant areas for future research and the development of technological solutions.
4. Proposal
- Unsecured Path: Represented in Figure 3, this path involves accessing the data without encryption measures, reserved exclusively for extraordinary cases, such as requests from entities with superior authority, for example, for judicial, law enforcement, or legal compliance purposes.
- Secured Path: Represented in Figure 4, in this path, the data undergoes an encryption scheme based on masking through FPE (Format Preserving Encryption) and is transformed into the Delta Lake format. This allows controlled and secure data consumption under the supervision of Data Stewards, who regulate access and ensure compliance with security policies.

4.1. Ingestion Layer
4.2. Persistence Layer
4.3. Data Access Layer
4.4. Consumer Layer
5. Results
5.1. Systematic Mapping Results
5.1.1. Data Extraction and Mapping
5.1.2. Analysis and Discussion
5.2. Survey Results
5.2.1. Participant Profile
5.2.2. Usability Assessment by Role
5.2.3. Quality Assessment by Role
5.2.4. Analysis of Outlier Responses in Usability Evaluation
6. Discussion
6.1. Analysis of Outlier Responses Regarding Protocol Usability
Author Contributions
Funding
Conflicts of Interest
Abbreviations
| ABE | Attribute-Based Encryption |
| AES | Advanced Encryption Standard |
| BD | Big Data |
| BI | Business Intelligence |
| CISO | Chief Information Security Officer |
| DL | Data Lake |
| ECC | Elliptic-Curve Cryptography |
| FPE | Format-Preserving Encryption |
| GDPR | General Data Protection Regulation |
| HE | Homomorphic Encryption |
| IaaS | Infrastructure as a Service |
| IT | Information Technology |
| SBD | Secure by Design |
| SRA | Software Reference Architecture |
| SUS | System Usability Scale |
Appendix A

References
- Chen, J.; Wang, H. Guest Editorial: Big Data Infrastructure I. IEEE Trans. Big Data 2018, 4, 148–149. [CrossRef]
- Rawat, R.; Yadav, R. Big data: Big data analysis, issues and challenges and technologies. In Proceedings of the IOP Conference Series: Materials Science and Engineering. IOP Publishing, 2021, Vol. 1022, p. 012014.
- Lagos, J.; San Martin, D.; Aillapán, G. BDS-Analytics: Towards a PySpark Library for a Preliminary Exploratory Big Data Analysis. In Proceedings of the Developments and Advances in Defense and Security; Rocha, Á.; Vaseashta, A., Eds., Singapore, 2025; pp. 369–379.
- Panwar, A.; Bhatnagar, V. Data lake architecture: a new repository for data engineer. International Journal of Organizational and Collective Intelligence (IJOCI) 2020, 10, 63–75.
- Guamán, M.A.; Vaca, M.N.; Salazar, K.V.; Yuquilema, J.B. Systematic mapping of literature of a data lake. mktDESCUBRE 2018, 1, 50–66.
- Moreno, J.; Fernandez, E.B.; Serrano, M.A.; Fernandez-Medina, E. Secure development of big data ecosystems. IEEE access 2019, 7, 96604–96619. [CrossRef]
- Gupta, S.; Jain, S.; Agarwal, M. Ensuring data security in databases using format preserving encryption. In Proceedings of the 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE, 2018, pp. 1–5.
- Kumar, D.; Li, S. Separating storage and compute with the databricks lakehouse platform. In Proceedings of the 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2022, pp. 1–2.
- Mouratidis, H.; Kang, M. Secure by Design: Developing Secure Software Systems from the Ground Up. Int. J. Secur. Softw. Eng. 2011, 2, 23–41.
- Shirtz, D.; Koberman, I.; Elyashar, A.; Puzis, R.; Elovici, Y. Enhancing Energy Sector Resilience: Integrating Security by Design Principles. ArXiv 2024, abs/2402.11543.
- Awaysheh, F.M.; Aladwan, M.N.; Alazab, M.; Alawadi, S.; Cabaleiro, J.C.; Pena, T.F. Security by design for big data frameworks over cloud computing. IEEE Transactions on Engineering Management 2021, 69, 3676–3693. [CrossRef]
- Bellare, M.; Ristenpart, T.; Rogaway, P.; Stegers, T. Format-preserving encryption. In Proceedings of the Selected Areas in Cryptography: 16th Annual International Workshop, SAC 2009, Calgary, Alberta, Canada, August 13-14, 2009, Revised Selected Papers 16. Springer, 2009, pp. 295–312.
- Weiss, M.; Rozenberg, B.; Barham, M. Practical solutions for format-preserving encryption. arXiv preprint arXiv:1506.04113 2015.
- Cui, B.; Zhang, B.; Wang, K. A data masking scheme for sensitive big data based on format-preserving encryption. In Proceedings of the 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC). IEEE, 2017, Vol. 1, pp. 518–524.
- Wu, M.; Huang, J. A Scheme of Relational Database Desensitization Based on Paillier and FPE. In Proceedings of the 2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI). IEEE, 2021, pp. 374–378.
- Wieringa, R. Design science as nested problem solving. In Proceedings of the Proceedings of the 4th international conference on design science research in information systems and technology, 2009, pp. 1–12.
- Wieringa, R.J. Design science methodology for information systems and software engineering; Springer, 2014.
- Wohlfaxrth, M. Data Portability on the Internet: An Economic Analysis. In Proceedings of the International Conference on Interaction Sciences, 2017.
- Wohlfarth, M. Data Portability on the Internet. Business & Information Systems Engineering 2019, 61, 551 – 574.
- Bozman, J.; Chen, G. Cloud computing: The need for portability and interoperability. IDC Executive Insights 2010, pp. 74–75.
- Huth, D.; Stojko, L.; Matthes, F. A Service Definition for Data Portability. In Proceedings of the International Conference on Enterprise Information Systems, 2019.
- Kadam, S.P.; Joshi, S.D. Secure by design approach to improve security of object oriented software. 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom) 2015, pp. 24–30.
- Kern, C. Secure by Design at Google. Technical report, Google Security Engineering, 2024.
- Arostegi, M.; Torre-Bastida, A.I.; Bilbao, M.N.; Ser, J.D. A heuristic approach to the multicriteria design of IaaS cloud infrastructures for Big Data applications. Expert Systems 2018, 35.
- Megahed, M.E.; Badry, R.M.; Gaber, S.A. Survey on Big Data and Cloud Computing: Storage Challenges and Open Issues. In Proceedings of the 2023 4th International Conference on Communications, Information, Electronic and Energy Systems (CIEES). IEEE, 2023, pp. 1–6.
- Zagan, E.; Danubianu, M. Cloud DATA LAKE: The new trend of data storage. In Proceedings of the 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA). IEEE, 2021, pp. 1–4.
- Dworkin, M. Recommendation for Block Cipher Modes of Operation. Methods and Techniques 2001.
- Konduru, S.S.; Saraswat, V. Privacy preserving records sharing using blockchain and format preserving encryption. Cryptology ePrint Archive 2023.
- Sawadogo, P.; Darmont, J. On data lake architectures and metadata management. Journal of Intelligent Information Systems 2021, 56, 97–120. [CrossRef]
- Giebler, C.; Gröger, C.; Hoos, E.; Schwarz, H.; Mitschang, B. Leveraging the data lake: current state and challenges. In Proceedings of the Big Data Analytics and Knowledge Discovery: 21st International Conference, DaWaK 2019, Linz, Austria, August 26–29, 2019, Proceedings 21. Springer, 2019, pp. 179–188.
- Madsen, M. How to Build an enterprise data lake: important considerations before jumping in. Third Nature Inc 2015, pp. 13–17.
- Gupta, S.; Giri, V. Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake; Apress, 2018.
- Lagos, J.; Cravero, A. Process Formalization Proposal for Data Ingestion in a Data Lake. In Proceedings of the 2022 41st International Conference of the Chilean Computer Science Society (SCCC). IEEE, 2022, pp. 1–8.
- Anisetti, M.; Ardagna, C.A.; Braghin, C.; Damiani, E.; Polimeno, A.; Balestrucci, A. Dynamic and scalable enforcement of access control policies for big data. In Proceedings of the Proceedings of the 13th International Conference on Management of Digital EcoSystems, 2021, pp. 71–78.
- Quinto, B.; Quinto, B. Big data governance and management. Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark 2018, pp. 495–506.
- Muñoz, A.P.; Martí, L.; Sánchez-Pi, N. Data Governance, a Knowledge Model Through Ontologies. In Proceedings of the Congreso Internacional de Tecnologías e Innovación, 2021.
- Mahanti.; Rupa. Data Governance Implementation – Critical Success Factors. Software Quality Professional Magazine 2018, 20.
- Saed, K.A.; Aziz, N.A.; Ramadhani, A.W.; Hassan, N.H. Data Governance Cloud Security Assessment at Data Center. 2018 4th International Conference on Computer and Information Sciences (ICCOINS) 2018, pp. 1–4.
- N.Maniam, J.; Singh, D. TOWARDS DATA PRIVACY AND SECURITY FRAMEWORK IN BIG DATA GOVERNANCE. International Journal of Software Engineering and Computer Systems 2020.
- Liu, W. How Data Security Could Be Achieved in The Process of Cloud Data Governance? In Proceedings of the 2022 2nd International Conference on Management Science and Software Engineering (ICMSSE 2022). Atlantis Press, 2022, pp. 114–120.
- Dingre, S.S. Exploration of Data Governance Frameworks, Roles, and Metrics for Success. Journal of Artificial Intelligence & Cloud Computing 2023.
- Khatri, V.; Brown, C.V. Designing data governance. Communications of the ACM 2010, 53, 148–152.
- Petersen, K.; Feldt, R.; Mujtaba, S.; Mattsson, M. Systematic mapping studies in software engineering. In Proceedings of the 12th international conference on evaluation and assessment in software engineering (EASE). BCS Learning & Development, 2008.
- Sommerville, I. Software engineering. 10th. Book Software Engineering. 10th, Series Software Engineering 2015, 10.
- Steurer, J. The Delphi method: an efficient procedure to generate knowledge. Skeletal Radiology 2011, 40, 959–961. https://doi.org/10.1007/s00256-011-1145-z. [CrossRef]
- Nadal, S.; Herrero, V.; Romero, O.; Abelló, A.; Franch, X.; Vansummeren, S.; Valerio, D. A software reference architecture for semantic-aware Big Data systems. Information and software technology 2017, 90, 75–92. [CrossRef]
- Brooke John. SUS: A ’Quick and Dirty’ Usability Scale. In Usability Evaluation In Industry; CRC Press, 1996; pp. 207–212. https://doi.org/10.1201/9781498710411-35.
- Lagos, J.; Cravero, A. Reference architecture for data ingestion in Data Lake. In Proceedings of the 2023 18th Iberian Conference on Information Systems and Technologies (CISTI). IEEE, 2023, pp. 1–9.
- Bangor Aaron.; Kortum Philip.; Miller James. Determining what individual SUS scores mean. Journal of Usability Studies 2009. https://doi.org/10.5555/2835587.2835589. [CrossRef]
- Panwar, A.; Bhatnagar, V. A cognitive approach for blockchain-based cryptographic curve hash signature (BC-CCHS) technique to secure healthcare data in Data Lake. Soft Computing 2021, p. 1. [CrossRef]
- Rieyan, S.A.; News, M.R.K.; Rahman, A.M.; Khan, S.A.; Zaarif, S.T.J.; Alam, M.G.R.; Hassan, M.M.; Ianni, M.; Fortino, G. An advanced data fabric architecture leveraging homomorphic encryption and federated learning. Information Fusion 2024, 102, 102004. [CrossRef]
- Yeng, P.K.; Diekuu, J.B.; Abomhara, M.; Elhadj, B.; Yakubu, M.A.; Oppong, I.N.; Odebade, A.; Fauzi, M.A.; Yang, B.; El-Gassar, R. HEALER2: A Framework for Secure Data Lake Towards Healthcare Digital Transformation Efforts in Low and Middle-Income Countries. In Proceedings of the 2023 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC). IEEE, 2023, pp. 1–9.
- Shang, X.; Subenderan, P.; Islam, M.; Xu, J.; Zhang, J.; Gupta, N.; Panda, A. One stone, three birds: Finer-grained encryption with apache parquet@ large scale. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022, pp. 5802–5811.
- Hamadou, H.B.; Pedersen, T.B.; Thomsen, C. The danish national energy data lake: Requirements, technical architecture, and tool selection. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data). IEEE, 2020, pp. 1523–1532.
- Revathy, P.; Mukesh, R. Analysis of big data security practices. In Proceedings of the 2017 3rd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT). IEEE, 2017, pp. 264–267.
- Rawat, D.B.; Doku, R.; Garuba, M. Cybersecurity in big data era: From securing big data to data-driven security. IEEE Transactions on Services Computing 2019, 14, 2055–2072. [CrossRef]
- Zhao, X.; Zhang, C.; Guan, S. A data lake-based security transmission and storage scheme for streaming big data. Cluster Computing 2024, 27, 4741–4755. [CrossRef]
- Kai, L.; Liang, Z.; Yaojing, Y.; Dazhu, Y.; Min, Z. Research on Federated Learning Data Management Method Based on Data Lake Technology. In Proceedings of the 2023 International Conference on Computers, Information Processing and Advanced Education (CIPAE). IEEE, 2023, pp. 385–390.
- Ancán, O.; Reyes, M. Cabuplot: Categorical Bubble Plot for systematic mapping studies, 2020.












| Question ID | Question Text | Response Type |
|---|---|---|
| SQ1.1 | Regardless of whether you were a graduate or not, since approximately what date have you been working in IT? | Short answer (date) |
| SQ1.2 | Regardless of whether you were a graduate or not, since approximately what date have you been working in Big Data? | Short answer (date) |
| SQ1.3 | Regardless of whether you were a graduate or not, since approximately what date have you been working with Data Lake? | Short answer (date) |
| SQ1.4 | Which role within the company most closely matches your functions? | Multiple choice:
|
| Question ID | Statement |
|---|---|
| SQ2.1 | I think that I would like to use this protocol frequently. |
| SQ2.2 | I found the protocolo unnecessarily complex. |
| SQ2.3 | I thought the protocolo was esay to use. |
| SQ2.4 | I think that I would need the support of a technical person to be able to use this protocolo. |
| SQ2.5 | I found the various functions of this protocolo were well integrated. |
| SQ2.6 | I thought there was too much inconsistency in this protocol. |
| SQ2.7 | I would imagine that most people would learn to use this protocol very quickly. |
| SQ2.8 | I found the protocol very cumbersome to use. |
| SQ2.9 | I felt very confident using the protocol. |
| SQ2.10 | I needed to learn a lot of things before I could get going with this protocol. |
| Question ID | Quality Attribute | Statement |
|---|---|---|
| SQ3.1 | Usefulness | The presented protocol would be useful in my work. |
| SQ3.2 | Satisfaction | Overall I fell satisfied with the presented protocol. |
| SQ3.3 | Trust | I would trust the protocol to handle my work with sensitive data. |
| SQ3.4 | Perceived Relative Benefit | Using the proposed protocol would be an improvement with respect to my current way of handling and analyzing sensitive data. |
| SQ3.5 | Functional Completeness | In general, the proposed protocol covers the needs of my work. |
| SQ3.6 | Functional Appropriateness | The proposed protocol facilitates the management of the work with sensitive data. |
| SQ3.7 | Willingness to Adopt | I would like to adopt hte protocol in my work. |
| ID | Research Question |
|---|---|
| RQ1 | What types of contributions are found in the selected documents? |
| RQ2 | What encryption techniques are used in Big Data tools for processing personal and sensitive data? |
| RQ3 | What encryption techniques are applied to data in Data Lake repositories? |
| RQ4 | What data format requirements are applied to data In Data Lake Repositories? |
| RQ5 | What others strategies for protecting personal and sensitive data are found in the selected documents? |
| RQ6 | What are the industry domains presented where personal and sensitive data protection is applied? |
| RQ7 | What types of challengedes and gaps are presented for future work in the reviewed documents? |
| Search Engine | Query Applied | Inclusion/Exclusion Criteria Applied |
|---|---|---|
| Scopus | 533 | 5 |
| WoS | 10 | 1 |
| IEEE | 9 | 2 |
| ACM | 27 | 1 |
| Company Position | Quantity | Exp. in IT (yrs) | Exp. in BD (yrs) | Exp. in DL (yrs) |
|---|---|---|---|---|
| Big Data Consultant | 13 | 5.47 | 1.87 | 1.77 |
| Director | 1 | 16.00 | 16.00 | 3.50 |
| In Training | 4 | 0.47 | 0.41 | 0.38 |
| Web and App Developer | 6 | 1.17 | 0.64 | 0.58 |
| Technical Lead | 2 | 12.68 | 7.18 | 5.39 |
| BI Consultant | 2 | 9.86 | 3.26 | 3.26 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).