Submitted:
18 February 2024
Posted:
19 February 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- distributed data domains;
- data-as-a-product;
- self-serve data infrastructure;
- federated computational governance.
- definition, access, and collaboration on schemes, semantic knowledge, lineage, etc
- tools to automatically enforce the security, transparency, legal policies, etc.
2. Background
2.1. Data Mesh Principles Explained
2.1.1. Distributed Data Domains
2.1.2. Data-as-a-Product
2.1.3. Shared Data Infrastructure
2.1.4. Federated Computational Data Governance
2.2. Running Example
2.3. Related works
3. Data Mesh Governance: Requirements and Challenges
3.1. Properties of Data Governance System
- Semantic Enrichment (SE) is related to providing the descriptive information that defines the meaning of the underlying data, facilitates the product comprehension and utilisation. On the basic level, the use of semantic tagging helps in relating different data sources, e.g. `Financial’ data, `Medical’ data etc. On a more advances level, the use of the semantic web technologies and knowledge graphs is envisioned. It can provide automatic product descriptions, data linking, relationship discovery etc.
- Data Indexing (DI) is a core element for efficient metadata querying and navigation. Almost all modern data storage providers has the indexing functionality that enables fast data look-ups. In the data mesh case, metadata system should provide the search capabilities to retrieve the data products information based on a diverse set of user requests: keywords, history, ownership, usage pattern, etc.
- Link Generation (LG) helps to identify and integrate the related information. For instance, the use of identical semantic tags helps to cluster the products into different groups. Data lineage establishes the tracing history of the upstream dependencies and enables the user alert or automatic product suspension in case of breaking changes. Product linking can also consist of a similarity measurements, which is a further extension of product clustering with a relevance degree.
- Data Polymorphism (DP) or polyglot data is related to the operation and management of multiple product facets. Since in data mesh the data consumption is done via connecting ports, the same product can have multiple consumers as well as multiple forms. For instance, the real-time analytics may require the streaming data interface while monthly reports does not need such time precision. In addition, the data can have variable degree of quality, representation (tabular data or a vector), maturity etc.
- Data Versioning (DV) describes the management of metadata updates within the system, while retaining the previous metadata states. It is very relevant since it ensures the system state recovery, reproducibility, and auditing. Moreover, versioning allows branching, enables backward compatibility, and parallel data product evolution.
- Usage Tracking (UT) consists of providing and tracking the access to the underlying data resources. It is a backbone for implementing the security and protection of data. Data security comprises multiple elements, such as information and communication encryption, firewall, etc. In fact, in the aspect of metadata management it is important to keep the records of products access, user identities, usage patterns, locations, time-schedules etc, in order to detect and prevent unauthorized activities.
- Computational Policies (CP) play a vital role in automatic governance execution. Beyond the access control enforcement, it also enables the data quality verification, consistency, uniqueness, lifecycle management, service contract tests, etc. This reflects the need to define the rules on each level - global and local, which then are applied in the mesh. Such governance execution also requires an appropriate platform infrastructure as well.
- In the context of micro-service architecture, Independently Deployable (ID) products provide the opportunity to update the performed services without the overall system interruption (e.g. canary deployment). It the context of data mesh, it means an option to deploy the new data products without affecting the established data consumption. This requirement is also applied to the metadata registration and policy updates. Ideally, the new information and rules should not interrupt the existing data flows unless it is specifically intended by the domain owners.
- Automatically Testable (AT) platform design insures the correctness of the future system state upon the modules upgrade. For instance, when implementing a new data resolution modules, e.g. transition from IPv4 to IPv6, the address space and links of the old resources should continue to work. To be sure that the introduction of a new module will not break the operations of the system, we are obligated to perform the automatic tests and verification of the system, assuming that the upgrades have took their place, while in reality keeping the old system functioning.
- Contract Management (CM) provides a way to negotiate, to participate, and to ensure the correctness of the delivered data products. Usually it includes the outlined service level objectives and agreements, including the quality of data, schema, update frequency, intended purposes of use, etc. As a part of the platform governance, it overlaps with the metadata management and computational execution modules.
- Product Compositionality (PC) helps to speed up the product development and to prevent the dataflow interruption. Automatic contract composition verification enables the advanced interoperability features and helps to prevent the unauthorized behaviour, e.g. recovery of PII. In cases of schema composition, it automatically enriches the existing products or prevents the introduction of the breaking changes.
3.2. Challenges of Federated Data Governance
4. Introducing Blockchain-Powered Mesh Platform
4.1. Defining the Data Mesh Governance Types
4.1.1. Type I - Centralized Metadata Repository.
- is a set of data products and is a set of product versions
- is a set of metadata records and is a set of metadata versions
- is a set of data contracts
- is a set of system users
- is a function returning a permissions map for any given pair
- is a function returning the contract for any given pair of data products or an empty set if it does not exist
- is a function returning the metadata description of a data product
- is a function returning a visibility map for any given pair of user and metadata
4.1.2. Type II - Federated Metadata Repository.
- is a set of unified governing policies
- is a set of metadata nodes that host the replicas
- is a set of metadata repository replicas
- is a set of methods that enforce the policies or contracts
- is a function returning a global policy b or a data contract c that is being enforce by the method t on a given node n
- is a function returning the consistency or synchronization state for any two given replicas.
4.1.3. Type III - Decentralized Metadata Repository.
- with being the repository associated with and
- is a set of all data domains
- is a function establishing the link presence or absence between a pair of metadata records that belong to different repositories and respectively.
4.2. Using Blockchain for Policy and Metadata Management
4.3. Advantages of Hyperledger Fabric.
4.4. The Fybrik Data Platform
5. Implementing Federated Data Mesh Governance
5.1. System Architecture
- createAsset is used for registering a new product, e.g when the user executes the notebook and corresponding workload persists the data;
- getAssetInfo returns the metadata information which is used for product discovery, workload processing, etc;
- updateAsset function updates any existing metadata records within the catalog;
- deleteAsset is used for deleting the metadata about the product.
5.2. The Data Product Quantum
- the product Metadata - schema, semantic tags, lineage, etc;
- Notebook code for the data manipulation which is deployed to workload computing nodes (created automatically by Fybrik);
- FybrikApplication specification defining the source and destination identifiers, and intended operations provided in yaml format;
- Policies define the product usage rules, e.g. what is allowed to which consumers.
- request the data asset access, e.g. based on the protocol outlined in [16];
- when the request is approved, the provider’s metadata and policy are used to form a new data contract;
- submit the Notebook and FrybrikApplication documents for provisioning a new computing process;
- register a new data product in the catalog by providing asset’s metadata and policies.
6. Contribution Discussion
7. Conclusions and Further Research
References
- Miloslavskaya, N.; Tolstoy, A. Big data, fast data and data lake concepts. 7th annual international conference on biologically inspired cognitive architectures (BICA 2016). Procedia Computer Science, 2016. [CrossRef]
- Inmon, W.; Strauss, D.; Neushloss, G. DW 2.0: The architecture for the next generation of data warehousing; Elsevier, 2010.
- Madera, C.; Laurent, A. The next information architecture evolution: the data lake wave. Proceedings of the 8th international conference on management of digital ecosystems, 2016, pp. 174–180. [CrossRef]
- Armbrust, M.; Ghodsi, A.; Xin, R.; Zaharia, M. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. Proceedings of CIDR, 2021, Vol. 8.
- Dehghani, Z. Data Mesh: Delivering Data-Driven Value at Scale; O’Reilly Media, Inc., 2022.
- Evans, E.; Evans, E.J. Domain-driven design: tackling complexity in the heart of software; Addison-Wesley Professional, 2004.
- Driessen, S.W.; Monsieur, G.; Van Den Heuvel, W.J. Data market design: a systematic literature review. Ieee access 2022, 10, 33123–33153. [Google Scholar] [CrossRef]
- DAMA-International. DAMA-DMBOK: Data Management Body of Knowledge; Technics Publications, 2017.
- Araújo Machado, I.; Costa, C.; Santos, M.Y. Advancing Data Architectures with Data Mesh Implementations. International Conference on Advanced Information Systems Engineering. Springer, 2022, pp. 10–18.
- Wider, A.; Verma, S.; Akhtar, A. Decentralized data governance as part of a data mesh platform: Concepts and approaches. 2023 IEEE International Conference on Web Services (ICWS). IEEE, 2023, pp. 746–754.
- Abel, G. Data Mesh: Systematic Gray Literature Study, Reference Architecture, and Cloud-based Instantiation at ASML. Master’s thesis, SCHOOL OF ECONOMICS AND MANAGEMENT, TILBURG UNIVERSITY, 2022.
- Butte, V.K.; Butte, S. Enterprise Data Strategy: A Decentralized Data Mesh Approach. 2022 International Conference on Data Analytics for Business and Industry (ICDABI). IEEE, 2022, pp. 62–66.
- Hooshmand, Y.; Resch, J.; Wischnewski, P.; Patil, P. From a Monolithic PLM Landscape to a Federated Domain and Data Mesh. Proceedings of the Design Society 2022, 2, 713–722. [Google Scholar] [CrossRef]
- Dolhopolov, A.; Castelltort, A.; Laurent, A. Exploring the Benefits of Blockchain-Powered Metadata Catalogs in Data Mesh Architecture. Proceedings of the 15th International Conference on Management of Digital EcoSystems. Springer, 2023.
- Dolhopolov, A.; Castelltort, A.; Laurent, A. Trick or Treat: Centralized Data Lake vs Decentralized Data Mesh. Proceedings of the 15th International Conference on Management of Digital EcoSystems. Springer, 2023.
- Dolhopolov, A.; Castelltort, A.; Laurent, A. Implementing a Blockchain-Powered Metadata Catalog in Data Mesh Architecture. International Congress on Blockchain and Applications. Springer, 2023, pp. 348–360.
- Machado, I.A.; Costa, C.; Santos, M.Y. Data mesh: concepts and principles of a paradigm shift in data architectures. Procedia Computer Science 2022, 196, 263–271. [Google Scholar] [CrossRef]
- Machado, I.; Costa, C.; Santos, M.Y. Data-driven information systems: the data mesh paradigm shift 2021.
- Driessen, S.; Monsieur, G.; van den Heuvel, W.J. Data Product Metadata Management: An Industrial Perspective. Service-Oriented Computing–ICSOC 2022 Workshops: ASOCA, AI-PA, FMCIoT, WESOACS 2022, Sevilla, Spain, November 29–December 2, 2022 Proceedings. Springer, 2023, pp. 237–248.
- Sawadogo, P.; Darmont, J. On data lake architectures and metadata management. Journal of Intelligent Information Systems 2021, 56, 97–120. [Google Scholar] [CrossRef]
- Hai, R.; S., G.; C., Q. Constance: an intelligent data lake system. International conference on management of data. ACM Digital Library, 2016.
- Zhao, Y. Metadata Management for Data Lake Governance. PhD thesis, Toulouse 1, 2021.
- Sawadogo, P.N.; Darmont, J.; Noûs, C. Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake. European Conference on Advances in Databases and Information Systems. Springer, 2021, pp. 88–101.
- Eichler, R.; Giebler, C.; Gröger, C.; Schwarz, H.; Mitschang, B. Modeling metadata in data lakes—A generic model. Data & Knowledge Engineering 2021. [Google Scholar]
- Mehmood, H.; Gilman, E.; Cortes, M.; Kostakos, P.; Byrne, A.; Valta, K.; Tekes, S.; Riekki, J. Implementing big data lake for heterogeneous data sources. 2019 ieee 35th international conference on data engineering workshops (icdew). IEEE, 2019.
- Halevy, A.Y.; Korn, F.; Noy, N.F.; Olston, C.; Polyzotis, N.; Roy, S.; Whang, S.E. Managing Google’s data lake: an overview of the Goods system. IEEE Data Eng. Bull. 2016, 39, 5–14. [Google Scholar]
- Apache Software Foundation. Apache Atlas – Data Governance and Metadata framework for Hadoop. https://atlas.apache.org. Accessed: 14.08.2023.
- DataHub Project. The Metadata Platform for the Modern Data Stack. https://datahubproject.io/. Accessed: 14.08.2023.
- Truong, H.L.; Comerio, M.; De Paoli, F.; Gangadharan, G.; Dustdar, S. Data contracts for cloud-based data marketplaces. International Journal of Computational Science and Engineering 2012, 7, 280–295. [Google Scholar] [CrossRef]
- Abbas, A.E.; Agahari, W.; Van de Ven, M.; Zuiderwijk, A.; De Reuver, M. Business data sharing through data marketplaces: A systematic literature review. Journal of Theoretical and Applied Electronic Commerce Research 2021, 16, 3321–3339. [Google Scholar] [CrossRef]
- Desai, H.; Liu, K.; Kantarcioglu, M.; Kagal, L. Adjudicating violations in data sharing agreements using smart contracts. 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). IEEE, 2018, pp. 1553–1560.
- Androulaki, E.; Barger, A.; Bortnikov, V.; Cachin, C.; Christidis, K.; De Caro, A.; Enyeart, D.; Ferris, C.; Laventman, G.; Manevich, Y.; others. Hyperledger fabric: a distributed operating system for permissioned blockchains. Proceedings of the thirteenth EuroSys conference, 2018, pp. 1–15.
| 1 | |
| 2 | |
| 3 | An extended description of the Netflix platform was access at: https://netflixtechblog.com/data-movement-in-netflix-studio-via-data-mesh-3fddcceb1059
|




| Property ⇓ / Source ⇒ | Our System | Zalando [17] | Netflix [17] | Machado et al. [9] | Wider et al. [10] | Butte et Butte [12] | Driessen et al. [19] | Hoosh- mand et al. [13] | Goedegebuure et al. [11] |
|---|---|---|---|---|---|---|---|---|---|
| Data Indexing | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Usage Tracking | ✓ | ✓ | ✓ | ✧ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Semantic Enrichment | ✧ | ✗ | ✧ | ✧ | ✧ | ✓ | ✓ | ✓ | ✓ |
| Link Generation | ✧ | ✗ | ✧ | ✧ | ✧ | ✓ | ✓ | ✓ | ✓ |
| Data Polymorphism | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ |
| Data Versioning | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✧ | ✧ | ✗ |
| Independently Deployable | ✓ | ✗ | ✗ | ✗ | ✗ | - | ✓ | ✓ | - |
| Computational Policies | ✧ | ✗ | ✗ | ✗ | ✧ | ✧ | ✗ | ✗ | ✧ |
| Composable Products | ✗ | ✗ | ✧ | ✗ | ✗ | ✗ | ✧ | ✗ | ✗ |
| Automatically Testable | ✧ | ✗ | ✗ | ✗ | ✗ | - | ✗ | ✗ | - |
| Contract Management | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Total | (9)/11 | 2/11 | (5)/11 | (4)/11 | (6)/11 | (6)/11 | (8)/11 | (6)/11 | (6)/11 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).