An American futurologist, Robert Anton Wilson, used to say that “the measure of a system’s viability is the measure of information propagation”.
In tech, data, as an information that has been translated into a form that is efficient for movement or processing, is the most crucial element in all the systems we are — and will be — building in a foreseeable future.
In this piece of writing we will take a look at a new way of data handling & decentralized environments built to better disseminate data in our organisations. The data mesh.
Prior to the changes that took place in the data domain in late 2010s, we all had been seeing numerous organisations dedicating resources to build centralized Data Warehouses and Data Lakes that later got outpaced by the complexity of their businesses, multitudes of data sources, no to mention the ambitious goals like utilizing the power of AI.
The thing was that in most cases the problem was not in the underlying tech stack or lack of skilled hands in the central teams governing data. The real issue was in the form of how we approach, store and distribute our data within the warehouses and lakes themselves.
The situation was calling for more scalable approach that would help us resist our systems’ rising entropy.
Usually, after some initial successes once a centralized data platform had been introduced, first obstacles would arise with the central data team getting overwhelmed by the multitude of analytical queries from management and product stakeholders. This bottleneck was — and still is — a serious challenge, as making prompt data-informed decisions is the baseline for each thriving business.
On top of that we had data teams learning to comprehend particular business domain behind each inquiry.
Consequently, companies were making significant investments in domain-driven design, autonomous domain teams, and decentralized microservice architectures.
For sure it was a major step towards better future. With domain teams understanding their respective areas, including the businesses’ information requirements, they were able to pull their web applications and APIs without experiencing major bottlenecks. Big time change, isn’t it? Alas, these teams still had to rely on the burdened central data team for crucial data-driven insights.
Despite the missing bits in the equation, with domain — driven approach we were at the threshold of something bigger.
This concept of domain-driven decentralization for analytical data similar to APIs in a microservice architecture was then picked up and developed by Zhamak Dehghani under data mesh hood in her How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh piece of writing. Thus, a new platform on which data engineering teams could operate was created.
“The menu is not the meal”, though, and in order to realise data mesh potential, we all need to become the cooks.
Let us then take a look at production insights from a data mesh environment setup that has been up and running for over three years now.
To grasp the essence of data mesh, it’s vital to understand the major distinction between operational and analytical data:
Modern data architectures built in tech reflect this strong line, and even though data mesh creator herself encourages mesh practitioners to challenge the situation, the below scheme reflects our take on the analytical side.
As mentioned, since data mesh is a set of practices and patterns rather than a fixed framework, there are multitude ways of putting it into practice.
In this particular case, the team constructed a robust, scalable, and efficient data environment without complicating things too much. All the main mesh principles, i.e. decentralized data ownership & fostering product-oriented thinking supported by a self-serve data infrastructure are inherent parts of the setup. Tools of trade are presented below:
Google Cloud Platform (GCP): At the foundation of things lies GCP, a suite of cloud computing services that provides the scalability, flexibility, and innovation pace required for modern data solutions. GCP’s services, including Google Cloud Storage (GCS) for data storage, Cloud Composer for workflow orchestration, and Secret Manager for secure management of sensitive information, form the backbone of presented data infrastructure.
Snowflake: Complementing GCP, Snowflake serves as our Data Warehouse (DWH) solution. Snowflake nails it with its ability to effortlessly scale compute and storage independently. It’s where our final data products reside, and where our data sharing capabilities spread from.
Snowflake’s choice felt very natural as the company aims at multicloud architectures and independency.
Did we mention top performance and concurrency for data analytics?
In case you need more details on other data platforms that do swell job, and how Snowflake compares with them, check out this link.
Apache Airflow: For data workflows orchestration, the team uses Apache Airflow, specifically managed within GCP’s Cloud Composer service. This integration allows to automate, monitor, and manage complex data pipelines efficiently, ensuring data is processed and moved through systems reliably.
Terraform: Infrastructure as Code (IaC) is critical for managing cloud resources consistently and transparently. HashiCorp’s Terraform enables to define the infrastructure using a high-level configuration language, which the team uses to manage both GCP and Snowflake resources, ensuring that infrastructure provisioning and updates are reproducible and version-controlled.
Data Build Tool (dbt): dbt plays a pivotal role in transforming data in our Snowflake DWH. It allows data engineers to transform data using SQL — which is then run through Airflow — making it easier to model, test, and version control our transformations. The data enters Snowflake in raw form, then goes through a couple of layers to be transformed at the final stage, which opens up possibility to transform data as many times as required making a shift from ETL to ELT processing.
By integrating these technologies, the team crafted a self-serve data infrastructure that empowers data teams to focus on delivering value rather than being bogged down by the complexities of data platform management.
How does the above setup relate to mesh theory itself? Let us go tad broader and bring up main principles of data mesh.
We have weaved in elements of the above tech setup as points of reference.
Central to data mesh is the principle of decentralization, where data ownership is distributed across business domains. Each domain becomes responsible for its analytical data, fostering agility and scalability.
The logical architecture of data mesh revolves around domains, each with its interfaces for operational and analytical data. In the case of presented setup the most notable domain teams are finance, procurement, stock management & logistics. The teams comprise business analysts with domain knowledge, data engineers and BI specialists responsible for the reporting layer. Each team is cross-functional within its domain.
Data from such single unit is later used as a product to feed the ML and AI processes.
A data product is an entity containing all the necessary components to process and store domain-specific data, designed for analytical or data-intensive applications. Data products’ crucial facet is their accessibility to other teams within the organization through designated ports of output.
Data products in the presented setup, and generally in each and every mesh environement, should establish connections with various data sources, such as operational systems or other data products, and undertake the task of data transformation. Then, they serve these processed data sets through one or multiple ports. These output ports typically manifest as structured data sets, defined by a specific data contract.
For the sake of the exercise, you can envision a data product as a module or microservice, specifically tailored for handling analytical data.
Each data product of the presented setup has its own GCP project, including resources such as service accounts for interacting with cloud components, GCP secrets necessary for running the data product, GCS buckets for landing source data or exporting the final data product, and access to the Composer project.
All the automated provisioning of Snowflake resources include schema setup, staging, storage integration for GCS, file formats, and table creation. Deployment and execution of dbt models takes place with Airflow.
Each data product from the setup is under the ownership of a domain team within the organization.
As data mesh strongly encourages flexibility in the approach, the landscape of self-serve data platforms on which we run our mills also varies.
And, to truly enable teams to independently manage their data products, they must be equipped with an abstraction layer for the basic infrastructure. This layer simplifies the complexity and reduces the effort involved in managing data product lifecycles, promoting a principle of self-serve data infrastructure to foster domain independence.
More so, for effective usage of data products across domains, domain teams must be able to seamlessly access, integrate, and query data products from other domains. The platform should support, monitor, and document these cross-domain interactions and usage of data products.
For the team, the setup in which the entire infrastructure, including GCP and Snowflake resources, is managed through Terraform Infrastructure as Code (IaC) is the winning one.
For the data product DAGs, the team uses Cloud Composer — a fully managed workflow orchestration service built on Apache Airflow.
On top of that, all the aspects mentioned are fully automated and managed by an orchestrated pipeline, allowing data teams to leverage a higher-level DSL for resource provisioning. This DSL is integrated with the CI/CD pipeline, eliminating the need for custom Terraform code development.
In a distributed data mesh approach, the need for transparency and proper level of interfacing between independent data products is a crucial one.
Corporate as it sounds, it does work in practice, under condition that the federated governance group serves as a guild comprising representatives from all teams involved in the data mesh ecosystem.
This group collaboratively establishes global policies, which serve as the guiding principles within the data mesh environment. These policies outline the standards that domain teams must adhere to when building their respective data products.
Furthermore, interoperability policies serve as the cornerstone of this governance approach. They ensure that data products can be utilized consistently across different domain teams. For instance, a global policy might dictate that data should be provided in a standard format, such as a Parquet file stored in GCS within a bucket managed by the corresponding domain team.
In order to establish secure and uniform access to the actual data product, a standardized approach that involves employing role-based access control makes most sense.
For the setup presented herein, the above principles’ implementation takes place through IAM policies managed by Terraform IaC, focusing on role-based access control and ensuring that data management practices adhere to organizational and regulatory standards.
Taking it one step lower, access to the data product is simply managed in a repository via pull requests from other data products. All the pull requests run under the radar of roles based access controls & are essentially approved by the data team owning particular repo. Thus, the team knows who, what and when.
The concept of a data mesh is really one of the functions of organic flow of things in tech — and one that is a natural consequence of a more siloed approach in the complacency of which we’ve been resting a bit too much.
Think microservices, think blockchain, think uberization of all the business sectors that lean on tech — the change is taking place now, and data mesh is a strong vector in the data world that is adding up to the general landscape. And it’s here to stay.
Having tested data mesh for over three years in production environment, we are surely hooked.
More granular arts on the subject matter will follow.
At insightify.io we design architectures for data-intensive applications with a strong focus on data governance in a distributed data mesh environments. Don’t hesitate to contact us to get more info!