by Sundip Gorai
A bit of data history – past, present, and future
The Past:
In the last thirty years, we have seen the rapid evolution of the various data and analytics paradigms. At the beginning of the 1990s, the world was dominated by mainframes and databases. The need for more query-driven MIS set the stage for the data warehouse – an aggregate store of data for doing reporting and KPIs and dashboards. We saw the rapid implementation of data warehouses with fierce debate raging between a top-down approach vs a bottom-up approach of executing a warehouse (also dubbed as a debate between a Kimball vs a Inmon architecture framework).
At the turn of the century, we saw the rapid rise of tools and technologies that aided the data warehouse paradigm, these included BI applications, reporting tools, and data integration tools, with a gradual rise in metadata, master data and data quality systems.
At the end of the first decade of this century, the desire to harness new data challenges, primarily, Volume of data – Zettabyte scale, Velocity of data – Batch and streaming datasets, Variety of data – Text, Video, Voice, Structured, Semi-structured, unstructured, Veracity – Data quality, led to the birth of Big data ecosystems, which changed the paradigm of the warehouse, which hitherto was mainly driven by SCHEMA ON WRITE approach ( model the data, before you bring the data into the data warehouse), to SCHEMA on READ (don’t model the data but bring everything into the data lake, the storage repository, and then make it ready for analytics) – all this driven by a paradigm shift in ask for vertical and horizontal scalability. This change was first driven by the big data (/HDFS/Hadoop) revolution, which later gave way to a more nimble-footed Apache Spark and other Apache frameworks. All this finally culminated with modernization to the cloud, a force still very much in reckoning which will continue to be so for at least another five years.
In parallel, the rapid advancement in data science and AI algorithm techniques, that take data from the above receptacles of storage (data warehouse, data lake, and others) are taking enterprises and preparing to take humanity to the next frontier.
The Present:
In summary, where we stand today:
- Most enterprises are modernizing to the cloud
- They continue to service their data warehouses and are building or shifting data warehouses to the cloud.
- In parallel, they build their data science ecosystem, to solve compelling problems for the enterprise
The Future:
While the above architectural changes have made great advances for the enterprise, there have emerged newer challenges, and newer thoughts, to integrate the same.
- Can we eliminate the redundancy between the warehouse and the data lake?
- Can we monitor and manage the enterprise MIS with a domain-driven paradigm?
- In order to address the above two challenges, in the recent past, four new paradigms have emerged.
- The Data Lakehouse – a construct, in which, instead of storing data separately in the warehouse and the lake, we create a Lakehouse, a unified receptacle, where data is stored to meet both traditional reporting needs and analytics needs
- The Data Mesh architecture – a paradigm to manage data-driven by domain instead of centrally storing the data.
- Data Vault – a paradigm to incrementally create an agile warehouse, after staging the data, and before creating data marts
- Logical warehouse- a data management architecture in which an architectural layer sits on top of a traditional data warehouse, enabling access to multiple, diverse data sources while appearing as one “logical” data source to users.
In today’s article, we will discuss another new paradigm gaining momentum – The Data Mesh. (Conceived by Zhamak Dehghani). As enterprises are standing at a crossroad with myriad data architecture and sub-architecture alternatives, pondering which one to adopt and which one to let go, lo and behold, we have a new paradigm, Data Mesh. It stands to challenge all the existing paradigms of data – including questioning centralized storage, team structure based on tasks, and others. As we discuss below, Data Mesh inverts the concept of data management and espouses distributed control of data based on domain, unlike a central warehouse or lake.
We discuss below the ten dimensions of this architecture framework.
- What challenges are encountered in today’s data ecosystems
- How to solve those challenges, what is a data mesh and how can a data mesh help
- What are the key reference architecture components of a Data Mesh?
- How to execute a Data mesh – steps and deliverables
- Summary
1. Challenges in today’s Data Ecosystem
Over the last thirty years, we have seen that the data world has been divided into primarily two camps – the operational data camp and the analytic data camp. The unfortunate problem is “core data management part” is that it sees itself as a technology problem, ignoring the domain construct of data – domain reemerges in reporting, analytics, and data sc, but in between -from source to analytics, data goes through many stages of transformation and storage, creating a chasm in managing lineage, resilience to change amongst other problems.
2. Solving the challenge of this data gap
The Data Mesh construct suggests that one way of solving this data challenge is looking at data as a product – a data product is an encapsulated abstraction, pertaining to the data the product belongs to. This way one has the agility to address change, linking consumer’s analytic ask directly to the data that serves it - an inverted model and topology based on domains and not technology stack. To solve this problem, a DATA PRODUCT QUANTUM – the encapsulated entity, works as an independent unit to receive and serve data, whether it’s a query, API, or events. Therefore, unlike a data warehouse, this is a more peer-to-peer construct. This solves the problem of repeatedly copying organization data, massive meaningless parallel processing, without a domain intent, eliminating centralization of the warehouse and reducing governance effort. All in all, it leads to freedom and autonomy of teams
3. Reference components of a Data Mesh
The Data Mesh requires defining
- Owners (who own the code, data, metadata, policies). This enables resilience, change stability, agility, and freedom of data owners to align to the context of business analytic needs.
- Data products - A data product is a logical business domain encapsulation of data, metadata, and policies. This is to create more accessibility and flexibility of data as the organization navigates changes.
- Self-serve ecosystem – Enable an ecosystem of developers over an independent platform visavis a monolithic architecture
- Federated computational governance
4. Data Mesh Guiding Principles
Data domains need to be designed and logically constructed with these views as below
- Source-aligned domain data: business facts that are generated by the source systems
- Aggregate domain data: analytical data that is an aggregate of multiple upstream domains, for example, data in marts or data warehouse
- Consumer-aligned domain data: analytical data transformed to fit the needs of one or multiple specific use cases and consuming applications. This is also called fir for-purpose domain data.
Once data products are created, it is important to take care of the guiding principles of the data in a data mesh which are
- Discoverable
- Understandable
- Addressable
- Secure
- Interoperable
- Trustworthy
- Accessible
- Valuable to own
Data Platform
This leads to an ask for a pliable platform for Data mesh architecture consumption. This quantum requires architecture to be useful, code to be useful, and architecture that helps create this with independence and autonomy. For example, microservices today rely on a container architecture, similarly, the desire to build a data product, which is like a contained agnostic of the platform variability.
In essence, a data mesh platform needs to reduce data complexity by creating the following core layers
- A self-serve layer (discovery, exploration, security, and others) that sits on
- Data product development platform experience component - this is where developers define the data product, which finally relies on a
- A data infrastructure plane that builds and monitors things like CI/CD, networking, transformation orchestration, and access control
To sum up, Data Mesh will lead to
- Transforming to serving data at the source, instead of just letting data flow through a series of transformations
- Transform from canonical models to distributed models
- From a source of truth to the most relevant source
- Shift focus from pipeline to domain
- Transform architecture from technology or lifeless data Lego blocks to domain Lego blocks