by Sundip Gorai
The Past:
In the last thirty years, we have seen the rapid evolution of the various data and analytics paradigms. At the beginning of the 1990s, the world was dominated by mainframes and databases. The need for more query-driven MIS set the stage for the data warehouse – an aggregate store of data for doing reporting and KPIs and dashboards. We saw the rapid implementation of data warehouses with fierce debate raging between a top-down approach vs a bottom-up approach of executing a warehouse (also dubbed as a debate between a Kimball vs a Inmon architecture framework).
At the turn of the century, we saw the rapid rise of tools and technologies that aided the data warehouse paradigm, these included BI applications, reporting tools, and data integration tools, with a gradual rise in metadata, master data and data quality systems.
At the end of the first decade of this century, the desire to harness new data challenges, primarily, Volume of data – Zettabyte scale, Velocity of data – Batch and streaming datasets, Variety of data – Text, Video, Voice, Structured, Semi-structured, unstructured, Veracity – Data quality, led to the birth of Big data ecosystems, which changed the paradigm of the warehouse, which hitherto was mainly driven by SCHEMA ON WRITE approach ( model the data, before you bring the data into the data warehouse), to SCHEMA on READ (don’t model the data but bring everything into the data lake, the storage repository, and then make it ready for analytics) – all this driven by a paradigm shift in ask for vertical and horizontal scalability. This change was first driven by the big data (/HDFS/Hadoop) revolution, which later gave way to a more nimble-footed Apache Spark and other Apache frameworks. All this finally culminated with modernization to the cloud, a force still very much in reckoning which will continue to be so for at least another five years.
In parallel, the rapid advancement in data science and AI algorithm techniques, that take data from the above receptacles of storage (data warehouse, data lake, and others) are taking enterprises and preparing to take humanity to the next frontier.
The Present:
In summary, where we stand today:
The Future:
While the above architectural changes have made great advances for the enterprise, there have emerged newer challenges, and newer thoughts, to integrate the same.
In today’s article, we will discuss another new paradigm gaining momentum – The Data Mesh. (Conceived by Zhamak Dehghani). As enterprises are standing at a crossroad with myriad data architecture and sub-architecture alternatives, pondering which one to adopt and which one to let go, lo and behold, we have a new paradigm, Data Mesh. It stands to challenge all the existing paradigms of data – including questioning centralized storage, team structure based on tasks, and others. As we discuss below, Data Mesh inverts the concept of data management and espouses distributed control of data based on domain, unlike a central warehouse or lake.
We discuss below the ten dimensions of this architecture framework.
Over the last thirty years, we have seen that the data world has been divided into primarily two camps – the operational data camp and the analytic data camp. The unfortunate problem is “core data management part” is that it sees itself as a technology problem, ignoring the domain construct of data – domain reemerges in reporting, analytics, and data sc, but in between -from source to analytics, data goes through many stages of transformation and storage, creating a chasm in managing lineage, resilience to change amongst other problems.
The Data Mesh construct suggests that one way of solving this data challenge is looking at data as a product – a data product is an encapsulated abstraction, pertaining to the data the product belongs to. This way one has the agility to address change, linking consumer’s analytic ask directly to the data that serves it - an inverted model and topology based on domains and not technology stack. To solve this problem, a DATA PRODUCT QUANTUM – the encapsulated entity, works as an independent unit to receive and serve data, whether it’s a query, API, or events. Therefore, unlike a data warehouse, this is a more peer-to-peer construct. This solves the problem of repeatedly copying organization data, massive meaningless parallel processing, without a domain intent, eliminating centralization of the warehouse and reducing governance effort. All in all, it leads to freedom and autonomy of teams
The Data Mesh requires defining
Data domains need to be designed and logically constructed with these views as below
Once data products are created, it is important to take care of the guiding principles of the data in a data mesh which are
This leads to an ask for a pliable platform for Data mesh architecture consumption. This quantum requires architecture to be useful, code to be useful, and architecture that helps create this with independence and autonomy. For example, microservices today rely on a container architecture, similarly, the desire to build a data product, which is like a contained agnostic of the platform variability.
In essence, a data mesh platform needs to reduce data complexity by creating the following core layers
To sum up, Data Mesh will lead to