Enterprise Data Warehouse (EDW) architecture needs to be robust, scalable, and flexible to adapt any change in the business environments. Traditionally, the relational and structured data sources feed data to data warehouse. In latest time, there is a demand to integrate structured data with unstructured data coming mainly from social media, external contents available in public domain, documents, email, etc. The architecture built based on the normalized and dimensional design techniques neither respond to the fast changes to the business requirements nor applicable to solutions where Big Data is applicable. Data vault based architecture is suitable to address the fast changes to requirements.
A good data architecture determines what are the key building blocks in data environment and ensures that the data needs of the enterprise are met in data warehousing solution. Through that the design, data flow, data management and data governance processes are defined. A typical data warehousing solution follows either Kimball based solution or Inmon’s based solution.
Kimball has introduced an often-used, two-layer architecture for the data warehouse. The raw data from the source systems is loaded into stage layer before it gets cleansed, transformed and loaded into data warehouse. The advantage of Kimball's proposed architecture is that it is easy to build a dimensional store from a source system. However, the disadvantage is that it is complex to rework on the same dimensional model to incorporate the information for similar kind of sources or any modification of the source system. It has impacts to the existing ETL packages. This needs additional effort to rework on everything with less re-usability of existing solution.
In Inmon's defined three-layer architecture, data marts are developed on top of the data warehouse which is in 3NF form. The business users access the data marts using Business Intelligence (BI) tool, similar to the two-layered architecture. It is quite easy to develop data marts from the data warehouse in three-layer architecture as enterprise data is cleaned, standardized, and integrated. However, it is more time consuming and complex to work on data model and data processing for enterprise data warehouse in one go is suggested top-down approach.
Data vault architecture takes the best of solutions from both the architectures mentioned above and brings a new perspective to address the challenges in those architecture.
Data Vault architecture has the following layers:
Data vault design has three components: Hub, Link, and Satellite. Hub has on the business key, Link captures the relationship between the business keys and Satellite has the non-business keys to capture majorly descriptive information. In data vault model, multiple satellite tables are there by following rate of change of information and types of information. The satellite tables capture the historical updates like Type II Slowly Changing Dimension in case of dimensional design. Data Vault has both the properties of normalization and dimensional design concept. The data vault architecture provides many-to-many relationship among the business keys in Link table which brings flexibility and extendibility in solution [1].
As EDW is treated as single source of truth at enterprise level, it needs to be flexible, scalable, and robust to accept any change happening to upstream layer, be it source systems, data structures, business rules, or addition of new source applications. Unlike other architectures, Data Vault architecture based EDW can handle easily following scenarios where changes are unavoidable at enterprise level:
Major advantages of Data Vault architecture solution are:
Inmon 3NF EDW | Kimball Bus Architecture | Data Vault | |
---|---|---|---|
Implementation Strategy | Top-down approach The design needs big bang approach where it needs to be finalized at the start. |
Bottom-up approach with iterative and incremental implementation methodology Long development time due to redesigns caused by delayed requirements. |
Methodology follows agile development methods It supports pure incremental implementation [2]. |
Design | Fully normalized with optimal number of joins Design is not easy for data loading to dimensional model |
De-normalized with less joins Designed for analytical query processing |
Semi-normalized with many joins Unlike other designs, it is flexible to allow different relationships (one-to-one, one-to-many, many-to-one, many-to-many) without changing the data model structure. |
Data Management | Referential integrity is maintained properly | Expensive updates and deletes | Data load happens based on design pattern (Hub and Link before Satellite tables) |
Audit Mechanism | Duplicate data sources need priority which is difficult to maintain the audit tracking | Not fully auditable | Fully auditable |
Data Presentation | Not suitable for end-user queries. | Star schema is suitable for user queries and reports | Not suitable for end-user queries |
Following table covers a comparison of different architectures used for data warehouse implementation.
Data Vault architecture is a mature data architecture as it supports impact free, robustness, flexibility and scalability in the solution through its optimal design in the data model. Coforge has different best practices, pre-defined templates and guidelines for implementing data warehouse solutions. Based on the customer’s requirements and needs, Coforge implements a specific architecture. In recent years, most of the organizations are considering agile based implementations. Data Vault architecture is a suitable architecture to implement enterprise data warehouse in agile approach.
[1] Dan Linstedt, “Data Vault Series 4 – Link Tables”, TDAN.com, January 2004. https://tdan.com/data-vault-series-4-link-tables/5172
[2] Dan Linstedt, “Data Vault Series 2 – Data Vault Components”, TDAN.com, January 2003. https://tdan.com/data-vault-series-2-data-vault-components/5155
[3] Dan Linstedt and Michael Olschimke, “Building a Scalable Data Warehouse with Data Vault 2.0”, 2016.
[4] W. Inmon, D. Strauss, G. Neuschloss, “DW 2.0: The Architecture for the Next Generation of Data Warehousing”, Elsevier, 2008.