By Rajan Jindal & Ragavendran
Business Context
There is a growing need for Data Governance particularly with requirements mandated by regulations like GDPR and CCPA (due to take effect early 2020). This has led to organizations and regulatory bodies increasingly emphasizing data privacy protection aspects in applications. While this may sound like a pure technology problem but it is equally related to people and process aspects as well.
Leading analysts have been emphasizing the growing trends around data governance and metadata management. For example:
- As per Gartner - By 2020, most data and analytics use cases will require connecting to distributed data sources, leading enterprises to double their investments in metadata management.
- Gartner: By 2022, 45% of Fortune 500 companies will experience an information crisis due to their inability to adequately value, govern and trust their enterprise information
- Verified Market Research: Global data governance market was valued at USD 1.24 Billion in 2018 and is projected to reach USD 5.80 Billion by 2026, growing at a CAGR of 21.2% from 2019 to 2026.
We look at leading challenges posed in the area of Data Governance and how better and automated metadata management strategies can solve them.
Data Governance
As DAMA explains, Data Governance is the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets.
The challenges of implementing an effective data governance stems in 3 aspects – Process, People and Technology.
Data Governance Framework
A typical data governance framework requires an effective lifecycle through strategy to formulating right policies & processes to technology. Putting together an effective mechanism for monitoring and control is equally important as data governance is a journey and not a one-time initiative. Chief Data Officer typically needs to remain invested to the cause of data governance initiative with focus on different aspects and changing business, technology and regulatory imperatives.
An effective Data Governance program focuses on the following aspects of technology solutions:
- Data Discovery: Due to digitization and adoption of cloud, data discovery means that users can find contextualized data at the right time using easy to use search and tagging facilities across enterprise and cloud data sources
- Data Classification: These are solutions which classify data in terms of different levels of sensitivity including personal and private data, confidential data, restrictive data accordingly to varying levels of financial and legal risks with the right combination of tools, technologies and processes.
- Data Cataloguing, lineage and Metadata Management: To have complete visibility into organizations’ data landscape, a data catalog is required to be maintained that contains business glossary, functional and technical metadata, lineage along with search and discovery capabilities
- Master Data Management & Reference Data Management: These solutions help define consistent and integrated definitions, values and processes for various enterprise level masters like customer, product, accounts, locations etc. Reference data management is handling special types of masters like airport codes, baggage special handling codes, ISIN numbers etc
- Data Quality Management: Data Quality Management refers to tools, technologies and processes to data profiling, cleansing, standardization, data enrichment, data validation and monitoring
- Data Access Management: This refers to authentication and authorization of data assets to right data consumers
- Auditing: These refer to monitoring, auditing and tracking of data assets helping proactively identify potential threats before they result in any business, legal or reputational loss. It is to ensure overall effectiveness of controls.
- Data Protection: Data access management alone is not sufficient but requires data protection solutions like data masking, data encryption while data in rest or motion, permanent deletion etc.
Metadata Management
By definition, Metadata “includes information about technical and business processes, data rules and constraints, and logical and physical data structures.”
Gartner defines metadata as various facets of an information asset in order to improve its usability throughout its life cycle. In simpler terms, metadata is structured information that describes and explains data in order to make it easier to locate, retrieve, use and manage an information resource. It gives data its context and meaning that’s needed to derive insights.
Metadata Management answers the overall context of enterprise data assets:
- Who: e.g. Who created this data?
- What: e.g. What is the business definition of this column?
- Where: e.g. Where is this data element used?
- Why: e.g. Why is this data element needed – its usage and purpose?
- When: e.g. When was this data element created, updated, accessed etc?
- How: e.g. How many applications need this data element?
The current metadata management solutions focus on data warehouses, BI & ETL tools. With the rise of social media, IoT and cloud systems, the future metadata management solutions will shift focus towards Big Data platforms, media files, social media, machine learning/AI etc. This shift requires tracking metadata and lineage across heterogenous systems and calls upon embedding use of Machine Learning, AI, NLP, etc as part of these solutions.
Keeping in mind, the above fundamental shift, we are currently working on building accelerators towards metadata management solutions for our customers.
Key capabilities of Metadata Management solutions:
- Metadata repository
- Business Glossary
- Data Lineage
- Impact Analysis
- Metadata ingestion and translation
- Metadata Exchange
- Business Rules
- Workflow Management
- User Experience
Our Accelerator - MetaXpress
Our Data & Analytics Practice is currently working on building a NEXT GEN Data Discovery Platform. This platform helps customers to answer questions based on:
- Search and tagging based e.g. I am looking for a table with data on policy claims
- Lineage based e.g. If this an event is down, what datasets are going to be impacted?
- Network based e.g. I want the list of all tables my peer or my manager uses
Some of the salient features of MetaXpress are:
- Search for artefacts based on tags, name, description, owners, usage etc.
- Perform search on variety of sources like data stores, dashboards/reports, events/schemas, streams, ELT/ETL flows, users
- Identify owners and usage patterns
- Impact analysis of any change
- Data profiling statistics
As part of MetaXpress, we have ready extractors and models to ingest metadata from below data sources:
- Relational Databases: Oracle, MS SQL Server, mySQL, Postgres
- NoSQL: MongoDB
- Cloud Sources: Google BigQuery, AWS Athena, AWS Presto, Snowflake
- Hadoop: Hive
- Real-time: Kafka
- Reporting, Visualization: Microsoft SQL Server Reporting Services, Apache Superset
Summary
Organizations in the current trend of global and local regulations and emergence of the use of hybrid cloud environments need to work on managing their Data Governance Strategy and Processes more efficiently. This requires them to better understand and use their Metadata with that helps them through their journey of data discovery, quality and governance. Our accelerator MetaXpress is designed to help clients achieve that with ease and efficiency.