In today’s dynamic business environment, Data is an invaluable new currency whose exponentially growing impact dominates our lives today. Organizations are vying with each other to utilize large volumes of data to gain deep business insights before committing to informed decisions.
Open Source Big Data technologies like Hadoop, Data Lakes, and NoSQL have made implementation of Big Data architectures more affordable, but organizations are continuing to face challenges in passing data to these systems. Data security, handling large data volumes, variable speeds of data availability and latency are other critical architectural intricacies to deal with.
In the process, most organizations are finding themselves juggling with multiple repositories and platforms while creating more data silos than unified platforms for insights. Disparate technology stacks tend to hamper timely, integrated data delivery to business users, customers, and partners. As a result, data is scattered across multiple operational and analytical systems and that is the biggest challenge today’s organizations are faced with.
Traditional Technologies are Falling Short
To get an integrated view of the siloed data, organizations have traditionally been using bulk/batch ETL (Extract, Transform, and Load) mechanisms before moving the processed data into Enterprise data warehouse (EDW). However, many organizations are quick to realize that physical consolidation and replication of data are impractical for meeting the stringent needs of data integration and business agility. Moreover, a typical EDW implementation takes between 12-24 months but, it is too long to wait for businesses trying to move their decision making into nearly real time. There is also a high likelihood that the business requirements may have changed by the time EDW development is finally complete. They are left with no choice but to rework all the downstream dashboards and analytics.
Significant technologies have emerged in the wake of the mainstream bulk/batch offerings market. Though they may represent a significant shift in focus in vision and execution, they do not address all data integration delivery requirements. Organizations will need more sophisticated architectures to integrate the multi-channel interactions that crisscross our lives today.
Distributed mobile devices, consumer apps and applications, multi-channel interactions and even social media interactions are driving the organizations to build very highly sophisticated integration architectures. As per a recent survey, as much as 40% of the organizations are utilizing Data Virtualization, message queues or simple replication with layers of data processing afterward.
Data Virtualization Delivers on its Promise
Data Virtualization (DV) technology which combines disparate data sources in a logical layer or a virtualized database has made great strides lately. DV maps the data from disparate sources (On-Premises, Cloud or external) into a virtualized layer which can then be seamlessly exposed to consumer applications. This is a much faster approach since data need not physically move out from their source systems. However, it should be understood that that DV is not a replacement for EDW, but complementary to it. Instead of creating a historical data layer, it sets up a virtualized layer over the operational systems including EDW which can serve as common and integrated source of information that all downstream applications and reporting tools can draw from.
Many of the new-age Data Virtualization tools now boast of massively parallel processing (MPP) capabilities along with dynamic query optimization, performance optimization, incremental caching of large datasets, read/write access to data sources (including Hadoop, NoSQL, Cloud data stores etc.), advanced security, metadata functionality that allow users to inventory distributed data assets and persistence of federated data stores.
Zeroing in on Provider Segments
As per Gartner, through 2019, 90% of the information assets from Big Data analytics efforts will be siloed and practically unusable across multiple business processes. It is also expected that through 2020, 50% of enterprises will implement some form of Data Virtualization as an enterprise production option for data integration. As a result of the traction for adding features and functionality to Data Virtualization solutions, it has become a technology of interest for many traditional data integration, database and application tool vendors. Overall, four different provider segments have emerged:
Standalone solution providers - These include thought leaders and pioneers like Denodo and Tibco. Traditional data integration vendors - Traditional data integration vendors are incorporating Data Virtualization capabilities into their existing solutions either as a separate product or as a complementary capability. These include data integration stalwarts like Informatica, IBM and SAS. Database vendors - Leading database vendors Oracle, Microsoft SQL Server are all extending access to data through Data Virtualization via Database links. BI/Analytics providers - All new age BI/Analytics tools provide a virtual semantic layer which developers or advanced business users can use to develop reports, dashboards or other ad-hoc analyses.
Data Virtualization: A Logical Representation
The picture below represents the typical logical components that are being offered by various Data Virtualization providers:
Data Ingestion:
This layer includes connectors to enable access to data from various databases, Big Data systems (Hadoop, NoSQL etc.), enterprise service bus, message queues, enterprise applications (including ERPs and CRMs), data warehouses and data marts, mainframes, cloud data systems, SaaS applications, and various file formats.
Security:
This stack provides authentication and authorization mechanisms
Federated and distributed query engine:
This is the core component of any Data Virtualization technology. It accepts incoming queries and creates the most efficient execution plan by breaking the incoming query into multiple sub-queries which can then be sent to source systems via the data ingestion layer. The retrieved datasets are then joined in memory to create a composite data view using which data is made available to all client applications.
Data Objects:
Typical implementation will have a hierarchy of views and data services that encapsulate the business logic.
Caching and optimization:
Caching and optimization help in improving the performance of the incoming queries. This includes both in-memory and physical storage cache apart from an option to configure full or partial caching of views. Physical storage could be done via files in proprietary format or standard RDBMS systems.
Data Distribution:
The Data Distribution layer exposes the data in response to the queries received through various protocols. A typical consumption layer publishes data in various formats including JSON and XML and supports Web Services such as Representational state transfer (REST) and Simple Object Access Protocol (SOAP), Java Database Connectivity (JDBC), Open Database Connectivity (ODBC), OData and Message Queues.
Design and administration tools:
This includes tools for graphical design and import/export of data objects, integration of source code configuration management tools like subversion and TFS, administration and configuration. Some of the advanced tools also provide intelligent design recommendations.
Data Source Write Back:
Ability to write the transformed data back to the source from which the data was initially extracted is now becoming more common. Many of the new age tools are now allowing users to write back the data either as new records or updates to existing ones.
Metadata Catalog:
Data Virtualization tools primarily support two types of Metadata:
Design and configurations: These include Metadata related to source systems, mapping of the source systems with creation of logical data objects during design, and run-time execution data. This could be stored either in an XML or a proprietary format. This also helps in sharing the Metadata with other data management tools performing data quality, ETL, or Master data management functions. Additional catalog: The design and configuration data may need to be queried sometimes. Most tools provide additional cataloging features which enable business users to tag and categorize the Metadata elements. Frequent Metadata queries can be saved in addition.
Making Strides with Agile Data Architecture
Today’s organizations are increasingly leveraging data virtualization for Data Services for Underwriting Desktop, Agile Sales Reporting, Risk Analytics, Operational Effectiveness, Advanced Customer Analytics, Risk Data Aggregation, Mergers, Acquisitions, Migrations and to insure against risk during legacy modernization. Data Virtualization is also finding a place in several horizontal use cases like providing controlled access to data for data governance, analytics, data lake/Big Data, cloud solutions, and data services etc.
Given its ability to be part of multiple data-related use case architectures, data virtualization is a unique technology. It is likely to see more innovations and large-scale adoption in a very short time.