Data Virtualization for Hybrid Data Integration

A perfect storm of growing data sources, volumes, and varieties along with empowerment of self-service business analytics is driving the need for rapid, agile access to hybrid integrated data sources. Organizations are no longer willing to wait months or years for physical data warehouses to be developed. Bulk and batch data movement techniques might not even be viable. To address those issues, defy data gravity challenges, deliver unified dimensional views and simplify complex hybrid data integration, we are seeing the emergence of logical data warehouses and data virtualization in enterprise data strategies.

Compelling Time and Cost Advantages

Data virtualization is modern data integration. It performs many of the same transformation functions as traditional data integration extract, transform, and load (ETL), data replication, data federation or Enterprise Service Bus (ESB) approaches but uses newer technologies to deliver real-time data integration at lower cost, with more speed and flexibility.

According to industry research, logical data warehouses and data virtualization solution time-to-value can be 5 to 10 times faster than traditional data warehouse development approaches. There are also significant cost reductions associated with virtual data entity reuse, the elimination of data replication processes and redundant data storage requirements.

Hybrid Data Integration

A logical data warehouse is a combination of hardware and software. Much like a traditional data warehouse, a logical data warehouse organizes data by subject and honors the concepts of time intelligence. Data architects define logical data views or deliver data-as-a-service to many different information consumers. Notably the logical data warehouse entities are not materialized and the origin source data is not copied.

Denodo
Source: Denodo

Data virtualization logical views execute distributed queries to combine structured and unstructured data sources in real-time, without requiring programming or complicated ETL processes. The data virtualization logical layer serves as a semantic layer buffering reporting applications from data source changes. With point-and-click ease, difficult hybrid integration of many different types of data sources becomes truly fast and flexible.

  • On-premises or cloud databases
  • IoT, NoSQL or Hadoop data sources
  • SaaS cloud application APIs
  • Web services, SOAP or RESTful data services
  • Web site page content
  • Google search results
  • XML, JSONs, and BAPI queries
  • Structured, delimited or Excel files
  • Unstructured document content

To see how straight-forward it is for data architects to define virtual views across many different types of structured and unstructured data sources, watch a quick 10-minute demonstration of Cisco’s Composite Software data virtualization offering. The image below is a peek into Denodo’s data virtualization administration view.

Data Virtualization
Source: Denodo

For self-service reporting users, hybrid data integration views can be queried from data virtualization platforms with simple ODBC connections. Logical data source views look just like physical data source views in self-service reporting tools such as Tableau, Excel, Power BI Desktop, or TIBCO Spotfire.

Views on Views

Data virtualization semantic layers of views span several different areas.

  • Physical Layer or Connection Views that access source data
  • Business Layer or Integration Views that link data from various sources
  • Application Layer or Consumer Views that presents user-friendly data
Composite Software's Data Abstraction Reference Architecture.
Composite Software’s Data Abstraction Reference Architecture

Data Gravity Challenges

Data virtualization is a good option for data architects to use in proof-of-concepts, when development time is limited, source data is well-defined, requirements can change, real-time query access is needed and the reporting queries return smaller result sets.

Data virtualization does not replace ETL or bulk data operations since these platforms are not designed for sophisticated data transformations, complex business logic, data cleansing and bulk data transfer. Logical views typically are created by a join, merge or union of hybrid data sets.

Note if a source system does not maintain historical data, data virtualization can only provide as-is current data views when queried. Historical data views would require traditional data warehousing techniques (slowly changing dimension logic) to be able to have users accurately report from that data source over time.

In the real-world, data location, network speeds, and throughput capabilities cannot be ignored when dealing with hybrid data integration. Although performance of distributed data virtualization views have been quite challenging historically to optimize, these technologies are improving with advanced, memory caching settings, throttling, prioritization and resource management capabilities. New intelligent, dynamic query optimization can determine the best execution plan at the time of user query.

Enterprise Data Catalogs and More

Several data virtualization vendors also include searchable enterprise data catalogs for reporting users to find and enhance metadata documentation about published data based on user subject matter expertise. Additional data virtualization platform features include data governance, data lineage, security, logging, and auditing.

For more information on industry leading data virtualization platforms, check out Informatica, Denodo, Cisco Composite Software, or IBM (InfoSphere Federation Server).