Business Intelligence and Data Virtualisation
What is data virtualization and what are its advantages and disadvantages?

Disclaimer: This article was translated from the Czech language by AI.
We have divided this article into two parts. The first part is a managerial summary, where you will learn what data virtualization actually is and what are its advantages and disadvantages, without being overwhelmed with technical terms. But if you’re the technical type, we recommend you skip the managerial summary and start reading straight into a more detailed technical explanation of what data virtualization is all about.
Managerial summary
The foundation of Business Intelligence is quality data. Companies invest heavily in reporting and data analytics to better optimize business processes, find new opportunities, minimize risks and generally improve performance. Those who have relevant information can also react faster and gain a competitive advantage.
Over time, a number of variations of Business Intelligence approaches and architectures have evolved, all aiming at the same thing: to deliver reliable information to the target user. The most common is an architecture built around a data warehouse. Corporate data is imported on a regular, usually daily, basis into a relational database (data warehouse), from which reports are then generated and analyses are performed. This architecture, like any other, has its advantages and disadvantages. Advantages include consolidated cleaned data from different systems in one user-friendly model, historization or data security.
The disadvantages of data warehouses, however, are often the time and cost of expanding with new data and generally implementing any changes. From a business perspective, a company needs to react very quickly. If the market situation changes, there is a need to adapt to it or take advantage of a sudden new opportunity. A “data driven” company will want to use as much information as possible to make an informed decision. However, adding a new area of data to the data warehouse is usually a matter of weeks, in many corporations more like months, which is desperately slow in turbulent times.
Data warehouses are also usually built for daily data processing. At night, when there is the least traffic in the data warehouse and in corporate applications, data is transferred to the data warehouse by automated processes, cleaned and prepared for users. This is a good approach until the data needs to be processed the moment it changes due to the speed of response. A traditional data warehouse cannot respond to this; real-time data integration is usually handled by other tools such as real-time ODS, but this is another element of the architecture that requires high development investment and operational costs.
There have been tools on the market for many years that can help us address the above shortcomings, but only recently, due to the increasing pressure for speed and flexibility, have they gained more popularity. These are tools for data virtualization (in English you can also meet the term data federation). In virtualization tools we create an imaginary (virtual) data model. The source data remains in the source systems and is not copied anywhere. Using the virtual model, they are only linked and offered to the target user. Let’s take the example of a CRM system with a customer table and an accounting system. Both systems work independently and are unaware of each other. In the virtualization tool, we link these applications, or the data from these applications, and provide the user with a consolidated view of the accounting system via the customers from the CRM, as if all the data were in one system. At the same time, we can cleanse and edit the data for better visual display. Because we are looking directly at the source data through this virtual model, there is no information lag. We are looking at real data directly from the sources.
The evolution of views or models is very fast. Literally in a few clicks we are able to create a business view of a new data area and connect it to other data in the data warehouse, data paint, ODS or anywhere else. Need to edit a data area and add more information? No problem. We simply drag and drop a new column from the source into the resulting model. Not satisfied with this model and want a new one? No problem either. We can create as many virtual models as we want.
The big advantage is that we don’t have to physically move the data anywhere, develop complex ETL procedures and deal with the impact of changes. But this approach also has its
disadvantages. It does not address data historization or complex transformations. Connecting the virtualization tool directly to the source application databases also carries some risks, but you can read about that in the more technical section of this article.
So what if you need to keep track of historical data, cleansed and transformed in a data warehouse, but you also want to be able to quickly connect it to any data in your systems? Alternatively, some applications you can access directly and some you might compromise? Or do you want to link data from your data warehouse, data varnish, ODS, internal system and external source? Almost anything can be a source for data virtualization. From text files to powerful analytical databases. Thus, a data warehouse with historical data can be one of the sources for data virtualization, and the virtualization tool will connect your historical data with the actual data from the source applications. There can be many such scenarios, it just depends on the needs of the end user.
If you are interested in data virtualization and would like to read how it all works, you can continue to the technical part of the article.
Data virtualization in technical detail
For quality BI analysis and display of business data we need to include data from various sources such as ERP systems, accounting systems, CRM systems, publicly available statistical data from the Internet or manual input from users and managers. Standard BI architecture tells us that we should report from a data warehouse or from data marts. ETL processes are the standard way to get data into the data warehouse. The pain of this architecture is the overly long process of applying new business requirements. In general, the larger the corporation, the longer the time from business requirement approval to physical implementation. Another disadvantage is the processing of data on a daily basis, so the data consumer receives data with some delay. In addition to the BI tool, we usually have other data consumers in the company using the same data sources. These can be, for example, websites, intranets, mobile applications or other enterprise applications. The problem is the same as with BI tools. If we need to use data that isn’t ready in the data warehouse or we need data faster than once a day and we don’t want to violate the agreed architecture, we’re probably out of luck. But there is an interesting possibility to solve this problem elegantly. That option is data virtualization, and it is provided for us by virtualization tools. A virtualization tool works on the principle of a virtual data model. In fact, you can have a physical data warehouse with a dimensional model next to a model from a virtualization tool and you won’t know the difference. If you’ve ever come across the term logical data warehouse or virtual data warehouse, it’s probably data virtualization, where the logical data warehouse model is just an abstract logical layer on top of the data in the source systems. On the input side of the virtualization tool are all possible available resources and technologies. It is actually one of the main requirements for such a system. The tool should be able to connect to data sources regardless of whether it is a data warehouse, application database of an ERP system, data lake, on-premise or cloud environment. A virtualization server usually has three logical layers. This is not always necessary, but it is a model proven by practice. Each logical layer plays a role in the overall solution.
The lowest/source layer is the source data image. This layer is the interface to the sources. After defining the connectivity to the source, we import the objects we want to use later from it. We only import object metadata, not data. Adding a new object to this layer is a matter of seconds. If we query the data through the lowest layer, we always get the original data from the source system. The only difference is that the data is in the form of a “virtual” relational table even if, for example, the original source data is a parquet file on HADOOP, an excel file, or some other type of non-relational data.
The middle/integration layer is all about data integration and standardization. The output is a virtual table that can be used in other logical layers. In the integration layer, we deal with linking physical tables by setting up constraints. We also deal with text unification, filling in missing values, filling in default values according to business logic, standardizing values by translating them or using a numeric or binding table.
The topmost/presentation layer serves as an interface towards the user. In this layer, we create virtual models that serve as output for data consumers. The virtual models are based on the integration layer and their creation is very simple. We usually prepare the model by dragging and dropping columns from tables in the integration layer into a new table in the presentation layer. If possible, the virtualization tool handles the bindings and everything else for us. We have a lot of freedom in creating models and can create models as the consumer needs them. If the consumer needs a model “a la” star schema, we create a star schema. Similarly, we can create a cross-section of business processes using dimensional modeling to create a virtual data warehouse. Alternatively, we can create very wide tables with many columns for the needs of some business application.
Let’s give a small example of how virtualization works in practice.
Let’s imagine a company that has a BI tool for displaying company reports. The BI tool must get data from data sources before displaying the report. It does this by generating a sql query to the source, the source processes the received sql query and sends the requested data back to the BI tool. The BI tool then visualizes the data, for example, in the form of a report or dashboard. In a classic architecture, the BI tool is usually connected to a data warehouse or datamart.
In our virtualization example, the BI tool is not directly connected to the data warehouse, but is connected to the virtualization server. If the reporting tool sends a query to the virtualization server and the virtualization server recognizes that the data is from three different systems, it splits the input query into three smaller queries and sends them to the source systems in the language that the source systems work in. When the data from the source systems arrives back at the virtualization server, the data is merged or otherwise combined in the integration layer and the result is sent back through the presentation layer to the reporting tool as a single output. What goes on in the background of the virtualization server is hidden from the data consumer. All transformation steps such as filtering, masking, calculations, text functions, aggregation, data joins are done live on demand. This is the main idea of data virtualization. Classic ETL also performs these transformations, but unlike a virtualization tool, it always stores them somewhere before we can use them.
An added value of virtualization tools is, for example, data lineage, which gives us information about the mapping of target attributes to source attributes. This will of course be very helpful in impact analyses when technical changes are made to the data source objects. Furthermore, data virtualization can serve as an integration link when migrating data from on-premise environments to the cloud. Many virtualization tools have an integrated data catalog for better retrieval of information, business concepts, and so on. Some tools have direct integration with BI tools (for example Denodo -> Tableau). Virtualization tools can address data security. So we don’t have to address row level security at the database level on each source separately, but we address it in one place. And we are not talking about resources where security cannot be implemented at all. Most tools support the most commonly used authentication methods such as Native, LDAP/Active Directory, Kerberos, Windosws SSO, Oauth 2.0, SAML, SPNEGO. Roles, integration with LDAP user groups, various accesses to functions within the application – metadata, execute, insert, create datasource, security at different data levels – schema, view, column, row, dynamically changing security and more are supported in the authorization…
Of course, there are many questions around data virtualization regarding sufficient performance.
Let’s take one example. An application sends a query that consists of 8 operations – aggregation, value replacement, filtering, etc… The virtualization server decomposes this query and finds that 6 of the 8 operations can be done by the source database server. The virtualization server is then left with two operations. Same example, but the data source is a filesystem. The filesystem is not capable of doing any complex operations, so in this case the virtualization server must do 6 of the 8 operations. So we are not able to determine the exact performance in advance. When we talk about the performance of a virtualization solution, we have to consider the performance of the virtualization server, but also the performance of the source server. The virtualization server always tries to use as much performance as possible from the source, and everyone probably understands that if Excel or Exadata is on the source, it makes a big difference. Virtualization tools always try to decompose the request into its component parts and send the request to the source in an optimal form.
For example, if you are joining two tables from different sources and filtering an attribute from table “A” while joining the two tables through that attribute, the virtualization server automatically filters that attribute on source “B” as well so that it joins as little data as possible at its level. Another example might be a join of two tables from different sources, where one table inner join filters the other. The virtualization server first pulls data from one source and then filters the query to the other source based on the result. For joining large tables, a different trick is used. Instead of having the virtualization server download the two large data sets together and work on them, it “learns” the two sources to communicate with each other and lets one of the servers handle the task. If it has the ability to create temp tables in one of the databases, it can transfer the table from the second database to the temp table of the first database and perform the join there.
Another one of the useful features that provide higher performance in virtualization is caching. Caching can protect, for example, transactional databases from too many queries. We can ask the virtualization server to cache some data, for example, the entire virtual table. Everybody today has associated caching with in-memory. This means that the data is loaded into memory and held there for a period of time. This is not our case. In virtualization, caching means that virtual data from virtual tables is physically stored. Of course, we can define where. Any other query that wants to use this virtual view will query the cache table and no longer query the source database. A simile from the database world can be a materialized view. And we could give countless such examples. The mechanism inside a virtualization server always tries to minimize the data flow over the network and transfer the smallest possible data blocks. It’s a big science, and virtualization tools have to be good at it. And they really are great at it. This clever mechanism is called an optimizer.
The main key tasks of the optimizer are:
- Inspecting and decomposing incoming sql queries into smaller chunks when necessary or performance-wise.
- Retrieving metadata from the source tables. Getting information about the data source, primary keys, foreign keys, indexes, partitions.
- Retrieve data statistics such as data size estimation for costbase optimizer.
- Finding out if the source can process all the data, if the source is read-only or can create a temp table, if the source is a massive parallel processing database, the size of the database cluster, and other important information.
At the beginning of this article, we talked about the logical data warehouse. So, is it possible to replace a traditional data warehouse with data virtualization? The unequivocal answer is no. Data virtualization is not a complete solution. You will always need additional components in your data architecture. However, incorporating it into your enterprise data architecture can bring benefits. For example, when combined with a data warehouse, it can provide you with a comprehensive view of historized data and real data at query time. Virtualization allows you to make rapid prototypes of data models that you can later implement permanently in the data warehouse. It also lets you do more complex analyses that won’t be repeated in the future, and there’s no need for heavy model modification in the data warehouse. The downside of data virtualization is that it is not designed for overly complicated queries, it is not designed for data with complex relationships, it is not designed for data with poor data quality, and it cannot easily handle data historization. This is where the classic data warehouse in conjunction with ETL excels. For lovers of modern open source technologies and data warehouse whisperers, I still present some interesting connections of a virtualization system with Hadoop, for example. Technologies built around Hadoop can be used as a source of historical data (HDFS, Hive, Impala, Presto, SparkSQL). Hadoop can also be used for data caching.
If you decide to try this technology and implement it later, you should follow best practices. For POC, choose a job that has “representative” complexity. Use ready-made reports for testing. Select a task that will test functionality and performance. When testing functionality, focus on sql queries generated by the reporting tool and execution plans in the target database. When testing performance, test on real data quantity. Do a stress test. Test on real outputs and real usage. Compare performance when caching is turned on and off. Simply put, test to simulate real-world use as much as possible.
And how to choose the right virtualization tool? Features and functions are very important. Focus mainly on the features you want to use. Be it different data sources, data transformation options or security. Performance is also important. There’s no point in choosing a solution that won’t perform, or worse, will cause performance issues in other business applications. The product should be locally supported. The experience of other users will also tell you a lot. Consider also the extra software and hardware you will need for master data management, data cleaning, data security, special connectors and drivers, database server for reference tables and caches, etc…
For the sake of completeness, let’s list some of the tools that are available on the market. The most dominant in Europe are DataVirtuality, Denodo, Fraxses, TIBCO. Globally, it is Denodo and TIBCO. Below is a list of the top 10 virtualization tools in use globally today.
AtScale
DataVirtuality
Denodo Platform
Dremio
Fraxses
IBM InfoSphere Federation Server & IBM Data Virtualization Manager for z/OS
Red Hat JBoss Data Virtualization
Stone Bond Enterprise Enabler Virtuoso
TIBCO Data Virtualization
Zetaris
Zdroje:
TDWI Munchen 2021 konference – Data virtualization in real live projects / Rick F. van der Lans
http://vianovaarchitectura.nl/page/what-is-data-virtualization
Denodo Platform 8.0 – Demo Overview- https://www.youtube.com/watch?v=_ro0bqUQ1J0
Denodo 8.0 Standard Demonstration- https://www.youtube.com/watch?v=qKXkOAcdQl8
TIBCO Data Virtualisation Demonstration – https://www.youtube.com/watch?v=-Sx3ykvVUhs
Autor: Michal Machata (Czech version)