Want to be Data-driven?

So be data-informed & data-inspired

header

What is Datalake

Datalake is a buzzword today, but ideas of what it actually means can be vastly different.

Datalake

Disclaimer: This article was translated from the Czech language by AI.

In recent years, we have encountered different opinions about what Datalake is. Since datalake is a buzzword nowadays, it can be discussed by IT specialists as well as company management. However, ideas about what it means can be diametrically opposed. From the view that datalake is just smarter storage to the view that datalake is a modern replacement for data warehousing. Much has been written about datalake, so let’s summarize the facts.

The originator of the term datalake is Pentaho’s CTO James Dixon. He introduced the term on his blog in October 2010* when his research around Hadoop highlighted the shortcomings of traditional data architecture around data warehouses and datamaps. He articulated these shortcomings in several points.

  • In the past, the standard way of reporting and analyzing data was to build a data warehouse and datamart. As one of the traditional rules of BI architecture was the storage of data relevant for analysis. Then identifying the most interesting attributes and aggregating them into datamarts. The disadvantage of this solution is that only a subset of attributes are examined, so only predefined questions can be answered. In addition, the data is aggregated in the datamarts, so the lowest level detail is lost. In the future, many unknown questions will arise that this architecture will not be able to answer quickly.
  • In companies there are many data consumers from different departments, areas and with different technical expertise. Covering all the needs in a traditional way, for example by creating a universal data model in dwh, is very time and cost consuming. Moreover, many data analyses will not be repeated in the future or will be so different that it does not make sense to make a data model for them.
  • In the past, companies have dealt overwhelmingly with structured or semi-structured data, and unstructured data has been neglected, even though it is an important source of information for the company.
  • With the passage of time, data has become so voluminous that it technically and/or economically cannot fit into traditional relational databases.

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

If you think of a datalake as a store of bottled water – purified, packaged, and adapted for easy consumption – a datalake is a large body of water in a more natural state. The contents of the datatalake flow from the sources to fill the lake, and various users of the lake can come to look, dip, or take samples.

* James Dixon, CTO, Pentaho (Blog, 2010)

So the original idea behind the datalake concept implies that the philosophy of datalake is different from the concept of data warehouse. In a datalake, we store data in its original form and at the lowest possible detail. Datalake supports all forms of data – structured, semi-structured and unstructured. Data storage is economically advantageous. Datalake is based on Hadoop philosophy and tools. It can serve as a data source for a traditional data warehouse, and it can perform new types of analytical tasks alongside the data warehouse. This definition clearly defined what a datalake is and how it should be used.

James Dixon came up with a case study for datalake in 2015 on his blog, where he puts datalake in a very different position. The basic idea is to collect data from all enterprise systems and all changes made over time. This would create the ability to get the state of every field of (potentially) every business application in the enterprise across time. He called this state “Union of the State”. The service or machine that would provide the above would be called an “Enterprise time machine”. There are of course benefits that we can capitalize on in trend, historical, or predictive analyses because we have all the attributes, even the seemingly unimportant ones. At the same time, he mentioned the architectural design of how this can be achieved.

The proposal is loosely translated as follows:

  • The first step is to write the current state of the application to a relational or No-SQL repository. Thus, a snapshot of the current state. This is done without affecting the application functionality if possible.
  • The next step is to log all events that happen in the application. Ideally in real time. This data is then stored in a datalink.
  • By processing the logs in parallel, we allow the analysts to access the state of all attributes at a specific time.
  • We apply the above method to all enterprise data in general.

Architektura Datalake, James Dixon, CTO, Pentaho (Blog, 2015)

* James Dixon, CTO, Pentaho (Blog, 2015)

With this idea, he actually said that what we commonly do today in data warehouses, operational databases and datahubs could be replaced by a new concept using datalake as a basis. We would not historicize data in data warehouses in dimensional models. The real time image of the source databases would not be replicated to the datastores and datalake would also take over the function of the operational databases. Everything would be served by massively parallel log processing over datalake. This idea is very revolutionary, but it only solves information retrieval for all possible attributes across time, not many practical problems that have been solved for years in proven concepts like ODS, DWH, DATAHUB. On the contrary, it confronts IT engineers with the same, often solved, problem that they will have to solve again using new technologies and a new concept. Over time, it has become apparent that many IT architects have given vent to their imagination and started implementing datalake in their solutions in very different ways and have started calling what is more of a “concept built on top of datalake” a datalake. In fact, if we were to take James Dixon’s words in detail, by 2015, datalake is a powerful analytical storage built on top of a hadoop that stores all possible data formats. From 2015 onwards, it is additionally a repository that is part of the “Enterprise time machine” concept, where we store the change logs of ERP systems. We don’t even store the initial snapshot of the ERP system in the datalink, which by design should be stored in a classic relational database. So is it still just the old familiar storage built on hadoop, or is it something more that is part of new innovative concepts? In the introduction of this article I mentioned that datalake is sometimes seen as a modern replacement for a data warehouse.

In 2020, Databricks introduced a new architecture concept built on its products and datalake at the same time. This concept is called datalakehouse. It is based on a combination of the data warehouse concept, the datalack and partly the operational database. The whole architecture is built on technologies such as Spark, Hadoop and a cobination of other cloud tools. All data is stored and processed in one place in the datalink. The concept built in this way is a bit reminiscent of the previously mentioned “Enterprise time machine”, i.e. the ability to analyze across time and in real time, while adding layers similar to a classic data warehouse and datamart. The datalakehouse concept is now being marketed across companies as a modern replacement for the data warehouse.

Databricks

Datalake today

The definition of datalake is still not precisely specified. Different consulting firms and different companies have different definitions on their websites. Datalake is also very often associated with internal marketing and is presented as an opportunity to start where, for example, a DWH project has failed. Nevertheless, concepts like DWH or ODS will still have a place next to datalake and will perform their function. Sometimes the term datalake is also misused in the sense of “let’s migrate to the cloud”. This may be related to the fact that we have more implementation options today. Technically, we can either build a datalake on Apache Hadoop or we can choose one of the available cloud services such as Microsof Azure Data Lake, AWS S3 or Google Cloud Storage.

So let’s list the most common uses of datalake in practice:

  • Standalone analytical storage, where data scientists play with data. This use case is most similar to the original intent of storing different data in different formats in a datalink and providing a space for data scientists to perform big data analytics.
  • IoT data support. Processing and analysis of data streams. Again, completely fulfilling the original intent. Data from IoT devices is limited in structure, sometimes not at all. Depending on the nature of the IoT device, huge amounts of data can be processed and stored at very high granularity. Datalake, for example, in combination with Spark, then allows this data to be analysed efficiently.
  • Data archiving from DWH or Datamart (Hadumping”). It is a very simple and effective method of data archiving where we store archive data from an ERP system or data warehouse in Datalake as an alternative to backing up data to tape.
  • The functionality is similar to the ETL process in theDWH concept. The data is first stored in the datalink and then cleansing, aggregation, and transformation occurs over the data. The result is again stored in the datalack and the data is then used as a source for filling DWH or other systems.
  • Addition of relational database functionality. Combination of historical and real-time data or combination of structured and unstructured data.

Maybe in future years we will see some more precise specification so that it will be immediately clear to everyone what a datalake is and what is the difference between a datalake and a concept built on top of a datalake. Perhaps the concept will specify itself, just by the most common use of datalake in practice. Until that happens, it will always be necessary to clarify how a datalake is used in a particular project and what role it plays so that the people involved have the same view of a particular datalake. Whether it’s just a dumb repository or a data warehouse built on top of a datalake.

Resources:

Autor: Michal Machata (Czech version)

Mohlo by vás zajímat

The time when developers of data warehouses and generally any platform integrating corporate data had full access to all production…

5 min.
Read
Read more

Do you have questions or want to tell us something? Contact us!

Drop files here or
Max. file size: 100 MB.
    This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.