Databricks admits it: the Data Lakehouse concept formulated in 2020 mainly comes from a marketing placement.
“When Databricks launched, we called it the unified analytics architecture,” recalls Nicolas Maillard, AVP Field Engineering, Solution Architect at Databricks. “The word Lakehouse sounded good and summed up our thoughts. It’s marketing, but marketing that’s caught on.”
The neologism, however, is not meaningless. This contraction of Data Lake and Data Warehouse indicates that the Databricks data processing platform is a data lake able to withstand the treatment acidthus ensuring the atomicity, consistency, isolation and durability of the data.
In other words, a data lake has historically been a NoSQL data store. With a Lakehouse, it is therefore a question of being able to query structured and unstructured data hosted using a technology that is essentially NoSQL. But the story does not end here. For Databricks, a Data Lakehouse must above all support simultaneously the set of use cases of the two paradigms that concatenate minimizing data movement.
“Our vision was relatively simple, and it’s kind of how the founders of Databricks envisioned it Spark Apache », points out Nicolas Maillard. “We had to provide a complete IT system capable of managing all the treatments used to define a data application”.
Originally, Apache Spark was designed to allow engineers and data scientists to perform various data transformations in Scala or Python (and now SQL) in a single environment.
In essence, the Lakehouse extends this concept to all types of treatment (AI e machine learningBI, analytics, etc.) data models and analytical techniques while maintaining the ACID guarantees historically provided by a Data store. As a result, Nicolas Maillard believes a Lakehouse should foster collaboration between data scientists and businesses.
Nicholas MaillardAVP Field Engineering, Solution Architect, Databricks
Databricks adds two more criteria. “In our opinion, there are two other foundations of the Lakehouse,” says the manager. “The Lakehouse needs to be open to many layers and standards and be multicloud. I need the different instances to be equal to the isoperimeter in the different clouds, to be equally fast and to be able to meet each other”.
Therefore, Databricks Data Lakehouse supports various data formats and open source table formats.
The obvious success of a marketing concept
The concept hit the market in 2022. “Databricks is the primary proponent of the Lakehouse concept,” confirms Gartner in its 2022 Magic Quadrant for Cloud Database Management Systems, released Dec. 13, 2022. But the publisher isn’t plus the only one to use the name. “As the Lakehouse concept grew in popularity, other vendors rushed to develop their own versions of this architecture,” the analysts note. AWS, Google Cloud, Teradata, Oracle, Dremio, snowflake or Cloudera again took over the terminology. Some, like Teradata, Oracle, GCP and Cloudera have created new offerings. Others believe their Snowflake product already meets most of the criteria formulated by Databricks.
It’s not necessarily light-hearted: having to use competitor terminology is certainly irritating.
Stephen BrostCTO, Teradata
Proof of this is that Teradata CTO Stephan Brobst doesn’t like the term Lakehouse. He decided. “Even though I don’t like the term, the Lakehouse concept is important,” he told MagIT. “There should be little friction moving from data lake to data product.”
According to Cécil Bove, Sales Engineering Director of Snowflake, this is a natural evolution.
“The Enterprise Data Warehouse, historically implemented on-premises, has evolved every three to five years based on estimated storage and compute needs. The stored data was controlled – the companies knew the volume and knew what they wanted to do with it,” he sums up.
“Then came data lakes. As companies generated more and more data, but didn’t know what they wanted to do with it right away, because the data wasn’t structured the same way, they just piled it aside in these lakes,” she continues. “Then came the public cloud that really offers this ability to elastically store data for later use.”
“The Lakehouse concept aims to erase the differences between the enterprise data warehouse and the data lake. We will be able to store in the same place and use structured, semi-structured and unstructured data in the same way”, confirms the manager of Snowflake.
Two great Data Lakehouse families
But Stephan Brobst says there are fundamental differences between the various products available on the market.
“This notion of a unified architecture is important, yet we have to recognize that there are different technologies, different implementations.”
A teasing observer might say that there is Lakehouse on one side and “Houselake” on the other, depending on whether the underlying data management system is relational or not.
Even as vendors promote their capabilities to process virtually any type of data, these technological origins still linger, Gartner notes.
“Publishers’ data lakehouses are emerging,” the analyst firm said in its 2022 data management hype cycle, released in late June 2022. data lakehouses, but they don’t support all transactional consistency or robust workload management capabilities that data and analytics managers expect from their data warehouses,” note the analysts. “Other Lakehouse platforms are good at data warehousing, but they don’t support the extensive data models and data science or data engineering capabilities of a data lake.”
Stephen Brosbt for his part believes that the Databricks processing model based on data refinement in three phases (bronze, the raw data, silver, corresponding to the filtered and augmented data, then gold for the reference data) “is not efficient” . Some add a fourth level to evoke the semantic layer. “I don’t think you need four copies of the data. Raw data and reference data must be stored with care. The silver layer should be non-persistent, but the semantic layer can be physical or virtualized as needed,” he explains.
Different modes of multicloud and hybrid approaches
Beyond the differences in terms of analytical and archiving capabilities, it is also necessary to distinguish between technical and commercial approaches. In general, players like Databricks, Snowflake or Google Cloud have encouraged their customers to “centralize” their data in the cloud. This is the position defended for three years by Ali Ghodsi, CEO of Databricks, with MagIT. Benoît Dageville, CEO of Snowflake makes much the same argument considering that even if it is necessary to separate storage from compute, ideally all this must take place in the cloud for performance and cost reasons. Behind this position, it must be understood that these publishers prefer “captive” customers of their solutions and of the cloud without however renouncing portability rails for those who one day would like to leave them. Also, publishers brag regularly their ability to support a large number of open source tables and data formats (Apache Iceberg, Hudi, Delta Lake, Parquet, Avro, ORC).
This isn’t necessarily the position of traditional publishers like Teradata and Cloudera who try to convince that their products aren’t (necessarily) locked down and expensive anymore. Their customers maintain hybrid data processing architectures, which forces these vendors to think about solutions suitable for this reality.
“We are changing our positioning,” said Steve McMillan, CEO of Teradata. “We want to offer an open, multi-cloud analytics platform. Our strength is our query engine and we can bring this engine closer to the data, wherever it is stored instead of sending it [systématiquement] in the cloud,” he adds.
Ironically, both approaches are based on the same technology: object storage. It is more particularly the S3 protocol that allows not only to reduce storage costs, but also to promote the analysis of data recorded in different formats. These containers can be hosted in the cloud and on-premises in S3-compatible instances. For its part, Snowflake has entered into partnerships with suppliers on-premise storage solutions to extract data entered in external tables and process them in the cloud. Databricks is investigating a similar solution, but Nicolas Maillard believes the use cases for that feature are immature.
Data Lakehouse vs. Logical Data Warehouse
In fact, few companies have a single platform that covers all their analytical needs. The vast majority of large groups have built and are building a multi-product architecture with their warehouses and data lakes at the center.
Most of them find themselves dealing with disparate data sources. However, they want to be able to correlate or analyze data like a lakehouse. In the data warehousing world, this practice is identified as Logical Data Warehouse (LDW).
The concept of logical data warehouse was defined in 2009 by Mark Beyer, an analyst at Gartner. An LDW is “an analytics-dedicated data management architecture that combines the strengths of traditional warehouses with alternative strategies for managing and accessing data.” Simply put, a logical data warehouse is “an architecture for consolidating and virtualizing data from multiple analytical systems.” It allows you to centrally integrate data from different sources into a logical repository rather than a physical location for analytical purposes. Virtualization tools like those from Denodo or TIBCO can support this approach.
In its 2022 Hype Cycle, Gartner considers a Lakehouse to be an “opportunistically constructed subset of LDW.” A logical data warehouse would serve more analytical use cases, until proven otherwise. “LDW remains a mature and best practice,” analysts advise. Furthermore, the Logical Data Warehouse concept has been excluded from the Hype Cycle 2022 simply because it no longer has to prove its worth. Like the data lake it did not replace the data warehouseLakehouse is likely to coexist with LDW for a long time.