Decomposing The Lakehouse
“Do not collect weapons or practice with weapons beyond what is useful.” Miyamoto Musashi, Dokkodo
Students of the Ichi school Way of Strategy should train from the start with the (normal) sword and the long sword in either hand. This is a truth: when you sacrifice your life, you must make fullest use of your weaponry. It is false not to do so, and to die with a weapon yet undrawn. Miyamoto Musashi, The Book of 5 Rings
“Absorb what is useful, discard what is useless and add what is specifically your own” Bruce Lee
Databricks introduced the Lakehouse to describe a unique set of principles that have emerged in the industry. At its core, it is a hybrid of two classic design patterns: the data warehouse and the data lake.
The data warehouse was “fathered” by Bill Inmon and further explored by Ralph Kimbell. The purpose of a data warehouse is to store curated and focused data for reporting and analysis. On the other hand, the data lake came out of the emergence of large amounts of very cheap storage. With the emergence of the data lake came Hadoop, and with its demise, several successors came on the scene, most notably Apache Spark, Apache Flink, and Apache Presto.
Each design pattern has its limitations, advantages, and use cases. A data warehouse can time to deploy and have limited scope. Even when a data warehouse has been deployed, the usefulness beyond reporting and targeted analysis is limited to the architect's foresight. Even the most skilled architect cannot predict every possible question asked of the data.
On the other hand, data lakes can store all raw data without the worry of trying to go to the source data store. Going to the source data store is almost always a very hard process with limited results. Data lakes welcome all types of data without discrimination. Data lakes also come with significant issues mostly centered around the lack of management and control.
The Lakehouse aims to address the limitations by combining the effectiveness of both patterns.
Being open is a fundamental principle in designing systems for long-term usage. Openness guarantees that not only is the data stored in a universal standard, it is also open for all methods of data consumption. By storing raw data and making it available to anyone with permission, we set the groundwork for data democratization.
What types of data should I be able to use?
- Unstructured Text
A shocking fact is that although many systems will claim to be friendly to nontraditional data, the sad truth is something very different. There are almost no relational databases or data warehouses that support structured and semistructured data equally as of writing this article. We won’t even bother talking about graph and geospatial data. The issue comes when you read the fine print, and they only support schema on reading for some types of nontraditional data.
Schema on reading is not good enough for an enterprise system meant to last the test of time. One of the main contributors to data swamps is not supporting schema on write. A Lakehouse must enforce both schema on read and write for a reasonable amount of nontraditional data types.
A common misconception I hear often is, “we are just going to make dimensional models out of the data; why bother with strong support for diverse data?” This perspective is limited and sees the data warehousing as the only useful outcome for the data.
What types of workflows should I be able to do on my data platform?
- Data warehousing
- Machine Learning/ AI
- Data analysis
- Graph analysis
- Geospatial analysis
What languages should I be able to use on my data platform?
- SQL (structured and unstructured)
- JVM Languages (Scala, Java, ..)
It's impossible to support every possible option, but care must be given to avoid boxing yourself in.
Decoupling Storage from Processing
Decoupling opens the door to flexibility.
The storage of data should have no bearing on the processing of that data. This is a double-edged sword since moving data is often expensive. In the context of a cloud-based OLTP, this isn’t typically a concern. By decoupling your data platform, you open the door to a whole ecosystem of options.
- A data scientist transforming source data and storing it in a feature store for Machine Learning / AI
- The marketing department is asking for a data pipeline to a graph database for graph analysis.
- Performing some workloads using one processing engine like Flink or Spark but leverage technologies like Apache Presto or AWS Athena for inexpensive, fast ad-hoc queries.
Streaming is a fundamental service for many companies. Having a data platform that is friendly to streaming data isn’t an option. Many companies that are not dealing with streaming data will require streaming data shortly.
Data lakes have classically failed when dealing with concurrent processing. They just can’t achieve the same level of reliability as databases and data warehouses. A Lakehouse is built on an ACID, a transaction-based architecture. When working concurrently, this will create a consistent view of the data. Transactions open the door for a full lineage of the table.
The Lakehouse addresses the main issues plaguing both the data lake and the data warehouse. The basic principles of the Lakehouse can be used individually or together. An example of using one principle in isolation is making your data platform open by creating raw delta lake tables before ingesting data into your data warehouse. You gain several benefits, including more robust interoperability and a more productive path for machine learning. Most notably, The Lakehouse is 100% compatible with data warehousing techniques.