Cloud Data Platform Showdown: Databricks
Databricks was founded by the original developers of the Apache Spark Project. Databricks is dedicated to the Open Source community in many ways, they contribute to Apache Spark, and have created new Open Source Projects and several new projects such as MLflow and Delta Lake. On the other hand to stay competitive Databricks has chosen to optimize, rewrite code, and offer proprietary features to clients. The Databricks platform is 100% vendor-neutral and has a very small footprint. Currently, all storage and infrastructure are located in the client's cloud account.
The Architecture that is promoted when working with Databricks is the Medallion architecture. That being said Kappa and Lambda Architecture will of course work perfectly fine also. Medallion architecture’s key feature is the lakehouse design. Some key takeaways from the lakehouse are decoupled storage and compute, ACID transactions, Schema on read/write, data governance, streaming, unstructured/semi-structured, and BI support. This philosophy isn’t limited to a particular platform and is fully compatible with approaches like Data Mesh.
- ANSI SQL
One of the unique aspects of Spark-based offerings like Databricks is the ability to use and mix a variety of languages. Data Science and Machine learning can be natively run in languages and with libraries that are the best of the breed. Yet in the same platform, you can build pipelines using mostly SQL and even heavily supporting tools like DBT. DBT is soo well supported that even the workflow service can call DBT.
Unique Data Engineering features
Notebooks are a common tool for ad-hoc queries, troubleshooting, and in some cases production usage. If notebooks are not to your fancy, you can use the Databricks jobs (a Spark application). Both options have a significant amount of unique features which are too many to list here.
There are also features that reduce the DevOps footprint by creating and managing infrastructure for you. Some examples are Autoloader, Workflows, and Delta Live Tables. Autoloader is a very unique stream/batch style ingestion service. Autoloader will automatically create a metadata and queue-backed ingestion framework in your cloud account. Autoloader can be run in streaming (microbatch) or batch which provides a wide range of flexibility.
Workflows is a Data Engineering, Data Science, and Machine Learning pipeline tool designed to run a DAG-like set of tasks. Tasks within the pipeline have the option currently to be a JVM jar, Python package, notebooks, DBT job, or even a Delta Live Table Pipeline. Workflows have the ability to fully mix Data Engineering tasks with Machine learning tasks all within the same minimalist infrastructure. Outside of some infrastructure as code issues as mentioned in the DevOps section, Workflows do not allow you to choose unique clusters per task. This might seem trivial at first but can become less optimal.
Other Pipeline tooling such as Apache Airflow, Azure Datafactory, Dagster among others is fully supported. If your pipeline tooling is not supported like in the case of AWS Step Functions, you can easily interact with REST API’s and fully integrate with little effort.
Delta Live Tables are a declarative code-based approach to creating data processing pipelines and managing data quality using Python. Delta Live Tables use the Delta Lake open standard. Delta Lake is a layer on top of the typical parquet-backed data lake table. What makes Delta Lake unique are
- ACID transactions
- audit log
- 0 infrastructure to manage
- schema evolution
- 100% schema on read/write.
What's refreshing to see is that yes the traditional relational data methodology is supported, but also strong support for semi-structured data (full schema on read and write), geospatial data, and even graph data.
Unique Machine Learning Offerings
Databricks comes with a wide variety of Data Science and Machine Learning features designed by Databricks and not a third party.
Machine learning can be accomplished through several tools. SparkML a native spark library can always be used. Native libraries like Sklearn, XGboost, and Hyperopt can be parallelized and integrated into workflows. Databricks offers a glass box style autoML. This approach to autoML allows the data scientist to have full visibility in the autoML process. Databricks also offers MLOps features like MLFlow and a feature store.
For Analysts and BI Developers, Databricks offers SQL analytics a SQL-only service for Analysis and BI dashboarding. This is a fully-featured BI solution that may not have as many features as Tableau or PowerBI, but definitely gives them a run for their money when viewed in the overall ecosystem.
Databricks offers several infrasture light features such as workflows, Delta Live Tables, and Autoloader for example. The design here is to allow for less infrastructure management. The infrastructure is present in our cloud account, it's just managed by Databricks. Databricks offers REST API access for using tools like terraform a popular infrastructure as a code tool. Also on Databricks many Features allow for integrating with CI/CD-produced JVM/Python/R libraries. This allows for cases where your pipeline is fully vetted through tests, linters, and a pull-request process just like any other software you would produce.
Sadly there are large gaps in the capabilities offered in Terraform and the REST API. This translates to manual interactions with the web GUI. This works perfectly for small groups and other use cases but will not scale and can’t be automated.
Databricks is a powerful data platform, that brings a ton of unique features to a very wide variety of workloads. There are rough edges but they should not stop you from using this platform.