Snowflake and Spark Databricks: A technical perspective
Note: I have avoided discussing the many possible Spark options available on the market and instead, I am focusing on Databricks, and this is because they offer a very good easy to use product and they are vendor-neutral. In the article, I will refer to Spark on Databricks simply as Spark.
Snowflake is a cloud-based vendor-neutral easy to use high concurrency data warehouse in the cloud. Both products follow the adage that you pay for what you use. Snowflake is really a modern iteration of the classic data warehouse. Your Snowflake data and resources will live as a tenant in Snowflakes cloud account in 90% of the use cases. Moreover, the data is not accessible to you nor is it stored in a standard format. Out of the box, Snowflake requires little setup, because it's not using your cloud account like Databricks. Snowflake conceptually provides a familiar data warehouse feel to customers. With Snowflake, you are going to have a significantly higher price point, which makes sense because you are getting a turn-key data warehouse solution.
Spark is a cloud-based vendor-neutral easy to use parallel processing framework. Spark offers all the modern tools of data science and data engineering at scale for a reasonable price. Spark can be the heart of your data lake or a DIY data warehouse. Spark’s adherence to open standards for storage allows for maximum longevity and preservation of data. Spark has drastically reduced any advanced knowledge needed. As of Databricks Spark 3.0, there are several ease of use and speed enhancement features that stand out in the industry (AQE, Partition pruning, Delta Engine). When using Databricks you will always have access to your data, and resources in your personal cloud account. All data is stored using open standards. Databricks has a very simple pricing model and the cost is much cheaper.
Snowflake supports most of the ANSI SQL operations. In both cases, they support unique operations that are tied to the platform. Spark 3.0 fully supports ANSI SQL.
Snowflake uses your preferred cloud vendor for its compute engine and storage. You are not given access to any resources or storage directly at any time and when you end your contract you must request to have your data purged. This is because all resources and data in most scenarios will live as a tenant in snowflake’s data lake.
Here are two examples of a common setup; one is small, the other is a medium data warehouse. With some clients, I have even seen several XL size data warehouses.
As you can see the compute costs are the majority of the cost when working with Snowflake. Keep in mind you only pay for what you use.
Once you choose a cloud vendor, Spark will run all components through your preferred cloud vendor and on your account. You will at all times have access to all your resources, which can be a good or bad idea depending on the organization.
Spark can use whatever storage you choose, but in this example, we will assume a cloud storage like AWS S3.
With Spark, you own, control, and will always be the steward of your data.
Current AWS S3 pricing is per month.
First 50 TB / Month $0.023 per GB
Next 450 TB / Month $0.022 per GB
Over 500 TB / Month $0.021 per GB
4 TB = 4000 GB which costs $92 a month
65 TB = 65000 GB which costs $1,495 a month
So to be clear storage on AWS is practically free.
What’s nice about Spark is that you can choose the compute resource to match your needs. In this case, I will choose M5d.4xlarge.
m5d.4xlarge: is $0.904 Hr, 16 CPUs, and 64 GiB.
The AWS cost in this case is;
4 instances X 10hr X 20 Days X $0.904 per Hour = $723.2 for the month.
The total cost per month is $1600 and annually $11,244.8
The AWS cost in this case is;
8 instances X 10hr X 20 Days X $0.904 per Hour = $1,446.4 for the month.
The total cost per month is $3,024.64 and annually $20,385.28
As you can see Spark combines the pay as you use philosophy of Snowflake, and gives you direct access to your data.
DML and DDL
For many engineers, SQL is the most important language for manipulating data. Both Snowflake and spark support SQL fully for data processing.
Spark supports a wide variety of languages and in several ways. Spark allows for the creation of full spark applications in Python, Scala, Java and R. This requires more technical skill but has significant benefits.
Spark has the additional ability of being able to mix languages as needed with small limitations. You could write optimized spark code in scala or Java and then it in SQL or Python.
Snowflake has a very easy to use WebGUI with a classic look and feel that allows you to run queries on your data, and manage your resources.
Spark uses a notebook to run queries, dynamically interact with your data and resources. These notebooks support all the spark languages (SQL, Python, Scala, R) but not Java for technical reasons. These notebooks allow for charting as well as exporting and importing notebooks. The notebooks have several collaborative and interactive features for teams working together.
It’s very common to have external applications want access to your data. Through the use of SDKs and *DBC driver’s this interaction can be very easily accomplished.
Both Snowflake and Spark offer these options, but care must be made to understand the scale of data being queried and transferred back to the application.
Accessing your Data
Snowflake stores all data in easy to use Tables, Views, Temp Tables, and External tables (read-only access to files in the Data Lake). Snowflake has support for semistructured data as well as structured data. In the case of semistructured data, you have the choice of using a Variant column which allows limited semistructured schema interaction or just flattening the data ahead of time.
Spark supports just about any data source including files, Databases as well as streaming sources like Kafka. You have the choice of creating tables, databases, and temporary tables with your data just like Snowflake. Spark will keep track of all metadata, in its Metadata store. Spark also supports read and write capability on all data sources.
Spark has a very strong semistructured API, as well as the ability to work with unstructured data. Spark has the ability to enforce a schema on read and on write which can be very useful in taming semistructured data.
Spark supports the open standard Delta lake which allows for your data lake to have the following characteristics:
- ACID transactions
- Scalable metadata handling
- Unified Batch and streaming source
- Schema enforcement
- Time Travel (Data Lineage)
- Audit History
- Full DML Support
Both Snowflake and Spark support advanced SQL techniques such as CUBE, ROLLUP. Which are all variants of the GROUP BY SQL command used in data warehouses.
Machine Learning, GIS and Graph Analytics
Spark can be used with any data source which includes Snowflake.
Spark offers several high-quality advanced analytics capabilities including machine learning, gis, and graph analytics. Many of the popular machine learning tools are supported on Spark which includes Tensorflow, PyTorch, Sklearn, MLlib to name a few.
What stands out to me as the key difference between the two products is the DIY nature of Spark, where Snowflake is basically ready to use with little technology ramp-up required. Both products can be used to make powerful cloud-based scalable data warehouses. Spark has the added ability to be able to manage data lakes and add analytics to your data.