Choosing a data store, a guide for the curious
When starting a new project, it's a good idea to evaluate your data storage needs. I’m going to shy away from the term database and instead, I’ll use the term data store because oftentimes labels are loaded with baggage that will distract us. Before we begin I will warn you that I‘m going to be rather broad with how I define many concepts.
Before we even start out talking about data stores, let's first discuss a few general concepts.
Online Transaction Processing
An OLTP will focus typically on small transactions or a unit of work treated as one unit. You will often see CRUD (create, read, update, delete) transactions from a web app using an OLTP data store. In the relational world, data can be stored in normalized (3rd normal form) or denormalized depending on the requirements. There are Nonrelational data stores that could be considered OLTP also like MongoDB since it typically processes shorter transactions often using CRUD.
Online Analytical Processing
When we use the word analytical processing, we are normally talking about bigger queries over larger datasets. An OLAP unlike an OLTP system is typically designed to handle these larger queries. I’m going to be inclusive and include MPP (Massive Parallel Processing)as well as lump feature stores in the OLAP category, knowing full well this is a more complex discussion.
Typically an OLAP is used for business intelligence analysis, reporting, and data analysis. When we start talking about data warehouses and data lakes the lines become blurry, but for simplicity's sake, I”m going to say both a data warehouse and a data lake can be an OLAP.
What is a data warehouse?
“a copy of transaction data specifically structured for query and analysis” -Ralph Kimbell
“a centralized repository for the entire enterprise”-Bill Inmon
A data warehouse is a “central” place to store copies of historical data. Typically data is “wrangled” and curated into being organized and shaped for BI, reporting, and data analysis needs. Data is denormalized following a logical pattern typically, and semistructured data is not common.
The feature store is an evolution in the storage of data for machine learning. Generally speaking, machine learning is the process of cleaning and manipulated historical data into features, which are then used for predicting the unknown. The process of creating features can be similar to some of the work done to curate data in a data warehouse. Some of the key characteristics of a feature store are:
- Full data lineage of features
- Automating feature creation through a pipeline
- Storage of features
- Central access to the features
- Feature metadata, for example, versioning
- Feature Lifecycle monitoring
- Tracking the effect of new features
- The ability to share features across teams.
Now here is the dark horse in the room. Data Lakes are often thought of as simply files stored in a massive storage system (typically in the cloud). A data lake can be as flexible or as strict as the architect's desire. You are forced to choose the level of structure your data will have and how to manage that data. This freedom gives you the flexibility to use structured and semistructured, tables, and databases anyway you choose. In the same light, you can haphazardly throw data into your data lake leaving you with what is known as a data swamp. One misconception about a data lake is that they are more prone to data swamps than other data storage patterns. This really isn't the case, the rule of thumb always is garbage in garbage out.
It's common to use a data lake as a “layer” before transferring data into a data store like a graph database or a data warehouse. It's also common to see data warehouses built on top of data lakes examples of this are Snowflake and Spark Delta Tables among many others.
Relational Database Management Systems, ex PostgreSQL
The king of the data store is in my opinion the relational database. It is a general-purpose system that is ACID compliant. Data is stored in a structured manner, although many systems have some support for semi-structured data. In the case of PostgreSQL, there is support for semistructured data via BJSON and should make you strongly consider PostgreSQL for semistructured data. Relational databases speak in SQL, a high-level declarative language (the data language), which supports very complex queries. A relational database is designed to handle the joining of data and can require technical skill when modeling data. That being said, it is common to use an ORM when writing a software application to abstract the modeling of data. The limitation of ORM’s being that they are only useful for simple transactions. One common issue with typical relational databases is that scaling isn’t very easy and requires a great deal of effort. This makes sense given the context in which they were created, the colocation of servers in a data center.
General Document Data stores
General document data stores are designed for JSON documents ie semistructured data. With this strict direction, you gain a great deal of flexibility and limitation. With JSON documents you are given the freedom to use whatever schema you want for each document, or with some systems like MongoDB you can have strict schema enforcement. These data stores are very popular for web applications because you can have a flexible schema upfront, but without care, this can become an issue down the road. Many of these systems support an ORM like an interface for developers.
These data stores are semistructured data stores, which can “feel” like general-purpose document data stores. If you look under the hood you will find they do not work as normal document data stores, instead, they are designed to analyze the text in the document using fast indexing no matter how large the document is. This super-focused skill comes at a cost, scaling can be tricky. Also, joining data together is not possible when querying data like general document data stores. These limitations are normally why full-text search data stores are typically not used for a general-purpose.
In-memory data stores are focused on very fast access to very limited structures of data. Typically the data is stored in a key-value relationship. Redis on the other hand allows for more complex data types such as strings, lists, sets, hashes to name a few. A common use for this type of data store is a fast caching layer for an application. Another valid use case for this type of data store is the core of a Publish & Subscribe pattern for applications.
ex Cassandra, Scylla, Hbase
Columnar data stores have evolved out of the limitations of traditional RDBMS systems. They avoid reading data row by row and instead read data in groups of columns which gives them a significant speed increase for some types of queries. Scaling is another benefit of column focused reading. These data stores are typically easier to scale than a normal RDBMS system and are often used for OLTP requirements that require scaling. These gains come with some pains that should be considered. These databases are not fully acid compliant and normally follow the CAP theorem instead. You should expect eventual consistency with these data stores. Also, they only allow a SQL “like” query language. Nothing beyond simple queries is typically supported, and even joining two tables is not allowed. Data is denormalized so that the model matches the query expected to be run. Real semistructured data isn’t typically supported for this type of data store.
One of the new faces of the data store community is the distributed SQL (DSQL). The most notable of this category is cockroachDB, a geo-distributed DSQL that scales easily and quickly. These data stores are meant to bridge the gap between classic RDBMS data stores and wide column data stores. DSQL data stores are a fully acid complement and support ANSI SQL fully. They look and feel just like a normal relational database. The catch is under the hood they don’t act like a typical relational database, and instead, focus heavily on scaling. DSQL systems are focused on redundancy and thrive in a modern geo-distributed cloud OLTP environment. There is a catch, nothing is perfect and DSQL typically will not perform as well with very complex SQL queries or for OLAP uses.
Graph data stores use graph modeling to focus on the relationships that exist within the data. A graph is a system of nodes (data) and edges (relationships). This could be modeled in an RDBMS data store but a graph is much more natural and easier. These data stores are typically acid compliant and should have comparable performance and scaling as any other type of data store.
The idea of a multi-model data store is to be able to store data in whatever model it lives in. Multi-model data stores can handle graph, key-value, or semi-struct data. In the case of Fauna, a multi-model managed service, you access the data through a GraphQL API. The major advantage to Fauna is that you can literally don’t have to manage anything not even the modeling, schema control, or any of the complexity of managing data stores. Some multi-model data stores have schema control. The cost of using Fauna is significant depending on several factors. Another consideration is that the model can have some constraints but not at the same level as an RDBMS. It may seem like a perfect solution, but often using a single focus data store can be a better choice.
Data can be stored in a variety of ways, and care should be taken to choose something that is appropriate for your project's needs. They all have pros and cons that should be weighed carefully. When in doubt, I believe the default choice should always be a classic RDBMS like PostgreSQL.