Modern Data Engineering
Data Engineering is a relatively new concept, although the skills have been around for some time. If you Google around you will find that the skills, tools, and job responsibilities will vary significantly. My approach is a broad, modern approach to the data engineering role. Many hyperspecialized roles also exist such as Data Warehouse Developer or Big Data Developer. Although those are key components of a modern data engineer, they are but pieces of a larger picture.
My modern data engineering philosophy has 6 pillars:
- Open Standard over closed propriety tools and languages:
Whenever engineering a solution, following open standards allows for your engineered project to be high quality, flexible, and more likely to stand the test of time.
Cloud isn’t a trend, it's a revolution. Not being held down to supporting data-centers is a major step forward. Not only do data-centers require manpower, but they also require significant financial commitments. There is no benefit to avoiding the cloud.
- Managed infrastructure whenever possible:
Data infrastructure can be complex, and really should be delegated to a sysadmin. Managed infrastructure allows for engineers to avoid being stuck in the sysadmin role, and focus on relevant skills. Open products like MongoDB, Kafka, Apache Spark, PostgreSQL, Redis all have managed services that are compatible with major cloud platforms.
- Platform neutral solutions:
When engineering a solution focusing on solutions that can be moved to a new platform with little to no effort is always the preferred route. In those cases, it goes without saying that single platform solutions are typically not the original projects, but “copies” created by cloud providers. It is a significant contribution to the community to support original projects over closed platform projects. In practical terms, this can be done with projects like MongoDB, Kafka, Apache Spark, PostgreSQL, Redis who all are Open Source projects. When you choose Confluent cloud for Kafka over a for example AWS Kinesis you not only get a larger ecosystem of features and products you also help contribute financially to the community.
- Code over GUI tools:
Although learning to write code can be a hard process, it has been shown in all cases code is always better than a GUI tool. Code allows for version control, testing, peer review, continuous integration, and much more. GUI’s are for non-engineers, and production should never include GUI tools for deployment.
- Follow modern software engineering best practices:
A data engineer does not need to be a seasoned software engineer, but knowing the critical tools they use is very important. The benefits of tools like automation, code standards, code review, testing are extensive and can’t be ignored.
The skills of a modern data engineer are a hybrid of several domains much like the skills of a data scientist.
Relational Modeling and Data Warehousing: In the past, the world of data engineering lived exclusively in this domain. Although these skills are very important they are limited in scope. Data is diverse just like people, and just like people, having a natural diversity is healthy. There is a time and place for documents over traditional RDBMS. Understanding when each is appropriate is very important. That being said, it's critical to be able to normalize and denormalize your data as well as follow common data warehouse patterns.
Modeling for NoSQL/ Document Stores and Graph Databases: Data can be represented and stored in many forms. Modern data modeling isn’t limited to relational data. It includes NoSQL, Documents, Graph, and Geospatial data. Modeling all forms of data is in high demand and is a key skill.
Streaming Data: Streaming data over time has become a vital component of data engineering. The possible sources for streaming data can vary significantly, and it requires unique methods.
Serving data: Data is often consumed by a user through an application. The role of serving that data is becoming a more common practice in the data engineering field. There are several ways to serve data for consumption, examples include REST, GraphQL, GRPC, and Kafka.
Software Engineering: The fundamentals of software engineering are the foundation for data engineering. Being able to write reliable, maintainable, testable code is not optional anymore. Even when using automation tools like Airflow and Jenkins, you must be able to rely on the code you run and the data management it provides.
DataOps: Being able to provide the best practices of DevOps to Data is a growing skill that is indispensable.
MLOps: MLOps is a growing field of organizing, managing, and automating Machine learning projects. In many organizations, MLOps has emerged as another job duty of data engineers.
SQL: Structured query language is a high-level abstraction to perform DML and DDL actions on a data store. Variations exist in most RDBMS data stores, and the ability to write SQL is a fundamental skill.
Python: Python is one of the most popular languages when working with data after SQL. Since it's a general-purpose scripting language it's very easy to write applications quickly.
Scala: With the popularity of Kafka and Spark, Scala has risen in importance in the data engineering world. Scala is a JVM language that has strong similarities to Java. Scala is both an object-oriented and functional programming language. Scala is so prominent in job descriptions that it is one of the top 3 languages recommended in the data engineering field.
Kafka: Kafka has become the most popular streaming data platform. Kafka has a large ecosystem of supporting applications including schema registry, connect and ksqlDB. Kafka is an open-source Apache project and its largest contributor is Confluent, who offers a managed service on every major cloud provider.
Spark: When it comes to working with data at scale, the most popular choice is to use Apache Spark. Apache Spark uses the lessons learned from Hadoop/MapReduce and has novel approaches to resolve issues with Hadoop/MapReduce. Apache Spark is a general processing platform that can be used for everything from ETL, graph analysis, geospatial to machine learning. Moreover, Apache Spark is available on every major cloud provider and is also offered with Databricks, the largest contributor to the project. Databricks also offers unique features and speed advantages exclusive to the Databricks platform.
Data Engineering is a quickly evolving role that encompasses many disciplines. As the industry moves toward a modern approach, data engineers are adapting newer skills from other disciplines. Most importantly, modern data engineering is moving toward open standards and a best practices approach.