MLOps Basics Part 2

Brian Lipp
6 min readJun 9, 2021
Photo by Gustavo Espíndola on Unsplash

In our first article, we introduced the basics of MLOps, now we will talk about our core application in our tech stack, Airflow. Airflow will be the central orchestrator for all batch-related tasks.

Swapping technologies

This tech stack is designed for flexibility and scalability. There should be no issues using alternative tooling. For example, if you wanted to replace Airflow you should have no issues swapping it out for another pipelining tool like Dagster or a cloud-specific pipeline tool.

Why Airflow?

Airflow does very well as a batch processing orchestrator because it’s not a GUI, it’s simple to use and yet very flexible. Airflow uses python configuration files to represent pipelines. Although your pipeline is written in Python, only minimal Python knowledge is needed, you are not writing Python applications. Airflow is not limited to a platform, it's not going to do the processing, so you are free to use whatever platform or processing engine you feel is ideal for your project.

What about streaming pipelines?

In cases where you have a streaming task, I suggest a tool like Jenkins or Gitlab CI/CD to deploy them and treat them like you would any normal backend application. In a future article, we will automate the tasks not covered by Airflow, but for now, we will focus on batch-related tasks.

DAG

A Directed Acyclic Graph, DAG, can be understood as a group of steps that never repeat. A DAG is very useful when creating complex data pipelines with many parent-child relationships. You may think, well my projects are not so complex, why do I need a DAG? Even if you have 3–5 steps with varying relationships, a DAG will oftentimes make life easier when expressing your relationships.

In Airflow your Pipeline will be represented as a DAG

Here is a simple Pipeline as an example:

In the above pipeline, we have two tasks that are bash operations that write the current date to files in the /tmp folder. Our relationship defines that task bash_1 happens before bash_2

Operators & Tasks

A single task is represented in airflow as an Operator. When an operator is instantiated it will be called a “task”.

Its best practice to try to use Operators that are tied to a specific task you want to accomplish like working with snowflake. There is a general rule of thumb with Airflow, Airflow doesn’t do the work. Airflow should not handle data, artifacts, deployable, or really any of the work you need. Airflow is just the manager, it doesn’t get its hands dirty.

In cases where the specific operator for a resource isn’t available or doesn’t support a feature, often hooks are available that give low-level access to resources. In the event that you are unable to use a hook, then you have the choice of writing your own operator or writing some simple python code to interact with the resource.

Astronomer.io

Quickstart Guide

For this series of articles, we will be using Astronomer.io. Astronomer is a company that heavily contributes to the Airflow project and offers a Saas service as well as a customized docker image. Astronomer’s Airflow is 100% compatible with the open-source project. If you prefer to use another version of Airflow 2.0 the core content should work fine. Astronomer.io doesn’t bill for local development, but cloud deployments will be billed.

Before we begin, follow the quickstart guide to deploy a sample DAG and get familiar with the process.

Astronomer’s development flow works in the following way:

  1. Create an account.
  2. Create a workspace

3. Create your local Dev Environment

astro dev init
  1. Create or modify your DAG
  2. Verify Locally
astro dev start
  1. Create a Deployment (not covered in this article)
  2. deploy your DAG to the cloud. (not covered in this article)
astro deploy

Note: The Quickstart Guide will explain how to enable CLI access to the cloud

Dockerfile

Additionally, we are going to set our Docker image to the 2.0 release, this will avoid any instabilities. I strongly recommend setting your docker image even for local development and testing.

Databricks

Databricks Spark will be used as the universal heavy lifting tool in our pipeline. Why Databricks Spark? It's extremely flexible, a managed service, and the ramp-up time to get knowledgeable is not extensive. We will cover Databricks in more depth in the next article, but we will need to set up our Databricks Operator in Airflow.

By default, most of the operators are not included with Airflow. There are a very large number of operators available in provider packages. Provider packages provide additional hooks, operators, and connections. We will be adding a provider for Databricks.

Docs: PyPi Documentation, Astronomer Documentation, Databricks Documentaiton, Jobs REST API

Setting up access

You can signup for Databricks using this free trial.

I will be using AWS with this Databricks deployment, but in theory GCP or Azure should work the same.

Additionally, you will need to set up your AWS deployment and Instance profile. These steps can take some time and can’t be skipped.

The previous steps are related to AWS. There are similar steps for GCP and Azure. My other suggestion is when setting up the deployment make sure you use Cross-account access.

Connection information

In order for Airflow to connect to our Spark cluster(s), we must add an access token and host to our dag. We are going to take a simplistic approach initially, and in a later article integrate with a secrets manager.

Here is a guide to setting up an access token.

Here is a guide to managing connections. The configuration is as follows:

connection_ID: databricks_defaultusername: tokenhost: https://XXXXXX.cloud.databricks.com/ (refer to your Databricks deployment)extra: {"token": "XXXXXXXXXXXXXXXXX"} 

DO NOT COMMIT TOKENS TO SOURCE CONTROL (GITHUB)

Install the Databricks provider package

Provider packages add features and functionalities to Airflow. In this case, we will add a pip package to be pulled from PyPI.

Add the following to the requirements.txt file in our local workspace to add the provider:

apache-airflow-providers-databricks

Our Dags

We are going to deploy three pipelines, the first one that will build our model. This DAG expects an input with two hyperparameter values:

{"max_depth": <>, "n_estimators": <>}

The second pipeline we will deploy will perform a hyperparameter grid search on our data and record that data in mlflow. This process will not be utilizing the spark cluster.

Lastly, we will create a pipeline that will perform our hyperparameter grid search utilizing our spark cluster and record that data in mlflow.

Test

Once we have created our local development environment and made all the above additions we will now need to test our new AIrflow image locally

astro dev start

if you find that there is an issue with your DAG, run the following and make the needed changes.

astro dev stop

Triggering our pipeline

Airflow offers several options when triggering your pipeline Airflow. You can create a schedule (not covered in this article), you can use the UI or CLI to manually trigger and you can use the REST API.

Here is a Guide to the UI.

REST API

The following code should trigger your DAG. You can run curl against your local docker container or the Saas service.

We won’t be covering the hosted option but for educational purposes, here is Astronomer’s Guide to using the REST API with its Saas service.

Connect to your local Docker container

docker ps
docker exec -it <webserver's id> bash

Once you have bashed into the container you will be able to run curl

curl --header "Content-Type: application/json" \
--request POST \
--data '{
"dag_run_id": "build_model_pipeline",
"execution_date": "2019-08-24T14:15:22Z",
"state": "success",
"conf": {"max_depth": 80, "n_estimators": 100}
}' \
https://localhost/api/v1/dags/{dag_id}/dagRuns

We have covered a great deal of ground, but we have still much more to go. Airflow is very flexible, you can expand upon your tutorial with new resources and use cases. In the next few articles, we will build out our Spark applications as well as add CI/CD and deploy our model.

--

--