MLOps Basics Part 3

6 min readJun 9, 2021

In previous articles, we gained the basics of MLOps and set up our orchestrator. Now we will put together our Python applications and interactions with Databricks Spark.

Why Databricks Spark?

We will be using Databricks Spark as a general platform to run our code. In some cases, we will Spark to run code across the “cluster”. The benefit here is that we have a very easy-to-use way to run Python code without having the complexity of servers, containers, Kubernetes, etc… This approach isn’t the best choice in all cases, but for teams that want a simple path to automate SQL/Python/R/JVM code, this is a powerful option.

Just to make this clear we will be having Databricks spin up AWS EC2 instances as needed, run our code, and then terminate. There is another option that Databricks offers for SQL only workflows which is serverless. I plan on covering this technology in a future article.

Python Code

The code provided will focus on the core logic of the applications but I highly suggest you follow basic software engineering processes like error handling, logging among others.

Here are the applications we will be using:

Why not a notebook?

In this example, we are choosing to use python applications over notebooks. If your workflow uses notebooks, you are more than welcome to use notebooks. Databricks has a powerful cloud-based notebook platform that integrates with Airflow. Airflow’s Databricks Operators can interact with notebooks, so the same framework can work for notebooks.

Databricks CLI

Databricks has a very easy-to-use CLI, that allows you to control just about every component in the Databricks ecosystem. In this article, we will be using it to control the DBFS (Databrick’s file system), our cluster, and deploy our applications.

Here are useful guides for setting up your Databricks CLI:

CLI Guide
Python environment Guide

Deploy Cluster

This is the cluster configuration I used for my testing, you can use it as a template. You can also do a Databricks local cluster to save even more money. I have set up a timer to auto-terminate your cluster. This will avoid having any AWS servers running without work. Keep in mind Databricks offers a free trial period, but you will pay for any AWS costs.

cluster.json:

To deploy your cluster you can use the CLI

databricks clusters create  --json-file cluster.json

Library Deployment in Cluster

Now that we have our cluster deployed, we will need to install any needed libraries. We used a machine learning runtime for the cluster, so most of the libraries are provided by Databricks.

Here is the CLI for installing the needed library on the cluster.

databricks libraries install --pypi-package yfinance --cluster-id   <cluster-id>

To see your cluster id's you can run

databricks clusters list

Deploy Spark Code

For simplicity's sake will be uploading the code to dbfs . Spark can also pull the code from cloud storage like S3. You can also incorporate an artifactory, but there must be a basic application spark loads from dbfs or blog storage.

Here are our steps:

Create the “data lake”

Here we will create the needed folders in dbfs.

databricks fs rm dbfs:/datalake -r 
databricks fs mkdirs dbfs:/datalake 
databricks fs mkdirs dbfs:/datalake/code
databricks fs ls dbfs:/datalake

2. Deploy the Preprocess code

Now we will deploy the preprocessing code to dbfs. The code takes data from yahoo finance, does some basic feature engineering, and saves that data.

cd preprocessing_folder
databricks fs rm -r dbfs:/datalake/code/preprocessing
databricks fs mkdirs dbfs:/datalake/code/preprocessing
databricks fs cp app dbfs:/datalake/code/preprocessing -r

3. Deploy the Model code

Our Model code pulls the data saved in the preprocessing task and takes input from airflow defining hyperparameters, and then builds out model. The app will then record useful information in mlflow, and save the built model for future use.

cd model_folder 
databricks fs rm -r dbfs:/datalake/code/model
databricks fs mkdirs dbfs:/datalake/code/model
databricks fs cp app dbfs:/datalake/code/model -r

4. Deploy the Search

I have added two apps that will perform a search to find the most ideal hyperparameter combination. This code will not perform any functionality across our Spark cluster.

cd search_folder 
databricks fs rm  -r dbfs:/datalake/code/search
databricks fs mkdirs dbfs:/datalake/code/search
databricks fs cp app dbfs:/datalake/code/search -r

5. Deploy the Cluster Search

Using a combination of useful functions, this code will perform the same hyperparameter search using Spark. Using our cluster allows us to distribute the load of building the large number of models needed for our search process.

cd cluster_search_folder 
databricks fs rm  -r dbfs:/datalake/code/cluster_search
databricks fs mkdirs dbfs:/datalake/code/cluster_search
databricks fs cp app dbfs:/datalake/code/cluster_search -r

Triggering our Model

Now that our code is deployed, let's trigger the model pipeline to create a Machine learning model. Take note of where the preprocessed data and the Pickled model are stored.

Using the Model in Spark SQL

We are going to test our model out using SQL and a little python.

Being able to take a standard Sklearn model, and now use PURE SQL on just about any data source (Kafka, REST API, Mongodb, RDBMS, Data lake, Data warehouse, Neo4j etc..) is something very unique to Spark.

Here we will register our sklearn pickled model in a Spark SQL UDF.

model_path = "/Users/<username>/datalake/stocks/experiments/<experiment_name>
predict = spark_udf(spark, model_path, result_type="double")
spark.udf.register("predictionUDF", predict)

Now we are going to use the path for our data we saved in our “preprocess” task in our model pipeline to create a temp View Delta Table.

CREATE OR REPLACE TEMPORARY VIEW stocksParquet
USING DELTA
OPTIONS(
 path = "...")

Now we will use our UDF to predict and the output will be in a new table called predictions.

USING <database>
DROP TABLE IF EXISTS predictions;CREATE TABLE predictions AS (
SELECT * , cast(predictUDF(cols..) AS double) AS prediction
FROM stocksParquet
LIMIT 10000
)

Finally, you will be able to see your predictions using the following SQL

SELECT * FROM predictions;

Saving your model

In our next article, we will use our pickled model. First, we will manually download the model and save it locally.

Your ML model should be accessible in the Experiment:

(workspaces/users/<username>/datalake/stocks/experiments/<name of experiment>

In the experiments screen, you will see several experiments among other things. Click the model, that corresponds to the run you want to use.

Now you will be presented with a screen, that will allow you to download the ML model. Click on the pickle5 model model.pkl,then the download artifact icon.

This is a basic way to access your models. Save this model because in our next article we will package this model in a docker image and deploy it in an AWS lambda.

mlflow

In our Python code, we used mlflow to manage our machine learning. In this section, We will go through some of the basics of mlflow.

mlflow is a powerful, minimalistic ML lifecycle platform. The benefit of mlflow over other similar tools is that it is 100% neutral to any platform, language, or method for creating your pipeline. mlflow allows you to record your versioning, metrics, parameters, and models easily. There are two types of ways you can interact with mlflow:

REST API
Native libraries

mlflow uses the following concepts:

Experiments:

An experiment is a way of grouping all of your ML projects. To set your experiment use the following command

mlflow.set_experiment("path to folder where stored..")

Run:

Within each Experiment, you will have several iterations that you would like to track.

with mlflow.start_run(run_name='arima_param'):
   ...

Parameter:

Parameters can be considered knobs for the tuning of your model. Oftentimes these can be hyperparameters. mlflow allows you to record any parameter you use in your experiment.

mlflow.log_param('n_estimators', -7000)

Metric:

A metric is a way of measuring your run. We will be looking at RMSE and MAE.

mlflow.log_metric("r2", r2)

At this point, we have an orchestrator, and all of our spark applications deployed. In the upcoming articles we will continue where we have left off and deploy our model in production, then add some best practices to our workflow.