MLOps Basics Part 1
MLOps (Machine Learning Operations) is the practice of combining the lessons learned from DevOps for the productionisation of machine learning. Its role is to fill the gap between the data scientist and the machine learning consumers.
Machine Learning? Data Science?
Machine Learning can be understood as the process of applying a set of techniques to a group of data to create a limited “Picture of how the world works”, called a model. This process of creating a model is called training your model. Once you have a trained model you can use that model with new data to better understand the past (Data Mining) or the future (Machine Learning). This description is a simplistic view but will work for our discussion.
A data scientist is typically a data analyst who possesses knowledge of machine learning and data modeling. This role can vary but often its focus is on analysis and research. Sometimes this research will produce a model paired with data preprocessing required for that model. Once all the research has been completed, a data scientist now hands his research off to an engineer to bring it into “production”. This last step is where MLOps steps in to bring all that work to life.
In simple terms:
- Reproducibility of ML/DM
- Automation ML/DM pipelines
- Versioning ML/DM and Data
- ML/DM Lifecycle management
- Monitoring ML/DM
I won’t be able to give examples in all of those areas but instead, I will be going through examples to paint a general picture.
This is the first of a series of articles on taking very basics steps toward MLOps. In our scenario we have been hired to set up MLOps for an investing company, Tesseract Inc. Tesseract has a Snowflake data warehouse that we will be using to pull stock data.
Our researched A.I.
The A.I. used (patent pending) is as follows
Preprocessing of data
Our MLOps Tech Stack:
- Airflow: A batch orchestration framework we will be using to automate our ML pipelines.
- Databricks Spark: A general-purpose parallel processing framework we will use for processing the data in Snowflake and train our ML model.
- mlflow: A flexible platform for ML operations.
- AWS Lambda: a serverless platform we will use to offer access to our model.
- GitLab: a DevOps and Version control platform, which we will use for CI related tasks.
Our tech stack is based on flexible tools, that will work with any situation and with any tools. They can be replaced with other tools as needed.
Our Next article will go through the first steps of our ML Pipeline and introduce Airflow as our core Pipeline tool. Our other articles will cover CD/CD, Orchestration, and processing.