What is MLOps and Why Do We Need It?

Machine learning models are expected to evolve and become more precise in the data-driven business world as more data is collected. MLOps, the unification of machine learning workflows and DevOps principles, makes sure that expectation is met.

Almost all new business ideas nowadays revolve around how to effectively use the data from our surroundings to deliver better service to our customers. From the software development perspective, the most sound approach is to develop ML workflows and integrate them as a part of the already existing DevOps process.This is most often referred to as Machine Learning Operations (MLOps).

MLOps combines the best of both worlds to enable faster experimentation and machine learning model management, rapid deployment of ML models into production, and top-notch quality assurance.

Read on to learn more about which problems in particular MLOps solve, the phases of MLOps, how it compares to DevOps, and some of the most successful business applications of MLOps.

What is MLOps?

MLOps is a process that combines the best practice of machine learning model development, software development, and operations to enable data scientists and IT teams to work together and increase the efficiency of the ML workflow.

The word MLOps is a combination of machine learning (ML) and the software development practice DevOps. According to Gartner, MLOps is a subset of the more general term called ModelOps. Like DevOps, MLOps increases the speed of model development, enhances the workflow with continuous integration and deployment methods, and installs proper validation mechanisms alongside monitoring and overall management of the workflow.

MLOps started as a set of best practices and is rapidly evolving into an independent approach to managing the ML application lifecycle, from model generation to CI/CD, deployment, diagnostics, governance, orchestration, and business metrics.

Why do we need MLOps?

MLOps helps eliminate the so-called “deployment gap” in industries which are starting to use ML to meet business goals. In addition, MLOps tends to shorten the time to market and enables efficient team communication. Let’s consider each of these points in detail.

Deployment gap

As suggested by the Algorithmia report, most companies experimenting with ML and AI still haven’t found a way to meet their business goals. The main reason for this is that it’s difficult to bridge the gap between experimentation and the real-world deployment of ML models.

MLOps is addressing the ML deployment gap by offering tools for easier management of models in production. As a core component of MLOps, DevOps allows software companies to move from a monthly or quarterly release cycle to daily or weekly cycles.

Building CI/CD pipelines for machine learning as part of an MLOps process is more challenging than with traditional software. Still, the automation of data collection, model training, and model evaluation allow data scientists and ML engineers to focus their efforts on improving deployed ML models while not having to worry about underlying deployment processes.

Time to market

Developing ML models quickly and effectively has been enabled by the large ecosystem of available model development tools such as PyTorch and TensorFlow. While these tools are valuable, they only allow the rapid development of ad-hoc, single-versioned ML workflows.

However, ML production is much more than ad-hoc, single versioned ML workflows. It is about continuously evolving machine learning capabilities. This is the area where Machine Learning Operations delivers most of its value.

Machine learning models usually use high-dimensional data, and data rarely remains static over long periods. MLOps offers tools to perform scheduled or on-demand retraining of the model to maintain accuracy and robustness. Furthermore, models may need to be retrained on a per-customer basis depending on customer-specific data. In these cases, MLOps techniques allow mass customization as part of the automation workflow.

Finally, a more extreme yet very likely scenario is to have one model architecture that runs in production using several versions of multiple datasets that need to be retrained seasonally. MLOps versioning of data and model parameters is the key to achieving this.

Efficient team communication

One of the essential factors for accelerated software development is effective team communication. Merging the development and operation teams under one hood has enabled teams to use the same tools and automate processes that are traditionally slow and manual.

Machine Learning Operations allows data scientists, ML engineers, and software developers to work alongside each other and communicate effectively. MLOps stimulates the development of pipelines as a priority over developing and deploying models in isolation. Teams are often required to manage all changes in components and as-a-code to achieve pipeline-based development and deployment. Such an approach may require additional work compared to the development of models in isolation. However, it allows scalability.

The compartmentalization of work in MLOps resembles a microservices approach to the development of large projects; a data scientist is not required to develop the entire model by themselves. Instead, the team works on separate stages of the ML pipeline (preprocessing, training, testing). In this way, teams can develop and maintain more complex models and benefit in the long run.

MLOps vs DevOps: what’s the difference?

As a practice of developing and operating modern software systems, DevOps shortens development cycles and increases deployment speed. MLOps extends continuous integration (CI) and continuous delivery (CD) concepts to machine learning systems.

Despite apparent similarities between machine learning ops and DevOps, there are some differences in terms of team composition, development process, testing, and deployment.

Team composition

MLOps teams include data scientists and ML engineers who focus on developing stages inside ML pipelines, such as feature engineering, exploratory analysis, and model experimentation. Unlike DevOps, MLOps team members may not have the software knowledge and experience to build reliable production-grade software systems.

Development process

ML models are developed in an iterative and experiment-heavy fashion. Parameter tuning and feature engineering are an essential part of development. While only code and environments are versioned in DevOps, MLOps requires the versioning of data itself and the sets of tuning parameters. Any change in data, algorithms, modeling techniques or parameter configuration must trigger processes that deploy and monitor performance.

DevOps usually deals with deterministic systems, while MLOps is oriented towards probabilistic methods. Hence, the challenge of reproducibility and reusability of software is amplified in MLOps.

Testing

Testing in Machine Learning Operations is a much more demanding process than it is in DevOps. This is mainly because in ML, raw data needs to be validated to ensure that data is clean and does not contain anomalies that may result in poor model performance.

In addition, the resulting “clean” data needs to be tested against statistical distribution characteristics. Finally, ML algorithms are tested, and performance metrics are tracked to ensure the model fits the business problem and fairness or ethical conformation.

Deployment and production

MLOps deploys ML systems as multi-stage pipelines requiring automatic model retraining and re-deployment mechanisms. Models in production need to be monitored to identify various phenomena such as model and data shifts. These phenomena do not exist in DevOps.

How MLOps works: phases

MLOps helps engineers and scientists implement the stages of the ML pipeline either as manual or fully automated processes. The process consists of at least five stages: data preparation, model development, model validation, model deployment, and model, and data monitoring phases.

Data preparation

The starting point of the ML workflow is the data. Data comes in from various sources in a variety of formats. Therefore, inconsistencies in data need to be removed first. Next, data needs to be labeled and put in a format ML models can consume.

Machine Learning Operations enables continuous data quality improvement at the data preparation stage and offers automation mechanisms so that new and better data is used in model development. In addition, MLOps does the versioning of source data and metadata (data attributes).

Model development

Model development usually consists of several sub-steps such as feature engineering, ML algorithm selection, hyperparameter tuning, model fitting, and model evaluation. MLOps allows engineers to track metrics and learn from mistakes during the model development stage in the ML workflow.

Since developing models involves writing code, MLOps processes for versioning the source code come in handy. As well as the code, the environment, dependencies and data can also be versioned for reproducibility. With ML pipelines, MLOps helps engineers create “checkpoints” in the process and allows them to re-run only the necessary elements of the pipeline, reducing development time and increasing the overall efficiency of the model development process.

Model validation

After data scientists have created models with suitable performance in the development environment, these models need to be deployed in a production environment, accessible by the end customer. Before real-world deployment, developed models usually go through a validation process. The models are validated from a business, technical, and, if necessary, ethical perspective. MLOps assists in validating the created models by offering techniques and tools for automating the validation. Perhaps the best thing is that MLOps accomplishes this in a reproducible and low-cost manner.

Model deployment

After the validation phase, the models are put into the machine learning production environment and run in customer applications. The validated models may provide classification or prediction results based on the supplied data. As part of the Machine Learning Operations strategy, the models can be deployed as microservices with well-defined application programming interfaces (APIs), or inside embedded elements such as mobile devices, wearables, or self-driving car electronic control units (ECUs).

Data monitoring

ML models continuously evolve and get better as more data becomes available. However, this process is not inherent to the model. As more data becomes available, a model’s performance may actually degrade, and only improve after its parameters are re-tuned according to the newly available data. For this reason, both model performance and the input data characteristics need to be continuously monitored.

This monitoring is done to spot performance issues and phenomena such as “model drift” and “data drift”. The business need may evolve over time, and models need to be retrained and redeployed — model drift. Data drift is slightly different and has to do with situations where the model is trained on a specific data distribution that changes due to the nature of the business (customer preferences change, seasons change, new products are added).

For monitoring, MLOps offers various tools that track the performance metrics, detect model or data drifts automatically, and trigger the retraining of the model to guarantee delivery of the desired performance.

Real-world MLOps implementations

MLOps delivers clear benefits for the majority of business cases. Due to its many advantages, MLOps is utilized by numerous companies around the world.

NVIDIA

NVIDIA utilizes MLOps to manage the AI lifecycle of its products. NVIDIA divides its ML workflow into components and has created a pipeline-based feedback loop where the operational output optimizes the original ML pipeline. Most of the pipeline components are based on Google’s MLOps Manifesto. The developed architecture at NVIDIA compresses the ML workflow and combines it with the end-user and the application that monitors the deployed machine learning models.

Spotify

Spotify utilizes machine learning to deliver value to the end-users of its platform. In this revealing blog post, the company describes the evolution of their machine learning infrastructure that can be compared with the MLOps levels 0, 1, and 2.

In recent years, Spotify has successfully transitioned to a fully automated machine learning pipeline, and model development and deployment using the best practices and tools that MLOps offers.

Ocado

As the leading online-only supermarket, Ocado utilizes machine learning to efficiently handle millions of events generated every minute as their customers navigate the online shop, fill their virtual baskets, check out, and pay. Ocado uses ML to ensure a better shopping experience, secure transactions, and optimize its supply chain and manufacturing with AI.

Due to the changing nature of data and its business goals, the company implements MLOps to quickly retrain and redeploy its machine learning models as new data becomes available. The developed models are continuously monitored for distribution shift and retrained on demand.

Revolut

Revolut, an online-only banking services company, trains the machine learning models that use transaction data to detect fraudulent card transactions. According to Revolut’s Dmitri Lihhatsov, the company built Sherlock — a fraud-detection machine learning system — in only nine months. The company operates Sherlock based on core Machine Learning Operations principles.

In addition, it has developed automated model deployment mechanisms, and actively monitors models and retrains them in production.

Netflix

As the world’s most popular TV show and movie streaming platform, Netflix utilizes machine learning models for almost every feature in its product offering, including a personalized experience. Thanks to MLOps, the company is capable of managing thousands of machine learning models. Each of these models operates with thousands of different datasets in the background. The training, deployment, and overall management of these models would be impossible without automated processes supported by MLOps.

Conclusion

MLOps, when implemented correctly, has the power to fully automate the ML workflow from data preparation to model deployment and monitoring. However, implementing MLOps in reality requires knowledge and experience in both machine learning and in the software development process. Lack of experience in these domains often prevents companies from achieving success in deploying robust machine learning models.

We at PixelPlex offer our services in designing and operating MLOps systems to companies of any size. Our team of skilled software developers and machine learning experts and our long track record of successful projects and awards in the domain position us as ideal partners for your MLOps needs.