MLOps: What Is It and Why Do We Need It?
Machine learning models are expected to evolve and become more precise in the data-driven business world as more data is collected. MLOps, the unification of machine learning workflows and DevOps principles, makes sure that expectation is met.
In the software-defined business world, the speed of delivery is a critical success factor. As a result, throughout past decades software development has evolved from a slow and linear process, commonly known as the waterfall model, into a more iterative and continuous process guided by DevOps principles. DevOps empowers companies to tackle challenges associated with rapidly evolving market conditions. Through DevOps, software can be delivered and upgraded faster than ever before in business history.
In parallel, almost all new business ideas nowadays revolve around how to effectively use the data from our surroundings to deliver better service to our customers. Therefore many companies apply machine learning (ML) techniques to provide valuable software features to their customers. The value is harnessed by collecting a vast amount of data and processing it to draw valuable insights that can be turned into product features.
From the software development perspective, the most sound approach is to develop ML workflows and integrate them as a part of the already existing DevOps process. This is most often referred to as Machine Learning Operations (MLOps), the unification of machine learning workflow and DevOps principles. MLOps combines the best of both worlds to enable faster experimentation and machine learning model management, rapid deployment of ML models into production, and top-notch quality assurance.
Read on to learn more about which problems in particular MLOps solve, the phases of MLOps, how it compares to DevOps, and some of the most successful business applications of MLOps.
The importance of machine learning in industry
Even though artificial intelligence has existed since the 50s, most progress in the field has been achieved recently. The main reasons for this are the rapidly decreasing costs of general-purpose computation hardware and increased communications throughputs that unlock new opportunities to create a data-driven world.
In its 2019 Digital Economy Compass, Statista has identified two main trends that have great potential to disrupt both the economic models and the way we live our lives:
- the data-driven world, fueled by the exponential growth of digitally collected data;
- the increasing importance of AI and ML that harness insight from the data.
Many companies are taking these opportunities seriously. Some of them have spent decades collecting data in their data lakes and now have the tools to analyze it. The most prominent feature of machine learning is the continuous improvement in the accuracy of results through the supply of new data. As new data becomes available, the prediction and classification results of ML algorithms can become even more fine-grained.
See how our AI and ML development services can help you get ahead
MLOps is a process that combines the best practice of machine learning model development, software development, and operations to enable data scientists and IT teams to work together and increase the efficiency of the ML Workflow.
The word MLOps is a combination of machine learning (ML) and the software development practice DevOps. According to Gartner, MLOps is a subset of the more general term ModelOps. Like DevOps, MLOps increases the speed of model development, enhances the workflow with continuous integration and deployment methods, and installs proper validation mechanisms alongside monitoring and overall management of the workflow.
MLOps started as a set of best practices and is rapidly evolving into an independent approach to managing the ML application lifecycle, from model generation to CI/CD, deployment, diagnostics, governance, orchestration, and business metrics.
MLOps as the solution to industry challenges
Machine Learning Operations focuses on the delivery of high-quality and robust ML models that can be deployed in production. As a result, the process eliminates the so-called “deployment gap” in industries which are starting to use machine learning to meet business goals. In addition, MLOps tends to shorten the time to market and enables efficient team communication.
As the need for data-driven business decisions keeps increasing, many companies are starting to experiment with machine learning. However, to fully benefit from developed ML models, they must be deployed into the existing software system. On the other hand, as suggested by the Algorithmia report, most companies experimenting with ML and AI still haven’t found a way to meet their business goals. The main reason for this lies in the difficulty of bridging the gap between experimentation and the real-world deployment of ML models.
MLOps is addressing the ML deployment gap by offering tools for easier management of models in production. As a core component of MLOps, DevOps allows software companies to move from a monthly or quarterly release cycle to daily or weekly cycles. Building CI/CD pipelines for machine learning as part of an MLOps process is more challenging than with traditional software. Still, the automation of data collection, model training, and model evaluation allow data scientists and ML engineers to focus their efforts on improving deployed ML models while not having to worry about underlying deployment processes.
Time to market
Developing ML models quickly and effectively has been enabled by the large ecosystem of available model development tools such as Pytorch and TensorFlow. While these tools are valuable, they only allow the rapid development of ad-hoc, single-versioned ML workflows.
However, ML production is much more than ad-hoc, single versioned ML workflows. It is about continuously evolving machine learning capabilities. This is the area where Machine Learning Operations delivers most of its value.
Machine learning models usually use high-dimensional data, and data rarely remains static over long periods. MLOps offers tools to perform scheduled or on-demand retraining of the model to maintain accuracy and robustness. Furthermore, models may need to be retrained on a per-customer basis depending on customer-specific data. In these cases, MLOps techniques allow mass customization as part of the automation workflow. Finally, a more extreme yet very likely scenario is to have one model architecture that runs in production using several versions of multiple datasets that need to be retrained seasonally. MLOps versioning of data and model parameters is the key to achieving this.
Necessary retraining efforts may grow exponentially with every new customer. In such cases, ad-hoc and single versioned ML workflows will deliver poor results and management chaos. Scalable business models need different solutions. MLOps offers tools and approaches to deliver those solutions. Time to market for ML models is usually measured in terms of the speed of evolving to changing data and market demands, rather than the speed of delivery of a single working ML model.
Efficient team communication
One of the essential factors for accelerated software development is effective team communication. Merging the development and operation teams under one hood has enabled teams to use the same tools and automate processes that are traditionally slow and manual.
Machine Learning Operations allows data scientists, ML engineers, and software developers to work alongside each other and communicate effectively. MLOps stimulates the development of pipelines as a priority over developing and deploying models in isolation. Teams are often required to manage all changes in components and as-a-code to achieve pipeline-based development and deployment. Such an approach may require additional work compared to the development of models in isolation. However, it allows scalability. The compartmentalization of work in MLOps resembles a microservices approach to the development of large projects; a data scientist is not required to develop the entire model by herself. Instead, the team works on separate stages of the ML pipeline (preprocessing, training, testing). In this way, teams can develop and maintain more complex models and benefit in the long run.
Check out our ML-based solution for human retina image recognition
How MLOps differs from DevOps
As a practice of developing and operating modern software systems, DevOps shortens development cycles and increases deployment speed. MLOps extends continuous integration (CI) and continuous delivery (CD) concepts to machine learning systems. Despite apparent similarities between machine learning ops and DevOps, there are some differences.
Team composition: MLOps teams include data scientists and ML engineers who focus on developing stages inside ML pipelines, such as feature engineering, exploratory analysis, and model experimentation. Unlike DevOps, MLOps team members may not have the software knowledge and experience to build reliable production-grade software systems.
Development process: ML models are developed in an iterative and experiment-heavy fashion. Parameter tuning and feature engineering are an essential part of development. While only code and environments are versioned in DevOps, MLOps requires the versioning of data itself and the sets of tuning parameters. Any change in data, algorithms, modeling techniques or parameter configuration must trigger processes that deploy and monitor performance. DevOps usually deals with deterministic systems, while MLOps is oriented towards probabilistic methods. Hence, the challenge of reproducibility and reusability of software is amplified in MLOps.
Testing: Testing in Machine Learning Operations is a much more demanding process than it is in DevOps. This is mainly because in ML, raw data needs to be validated to ensure that data is clean and does not contain anomalies that may result in poor model performance. In addition, the resulting “clean” data needs to be tested against statistical distribution characteristics. Finally, ML algorithms are tested, and performance metrics are tracked to ensure the model fits the business problem and fairness or ethical conformation.
Deployment and production: MLOps deploys ML systems as multi-stage pipelines requiring automatic model retraining and re-deployment mechanisms. Models in production need to be monitored to identify various phenomena such as model and data shifts. These phenomena do not exist in DevOps.
Overview of MLOps phases
Machine learning projects are developed as a part of the ML pipeline and tend to be experimental in nature. MLOps, on the other hand, helps engineers and scientists implement the stages of the ML pipeline either as manual or fully automated processes. Machine learning pipelines consist of data preparation, model development, model validation, model deployment, and model, and data monitoring phases.
The starting point of the ML workflow is the data. Data comes in from various sources in a variety of formats. Therefore, inconsistencies in data need to be removed first. Next, data needs to be labeled and put in a format ML models can consume. Machine Learning Operations enables continuous data quality improvement at the data preparation stage and offers automation mechanisms so that new and better data is used in model development. In addition, MLOps enables the versioning of source data and metadata (data attributes).
Model development usually consists of several sub-steps such as feature engineering, ML algorithm selection, hyperparameter tuning, model fitting, and model evaluation. Machine learning operations enable engineers to track metrics and learn from mistakes during the model development stage in the ML workflow. Since developing models involves writing code, MLOps processes for versioning the source code come in handy. As well as the code, the environment, dependencies and data can also be versioned for reproducibility. With the help of ML pipelines, MLOps helps engineers create “checkpoints” in the process and allows them to re-run only the necessary elements of the pipeline, reducing development time and increasing the overall efficiency of the model development process.
After data scientists have created models with suitable performance in the development environment, these models need to be deployed in a production environment, accessible by the end customer. Before real-world deployment, developed models usually go through a validation process. The models are validated from a business, technical, and, if necessary, ethical perspective. MLOps assists in validating the created models by offering techniques and tools for automating the validation. Perhaps the best thing is that MLOps accomplishes this in a reproducible and low-cost manner.
After the validation phase, the models are put into the machine learning production environment and run in customer applications. The validated models may provide classification or prediction results based on the supplied data. As part of the Machine Learning Operations strategy, the models can be deployed as microservices with well-defined application programming interfaces (APIs), or inside embedded elements such as mobile devices, wearables, or self-driving car electronic control units (ECUs).
ML models continuously evolve and get better as more data becomes available. However, this process is not inherent to the model. As more data becomes available, a model’s performance may actually degrade, and only improve after its parameters are re-tuned according to the newly available data. For this reason, both model performance and the input data characteristics need to be continuously monitored.
This monitoring is done to spot performance issues and phenomena such as “model drift” and “data drift”. The business need may evolve over time, and models need to be retrained and redeployed – model drift. Data drift is slightly different and has to do with situations where the model is trained on a specific data distribution that changes due to the nature of the business (customer preferences change, seasons change, new products are added). For monitoring, MLOps offers various tools that track the performance metrics, detect model or data drifts automatically, and trigger the retraining of the model to guarantee delivery of the desired performance.
Find out more about this intelligent grocery shopping list app based on machine learning algorithms
MLOps levels of maturity
For some business cases, the ML workflow and building of ML pipelines can mainly be manual without a negative business impact. However, in more dynamic business cases where data and models can frequently change (such as retail and financial services), the ML workflow needs to be fully automated. Thus, the automation of machine learning operations reflects the maturity of the process divided into different levels: level 0 – manual process, level 1 – automated ML pipeline, and level 2 – full CI/CD pipeline automation.
Level 0 – No automation
This level of MLOps is customary for companies that are new to ML workflows. At this level, MLOps is almost entirely driven by data scientists. This level is suitable for models that do not need to change during their lifecycle. The models are built, deployed, and operated manually. The inspection of model and data shifts and retraining of the model when necessary are entirely manual.
- Manual process – the process is primarily manual. The data scientists produce scripts for data preparation, model training, and model validation.
- Distinct development and operations – after the model has been created and validated by the development team, it is deployed into production by the operations team. Thus, data scientists and operations engineers are in separate teams.
- Infrequent releases – since the ML models and data do not change often, retraining and redeployment is done on demand or several times per year.
- No CI – CI is ignored due to infrequent changes in implementation. Instead, code development is done via data scientist notebooks (often written with the help of Jupyter).
- No CD – CD is ignored due to rare releases.
- Scope of deployment – deployment includes only the deployment of the ML model and not the complete ML pipeline.
- Monitoring – monitoring mechanisms are rudimentary and usually include basic logging.
Challenges and remedies
Despite the initial assumptions that models and data will rarely change, the frequency of change often proves to be under-estimated. Poor performance of models in production often leads to customer dissatisfaction and revenue loss. To remedy this, the model needs to be actively monitored for quality while in production. The models should be preemptively retrained and redeployed. Additionally, the teams should actively experiment with different model architectures and hyperparameter sets.
Level 1 – Automated ML pipeline
Level 1 MLOps eliminates the challenges associated with the manual processes and enables continuous delivery and machine learning model management. Machine Learning Operations at this level supports ML models in dynamic business environments where several factors trigger model and data changes.
- Continuous experimentation – ML experimentation steps are automated.
- Continuous training – machine learning in production demands ongoing and automatic retraining.
- Unified process – ML pipeline implementation is reflected in production. The true power of MLOps resides in the unification of development and operational environments.
- Modularity – ML pipeline components are modular and can be reused and shared across different projects.
- Continuous delivery – model deployment is automated.
- Pipeline deployment – as well as the models, the entire training ML pipeline is deployed to automatically retrain the models.
- Validation – data and model validation is automatic.
- Feature repository – centralized storage of features used by models for classification or prediction.
- ML pipeline triggers – pipelines are triggered by the availability of new data, scheduled events, model degradation, or on-demand.
Challenges and remedies
Level 1 offers MLOps mechanisms for deploying models based on new data and is unsuitable for new machine learning based business ideas. When quick testing of new ideas is necessary, deployment in production needs to be implemented using CI/CD tools that allow automation.
Level 2 – Automated CI/CD pipeline
Level 2 includes all of the highlights of level 1 MLOps and eliminates the challenges associated with providing continuous delivery of both models and training pipelines. This is a fully automated level of MLOps based on CI/CD principles. This level of Machine Learning Operations is most suitable for dynamic business environments that experience a lot of change in data characteristics and service offerings.
- Continuous experimentation – ML experimentation steps and experiment deployment are both automated.
- Continuous delivery – both pipelines and models are part of the automated continuous delivery process.
- Automated triggers – pipelines are automatically executed in production based on scheduled, event-based, or on-demand triggers.
- Monitoring – statistics are collected based on the operational model and live data. Triggers are generated based on the statistics collected.
Challenges and remedies
The data analysis and model analysis steps are mainly manual and involve the work of data scientists. However, with further developments in these fields, it may be possible to automate these steps. MLOps delivers the right tools for their automation where possible.
Notable MLOps implementations
MLOps delivers clear benefits for the majority of business cases. Due to its many advantages, MLOps is utilized by numerous companies around the world.
NVIDIA utilizes MLOps to manage the AI lifecycle of its products. NVIDIA divides its ML workflow into components and has created a pipeline-based feedback loop where the operational output optimizes the original ML pipeline. Most of the pipeline components are based on Google’s MLOps Manifesto. The developed architecture at NVIDIA compresses the ML workflow and combines it with the end-user and the application that monitors the deployed machine learning models.
Spotify utilizes machine learning to deliver value to the end-users of its platform. In this revealing blog post, the company describes the evolution of their machine learning infrastructure that can be compared with the MLOps levels 0, 1, and 2.
In recent years, Spotify has successfully transitioned to a fully automated machine learning pipeline, and model development and deployment using the best practices and tools that MLOps offers.
As the leading online-only supermarket, Ocado utilizes machine learning to efficiently handle millions of events generated every minute as their customers navigate the online shop, fill their virtual baskets, check out, and pay. Ocado uses ML to ensure a better shopping experience, secure transactions, and optimize its supply chain. Due to the changing nature of data and its business goals, the company implements MLOps to quickly retrain and redeploy its machine learning models as new data becomes available. The developed models are continuously monitored for distribution shift and retrained on demand.
Revolut, an online-only banking services company, trains the machine learning models that use transaction data to detect fraudulent card transactions. According to Revolut’s Dmitri Lihhatsov, the company built Sherlock – a fraud-detection machine learning system – in only nine months. The company operates Sherlock based on core Machine Learning Operations principles. In addition, it has developed automated model deployment mechanisms, and actively monitors models and retrains them in production.
As the world’s most popular TV show and movie streaming platform, Netflix utilizes machine learning models for almost every feature in its product offering, including a personalized experience. Thanks to MLOps, the company is capable of managing thousands of machine learning models. Each of these models operates with thousands of different datasets in the background. The training, deployment, and overall management of these models would be impossible without automated processes supported by MLOps.
A machine learning model is an element of software that relies on data to perform its duty. As with traditional software, the ML model when deployed in production naturally follows the DevOps process to shorten the system development lifecycle while providing continuous delivery of high-quality software. When the DevOps process is explicitly applied to machine learning, it is called Machine Learning Operations. The main difference between the two is that the underlying quality of data defines the quality of the ML model. Therefore, MLOps aims to understand, monitor, and improve the datasets in order to improve the accuracy and robustness of the ML model.
MLOps, when implemented correctly, has the power to fully automate the ML workflow from data preparation to model deployment and monitoring. However, implementing MLOps in reality requires knowledge and experience in both machine learning and in the software development process. Lack of experience in these domains often prevents companies from achieving success in deploying robust machine learning models.
We at PixelPlex offer our services in designing and operating MLOps systems to companies of any size. Our team of skilled software developers and machine learning experts and our long track record of successful projects in the domain position us as ideal partners for your MLOps needs. Contact us any time and we will be glad to help your business get ahead.