Skip to main content

MLOps Drivenby Data Quality using ease.ml techniques

 Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research

Paper Link

ease.ml reference paper link

Image courtesy 99designes

Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective.


Courtesy: google


The term “MLOps” is used when this DevOps process is specifically applied to ML. Different from traditional software artefacts, the quality of an ML model (e.g., accuracy, fairness, and robustness) is often a reflection of the quality of the underlying data, e.g., noises, imbalances, and additional adversarial perturbations.

Therefore, one of the most promising ways to improve the accuracy, fairness, and robustness of an ML model is often to improve the dataset, via means such as data cleaning, integration, and label acquisition. As MLOps aims to understand, measure, and improve the quality of ML models, it is not surprising to see that data quality is playing a prominent and central role in MLOps. In fact, many researchers have conducted fascinating and seminal work around MLOps by looking into different aspects of data quality. Substantial effort has been made in the areas of data acquisition with weak supervision (e.g., Snorkel, ML engineering pipelines (e.g., TFX), data cleaning (e.g., ActiveClean), data quality verification (e.g., Deequ), interaction (e.g., Northstar), or fine-grained monitoring and improvement (e.g., Overton ), to name a few. 


Independent of downstream ML models, researchers have studied different aspects of data quality that can naturally be split across the following four dimensions: 

(1) accuracy – the extent to which the data are correct, reliable and certified for the task at hand; 

(2) completeness – the degree to which the given data collection includes data that describes the corresponding set of real-world objects; 

(3) consistency – the extent of violation of semantic rules defined over a set of data; and 

(4) timeliness (also referred to as currency or volatility) – the extent to which data are up-to-date for a task.


Outcome directly proportional to the quality of data


MLOps challenges are bound to data management challenges — given the aforementioned strong dependency between the quality of ML models and the quality of data, the never-ending pursuit of understanding, measuring, and improving the quality of ML models, often hinges on understanding, measuring, and improving the underlying data quality issues


Key Elements in Measuring the quality of data for MLModels

Validation and Test 

Standard ML cookbooks suggest that the data should be represented by three disjoint sets to train, validate, and test. The validation set accuracy is typically used to choose the best possible set of hyperparameters used by the model trained on the training set. The final accuracy and generalization properties are then evaluated on the test set. Following this, we use the term validation for evaluating models in the pre-training phase, and the term testing for evaluating models in the post-training phase. 

Bayes Error Rate Given a probability distribution p(X, Y ), the lowest possible error rate achievable by any classifier is known in the literature as the Bayes Error Rate (BER). for more detailed info refer to the paper.

Concept Shift 

The general idea of ML described so far assumes that the probability distribution P(X, Y ) remains fixed over time, which is sometimes not the case in practice. Any change of distribution over time is known as a concept shift. for more detailed info refer to the paper.


MLOps Task 1: Effective ML Quality Optimization

MLOps Challenge Not all noisy or dirty samples matter equally to the quality of the final ML model. In other words – when “propagating” through the ML training process, noise and uncertainty of different input samples might have vastly different effects. As a result, simply cleaning the input data artifacts either randomly or agnostic to the ML training process might lead to a sub-optimal improvement of the downstream ML model. Since the cleaning task itself is often performed “semi-automatically” by human annotators, with guidance from automatic tools, the goal of a successful cleaning strategy from an MLOps perspective should be to minimize the amount of human effort. This typically leads to a partially cleaned dataset, with the property that cleaning additional training samples would not affect the outcome of the trained model (i.e., the predictions and accuracy on a validation set are maintained). 

Solution Approach Cleaning with CPClean CPClean directly models the noise propagation — the noises and incompleteness introduce multiple possible datasets, called possible worlds in relational database theory, and the impact of these noises to final ML training is simply the entropy of training multiple ML models, one for each of these possible worlds. Intuitively, the smaller the entropy, the less impactful the input noise is to the downstream ML training process. 

MLOps Task 2: Preventing Unrealistic Expectations

MLOps Challenge 

One principled way to model the feasibility study problem for ML is to ask: Given an ML task, defined by its training and validation sets, how to estimate the error that the best possible ML model can achieve, without running expensive ML training? 

The answer to this question is linked to a traditional ML problem, i.e., to estimate the Bayes error rate (also called irreducible error).


Non-Zero Bayes Error and Data Quality Issues At the first glance, even understanding why the BER is not zero for every task can be quite mysterious — if we have enough amount of data and a powerful ML model, what would stop us from achieving perfect accuracy? 

The answer to this is deeply connected to data quality. There are two classical data quality dimensions that constitute the reasons for a non-zero BER: 

(1) completeness of the data, violated by an insufficient definition of either the feature space or label space, and 

(2) accuracy of the data, mirrored in the number of noisy labels.

SolutionApproach: ease.ml/snoopy 
A novel BER estimation method that 
(1) has no hyperparameters to tune, as it is based on nearest-neighbor estimators, which are non-parametric; 
(2) uses pre-trained embeddings, from public sources such as PyTorch Hub or Tensorflow Hub1 , to considerably decrease the dimension of the feature space.




MLOps Task 3: Rigorous Model Testing Against Overfitting
MLOps Challenge 
In order to generalize to the unknown underlying probability distribution, when training an ML model, one has to be careful not to overfit to the (finite) training dataset. However, much less attention has been devoted to the statistical generalization properties of the test set. Following best ML practices, the ultimate testing phase of a new ML model should either be executed only once per test set or has to be completely obfuscated from the developer. Handling the test set in one way or the other ensures that no information on the test set is leaked to the developer, hence preventing potential overfitting. Unfortunately, in ML development environments it is often impractical to implement either of these two approaches


Our Approach: Continuous Integration of ML Models with ease.ml/ci 
As part of the ease.ml pipeline, we designed a CI engine to address both aforementioned challenges. The workflow of the system is summarized in Figure 3. 
The key ingredients of our system lie in 
(a) the syntax and semantics of the test conditions and how to accurately evaluate them, and 
(b) an optimized sample-size estimator that yields a budget of test set re-uses before it needs to be refreshed.

MLOps Task 4: Efficient Continuous Quality Testing

MLOps Challenge While there have been various research on automatic domain adaption, Identify a different challenge when presented with a collection of models, each of which could be a “staled model” or an automatically adapted model given some domain adaption method. This scenario is quite common in many companies — they often train distinct models on different slices of data independently (for instance one model for each season) and automatically adapt each of these models using different methods for new data

Solution Approach: ease.ml/ModelPicker 
Model Picker is an online model selection approach to selectively sample instances that are informative for ranking pre-trained models. Specifically, given a set of pre-trained models and a stream of unlabeled data samples that arrive sequentially from a data source, the Model Picker algorithm answers when to query the label of an instance, in order to pick the best model under limited labelling budget.


Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...