MLOps Drivenby Data Quality using ease.ml techniques

Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research

Image courtesy 99designes

Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective.

Courtesy: google

The term “MLOps” is used when this DevOps process is specifically applied to ML. Different from traditional software artefacts, the quality of an ML model (e.g., accuracy, fairness, and robustness) is often a reflection of the quality of the underlying data, e.g., noises, imbalances, and additional adversarial perturbations.

Therefore, one of the most promising ways to improve the accuracy, fairness, and robustness of an ML model is often to improve the dataset, via means such as data cleaning, integration, and label acquisition. As MLOps aims to understand, measure, and improve the quality of ML models, it is not surprising to see that data quality is playing a prominent and central role in MLOps. In fact, many researchers have conducted fascinating and seminal work around MLOps by looking into different aspects of data quality. Substantial effort has been made in the areas of data acquisition with weak supervision (e.g., Snorkel, ML engineering pipelines (e.g., TFX), data cleaning (e.g., ActiveClean), data quality verification (e.g., Deequ), interaction (e.g., Northstar), or fine-grained monitoring and improvement (e.g., Overton ), to name a few.

Independent of downstream ML models, researchers have studied different aspects of data quality that can naturally be split across the following four dimensions:

(1) accuracy – the extent to which the data are correct, reliable and certified for the task at hand;

(2) completeness – the degree to which the given data collection includes data that describes the corresponding set of real-world objects;

(3) consistency – the extent of violation of semantic rules defined over a set of data; and

(4) timeliness (also referred to as currency or volatility) – the extent to which data are up-to-date for a task.

Outcome directly proportional to the quality of data

MLOps challenges are bound to data management challenges — given the aforementioned strong dependency between the quality of ML models and the quality of data, the never-ending pursuit of understanding, measuring, and improving the quality of ML models, often hinges on understanding, measuring, and improving the underlying data quality issues

Key Elements in Measuring the quality of data for MLModels

Validation and Test

Standard ML cookbooks suggest that the data should be represented by three disjoint sets to train, validate, and test. The validation set accuracy is typically used to choose the best possible set of hyperparameters used by the model trained on the training set. The final accuracy and generalization properties are then evaluated on the test set. Following this, we use the term validation for evaluating models in the pre-training phase, and the term testing for evaluating models in the post-training phase.

Bayes Error Rate Given a probability distribution p(X, Y ), the lowest possible error rate achievable by any classifier is known in the literature as the Bayes Error Rate (BER). for more detailed info refer to the paper.

Concept Shift

The general idea of ML described so far assumes that the probability distribution P(X, Y ) remains fixed over time, which is sometimes not the case in practice. Any change of distribution over time is known as a concept shift. for more detailed info refer to the paper.

MLOps Task 1: Effective ML Quality Optimization

MLOps Challenge Not all noisy or dirty samples matter equally to the quality of the final ML model. In other words – when “propagating” through the ML training process, noise and uncertainty of different input samples might have vastly different effects. As a result, simply cleaning the input data artifacts either randomly or agnostic to the ML training process might lead to a sub-optimal improvement of the downstream ML model. Since the cleaning task itself is often performed “semi-automatically” by human annotators, with guidance from automatic tools, the goal of a successful cleaning strategy from an MLOps perspective should be to minimize the amount of human effort. This typically leads to a partially cleaned dataset, with the property that cleaning additional training samples would not affect the outcome of the trained model (i.e., the predictions and accuracy on a validation set are maintained).

Solution Approach Cleaning with CPClean CPClean directly models the noise propagation — the noises and incompleteness introduce multiple possible datasets, called possible worlds in relational database theory, and the impact of these noises to final ML training is simply the entropy of training multiple ML models, one for each of these possible worlds. Intuitively, the smaller the entropy, the less impactful the input noise is to the downstream ML training process.

MLOps Task 2: Preventing Unrealistic Expectations

MLOps Challenge

One principled way to model the feasibility study problem for ML is to ask: Given an ML task, defined by its training and validation sets, how to estimate the error that the best possible ML model can achieve, without running expensive ML training?

The answer to this question is linked to a traditional ML problem, i.e., to estimate the Bayes error rate (also called irreducible error).

Non-Zero Bayes Error and Data Quality Issues At the first glance, even understanding why the BER is not zero for every task can be quite mysterious — if we have enough amount of data and a powerful ML model, what would stop us from achieving perfect accuracy?

The answer to this is deeply connected to data quality. There are two classical data quality dimensions that constitute the reasons for a non-zero BER:

(1) completeness of the data, violated by an insufficient definition of either the feature space or label space, and

(2) accuracy of the data, mirrored in the number of noisy labels.

SolutionApproach: ease.ml/snoopy

A novel BER estimation method that

(1) has no hyperparameters to tune, as it is based on nearest-neighbor estimators, which are non-parametric;

(2) uses pre-trained embeddings, from public sources such as PyTorch Hub or Tensorflow Hub1 , to considerably decrease the dimension of the feature space.

MLOps Task 3: Rigorous Model Testing Against Overfitting
MLOps Challenge

In order to generalize to the unknown underlying probability distribution, when training an ML model, one has to be careful not to overfit to the (finite) training dataset. However, much less attention has been devoted to the statistical generalization properties of the test set. Following best ML practices, the ultimate testing phase of a new ML model should either be executed only once per test set or has to be completely obfuscated from the developer. Handling the test set in one way or the other ensures that no information on the test set is leaked to the developer, hence preventing potential overfitting. Unfortunately, in ML development environments it is often impractical to implement either of these two approaches

Our Approach: Continuous Integration of ML Models with ease.ml/ci

As part of the ease.ml pipeline, we designed a CI engine to address both aforementioned challenges. The workflow of the system is summarized in Figure 3.

The key ingredients of our system lie in

(a) the syntax and semantics of the test conditions and how to accurately evaluate them, and

(b) an optimized sample-size estimator that yields a budget of test set re-uses before it needs to be refreshed.

MLOps Task 4: Efficient Continuous Quality Testing

MLOps Challenge While there have been various research on automatic domain adaption, Identify a different challenge when presented with a collection of models, each of which could be a “staled model” or an automatically adapted model given some domain adaption method. This scenario is quite common in many companies — they often train distinct models on different slices of data independently (for instance one model for each season) and automatically adapt each of these models using different methods for new data

Solution Approach: ease.ml/ModelPicker

Model Picker is an online model selection approach to selectively sample instances that are informative for ranking pre-trained models. Specifically, given a set of pre-trained models and a stream of unlabeled data samples that arrive sequentially from a data source, the Model Picker algorithm answers when to query the label of an instance, in order to pick the best model under limited labelling budget.

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog

MLOps Drivenby Data Quality using ease.ml techniques

Labels

Comments

Popular posts from this blog

ABOD and its PyOD python module

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY