Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research
Image courtesy 99designesDeveloping machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective.
Courtesy: google
The term “MLOps” is used when this DevOps process is specifically applied to ML. Different from traditional software artefacts, the quality of an ML model (e.g., accuracy, fairness, and robustness) is often a reflection of the quality of the underlying data, e.g., noises, imbalances, and additional adversarial perturbations.
Therefore, one of the most promising ways to improve the accuracy, fairness, and robustness of an ML model is often to improve the dataset, via means such as data cleaning, integration, and label acquisition. As MLOps aims to understand, measure, and improve the quality of ML models, it is not surprising to see that data quality is playing a prominent and central role in MLOps. In fact, many researchers have conducted fascinating and seminal work around MLOps by looking into different aspects of data quality. Substantial effort has been made in the areas of data acquisition with weak supervision (e.g., Snorkel, ML engineering pipelines (e.g., TFX), data cleaning (e.g., ActiveClean), data quality verification (e.g., Deequ), interaction (e.g., Northstar), or fine-grained monitoring and improvement (e.g., Overton ), to name a few.
Independent of downstream ML models, researchers have studied different aspects of data quality that can naturally be split across the following four dimensions:
(1) accuracy – the extent to which the data are correct, reliable and certified for the task at hand;
(2) completeness – the degree to which the given data collection includes data that describes the corresponding set of real-world objects;
(3) consistency – the extent of violation of semantic rules defined over a set of data; and
(4) timeliness (also referred to as currency or volatility) – the extent to which data are up-to-date for a task.
Validation and Test
Standard ML cookbooks suggest that the data should be represented by three disjoint sets to train, validate, and test. The validation set accuracy is typically used to choose the best possible set of hyperparameters used by the model trained on the training set. The final accuracy and generalization properties are then evaluated on the test set. Following this, we use the term validation for evaluating models in the pre-training phase, and the term testing for evaluating models in the post-training phase.
Bayes Error Rate Given a probability distribution p(X, Y ), the lowest possible error rate achievable by any classifier is known in the literature as the Bayes Error Rate (BER). for more detailed info refer to the paper.
Concept Shift
The general idea of ML described so far assumes that the probability distribution P(X, Y ) remains fixed over time, which is sometimes not the case in practice. Any change of distribution over time is known as a concept shift. for more detailed info refer to the paper.
MLOps Challenge
One principled way to model the feasibility study problem for ML is to ask: Given an ML task, defined by its training and validation sets, how to estimate the error that the best possible ML model can achieve, without running expensive ML training?
The answer to this question is linked to a traditional ML problem, i.e., to estimate the Bayes error rate (also called irreducible error).
Non-Zero Bayes Error and Data Quality Issues At the first glance, even understanding why the BER is not zero for every task can be quite mysterious — if we have enough amount of data and a powerful ML model, what would stop us from achieving perfect accuracy?
The answer to this is deeply connected to data quality. There are two classical data quality dimensions that constitute the reasons for a non-zero BER:
(1) completeness of the data, violated by an insufficient definition of either the feature space or label space, and
(2) accuracy of
the data, mirrored in the number of noisy labels.
Comments