7 Myths in Machine learning

-By Oscar Chang, Hod Lipson

This paper presents few common myths in machine learning. Tries to explain each of them.

Myth 1: TensorFlow is a library for working with tensors.
Myth 2: Image databases reflect real photographs found in nature.
Myth 3: MO researchers do not use test kits for testing.
Myth 4: All input data is used in neural network training.
Myth 5: Learning for very deep residual networks requires packet normalization.
Myth 6: Networks with attention [attention] are better than convolution [convolution].
Myth 7: Cards of Significance — A Reliable Way to Interpret Neural Networks.

Myth 1: TensorFlow is a library for working with tensors

In fact, this is a library for working with matrices, and this difference is quite significant.
In Computing Higher Order Derivatives of Matrix and Tensor Expressions. Laue et al. NeurIPS 2018 authors demonstrate that their automatic differentiation library, based on real tensor calculus, has much more compact expression trees. The fact is that tensor calculus uses index notation, which makes it possible to work equally with direct and inverse modes.

Matrix numbering hides indices for convenience of notation, which makes automatic expression trees often too complex.

Consider the multiplication of matrices C = AB. We have

C˙=A˙B+AB

for direct mode and

A¯=C¯BT,B¯=ATC

for the reverse. To correctly perform multiplication, you need to strictly observe the order and use of hyphenation. For a person engaged in MO, from the point of view of writing, it looks confusing, but from the point of view of calculations it is an extra load for the program.

Another example, less trivial:

c = d e t (A)

. We have

c˙=tr(inv(A)A˙)

for direct mode and

A¯=c¯cinv(A)T

for the inverse. In this case, it is obviously impossible to use the expression tree for both modes, given that they consist of different operators.

In general, the way TensorFlow and other libraries (for example, Mathematica, Maple, Sage, SimPy, ADOL -C, TAPENADE, TensorFlow, Theano, PyTorch, HIPS autograd) implemented automatic differentiation, which leads to the fact that different and inefficient expression trees are built for the forward and reverse modes. Tensor reckoning circumvents these problems due to the commutativeness of multiplication due to the index record. For details on how this works, see scientific work.

The authors tested their method by performing automatic differentiation of the reverse mode, also known as reverse propagation, on three different tasks, and measured the time it took to calculate the Hessians .

In the first task, the quadratic function x T Ax. In the second, the logistic regression was calculated, in the third, matrix factorization.

On the CPU, their method was two orders of magnitude faster than such popular libraries as TensorFlow, Theano, PyTorch, and HIPS autograd.

They observed even greater acceleration on the GPU, by as much as three orders of magnitude.

Implications:

Calculation of derivatives for functions of the second or higher order with the help of current deep learning libraries is computationally expensive. This includes calculations of general fourth-order Hessian type tensors (for example, in MAML and Newton's second-order optimization). Fortunately, quadratic formulas are rarely found in deep learning. However, they are often found in “classical” machine learning – SVM, least squares method, LASSO, Gaussian processes, etc.

Myth 2: The image database reflects real photographs found in nature

Many people like to think that neural networks have learned to recognize objects better than people. This is not true. They can be ahead of people on the bases of selected images, for example, ImageNet, but in the case of recognizing objects from real photos from ordinary life, they certainly cannot outperform an ordinary adult. This is because the sampling of images in the current datasets does not coincide with the sampling of all possible images naturally occurring in reality.

In a rather old paper, Unbiased Look at Dataset Bias. Torralba and Efros. CVPR 2011., the authors proposed to investigate the distortions associated with a set of images in twelve popular bases, finding out whether it is possible to train the classifier to determine the data set from which this image was taken.

The chances of randomly guessing the correct data set are 1/12 ≈ 8%, while the scientists themselves coped with the task with a success of> 75%.

They trained the SVM on the histogram of directional gradients (HOG) and found that the classifier did the job in 39% of cases, which significantly exceeds the random hit Nia. If we repeated this experiment today, with the most advanced neural networks, we would surely see an increase in the accuracy of the classifier.

If the image bases correctly displayed the true images of the real world, we would not be able to determine from which A specific image occurs.

However, there are features in the data that make each set of images different from the others. ImageNet has a lot of race cars that hardly describe the “theoretical” average car as a whole.

The authors also determined the value of each data set by measuring how well the classifier trained on one set, works with images from other sets. According to this metric, LabelMe and ImageNet were the least biased, receiving a rating of 0.58 using the “currency basket” method. All values turned out to be smaller than one, which means that training on another data set always leads to a deterioration in efficiency. In an ideal world without biased sets, some numbers should have exceeded 1.

The authors pessimistically concluded:

So what is the value of the existing data sets for learning algorithms intended for the real world? The resulting answer can be described as “better than nothing, but not much.”

Myth 3: MO researchers do not use test sets for tests

In the machine learning textbook, we are taught to divide the data set into training, evaluation, and test. The effectiveness of the model, trained on the training set, and evaluated on the assessment helps the person involved in MO, to adjust the model to maximize efficiency in its actual use. The test set does not need to be touched until the person completes the adjustment to provide an unbiased assessment of the real performance of the model in the real world. If a person cheats using a test set at the training or evaluation stages, the model risks becoming too well adapted to a specific data set.

In the hyper competitive world of research, new algorithms and models are often judged by the effectiveness of their work with the verification data. Therefore, it makes no sense for researchers to write or publish papers that describe methods that work poorly with test data sets. And this, in essence, means that the MO community as a whole uses a test set for evaluation.

What are the consequences of this scam?

Authors of Do CIFAR- 10 Classifiers Generalize to CIFAR-10? Recht et al. ArXiv 2018 investigated this issue by creating a new test set for CIFAR-10. To do this, they sampled images from Tiny Images.

They chose CIFAR-10 because it is one of the most commonly used datasets in MO, the second most popular set in NeurIPS 2017 (after MNIST). The process of creating a dataset for CIFAR-10 is also well described and transparent, there are quite a few detailed labels in the large Tiny Images database, so you can reproduce the new test set by minimizing the distribution shift.

They found that a large number of different models of neural networks on a new test set had a significant drop in accuracy (4% – 15%). However, the relative rank of the performance of each model remained fairly stable.

In general, better-performing models showed a smaller decrease in accuracy compared to worse-performing ones. This is nice because it follows from this that the loss of generalizability of the model due to cheating, at least in the case of CIFAR-10, decreases as the community invents improved MO methods and models.

Myth 4 : The neural network training uses all input data

It is believed that the data is a new oil, and that the more data we have, the better we will be able to train models for in-depth training, which are now ineffective on samples [sample-inefficient] and re-parametrized [overparametrized].

In An Empirical Study Example of Forgetting During Deep Neural Network Learning. Toneva et al. ICLR 2019 authors demonstrate significant redundancy in several common sets of small images. Surprisingly, 30% of the data from CIFAR-10 can be simply removed without changing the accuracy of the check by a significant amount.

Forget histograms from (left to right) MNIST, permutedMNIST and CIFAR- 10.

Forgetting happens when the neural network incorrectly classifies the image at time t + 1, while at time t it managed to correctly classify the image. The passage of time is measured by SGD updates. To keep track of forgetting, the authors launched their neural network on a small set of data after each SGD update, and not on all the examples available in the database. Non-forgettable examples are called unforgettable examples.

They found that 91.7% of MNIST, 75.3% permutedMNIST, 31.3% of CIFAR-10 and 7.62% of CIFAR-100 are unforgettable examples. Intuitively, this is understandable, since an increase in the diversity and complexity of the data set should make the neural network forget more examples.

The forgotten examples seem to show more rare and strange features compared to unforgettable. The authors compare them with support vectors in SVM, since they seem to mark the contours of decision-making boundaries.

Unforgettable examples, in turn, encode mostly redundant information. If you sort the examples according to the degree of unforgettable, we can compress the data set by deleting the most memorable ones.

30% of CIFAR-10 data can be deleted without affecting the accuracy of the checks, and deleting 35% of the data leads to a slight drop in accuracy checks at 0.2%. If you select 30% of the data at random, deleting them will lead to a significant loss of verification accuracy of 1%.

Similarly, you can remove 8% of the data from the CIFAR-100 without loss of verification accuracy.
the results show that there is considerable redundancy in the data for training neural networks, approximately as in SVM training, where non-supporting vectors can be removed without affecting the model solution.

Consequences:

If we can determine which of the data is unforgettable, before the start of training, then You can save space by deleting them and time without using them when training a neural network.

Myth 5: Packet normalization is required to train very deep residual networks

For a long time, it was believed that “learning deep neural networks for direct optimization only for a controlled goal (for example, the logarithmic probability of correct classification) using gradient descent, starting from random parameters, does not work well.”

The heap of sophisticated random methods that appeared initialization, activation functions, optimization techniques and other innovations, such as residual connections, facilitated the training of deep neural networks using the gradient descent method.

But the real breakthrough came after entering batch normalization (and other sequential normalization techniques) limiting the size of activations for each layer of the network to eliminate the problem of disappearing and explosive gradients.

In the recent work Fixup Initialization: Residual Learning Without Normalization. Zhang et al. ICLR 2019 showed that it is possible to train a network with 10,000 layers using pure SGD without applying any normalization.

The authors compared the learning of the residual neural network for different depths on CIFAR-10 and found that, although the standard initialization methods did not work for 100 layers, the Fixup methods and packet normalization succeeded with 10,000 layers.

They carried out a theoretical analysis and showed that “the gradient normalization of certain layers is limited to a number that increases infinitely with deep Oka network ", which is a problem of explosive gradients. To prevent this, Foxup is used, the key idea of which is to scale weights into m layers for each of the residual L branches in a number of times, depending on m and L.

Fixup helped to train a deep residual network with 110 layers on CIFAR-10 with a high learning rate comparable to the network behavior of a similar architecture trained using batch normalization.

The authors further showed similar results of tests using Fixup on the network without any normalization, working with the ImageNet database and with by translations from English to German.

Myth 6: Networks with attention better than convolution

In the community of researchers, the Ministry of Defense is gaining in popularity the idea that the mechanisms of "attention" outnumber the convolutional neural networks. In the work of Vaswani and colleagues, it is noted that “the computational costs of detachable convolutions are equal to the combination of the layer with self-taking [self-attention layer] and the point-like rewinding layer [point-wise feed-forward layer]”.

Even the advanced generative-competitive networks show the advantage of self-taking over standard convolutions when modeling long-range dependencies.

Authors of Pay Less Attention with Lightweight and Dynamic Convolutions. Wu et al. ICLR 2019 calls into question the parametric efficiency and effectiveness of self-taking when modeling long-range dependencies, and offers new versions of convolutions, partially inspired by self-taking, more efficient in terms of parameters.

“Lightweight” convolutions are separated by depth, softmax-normalized by time dimension, shared by channel dimension, and the same weights are reused at each time step (as recurrent neural networks). Dynamic convolutions are lightweight convolutions that use different weights at each time step.

Similar tricks do lightweight and dynamic convolutions several orders of magnitude more efficient than standard inseparable convolutions.

The authors show that these new convolutions meet or exceed self-attentive networks in machine translation, language modeling, abstract summation problems, using as many or fewer parameters.

Myth 7 : Maps of importance – on A good way to interpret neural networks

Although it is believed that neural networks are black boxes, a great many attempts have been made to interpret them. The most popular of these are significance maps, or other similar methods, assigning importance scores to features or teaching examples.

It is tempting to be able to conclude that this image was definitely classified because of certain parts of the image that are relevant to the neural network. There are several ways to calculate significance maps that often use the activation of neural networks on a given image and gradients passing through the network.

In the Interpretation of Neural Networks is Fragile. Ghorbani et al. AAAI 2019 authors show that they can introduce an elusive change in the picture, which, however, distorts its significance map.

The neural network is determined by a monarch butterfly not by the design on its wings , and due to the presence of unimportant green leaves on the background.

Multidimensional images are often closer to the decision-making boundaries of deep neural networks, hence their sensitivity to competitive attacks [adversarial attacks]. And if the adversary attacks shift images beyond the decision boundary, the adversary interpretative attacks shift them along the contour of the decision boundary, without leaving the territory of the same decision.

The basic method developed by the authors is a modification of Goodfellow's fast gradient markup method, which was one of the first successful methods of competitive attacks.

SRI Blog

Search This Blog