THE EARLY PHASE OF NEURAL NETWORK TRAINING

-By Jonathan Frankle† MIT CSAIL

David J. Schwab CUNY ITS and Ari S. Morcos of Facebook AI Research

Many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example,

Sparse
Trainable sub-networks emerge
Gradient descent moves into a small subspace
Network undergoes a critical period

Researchers examine the changes that deep neural networks undergo during this early phase of training.

Over the past decade, methods for successfully training big, deep neural networks have revolutionized machine learning. Yet surprisingly, the underlying reasons for the success of these approaches remain poorly understood, despite remarkable empirical performance. A large body of work has focused on understanding what happens during the later stages of training, while the initial phase has been less explored.

Research is built on Basic framework "Iterative Magnitude Pruning with rewinding" called The Lottery Ticket Hypothesis .

What is The Lottery Ticket Hypothesis

A randomly initialized, dense neural network contains a sub-network that is initialized such that "when trained in isolation" it can match the test accuracy of the original network after training for at most the same number of iterations.

Can reduce the parameter count by more than 90% without harming accuracy
Decreases storage size, energy consumption, and inference time
Typically using: One-Shot Pruning

Paper uses rewinding method of The Lottery Ticket Hypothesis that is

Iterative magnitude pruning with rewinding:

In order to test the effect of various hypotheses about the state of sparse networks early in training, they use the Iterative Magnitude Pruning with rewinding (IMP) procedure to extract sub-networks from various points in training that could have learned on their own.

The procedure involves training a network to completion, pruning the 20% of weights with the lowest magnitudes globally throughout the network, and rewinding the remaining weights to their values from an earlier iteration k during the initial, pre-pruning training run. This process is iterated to produce networks with high sparsity levels. As demonstrated in IMP with rewinding leads to sparse sub-networks which can train to high performance even at high sparsity levels > 90%.

Figure 1 shows the results of the IMP with rewinding procedure, showing the accuracy of ResNet20 at increasing sparsity when performing this procedure for several rewinding values of k. For k ≥ 500, sub-networks can match the performance of the original network with 16.8% of weights remaining. For k > 2000, essentially no further improvement is observed (not shown)

Paper also considers one more section

Critical periods in deep learning, found that perturbing the training process by providing corrupted data early on in training can result in irrevocable damage to the final performance of the network. Architecture, learning rate schedule, and regularization all modify the timing of the critical period, and regularization, in particular weight decay and data augmentation.

After the initial iterations, gradient magnitudes drop and the rate of change in each of the aforementioned quantities gradually slows through the remainder of the period observed. Interestingly, gradient magnitudes reach a minimum after the first 200 iterations and subsequently increase to a stable level by iteration 500. Evaluation accuracy, improves rapidly, reaching 55% by the end of the first epoch (400 iterations), more than halfway to the final 91.5%. By 2000 iterations, accuracy approaches 80%.

Data Dependency:

RANDOM LABELS

To evaluate whether this phase of training is dependent on underlying structure in the data, we drew inspiration from pre-trained networks on data with randomized labels. The experiment tests whether the input distribution of the training data is sufficient to put the network in a position from which IMP with rewinding can find a sparse, trainable sub-network despite the presence of incorrect (not just missing) labels. The result suggests that, though it is still possible that labels may not be required for learning, the presence incorrect labels is sufficient to prevent learning which approximates the early phase of training.

SELF-SUPERVISED ROTATION PREDICTION

What if we remove labels entirely? The result suggests that the labels for the ultimate task themselves are not necessary to put the network in such a state (although explicitly misleading labels are detrimental). Authors emphasize that the duration of the pre-training phase required is an order of magnitude larger than the original rewinding iteration, however, suggested that labels add important information which accelerates the learning process.

SPARSE PRETRAINING

Since sparse sub-networks are often challenging to train from scratch without the proper initialization, does pre-training make it easier for sparse neural networks to learn? The result suggests that while pre-training is sufficient to approximate the early phase of supervised training with an appropriately structured mask, it is not sufficient to do so with an inappropriate mask.

Conclusion

The results have significant implications for the lottery ticket hypothesis. The seeming necessity of late rewinding calls into question certain interpretations of lottery tickets as well as the ability to identify sub-networks at initialization. Observation is that weights are highly non-independent at the rewinding point suggests that the weights at this point cannot be easily approximated, making approaches which attempt to “jump” directly to the rewinding point unlikely to succeed. However, result that labels are not necessary to approximate the rewinding point suggests that the learning during this phase does not require task-specific information, suggesting that rewinding may not be necessary if networks are pre-trained appropriately.

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

- By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for...

SRI Blog

Search This Blog