Skip to main content

THE EARLY PHASE OF NEURAL NETWORK TRAINING

-By Jonathan Frankle† MIT CSAIL 
David J. Schwab CUNY ITS and Ari S. Morcos  of Facebook AI Research







Many important aspects of neural network learning take place within the very earliest iterations or epochs of training.  For example, 
  1. Sparse 
  2. Trainable sub-networks emerge 
  3. Gradient descent moves into a small subspace 
  4. Network undergoes a critical period 


Researchers examine the changes that deep neural networks undergo during this early phase of training.


Over the past decade, methods for successfully training big, deep neural networks have revolutionized machine learning. Yet surprisingly, the underlying reasons for the success of these approaches remain poorly understood, despite remarkable empirical performance. A large body of work has focused on understanding what happens during the later stages of training, while the initial phase has been less explored. 

Research is built on Basic framework "Iterative Magnitude Pruning with rewinding" called The Lottery Ticket Hypothesis .

What is The Lottery Ticket Hypothesis

A randomly initialized, dense neural network contains a sub-network that is initialized such that "when trained in isolation" it can match the test accuracy of the original network after training for at most the same number of iterations.


  1. Can reduce the parameter count by more than 90% without harming accuracy
  2. Decreases storage size, energy consumption, and inference time
  3. Typically using: One-Shot Pruning

Paper uses rewinding method of The Lottery Ticket Hypothesis that is 

Iterative magnitude pruning with rewinding: 
In order to test the effect of various hypotheses about the state of sparse networks early in training, they use the Iterative Magnitude Pruning with rewinding (IMP) procedure to extract sub-networks from various points in training that could have learned on their own. 

The procedure involves training a network to completion, pruning the 20% of weights with the lowest magnitudes globally throughout the network, and rewinding the remaining weights to their values from an earlier iteration k during the initial, pre-pruning training run. This process is iterated to produce networks with high sparsity levels. As demonstrated in IMP with rewinding leads to sparse sub-networks which can train to high performance even at high sparsity levels > 90%. 

Figure 1 shows the results of the IMP with rewinding procedure, showing the accuracy of ResNet20 at increasing sparsity when performing this procedure for several rewinding values of k. For k ≥ 500, sub-networks can match the performance of the original network with 16.8% of weights remaining. For k > 2000, essentially no further improvement is observed (not shown)

Paper also considers one more section 

Critical periods in deep learning, found that perturbing the training process by providing corrupted data early on in training can result in irrevocable damage to the final performance of the network. Architecture, learning rate schedule, and regularization all modify the timing of the critical period, and regularization, in particular weight decay and data augmentation.


After the initial iterations, gradient magnitudes drop and the rate of change in each of the aforementioned quantities gradually slows through the remainder of the period observed. Interestingly, gradient magnitudes reach a minimum after the first 200 iterations and subsequently increase to a stable level by iteration 500. Evaluation accuracy, improves rapidly, reaching 55% by the end of the first epoch (400 iterations), more than halfway to the final 91.5%. By 2000 iterations, accuracy approaches 80%.


Data Dependency:

  • RANDOM LABELS 
To evaluate whether this phase of training is dependent on underlying structure in the data, we drew inspiration from pre-trained networks on data with randomized labels. The experiment tests whether the input distribution of the training data is sufficient to put the network in a position from which IMP with rewinding can find a sparse, trainable sub-network despite the presence of incorrect (not just missing) labels. The result suggests that, though it is still possible that labels may not be required for learning, the presence incorrect labels is sufficient to prevent learning which approximates the early phase of training.
  • SELF-SUPERVISED ROTATION PREDICTION
What if we remove labels entirely? The result suggests that the labels for the ultimate task themselves are not necessary to put the network in such a state (although explicitly misleading labels are detrimental). Authors emphasize that the duration of the pre-training phase required is an order of magnitude larger than the original rewinding iteration, however, suggested that labels add important information which accelerates the learning process.
  • SPARSE PRETRAINING
Since sparse sub-networks are often challenging to train from scratch without the proper initialization, does pre-training make it easier for sparse neural networks to learn? The result suggests that while pre-training is sufficient to approximate the early phase of supervised training with an appropriately structured mask, it is not sufficient to do so with an inappropriate mask.

Conclusion

The results have significant implications for the lottery ticket hypothesis. The seeming necessity of late rewinding calls into question certain interpretations of lottery tickets as well as the ability to identify sub-networks at initialization. Observation is that weights are highly non-independent at the rewinding point suggests that the weights at this point cannot be easily approximated, making approaches which attempt to “jump” directly to the rewinding point unlikely to succeed. However,  result that labels are not necessary to approximate the rewinding point suggests that the learning during this phase does not require task-specific information, suggesting that rewinding may not be necessary if networks are pre-trained appropriately.



Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods