Skip to main content

THE EARLY PHASE OF NEURAL NETWORK TRAINING

-By Jonathan Frankle† MIT CSAIL 
David J. Schwab CUNY ITS and Ari S. Morcos  of Facebook AI Research







Many important aspects of neural network learning take place within the very earliest iterations or epochs of training.  For example, 
  1. Sparse 
  2. Trainable sub-networks emerge 
  3. Gradient descent moves into a small subspace 
  4. Network undergoes a critical period 


Researchers examine the changes that deep neural networks undergo during this early phase of training.


Over the past decade, methods for successfully training big, deep neural networks have revolutionized machine learning. Yet surprisingly, the underlying reasons for the success of these approaches remain poorly understood, despite remarkable empirical performance. A large body of work has focused on understanding what happens during the later stages of training, while the initial phase has been less explored. 

Research is built on Basic framework "Iterative Magnitude Pruning with rewinding" called The Lottery Ticket Hypothesis .

What is The Lottery Ticket Hypothesis

A randomly initialized, dense neural network contains a sub-network that is initialized such that "when trained in isolation" it can match the test accuracy of the original network after training for at most the same number of iterations.


  1. Can reduce the parameter count by more than 90% without harming accuracy
  2. Decreases storage size, energy consumption, and inference time
  3. Typically using: One-Shot Pruning

Paper uses rewinding method of The Lottery Ticket Hypothesis that is 

Iterative magnitude pruning with rewinding: 
In order to test the effect of various hypotheses about the state of sparse networks early in training, they use the Iterative Magnitude Pruning with rewinding (IMP) procedure to extract sub-networks from various points in training that could have learned on their own. 

The procedure involves training a network to completion, pruning the 20% of weights with the lowest magnitudes globally throughout the network, and rewinding the remaining weights to their values from an earlier iteration k during the initial, pre-pruning training run. This process is iterated to produce networks with high sparsity levels. As demonstrated in IMP with rewinding leads to sparse sub-networks which can train to high performance even at high sparsity levels > 90%. 

Figure 1 shows the results of the IMP with rewinding procedure, showing the accuracy of ResNet20 at increasing sparsity when performing this procedure for several rewinding values of k. For k ≥ 500, sub-networks can match the performance of the original network with 16.8% of weights remaining. For k > 2000, essentially no further improvement is observed (not shown)

Paper also considers one more section 

Critical periods in deep learning, found that perturbing the training process by providing corrupted data early on in training can result in irrevocable damage to the final performance of the network. Architecture, learning rate schedule, and regularization all modify the timing of the critical period, and regularization, in particular weight decay and data augmentation.


After the initial iterations, gradient magnitudes drop and the rate of change in each of the aforementioned quantities gradually slows through the remainder of the period observed. Interestingly, gradient magnitudes reach a minimum after the first 200 iterations and subsequently increase to a stable level by iteration 500. Evaluation accuracy, improves rapidly, reaching 55% by the end of the first epoch (400 iterations), more than halfway to the final 91.5%. By 2000 iterations, accuracy approaches 80%.


Data Dependency:

  • RANDOM LABELS 
To evaluate whether this phase of training is dependent on underlying structure in the data, we drew inspiration from pre-trained networks on data with randomized labels. The experiment tests whether the input distribution of the training data is sufficient to put the network in a position from which IMP with rewinding can find a sparse, trainable sub-network despite the presence of incorrect (not just missing) labels. The result suggests that, though it is still possible that labels may not be required for learning, the presence incorrect labels is sufficient to prevent learning which approximates the early phase of training.
  • SELF-SUPERVISED ROTATION PREDICTION
What if we remove labels entirely? The result suggests that the labels for the ultimate task themselves are not necessary to put the network in such a state (although explicitly misleading labels are detrimental). Authors emphasize that the duration of the pre-training phase required is an order of magnitude larger than the original rewinding iteration, however, suggested that labels add important information which accelerates the learning process.
  • SPARSE PRETRAINING
Since sparse sub-networks are often challenging to train from scratch without the proper initialization, does pre-training make it easier for sparse neural networks to learn? The result suggests that while pre-training is sufficient to approximate the early phase of supervised training with an appropriately structured mask, it is not sufficient to do so with an inappropriate mask.

Conclusion

The results have significant implications for the lottery ticket hypothesis. The seeming necessity of late rewinding calls into question certain interpretations of lottery tickets as well as the ability to identify sub-networks at initialization. Observation is that weights are highly non-independent at the rewinding point suggests that the weights at this point cannot be easily approximated, making approaches which attempt to “jump” directly to the rewinding point unlikely to succeed. However,  result that labels are not necessary to approximate the rewinding point suggests that the learning during this phase does not require task-specific information, suggesting that rewinding may not be necessary if networks are pre-trained appropriately.



Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...