Skip to main content

THE EARLY PHASE OF NEURAL NETWORK TRAINING

-By Jonathan Frankle† MIT CSAIL 
David J. Schwab CUNY ITS and Ari S. Morcos  of Facebook AI Research







Many important aspects of neural network learning take place within the very earliest iterations or epochs of training.  For example, 
  1. Sparse 
  2. Trainable sub-networks emerge 
  3. Gradient descent moves into a small subspace 
  4. Network undergoes a critical period 


Researchers examine the changes that deep neural networks undergo during this early phase of training.


Over the past decade, methods for successfully training big, deep neural networks have revolutionized machine learning. Yet surprisingly, the underlying reasons for the success of these approaches remain poorly understood, despite remarkable empirical performance. A large body of work has focused on understanding what happens during the later stages of training, while the initial phase has been less explored. 

Research is built on Basic framework "Iterative Magnitude Pruning with rewinding" called The Lottery Ticket Hypothesis .

What is The Lottery Ticket Hypothesis

A randomly initialized, dense neural network contains a sub-network that is initialized such that "when trained in isolation" it can match the test accuracy of the original network after training for at most the same number of iterations.


  1. Can reduce the parameter count by more than 90% without harming accuracy
  2. Decreases storage size, energy consumption, and inference time
  3. Typically using: One-Shot Pruning

Paper uses rewinding method of The Lottery Ticket Hypothesis that is 

Iterative magnitude pruning with rewinding: 
In order to test the effect of various hypotheses about the state of sparse networks early in training, they use the Iterative Magnitude Pruning with rewinding (IMP) procedure to extract sub-networks from various points in training that could have learned on their own. 

The procedure involves training a network to completion, pruning the 20% of weights with the lowest magnitudes globally throughout the network, and rewinding the remaining weights to their values from an earlier iteration k during the initial, pre-pruning training run. This process is iterated to produce networks with high sparsity levels. As demonstrated in IMP with rewinding leads to sparse sub-networks which can train to high performance even at high sparsity levels > 90%. 

Figure 1 shows the results of the IMP with rewinding procedure, showing the accuracy of ResNet20 at increasing sparsity when performing this procedure for several rewinding values of k. For k ≥ 500, sub-networks can match the performance of the original network with 16.8% of weights remaining. For k > 2000, essentially no further improvement is observed (not shown)

Paper also considers one more section 

Critical periods in deep learning, found that perturbing the training process by providing corrupted data early on in training can result in irrevocable damage to the final performance of the network. Architecture, learning rate schedule, and regularization all modify the timing of the critical period, and regularization, in particular weight decay and data augmentation.


After the initial iterations, gradient magnitudes drop and the rate of change in each of the aforementioned quantities gradually slows through the remainder of the period observed. Interestingly, gradient magnitudes reach a minimum after the first 200 iterations and subsequently increase to a stable level by iteration 500. Evaluation accuracy, improves rapidly, reaching 55% by the end of the first epoch (400 iterations), more than halfway to the final 91.5%. By 2000 iterations, accuracy approaches 80%.


Data Dependency:

  • RANDOM LABELS 
To evaluate whether this phase of training is dependent on underlying structure in the data, we drew inspiration from pre-trained networks on data with randomized labels. The experiment tests whether the input distribution of the training data is sufficient to put the network in a position from which IMP with rewinding can find a sparse, trainable sub-network despite the presence of incorrect (not just missing) labels. The result suggests that, though it is still possible that labels may not be required for learning, the presence incorrect labels is sufficient to prevent learning which approximates the early phase of training.
  • SELF-SUPERVISED ROTATION PREDICTION
What if we remove labels entirely? The result suggests that the labels for the ultimate task themselves are not necessary to put the network in such a state (although explicitly misleading labels are detrimental). Authors emphasize that the duration of the pre-training phase required is an order of magnitude larger than the original rewinding iteration, however, suggested that labels add important information which accelerates the learning process.
  • SPARSE PRETRAINING
Since sparse sub-networks are often challenging to train from scratch without the proper initialization, does pre-training make it easier for sparse neural networks to learn? The result suggests that while pre-training is sufficient to approximate the early phase of supervised training with an appropriately structured mask, it is not sufficient to do so with an inappropriate mask.

Conclusion

The results have significant implications for the lottery ticket hypothesis. The seeming necessity of late rewinding calls into question certain interpretations of lottery tickets as well as the ability to identify sub-networks at initialization. Observation is that weights are highly non-independent at the rewinding point suggests that the weights at this point cannot be easily approximated, making approaches which attempt to “jump” directly to the rewinding point unlikely to succeed. However,  result that labels are not necessary to approximate the rewinding point suggests that the learning during this phase does not require task-specific information, suggesting that rewinding may not be necessary if networks are pre-trained appropriately.



Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Cybersecurity Threats in Connected and Automated Vehicles based Federated Learning Systems

  Ranwa Al Mallah , Godwin Badu-Marfo , Bilal Farooq image Courtesy: Comparitech Abstract Federated learning (FL) is a machine learning technique that aims at training an algorithm across decentralized entities holding their local data private. Wireless mobile networks allow users to communicate with other fixed or mobile users. The road traffic network represents an infrastructure-based configuration of a wireless mobile network where the Connected and Automated Vehicles (CAV) represent the communicating entities. Applying FL in a wireless mobile network setting gives rise to a new threat in the mobile environment that is very different from the traditional fixed networks. The threat is due to the intrinsic characteristics of the wireless medium and is caused by the characteristics of the vehicular networks such as high node-mobility and rapidly changing topology. Most cyber defense techniques depend on highly reliable and connected networks. This paper explores falsified informat...

MLOps Drivenby Data Quality using ease.ml techniques

 Cedric Renggli, Luka Rimanic, Nezihe Merve Gurel, Bojan Karlas, Wentao Wu, Ce Zhang ETH Zurich Microsoft Research Paper Link ease.ml reference paper link Image courtesy 99designes Developing machine learning models can be seen as a process similar to the one established for traditional software development. A key difference between the two lies in the strong dependency between the quality of a machine learning model and the quality of the data used to train or perform evaluations. In this work, we demonstrate how different aspects of data quality propagate through various stages of machine learning development. By performing joint analysis of the impact of well-known data quality dimensions and the downstream machine learning process, we show that different components of a typical MLOps pipeline can be efficiently designed, providing both a technical and theoretical perspective. Courtesy: google The term “MLOps” is used when this DevOps process is specifically applied to ML. Diffe...