-By Jonathan Frankle†
MIT CSAIL
David J. Schwab
CUNY ITS and Ari S. Morcos of Facebook AI Research
Many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example,
- Sparse
- Trainable sub-networks emerge
- Gradient descent moves into a small subspace
- Network undergoes a critical period
Researchers examine the changes that deep
neural networks undergo during this early phase of training.
Over the past decade, methods for successfully training big, deep neural networks have revolutionized machine learning. Yet surprisingly, the underlying reasons for the success of these approaches
remain poorly understood, despite remarkable empirical performance. A large body of work has focused on understanding what happens during the later
stages of training, while the initial phase has been less explored.
Research is built on Basic framework "Iterative Magnitude Pruning with rewinding" called The Lottery Ticket Hypothesis .
What is The Lottery Ticket Hypothesis
A randomly initialized, dense neural network contains a sub-network that is initialized such that "when trained in isolation" it can match the test accuracy of the original network after training for at most the same number of iterations.
- Can reduce the parameter count by more than 90% without harming accuracy
- Decreases storage size, energy consumption, and inference time
- Typically using: One-Shot Pruning
Paper uses rewinding method of The Lottery Ticket Hypothesis that is
Iterative magnitude pruning with rewinding:
In order to test the effect of various hypotheses
about the state of sparse networks early in training, they use the Iterative Magnitude Pruning with
rewinding (IMP) procedure to extract sub-networks from various points in
training that could have learned on their own.
The procedure involves training a network to completion, pruning the 20% of weights with the lowest magnitudes globally throughout the network,
and rewinding the remaining weights to their values from an earlier iteration k during the initial,
pre-pruning training run. This process is iterated to produce networks with high sparsity levels. As
demonstrated in IMP with rewinding leads to sparse sub-networks which can
train to high performance even at high sparsity levels > 90%.
Figure 1 shows the results of the IMP with rewinding procedure, showing the accuracy of ResNet20 at increasing sparsity when performing this procedure for several rewinding values of k. For
k ≥ 500, sub-networks can match the performance of the original network with 16.8% of weights
remaining. For k > 2000, essentially no further improvement is observed (not shown)
Paper also considers one more section
Critical periods in deep learning, found that perturbing the training process
by providing corrupted data early on in training can result in irrevocable damage to the final performance of the network. Architecture, learning rate schedule, and regularization
all modify the timing of the critical period, and regularization, in particular weight decay and data augmentation.
After the initial iterations, gradient magnitudes
drop and the rate of change in each of the aforementioned quantities gradually slows through the
remainder of the period observed. Interestingly, gradient magnitudes reach a minimum after the
first 200 iterations and subsequently increase to a stable level by iteration 500. Evaluation accuracy,
improves rapidly, reaching 55% by the end of the first epoch (400 iterations), more than halfway to
the final 91.5%. By 2000 iterations, accuracy approaches 80%.
Data Dependency:
- RANDOM LABELS
- SELF-SUPERVISED ROTATION PREDICTION
- SPARSE PRETRAINING
Conclusion
The results have significant implications for the lottery ticket hypothesis. The seeming necessity
of late rewinding calls into question certain interpretations of lottery tickets as well as the ability to
identify sub-networks at initialization. Observation is that weights are highly non-independent at
the rewinding point suggests that the weights at this point cannot be easily approximated, making
approaches which attempt to “jump” directly to the rewinding point unlikely to succeed. However, result that labels are not necessary to approximate the rewinding point suggests that the learning
during this phase does not require task-specific information, suggesting that rewinding may not be
necessary if networks are pre-trained appropriately.
Comments