-By Md Zahangir Alom , Tarek M. Taha , Chris Yakopcic , Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin ,
Brian C Van Essen , Abdul A S. Awwal, and Vijayan K. Asari
Paper covers vast information on Deep learning, its History, techniques, and analysis. I would present a miniature version of it. For detailed info refer
Since the 1950s, a small subset of Artificial Intelligence (AI), often called Machine Learning (ML), has revolutionized several fields in the last few decades. Neural Networks(NN) is a subfield of ML, and it was this subfield that spawned Deep Learning (DL). Since its inception, DL has been creating ever larger disruptions, showing outstanding success in almost every application domain. Fig. 1 shows, the taxonomy of AI. DL (using either deep architecture of learning or hierarchical learning approaches) is a class of ML developed largely from 2006 onward. Learning is a procedure consisting of estimating the model parameters so that the learned model (algorithm) can perform a specific task. For example, in Artificial Neural Networks (ANN), the parameters are the weight matrices (𝑤 𝑖,𝑗 ′𝑠). DL, on the other hand, consists of several layers in between the input and output layer which allows for many stages of non-linear information processing units with hierarchical architectures to be present that are exploited for feature learning and pattern classification [1, 2]. Learning methods based on representations of data can also be defined as representation learning. Recent literature states that DL based representation learning involves a hierarchy of features or concepts, where the high-level concepts can be defined from the low-level ones and low-level concepts can be defined from high-level ones. In some articles, DL has been described as a universal learning approach that is able to solve almost all kinds of problems in different application domains. In other words, DL is not task specific.
Types of DL approaches
Category of Deep Learning
Feature Learning
When and where to apply DL
DL is employed in several situations where machine intelligence would be useful:
- Absence of a human expert (navigation on Mars)
- Humans are unable to explain their expertise (speech recognition, vision and language understanding)
- The solution to the problem changes over time (tracking, weather prediction, preference, stock, price prediction)
- Solutions need to be adapted to particular cases (biometrics, personalization).
- The problem size is too vast for our limited reasoning capabilities (calculation webpage ranks, matching ads to Facebook, sentiment analysis).
Why deep Learning
- Universal learning approach
This approach is sometimes called universal learning because it can be applied to almost any application domain.
- Robust
Deep learning approaches do not require the design of features ahead of time. Features are automatically learned that is optimal for the task at hand. As a result, the robustness to natural variations in the data is automatically learned.
- Generalization
The same deep learning approach can be used in different applications or with different data types. This approach is often called transfer learning. In addition, this approach is helpful where the problem does not have sufficient available data.
Challenges of DL
There are several challenges for deep learning:
- Big data analytics using Deep Learning
- Scalability of DL approaches
- Ability to generate data which is important where data is not available for learning the system (especially for computer vision task such as inverse graphics).
- Energy efficient techniques for special purpose devices including mobile intelligence, FPGAs, and so on.
- Multi-task and transfer learning (generalization) or multi-module learning. This means learning from different domains or with different models together.
- Dealing with causality in learning.
Below is a brief history of neural networks highlighting key events:
- 1943: McCulloch & Pitts show that neurons can be combined to construct a Turing machine (using ANDs, ORs, & NOTs).
- 1958: Rosenblatt shows that perceptron’s will converge if what they are trying to learn can be represented.
- 1969: Minsky & Papert show the limitations of perceptron’s, killing research in neural networks for a decade.
- 1985: The backpropagation algorithm by GeoffreyHinton revitalizes the field.
- 1988: Neocognitron: a hierarchical neural network capable of visual pattern recognition.
- 1998: CNNs with Backpropagation for document analysis by Yan LeCun.
- 2006: The Hinton lab solves the training problem for DNNs.
- 2012 : AlexNet by Alex Krizhevesky in 2012.
Artificial neurons, which try to mimic the behaviour of the human brain, are the fundamental component for building ANNs. The basic computational element (neuron) is called a node (or unit) which receives inputs from external sources and has some internal parameters (including weights and biases that are learned during training) which produce outputs. This unit is called a perceptron.
Gradient descent
The gradient descent approach is a first order optimization algorithm which is used for finding the local minima of an objective function. This has been used for training ANNs in thelast couple of decades successfully.
Stochastic Gradient Descent (SGD)
Since a long training time is the main drawback for the traditional gradient descent approach, the SGD approach is used for training Deep Neural Networks (DNN)Back-propagation
DNN are trained with the popular Back-Propagation (BP) algorithm with SGD. The pseudo code of the basic Backpropagation is given in Algorithm III. In the case of MLPs, we can easily represent NN models using computation graphs which are directive acyclic graphs. For that representation of DL, we can use the chain-rule to efficiently calculate the gradient from the top to the bottom layers with BP.Momentum
Momentum is a method which helps to accelerate the training process with the SGD approach. The main idea behind it is to use the moving average of the gradient instead of using only the current real value of the gradient.Learning rate (𝜼)
The learning rate is an important component for training DNN (as explained in Algorithm I and II). The learning rate is the step size considered during training which makes the trainingprocess faster.
Weight decay
Weight decay is used for training deep learning models as a L2 regularization approach, which helps to prevent over fitting the network and model generalization.Popular CNN architectures
In general, most deep convolutional neural networks are made of a key set of basic layers, including the convolution layer, the sub-sampling layer, dense layers, and the soft-max layer. The architectures typically consist of stacks of several convolutional layers and max-pooling layers followed by a fully connected and SoftMax layers at the end. Some examples of such models are LeNet, AlexNet, VGG Net, NiN and all convolutional (All Conv). Other alternatives and more efficient advanced architectures have been proposed including GoogLeNet with Inception units,
Residual Networks, DenseNet, and FractalNet
Paper deep dives on networks and other aspects.
Comments