Skip to main content

Capsule Neural Networks

-By Sara Sabour, Nicholas Frosst, 
Geoffrey E. Hinton 
Google Brain Toronto



Abstract 
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

A Capsule Neural Network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic the biological neural organization.

The idea is to add structures called capsules to a convolutional neural network (CNN), and to reuse output from several of those capsules to form more stable (with respect to various perturbations) representations for higher order capsules. The output is a vector consisting of the probability of observation, and a pose for that observation. This vector is similar to what is done for example when doing classification with localization in CNNs.

Among other benefits, capsnets address the "Picasso problem" in image recognition: images that have all the right parts but that are not in the correct spatial relationship (e.g., in a "face", the positions of the mouth and one eye are switched). For image recognition, capsnets exploit the fact that while viewpoint changes have nonlinear effects at the pixel level, they have linear effects at the part/object level. This can be compared to inverting the rendering of an object of multiple parts.


An invariant is an object property that does not change as a result of some transformation. For example, the area of a circle does not change if the circle is shifted to the left.
Informally, an equivariant is a property that changes predictably under transformation. For example, the center of a circle moves by the same amount as the circle when shifted. A non-equivariant is a property whose value does not change predictably under a transformation. For example, transforming a circle into an ellipse means that its perimeter can no longer be computed as π times the diameter. In computer vision, the class of an object is expected to be an invariant over many transformations. I.e., a cat is still a cat if it is shifted, turned upside down or shrunken in size. However, many other properties are instead equivariant. The volume of a cat changes when it is scaled. Equivariant properties such as a spatial relationship are captured in a pose, data that describes an object's translation, rotation, scale, and reflection. The translation is a change in location in one or more dimensions. Rotation is a change in orientation. The scale is a change in size. Reflection is a mirror image.

Unsupervised capsnets learn a global linear manifold between an object and its pose as a matrix of weights. In other words, capsnets can identify an object independent of its pose, rather than having to learn to recognize the object while including its spatial relationships as part of the object. In capsnets, the pose can incorporate properties other than spatial relationships, e.g., color (cats can be of various colors). Multiplying the object by the manifold poses the object (for an object, in space).


Pooling


Capsnets reject the pooling layer strategy of conventional CNNs that reduces the amount of detail to be processed at the next higher layer. Pooling allows a degree of translational invariance (it can recognize the same object in a somewhat different location) and allows a larger number of feature types to be represented. Capsnet proponents argue that pooling


  • violates biological shape perception in that it has no intrinsic coordinate frame.
  • provides invariance (discarding positional information) instead of equivariance (disentangling that information). 
  • ignores the linear manifold that underlies many variations among images;
  • routes statically instead of communicating a potential "find" to the feature that can appreciate it.
  • damages nearby feature detectors, by deleting the information they rely upon.






To a CNN, both pictures are similar, since they both contain similar elements. Source 


Capsules

A capsule is a set of neurons that individually activate for various properties of a type of objects, such as position, size, and hue. Formally, a capsule is a set of neurons that collectively produce an activity vector with one element for each neuron to hold that neuron's instantiation value (e.g., hue). Graphics programs use instantiation value to draw an object. Capsnets attempt to derive these from their input. The probability of the entity's presence in a specific input is the vector's length, while the vector's orientation quantifies the capsule's properties.

Artificial neurons traditionally output a scalar, real-valued activation that loosely represents the probability of observation. Capsnets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement.

Because capsules are independent, when multiple capsules agree, the probability of correct detection is much higher. A minimal cluster of two capsules considering a six-dimensional entity would agree within 10% by chance only once in a million trials. As the number of dimensions increase, the likelihood of a chance agreement across a larger cluster with higher dimensions decreases exponentially.

Capsules in higher layers take outputs from capsules at lower layers and accept those whose outputs cluster. A cluster causes the higher capsule to output a high probability of observation that an entity is present and also output a high-dimensional (20-50+) pose.

Higher-level capsules ignore outliers, concentrating on clusters. This is similar to the Hough transform, the RHT, and RANSAC from classic digital image processing



The capsule network is much better than other models at telling that images in top and bottom rows belong to the same classes, only the view angle is different. The latest papers decreased the error rate by a whopping 45%. Source




Squashing is a non-linearity. So instead of adding squashing to each layer like how you do in CNN, you add the squashing to a nested set of layers. So the squashing function gets applied to the vector output of each capsule.


Conclusion

Capsules introduce a new building block that can be used in deep learning to better model hierarchical relationships inside of internal knowledge representation of a neural network. The intuition behind them is very simple and elegant.

Hinton and his team proposed a way to train such a network made up of capsules and successfully trained it on a simple data set, achieving state-of-the-art performance. This is very encouraging.

Nonetheless, there are challenges. Current implementations are much slower than other modern deep learning models. Time will show if capsule networks can be trained quickly and efficiently. In addition, we need to see if they work well on more difficult data sets and in different domains.

In any case, the capsule network is a very interesting and already working model which will definitely get more developed over time and contribute to further expansion of deep learning application domain.

Comments

Thiyagu said…
New learning... good one... thank you.

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based sim...

Ownership at Large

 Open Problems and Challenges in Ownership Management -By John Ahlgren, Maria Eugenia Berezin, Kinga Bojarczuk, Elena Dulskyte, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Shan He, Ralf Lämmel, Erik Meijer, Silvia Sapora, and Justin Spahr-Summers Facebook Inc.  Software-intensive organizations rely on large numbers of software assets of different types, e.g., source-code files, tables in the data warehouse, and software configurations. Who is the most suitable owner of a given asset changes over time, e.g., due to reorganization and individual function changes. New forms of automation can help suggest more suitable owners for any given asset at a given point in time. By such efforts on ownership health, accountability of ownership is increased. The problem of finding the most suitable owners for an asset is essentially a program comprehension problem: how do we automatically determine who would be best placed to understand, maintain, ev...

Hybrid Approach to Automation, RPA and Machine Learning

- By Wiesław Kopec´, Kinga Skorupska, Piotr Gago, Krzysztof Marasek  Polish-Japanese Academy of Information Technology Paper Link Courtesy DZone   Abstract One of the more prominent trends within Industry 4.0 is the drive to employ Robotic Process Automation (RPA), especially as one of the elements of the Lean approach.     The full implementation of RPA is riddled with challenges relating both to the reality of everyday business operations, from SMEs to SSCs and beyond, and the social effects of the changing job market. To successfully address these points there is a need to develop a solution that would adjust to the existing business operations and at the same time lower the negative social impact of the automation process. To achieve these goals we propose a hybrid, human-centred approach to the development of software robots. This design and  implementation method combines the Living Lab approach with empowerment through part...