What are deep Residual Networks or Why ResNets are important ?
Paper summary and code.
Deep convolutional neural networks have led to a series of breakthroughs for image classification tasks.
Challenges such as ILSVRC and COCO saw people exploiting deeper and deeper models to achieve better results. Clearly network depth is of crucial importance.
Due to the difficult nature of real world tasks or problems being thrown at deep neural networks, the size of the networks is bound to increase when one wants to attain high levels of accuracy on deep learning tasks. The reason being that the network needs to extract high number of critical patterns or features from the training data and learn very complex yet meaning full representations, from their combinations, at later layers in the network, from raw input data like high resolution multi colored images.
Without considerable depth, the network will not be able to combine intricate small/mid/high level features in much more complex ways in order to actually LEARN the inherent complex nature of the problem being solved with raw data.
Hence, the first solution to solve complex problems was to make your neural networks deep, really Deeeeeeeeeeeeeeeeeeeeeeeeeep. For experiments and research purposes, the depth of some networks were set to be around more than 100s of layers in order to get high understanding and training accuracy for the problem at hand.
So, the authors of Deep Residual Networks paper asked one very important but neglected question: Is learning better networks as easy as stacking more layers ?
Trying this, however, proved highly inefficient without giving the expected performance gains. The reason, you ask !
In theory, as the number of layers in a plain neural network increases, then it should provide us increasingly complex representations, resulting better learning and thus higher accuracies. Contrary to the belief, with experiments to prove, with increase in number of layers, the network’s train accuracy began to drop. See the image below.
From the paper: When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.
So, to address the degradation problem, the authors introduced a deep residual learning framework, as part of a hypothesis hoping to address the issue.
The idea is that instead of letting layers learn the underlying mapping, let the network fit the residual mapping. So, instead of say H(x), initial mapping, let the network fit, F(x) := H(x)-x which gives H(x) := F(x) + x
The approach is to add a shortcut or a skip connection that allows info to flow, well just say, more easily from one layer to the next’s next layer, ie you bypass data along with normal CNN flow from one layer to the next layer after the immediate next.
A residual block:
How does this helps ?
As hypothesised by the authors: adding skip connection not only allows data to flow between layers easily , it also allows the learning of identity function very trivial, thus allowing the same information to flow without transformations. The network is able to learn better when having identity mapping by it’s side to use, as learning perturbations with reference to identity mappings are easier than to learn function as a new one from scratch.
ResNet uses two major building blocks to construct the entire network.
The first one is the id block, same as above
The second component is the Conv block.
The Conv block helps to modify the incoming data and restructure so that the output of the first layer matches the dimensions of the third layer and can be added together.
These building blocks help to achieve a better and accurate deep learning model. The code I have included, however, uses different approach for Batch normalization and Relu.
The results in paper !
The 34 layer ResNet performed better than the 18 layer ResNet and plain counter part. So the degradation problem was addressed on deep ResNet better than the shallower network, both plain and ResNet one.
For deeper networks (50 and above) authors introduced bottleneck architectures for economical gains.
Based on deep residual nets, the authors won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
The variants of ResNet:
THE CODE:
The repository is at : https://github.com/MANU-CHAUHAN/deep-residual-net-image-classification
The above code includes the ability to pass dynamic image sizes while training ie. enabled model to be input image size independent !
The initial code was inspired from various places and mainly from https://github.com/raghakot/keras-resnet
Conclusion:
ResNets lead to achieve higher accuracies in comparison to plain nets and earlier approaches.
And one would think that the world lived happily after ResNets came into being.
This did not happen though! There were indeed better approaches waiting for the world to see. Which I will cover in a separate articles.