Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Pranav Murali Rao 21BCE5102

Aviral Vishal Bansal 21BCE1832


Kathit Bhongale 21BCE1786

Deep Residual Learning for Image Recognition

Abstract
Training increasingly complex neural networks is more challenging. To make training networks that
are far deeper than currently utilised networks easier, we provide a residual learning framework.

previously. Instead of learning unreferenced functions, we deliberately reformulate the layers to


learn residual functions with reference to the layer inputs. We offer in-depth empirical proof that
these residual networks are simpler to tune and can improve accuracy over far more depth. On the
ImageNet dataset, we test residual networks up to 152 layers deep, which is 8 layers deeper than
VGG nets [40] but still less sophisticated. On the ImageNet test set, an ensemble of these residual
nets produces an error of 3.57%. This outcome took first place in the 2015 ILSVRC classification task.

We also discuss CIFAR-10 analysis with 100 and 1000 layers.

For many visual identification tasks, the depth of representations is crucial. On the COCO object
detection dataset, we achieve a 28% relative improvement only as a result of our incredibly deep
representations. Our entries to the ILSVRC & COCO 2015 competitions1 were built on deep residual
nets, and we also placed first on the tasks of ImageNet detection, ImageNet localization, COCO
detection, and COCO segmentation.

Introduction
A number of advances in image categorization have been made thanks to deep convolutional neural
networks [22, 21], including [21, 49, 39]. Deep networks incorporate low, mid, and high-level
features and classifiers in an end-to-end multilayer form by default [49], and the "levels" of features
can be enhanced by the number of stacked layers (depth). The leading findings [40, 43, 12, 16] on
the difficult ImageNet dataset [35] all use "extremely deep" [40] models, with a depth of sixteen [40]
to thirty [16]. Recent evidence [40, 43] demonstrates the critical impact of network depth. There
have also been several more nontrivial visual recognition challenges have also greatly benefited
from very deep models.

The importance of depth prompts the following query: Is learning better networks as simple as
adding more layers?

The well-known issue of vanishing/exploding gradients [14, 1, 8], which impede convergence from
the start, made it difficult to provide a response. However, normalised initialization [23, 8, 36, 12]
and intermediate normalisation layers [16] have largely solved this issue and made it possible for
networks with tens of layers to begin convergent stochastic gradient descent (SGD) with
backpropagation [22].

When deeper networks are able to begin converging, a degradation issue is revealed: as network
depth increases, accuracy becomes saturated (which may not be surprising), at which point it rapidly
deteriorates.

Unexpectedly, such deterioration is not the result of overfitting, and as stated in [10, 41] and amply
supported by our results, adding more layers to a sufficiently deep model increases training error. A
such example is shown in Fig. 1.

The decline in training accuracy shows that not every system is equally simple to improve. Let's
compare a deeper architecture to a shallower one by adding additional layers. The deeper model
may be solved by construction: identity mapping is used for the newly added layers, and the learnt
shallower model is used for the other layers.

A deeper model shouldn't produce a higher training error than its shallower counterpart, according
to the existence of this constructed solution. However, investigations reveal that our available
solvers are unable to produce results that are as good as or better than the designed solution (or
unable to do so in feasible time).

In this study, we introduce a deep residual learning approach to tackle the degradation issue. We
specifically allowed these layers to suit a residual mapping rather than assuming that each few
stacked levels would immediately fit a desired underlying mapping. Formally, we allow the stacked
nonlinear layers match another mapping of F(x):= H(x)x, indicating the desired underlying mapping
as H(x). Recasting the original mapping into F(x)+x. We propose that the residual mapping is simpler
to optimise than the original, unreferenced mapping. In the extreme, fitting an identity mapping by a
stack of nonlinear layers would be more difficult than pushing the residual to zero if an identity
mapping were best.

By using feedforward neural networks with "shortcut connections," F(x) + x may be accomplished
(Fig. 2).

Connections that bypass one or more levels are known as shortcut connections [2, 33, 48]. In this
instance, the shortcut connections only carry out identity mapping, and the results are combined
with those of the stacked layers (Fig. 2). Identity shortcut links don't increase computing complexity
or introduce new parameters. Without changing the solvers, the entire network can still be trained
end-to-end using SGD with backpropagation and implemented quickly using widely used libraries
(such as Caffe [19]).

We offer thorough tests on ImageNet [35] to demonstrate the degradation issue and assess our
approach. We exhibit that: 1) Although the counterpart "plain" nets (which simply stack layers)
exhibit higher training error as the depth increases, our extremely deep residual nets are simple to
optimise; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth,
producing results that are significantly better than those of previous networks.

Similar behaviours are also observed on the CIFAR-10 set [20], indicating that our method's impacts
and optimization challenges extend beyond a single dataset.

We examine models with over 1000 layers and demonstrate successfully trained models on this
dataset with over 1000 layers.
With incredibly deep residual nets, we achieve great results on the ImageNet classification dataset
[35]. While still being less complicated than VGG nets, our 152-layer residual net is the deepest
network yet published on ImageNet [40]. Our ensemble took first place in the 2015 ILSVRC
classification competition with a top-5 error rate of 3.57% on the mageNet test set.

We additionally won the first places on ImageNet detection, ImageNet localization, COCO detection,
and COCO segmentation at the ILSVRC & COCO 2015 competitions thanks to the extraordinarily
deep representations' exceptional generalisation performance on other recognition tasks. This
substantial data demonstrates the generality of the residual learning principle, and we anticipate
that it will hold true for a variety of vision and non-visual issues.

Related work
Residual Representations:
When it comes to encoding residual vectors with regard to a dictionary, VLAD [18] is a
representation used in image recognition. Fisher Vector [30] may be thought of as a probabilistic
variation of VLAD [18]. They are both effective shallow representations for picture classification and
retrieval [4, 47]. Encoding residual vectors is demonstrated to be more efficient for vector
quantization [17] than encoding original vectors.

The popular Multigrid method [3] reformulates the system as subproblems at multiple scales, where
each subproblem is responsible for the residual solution between a coarser and a finer scale, for
solving Partial Differential Equations (PDEs) in low-level vision and computer graphics. Hierarchical
basis preconditioning [44, 45] is an alternative to Multigrid that uses variables to express residual
vectors between two scales. These solvers converge much more quickly than standard solvers that
are unaware of the residual nature of the solutions, as demonstrated in [3, 44, 45]. These techniques
imply that an effective reformulation or preconditioning can make the optimization simpler.

Shortcut Connections:
Shortcut connections are practises and beliefs that have been explored for a very long period [2, 33,
48]. Adding a linear layer linked from the network input to the output was a common early
technique for training multi-layer perceptrons (MLPs) [33, 48]. In [43, 24], a few intermediary layers
that deal with vanishing/exploding gradients are directly coupled to auxiliary classifiers. The studies
[38, 37, 31, 46] suggest shortcut connections-based techniques for centering gradients, propagating
errors, and layer responses. According to [43], a shortcut branch and a few deeper branches make
up a "inception" layer.

"Highway networks" [41, 42] propose shortcut links with gating features [15] concurrent with our
work.

Unlike our identity shortcuts, which have no parameters, these gates depend on data and contain
parameters.

The layers in highway networks indicate non-residual functions when a gated shortcut is "closed"
(getting near to zero). Contrarily, our formulation continuously picks up residual functions; our
identity shortcuts are never closed, and all data is continuously passed through, requiring the
learning of new residual functions. In addition, despite significantly increased depth, highway
networks have not shown accuracy gains.

Deep Residual Learning


Residual Learning
Take H(x) as an underlying mapping that can be fitted by a few stacked layers (but not necessarily
the complete net), with x indicating the inputs to the first of these layers. Assuming that the input
and output have the same dimensions, it is analogous to hypothesising that many nonlinear layers
may asymptotically approach the residual functions, i.e., H(x) x, if one believes that they can
asymptotically approximate complex functions2. Therefore, we explicitly allow these layers to
approach a residual function F(x):= H(x) x rather than assuming that they will approximate H(x).
Thus, the initial function becomes F(x)+x. Although (as expected) both forms should be able to
asymptotically approach the required functions, the learning curve for each form may be different.

The paradoxical behaviour regarding the degradation issue serves as the inspiration for this
reformulation (Fig. 1, left). As was mentioned in the introduction, a deeper model should have
training error that is no larger than that of its shallower counterpart if the additional layers can be
built as identity mappings. According to the degradation problem, solvers may struggle to
approximate identity mappings using several nonlinear layers. If identity mappings are the best
option, the solvers may simply push the weights of the several nonlinear layers toward zero to get
close to identity mappings using the residual learning reformulation.

Identity mappings are probably not the best option in practical situations, but our reformulation
could assist to condition the issue. It should be simpler for the solver to find the perturbations with
reference to an identity mapping than to learn the function as a new one if the optimal function is
closer to an identity mapping than to a zero mapping. By conducting experiments, we demonstrate
in Figure 7 that learned residual functions typically have small responses, indicating that identity
mappings offer adequate preconditioning.

Identity Mapping by Shortcuts


To a few layered layers, we apply residual learning.

In Fig. 2, a construction block is seen. Formally, a construction block is defined as follows in this
paper:

y = F(x, {Wi}) + x

Here, the input and output vectors for the layers under consideration are x and y. The residual
mapping to be learned is represented by the function F(x, "Wi"). For the two-layer example in Fig. 2,
F = W2(W1x), where stands for ReLU [29] and the biases are deleted to simplify notations. By using a
shortcut connection and element-wise addition, the operation F + x is carried out. We use the
second nonlinearity following the addition, y, as shown in Figure 2.

Eqn. (1)'s shortcut connections don't add any new parameters or increase the difficulty of the
computation. This is significant in our comparisons between simple and residual networks and is not
just appealing in practise. Given that plain/residual networks have the same quantity of parameters,
depth, breadth, and computational cost, we may compare them equitably (except for the negligible
element-wise addition).

In Eqn, x and F's dimensions must match (1).

The dimensions can be matched by performing a linear projection Ws by the shortcut connections if
this is not the case (for instance, while changing the input/output channels):

y = F(x, {Wi}) + Wsx.

In Eqn, a square matrix Ws is an additional option. However, we'll demonstrate through experiments
that identity mapping is adequate and cost-effective for resolving the deterioration problem. Ws is
therefore only applied when dimensions match.

The residual function F can take on several shapes. The function F used in the experiments in this
study contains two or three levels (Fig. 5), while additional layers may be feasible. However, if F only
has one layer, Eq. (1) is comparable to a linear layer, with the formula y = W1x + x, for which we have
not noticed any advantages.

Also to be noted is the fact that convolutional layers can benefit from the aforementioned notations,
which, for the sake of simplicity, refer to fully-connected layers. Multiple convolutional layers can be
represented by the function F(x, Wi). Two feature maps are added element by element, channel by
channel.

Network Architectures
We tested different plain/residual nets and found recurring phenomena. We describe two models
for ImageNet in the following in order to provide examples for discussion.

Plain Network
Our simple baselines (Fig. 3, middle) are primarily motivated by the VGG nets [40] concept (Fig. 3,
left). The convolutional layers typically have 33 filters and adhere to two straightforward design
principles: (i) the layers have the same number of filters for the same output feature map size; and
(ii) if the feature map size is cut in half, the number of filters is doubled to maintain the time
complexity per layer. Directly using convolutional layers with a stride of two, we downsample. A
1000-way fully-connected layer with softmax and a global average pooling layer make up the
network's final layer. In Fig. 3, there are 34 weighted layers overall.

It is important to note that our model is simpler and has fewer filters than VGG networks [40]. Only
18% of VGG-19's 3.6 billion FLOPs (multiply-adds) are found in our 34-layer baseline.

Experiments
ImageNet Classification
Even though the solution space of the 18-layer plain network is a subdomain of that of the 34-layer
plain network, the 34-layer plain net has larger training error during the whole training operation.

34th layer. We contend that vanishing gradients are not likely to represent the root of this
optimization challenge. Since BN [16] is used to train these simple networks, forward propagated
signals are guaranteed to have non-zero variances. We also confirm that the backward propagated
gradients with BN show sound norms. So neither the forward nor the reverse signals disappear. In
reality, Table 3 shows that the 34-layer simple net may still attain competitive accuracy, indicating
that the solver is at least somewhat effective.

Deeper Bottleneck Architectures:

We then go over our deeper ImageNet nets. We change the building block as a bottleneck design4
due to worries about the amount of training time we can afford. We employ a 3-layer stack rather
than a 2-layer stack for each residual function F. (Fig. 5). Three layers—11, 33, and 11 convolutions—
are used, with the 11 layers handling the task of decreasing and subsequently increasing (restoring)
dimension, leaving the 33 layer as a bottleneck with reduced input/output dimensions. Fig. 5
illustrates a situation in which the temporal complexity of the two systems is somewhat comparable.

Architectures with deeper bottlenecks

We then go over our deeper ImageNet nets. We change the building block as a bottleneck design4
due to worries about the amount of training time we can afford. We employ a 3-layer stack rather
than a 2-layer stack for each residual function F. (Fig. 5). Three layers—11, 33, and 11 convolutions—
are used, with the 11 layers handling the task of decreasing and subsequently increasing (restoring)
dimension, leaving the 33 layer as a bottleneck with reduced input/output dimensions. In the case
shown in Fig. 5, both designs have a comparable level of time complexity.

For the bottleneck designs, the parameter-free identity shortcuts are particularly crucial. One can
demonstrate that the time complexity and model size increase when the identity shortcut in Fig. 5
(right) is substituted with a projection.

ResNets 101-layer and 152-layer:

We build the aforementioned ResNets by adding extra 3-layer blocks (Table 1). Surprisingly, the 152-
layer ResNet (11.3 billion FLOPs) still has a lower complexity than VGG-16/19 nets (15.3/19.6 billion
FLOPs), although having a substantially higher depth.

The 50/101/152-layer ResNets are far more accurate than the 34-layer ones (Table 3 and 4).

Since the degradation issue is not there, we experience considerable accuracy increases at greatly
increased depth. All evaluation indicators show the advantages of depth.

Comparisons with Modern Techniques. Table 4 compares the current findings to the prior top single-
model outcomes.

Our 34-layer ResNets at the baseline level exhibit highly competitive accuracy. Our 152-layer ResNet
has a top-5 validation error for a single model of 4.49%. This single-model outcome performs better
than all earlier ensemble outcomes (Table 5). To create an ensemble, we merge six models with
various levels of depth (only with two 152-layer ones at the time of submitting).

This causes the top-5 error on the test set to be 3.57%. (Table 5).

This entry took up the first prize in the 2015 ILSVRC.

Analysis and CIFAR-10

On the CIFAR-10 dataset [20], which comprises of 50k training images and 10k testing images in 10
classes, we did additional research. We provide experiments that were tested against the test set
and trained on the training set. We purposely utilise simple architectures because our focus is on the
behaviours of incredibly deep networks rather than pushing the most cutting-edge outcomes.

The shape in Fig. 3 (middle/right) is followed by the plain/residual architectures. 32x32 images are
used as the network inputs after the per-pixel mean has been taken out. 33 convolutions make up
the top layer. Then, using 2n layers for each feature map size, we apply a stack of 6n layers with 33
convolutions to the feature maps of sizes 32, 16, and 8 accordingly.

We adopt the weight initialization in [12] and BN [16], but with a weight decay of 0.0001 and
momentum of 0.9, and no dropout. These models are trained using two GPUs and a minibatch size
of 128. We begin by learning.

at 32k and 48k iterations, divide it by 10, and stop training at 64k iterations, which is decided on a
45k/5k train/val split. We use the straightforward data augmentation method in [24] for training: A
3232 crop is randomly sampled from the padded image or its horizontal flip after 4 pixels are added
to either side of the image. We only test the single view of the original 3232 image for testing
purposes.

We contrast networks with 20, 32, 44, and 56 layers when n = 3, 5, 7, and 9. The plain nets'
behaviours are depicted in Fig. 6 (left). Going deeper causes the deep plain nets to suffer from
increased depth and shows higher training error. Similar phenomena have been observed on
ImageNet (Fig. 4, left) and MNIST (see [41]), indicating that such an optimization challenge is a major
issue.

We investigate n = 18 further, which results in a 110-layer ResNet. In this instance, we discover that
the 0.1 initial learning rate is a tad too high to begin convergence5. So, after 400 iterations of
warming up the training with 0.01 till the training error is under 80%, we switch back to 0.1 and
carry on with the training.

The remainder of the learning schedule is completed as before.


Examining Layer Responses

The standard deviations (std) of the layer responses are displayed in Fig. 7. The results of each 33
layer, following BN and before additional nonlinearity (ReLU/addition), are the responses. This
investigation illustrates the residual functions' reaction potency for ResNets.

ResNets typically have smaller responses than their plain equivalents, as shown in Fig. 7. These
findings validate our fundamental hypothesis (Sec.3.1), according to which the residual functions
might typically be closer to zero than the non-residual functions.

The comparisons between ResNet-20, 56, and 110 in Fig. 7 show that the deeper ResNet has smaller
response magnitudes. A ResNets layer often modifies the signal less when there are more layers.

Object Recognition on MS COCO and PASCAL

On different recognition challenges, our technique performs well in terms of generalisation. The
baseline findings for object detection for PASCAL VOC 2007 and 2012 [5] and COCO [26] are
displayed in Tables 7 and 8. We use the detection method Faster R-CNN [32]. Here, we're interested
in the enhancements brought about by using ResNet-101 instead of VGG-16 [40]. The benefits can
only be attributed to improved networks since both models' detection implementations (see
appendix) are the same.

Most surprisingly, we achieve a 6.0% gain in COCO's standard metric (mAP@[.5,.95]), which is a 28%
relative improvement, on the difficult COCO dataset. The taught representations are the only reason
for this gain.

We took first place in numerous ILSVRC & COCO 2015 competition tracks using deep residual nets,
including ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
References
[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is
difficult. IEEE Transactions on Neural

Networks, 5(2):157–166, 1994.

[2] C. M. Bishop. Neural networks for pattern recognition. Oxford

university press, 1995.

[3] W. L. Briggs, S. F. McCormick, et al. A Multigrid Tutorial. Siam,

2000.

[4] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil

is in the details: an evaluation of recent feature encoding methods.

In BMVC, 2011.

[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object
Classes (VOC) Challenge. IJCV,

pages 303–338, 2010.

[6] R. Girshick. Fast R-CNN. In ICCV, 2015.

[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In

CVPR, 2014.

[8] X. Glorot and Y. Bengio. Understanding the difficulty of training

deep feedforward neural networks. In AISTATS, 2010.

[9] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and

Y. Bengio. Maxout networks. arXiv:1302.4389, 2013.


[10] K. He and J. Sun. Convolutional neural networks at constrained time

cost. In CVPR, 2015.

[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep

convolutional networks for visual recognition. In ECCV, 2014.

[12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:

Surpassing human-level performance on imagenet classification. In

ICCV, 2015.

[13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and

R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors.


arXiv:1207.0580, 2012.

[14] S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen.

Diploma thesis, TU Munich, 1991.

[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural

computation, 9(8):1735–1780, 1997.

[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep

network training by reducing internal covariate shift. In ICML, 2015.

[17] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest

neighbor search. TPAMI, 33, 2011.

[18] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and

C. Schmid. Aggregating local image descriptors into compact codes.

TPAMI, 2012.

[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for

fast feature embedding. arXiv:1408.5093, 2014.

[20] A. Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.

[21] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification

with deep convolutional neural networks. In NIPS, 2012.

[22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,

W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural
computation, 1989.

[23] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller. Efficient backprop. ¨

In Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998.


[24] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. arXiv:1409.5185, 2014.

[25] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400,

2013.

[26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in ´

context. In ECCV. 2014.

[27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks

for semantic segmentation. In CVPR, 2015.

[28] G. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of ´

linear regions of deep neural networks. In NIPS, 2014.

[29] V. Nair and G. E. Hinton. Rectified linear units improve restricted

boltzmann machines. In ICML, 2010.

[30] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for

image categorization. In CVPR, 2007.

[31] T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by

linear transformations in perceptrons. In AISTATS, 2012.

[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards

real-time object detection with region proposal networks. In NIPS,

2015.

[33] B. D. Ripley. Pattern recognition and neural networks. Cambridge

university press, 1996.

[34] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and

Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.

[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,

Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet

large scale visual recognition challenge. arXiv:1409.0575, 2014.

[36] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to

the nonlinear dynamics of learning in deep linear neural networks.

arXiv:1312.6120, 2013.

[37] N. N. Schraudolph. Accelerated gradient descent by factor-centering

decomposition. Technical report, 1998.


[38] N. N. Schraudolph. Centering neural network gradient factors. In

Neural Networks: Tricks of the Trade, pages 207–226. Springer,

1998.

[39] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated
recognition, localization and detection

using convolutional networks. In ICLR, 2014.

[40] K. Simonyan and A. Zisserman. Very deep convolutional networks

for large-scale image recognition. In ICLR, 2015.

[41] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks.

arXiv:1505.00387, 2015.

[42] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep

networks. 1507.06228, 2015.

[43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich. Going deeper with convolutions. In CVPR, 2015.

[44] R. Szeliski. Fast surface interpolation using hierarchical basis functions. TPAMI, 1990.

[45] R. Szeliski. Locally adapted hierarchical basis preconditioning. In

SIGGRAPH, 2006.

[46] T. Vatanen, T. Raiko, H. Valpola, and Y. LeCun. Pushing stochastic gradient towards second-order
methods–backpropagation learning with transformations in nonlinearities. In Neural Information

Processing, 2013.

[47] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library

of computer vision algorithms, 2008.

[48] W. Venables and B. Ripley. Modern applied statistics with s-plus.

1999.

[49] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. In
ECCV, 2014.

You might also like