Professional Documents
Culture Documents
Contents MLP PDF
Contents MLP PDF
f (x; w , b) = xw + b.
1 1
h2
x2
0 0
0 1 0 1
x1 h
y y
h1 h2 h
x1 x2 x
g (z ) = max{0, z}
0
z
1
w= ,
2
and b = 0.
We can now walk through the way that the model proce
Let X be the design matrix containing all four points in t
with one example per row:
0 0
0 1
X= 1 0 .
1 1
The rst step in the neural network is to multiply the inp
layers weight matrix:
0 0
1 1
XW =
1 1 .
2 2
Next, we add the bias vector c, to obtain
0 1
1 0
1 0 .
2 1
In this space, all of the examples lie along a line with slope
this line, the output needs to begin at 0, then rise to 1, then
A linear model cannot implement such a function. To nish
of h for each example, we apply the rectied linear transfo
0 0
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The neural network has obtained the correct answer for every
In this example, we simply specied the solution, then s
zero error.In a real situation, there might be billions of m
billions of training examples, so one cannot simply guess
here. Instead, a gradient-based optimization algorithm can
produce very little error. The solution we described to the
global minimum of the loss function, so gradient descent
point. There are other equivalent solutions to the XOR p
descent could also nd. The convergence point of gradient d
initial values of the parameters. In practice, gradient desc
nd clean, easily understood, integer-valued solutions like
here.
log P(y ) = yz
P(y ) = exp(yz )
exp(yz )
P (y ) = 1
y =0 exp(y z)
P (y ) = ((2y 1)z ) .
z = W h + b,
exp(zi )
softmax(z )i = .
j exp(zj )
h = g (W x + b).
where G(i) is the set of indices into the inputs for group i,
This provides a way of learning a piecewise linear function tha
directions in the input x space.
A maxout unit can learn a piecewise linear, convex funct
Maxout units can thus be seen as learning the activation
than just the relationship between units. With large enough
learn to approximate any convex function with arbitrary
a maxout layer with two pieces can learn to implement the
input x as a traditional layer using the rectied linear activa
value rectication function, or the leaky or parametric R
implement a totally dierent function altogether. The maxo
be parametrized dierently from any of these other layer
dynamics will be dierent even in the cases where maxout le
same function of x as one of the other layer types.
Each maxout unit is now parametrized by k weight vect
so maxout units typically need more regularization than rect
can work well without regularization if the training set is la
pieces per unit is kept low (Cai et al., 2013).
Maxout units have a few other benets. In some cases,
tistical and computational advantages by requiring fewer pa
if the features captured by n dierent linear lters can be
losing information by taking the max over each group of k
layer can get by with k times fewer weights.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
g (z ) = (z )
g (z ) = tanh(z ).
Many other types of hidden units are possible, but are used
In general, a wide variety of dierentiable functions p
Many unpublished activation functions perform just as we
To provide a concrete example, the authors tested a feedf
h = cos(W x + b) on the MNIST dataset and obtained an
1%, which is competitive with results obtained using more c
functions. During research and development of new tech
to test many dierent activation functions and nd that
standard practice perform comparably. This means that us
types are published only if they are clearly demonstrated t
improvement. New hidden unit types that perform roughly
types are so common as to be uninteresting.
It would be impractical to list all of the hidden unit typ
in the literature. We highlight a few especially useful and d
One possibility is to not have an activation g (z) at all.
this as using the identity function as the activation funct
seen that a linear unit can be useful as the output of a ne
also be used as a hidden unit. If every layer of the neural n
linear transformations, then the network as a whole will b
is acceptable for some layers of the neural network to be p
a neural network layer with n inputs and p outputs, h = g
replace this with two layers, with one layer using weight m
using weight matrix V . If the rst layer has no activation f
essentially factored the weight matrix of the original lay
factored approach is to compute h = g( V U x + b). If U
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
96.5
96.0
Test accuracy (percent)
95.5
95.0
94.5
94.0
93.5
93.0
92.5
92.0
36 4 5 7 8 9
Number of hidden layers
Figure 6.6: Empirical results showing that deeper networks gen
to transcribe multi-digit numbers from photographs of addresses
et al. (2014d).The test set accuracy consistently increases wit
gure 6.7 for a control experiment demonstrating that other inc
do not yield the same eect.
97
3,
Test accuracy (percent)
96
3,
95 11
94
93
92
91
0.0 0.2 0.4 0.6 0
Number of parameters
z u(1)
+
dot
x y x w
(a) (b)
H u(2)
relu
sum
X W b x w
(c) (d)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
z
w
z y x
=
y x w
=f (y )f (x)f (w )
=f (f (f (w )))f (f (w ))f (w )
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
z z
f f
f dz
y y
dy
f f
f
dy
x x
dx
f f
f
dx
w w
dw
J MLE J
cross_entropy +
U (2) y u
matmul
U (1)
matmul
6.5.8 Complications
ABCD,