Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 8

65 45 46 TIMIT 40 43 21.5
60 40 44 35 42 21.0
35 20.5
Classification Error
55 30

total time in h
30 42 41 20.0
50 25 25 19.5
20 40 40 19.0
45 20
40 15 38 15 39 18.5
10 36 38 18.0
35 5 10 17.5
30 0 34 5 37 17.0
100 100 IAM Online 100 58 34
100 56 33
80 90 80
Character Error Rate

80 54 32

total time in h
80 31
60 60 52
60 70 30
40 40 50 29
40 60 48
20 20 28
50 46 27
20
0 40 0 44 26
12.5 0.60 10.15 JSB Chorales 3.0 10.2
12.0 0.55 10.10 2.5 10.1
Negative Log Likelihood

11.5 0.50 10.05 0.50


2.0 10.0

total time in h
11.0 0.45 10.00 9.9
10.5 9.95
0.40 9.90 1.5 0.45
10.0 0.35 9.85 9.8
1.0 9.7
9.5 error 0.30 9.80 0.40
9.0 time 0.25 9.75 0.5 9.6
8.5 -6 0.20 9.70 0.0 9.5
10 10-5 10-4 10-3 10-2 20 40 60 80 100 120 140 160 180 200 0.0 0.2 0.4 0.6 0.8 1.0
learning rate hidden size input noise standard deviation

Figure 4. Predicted marginal error (blue) and marginal time for different values of the learning rate, hidden size, and the input noise (columns) for the test set
of all three datasets (rows). The shaded area indicates the standard deviation between the tree-predicted marginals and thus the reliability of the predicted mean
performance. Note that each plot is for the vanilla LSTM but curves for all variants that are not significantly worse look very similar.

On the right side of Figure 6 we can see for the same pair
of hyperparameters how their interaction differs from the case
of them being completely independent. This heat map exhibits
less structure, and it may in fact be the case that we would
need more samples to properly analyze the interplay between
them. However, given our observations so far this might not
be worth the effort. In any case, it is clear from the plot on the
left that varying the hidden size does not change the region of
optimal learning rate.
One clear interaction pattern can be observed in the IAM On-
line and JSB datasets between learning rate and input noise.
Here it can be seen that for high learning rates (' 10−4 )
lower input noise (/ .5) is better like also observed in the
marginals from Figure 4. But this trend reverses for lower
learning rates, where higher values of input noise are beneficial.
Though interesting this is not of any practical relevance because
performance is generally bad in that region of low learning
Figure 5. Pie charts showing which fraction of variance of the test set rates. Apart from this, however, it is difficult to discern any
performance can be attributed to each of the hyperparameters. The percentage regularities in the analyzed hyperparameter interactions. We
of variance that is due to interactions between multiple parameters is indicated
as “higher order.” conclude that there is little practical value in attending to the
interplay between hyperparameters. So for practical purposes
hyperparameters can be treated as approximately independent
For example, looking at the pair hidden size and learning and thus optimized separately.
rate on the left side for the TIMIT dataset, we can see that
performance varies strongly along the x-axis (learning rate), VI. C ONCLUSION
first decreasing and then increasing again. This is what we This paper reports the results of a large scale study on
would expect knowing the valley-shape of the learning rate variants of the LSTM architecture. We conclude that the
from Figure 4. Along the y-axis (hidden size) performance most commonly used LSTM architecture (vanilla LSTM)
seems to decrease slightly from top to bottom. Again this is performs reasonably well on various datasets. None of the eight
roughly what we would expect from the hidden size plot in investigated modifications significantly improves performance.
Figure 4. However, certain modifications such as coupling the input and

You might also like