Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Minds and Machines (2019) 29:515–553

https://doi.org/10.1007/s11023-019-09512-8

ORIGINAL PAPER

The Unbearable Shallow Understanding of Deep Learning

Alessio Plebe1 · Giorgio Grasso1

Received: 10 April 2019 / Accepted: 28 November 2019 / Published online: 12 December 2019
© Springer Nature B.V. 2019

Abstract
This paper analyzes the rapid and unexpected rise of deep learning within Artifi-
cial Intelligence and its applications. It tackles the possible reasons for this remark-
able success, providing candidate paths towards a satisfactory explanation of why
it works so well, at least in some domains. A historical account is given for the ups
and downs, which have characterized neural networks research and its evolution
from “shallow” to “deep” learning architectures. A precise account of “success” is
given, in order to sieve out aspects pertaining to marketing or sociology of research,
and the remaining aspects seem to certify a genuine value of deep learning, calling
for explanation. The alleged two main propelling factors for deep learning, namely
computing hardware performance and neuroscience findings, are scrutinized, and
evaluated as relevant but insufficient for a comprehensive explanation. We review
various attempts that have been made to provide mathematical foundations able to
justify the efficiency of deep learning, and we deem this is the most promising road
to follow, even if the current achievements are too scattered and relevant for very
limited classes of deep neural models. The authors’ take is that most of what can
explain the very nature of why deep learning works at all and even very well across
so many domains of application is still to be understood and further research, which
addresses the theoretical foundation of artificial learning, is still very much needed.

Keywords Artificial neural networks · Deep learning · Philosophy of science ·


Heuristic appraisal · Visual system

* Alessio Plebe
aplebe@unime.it
Giorgio Grasso
gmgrasso@unime.it
1
Department of Cognitive Science, Università degli Studi di Messina, via Concezione 8,
Messina 98121, Italy

13
Vol.:(0123456789)
516 A. Plebe, G. Grasso

1 Introduction

The most dramatic shift in computing in the last few years is due to a family of
techniques collected under the name of deep learning (Schmidhuber 2015), the
last evolution of the idea of artificial neural networks organized in layers (Rumel-
hart and McClelland 1986). By adopting deep learning techniques, computers are
endowed with the capability of acting without being explicitly programmed, con-
structing algorithms that adapt their functions from data, producing decisions or
predictions. Deep learning is responsible for the current AI Renaissance (Tan and
Lim 2018), the fast resurgence of Artificial Intelligence (AI) after several decades
of slow and unsatisfactory advances. Recently AI, thanks to the vast success of
deep learning, appeared as highlights on the cover of journals such as Science
(July 2015), Nature (January 2016), The Economist (May 2015). The worldwide
investment in private companies focused on AI increased from $589 million in
2012 to over 5 billions in 2016 (from the CB Insights database). AI economy is
dominated by deep learning, which, according to the McKinsey Global Institute
(Chui et al. 2018) accounts for about 40% of the annual value potentially created
by all analytics techniques, and can potentially enable the creation of between
$3.5 trillion and $5.8 trillion in value annually.
During this decade a number of exciting results have stirred a great deal of
attention on deep learning, we report here just a couple. In 2012, the group at the
University of Toronto lead by Geoffrey Hinton, the inventor of deep learning, won
the most challenging image classification competition. Soon Hinton was invited
by Google that adopted deep learning for its image search engine. In 2016, the
company DeepMind, founded by Demis Hassabis and soon acquired by Google,
defeated the world champion of Go, the Chinese chessboard game much more
complex than chess (Silver et al. 2016). The leading Internet companies were
among the first in employing deep learning on a scale (Hazelwood et al. 2018)
and are also the largest investors in research well over their own applications. The
release of deep learning programming frameworks such as Google’s TensorFlow
(Abadi et al. 2015), Facebook’s PyTorch (Ketkar 2017), Apache’s MXNet (Chen
et al. 2015) boosted the deployment of deep learning in a vast range of applica-
tions (Liu et al. 2017; Jones et al. 2017).
The remarkable performances of deep learning were totally unexpected. As
we will describe in Sect. 2.1 it is a derivation from artificial neural networks,
a field that was stagnating at the beginning of this century. After decades of
intense development, briefly summarized in Sect. 2, artificial neural networks
seemed to have exhausted their potential, and how relative small adjustments
have lead to such impressive resurrection remains largely unexplained. One may
argue that deep learning is, after all, a representative component of AI which
typically suffers recurrent periods of excessive enthusiasm, followed by round of
disappointment known as “AI winters”. Therefore, it might be that deep learn-
ing is just enjoying its evanescent turn of summertime hype. There are certainly
sociological and marketing factors contributing to the current fortunate period
of deep learning, we try in Sect. 3.1 to distill out of the various facets of its

13
The Unbearable Shallow Understanding of Deep Learning 517

“success” what really certify its computational superiority compared to other


non neural technologies. This analysis has led us to feel confident that deep
learning, indeed, works well, in fact so well that it is hard to understand why.
In the last few years, there has been an increasing focus on searching for valid
explanations, and the two most popular explanations point to hardware perfor-
mances and neuroscience. We discuss the first explanation in Sect. 4, where we
express our skepticism about ascribing the success of deep learning to computa-
tional power only.
The second account, neuroscience, is also very common, and often asserted
as undisputed fact, as in Arel et al. (2010, p. 13):
Recent neuroscience findings have provided insight into the principles gov-
erning information representation in the mammal brain, leading to new
ideas for designing systems that represent information. [...] This discov-
ery motivated the emergence of the subfield of deep machine learning,
which focuses on computational models for information representation that
exhibit similar characteristics to that of the neocortex.
Deep learning models are, after all, artificial neural networks, where the connec-
tion with neuroscience is explicitly claimed. No doubts that there is a connec-
tion, however, the role of neuroscience in explaining the effectiveness of deep
learning is not obvious and direct, and deserves a careful analysis. By borrowing
from philosophy of science the distinction between the contexts of discovery
and justifications (Reichenbach 1938; Schickore and Steinle 2006), we would
agree on a clear role of neuroscience in the context of the discovery of deep
learning, much less in the justification of its successful performances. In Sect. 2
we will first look at the influence of brain science on artificial neural network
under an historical perspective, and in Sect. 5 we discuss their relationship under
the computational perspective. We will delve deeper into the case of vision, in
Sect. 5.2, because several recent findings suggest that analogies with biological
vision may explain performances of deep learning in visual object recognition.
The domain that looks more appropriate for a justification of the functioning
of deep learning is that of mathematics, that will be discussed in Sect. 6. We
will see that, as strange at it may seem, a formal explanation of why the math-
ematics of deep learning is so effective, is still missing.
Eventually, we will use a more refined category in philosophy of science, due
to Nickles (2006), which stands somehow in between the context of discovery
and of justification, termed heuristic appraisal. It shares with classical justifica-
tion the aim of an objective explanation of a theory, independent of the act of
conceiving or inventing it. However, heuristic appraisal does not foster the truth-
conducive features of traditional justification, distinct by Nickles as epistemic
appraisal, rather attends to a variety of heuristic and pragmatic considerations
in a research program. We will show in Sect. 7 that several of the inventions
composing the success of deep learning seem to fit well within this category.

13
518 A. Plebe, G. Grasso

2 An empiricist revenge

Deep learning evolved from artificial neural networks, therefore the name itself
reveals that neuroscience must have constituted, to some extent, a context for the
development of this technique. We try now to assess what has been the role of
neuroscience in the rise of deep learning. More historical details on the interplay
between neuroscience and computing can be found in Plebe and Grasso (2016).
The extraordinary developments of neuroscience at the beginning of the nine-
teenth century (Ramón y Cajal 1917; von Economo and Koskinas 1925; Lorente
de Nó 1938), have been very influential in many fields of knowledge of that time.
Before the existence of digital computers, a highly celebrated paper (McCull-
och and Pitts 1943) came up with a bold connection between neurobiology and
mathematical logic. Seemingly (Anderson and Rosenfeld 2000, p. 3), the work
of Leibniz (1666) was the first source of inspiration for Pitts, in the fascinating
enterprise of explicating thinking with logic at the physiological level. Moreover,
Pitts strengthened his skills in logic studying with Rudolph Carnap and tried to
adopt his specific formalism (Carnap 1938) in the paper with McCulloch. Even if
this paper did not progressed neuroscience, and was basically flawed in the inter-
pretation of neural behavior (Lettvin et al. 1959), it has been highly influential on
the forthcoming field of artificial neural networks (Piccinini 2004).
In the brave new world of computing, Turing (1948) himself was the first to
advance the idea that computers can be designed borrowing hints from biological
neurons. He envisioned a machine based on distributed interconnected elements,
called B-type unorganized machine. Turing’s neurons were simple NAND gates
with two inputs, randomly interconnected, and each NAND input can be con-
nected or disconnected, and thus a learning method can “organize” the machine
by modifying the connections. His idea of learning generic algorithms by rein-
forcing useful links and by cutting useless ones was the most farsighted of this
report, anticipating the empiricist approach characteristic of deep learning. Not
so farsighted was his employer at the National Physical Laboratory, where the
report was produced, who dismissed the work as a “schoolboy essay”. Therefore,
this report remained hidden for decades, until upheld by Copeland and Proudfoot
(1996).
Under the influence of the newborn cognitive science, early AI opposed empir-
icism, in favor of a rationalist approach, as in the problem solving algorithms
(Newell and Simon 1972), and the research on artificial neural network was mar-
ginalized (Minsky and Papert 1969). It was only in the late ’80s that artificial
neural network found their way, with the PDP (Parallel Distributed Processing)
project of Rumelhart and McClelland (1986). The basic structure of the “parallel
distributed” is made of simple units organized into distinct layers, with unidirec-
tional connections between each layer and the next one. This structure, known as
feedforward network is preserved in most deep learning models. The values of the
units, affectionately called “neurons”, are computed with the following equations:

13
The Unbearable Shallow Understanding of Deep Learning 519

(a) f (x 1 ) (b)
output
layer
output
layer

hidden
x2,1 x2,2 layer
hidden
w1,1 layers

input
layer

input
layer

x1

Fig. 1  Examples of artificial neural feedforward network: in a a “shallow” network with three layers, in
b a “deep” network. In network a the mathematical symbols correspond to those used in Eq. (1) in the
text

𝐱1 = 𝐀(I) 𝐱 + 𝐛(I) ,
f̂ (𝐱) = 𝐀(O) 𝐱N + 𝐛(O) , (1)
( )
xi,k = h 𝐰i,k 𝐱i−1 − 𝜃i,k 1 < i < N.

The first layer, ruled by Eq. (1), simply provides input values to the network, nor-
malized with the linear operators 𝐀(I) and 𝐛(I). The top layer is where the output
data appear. The entire feedforward network can be expressed as a function f̂ (𝐱) of
the input vector 𝐱, and, as described by Eq. (1), it is normalized again to meet the
desired data range. The real dirty work is done by the layers in between, according
to Eq. (1). In a layer i, each unit xi,k sums up all the values from the previous level,
weighted by parameters 𝐰i,k , and the result is modified by a non-linear activation
function h(⋅), such as the sigmoid or the hyperbolic tangent. A sketch illustrating the
computationin feedforward networks is provided in Fig. 1a.
Parallel distributed processing reestablished a strong empiricist account, with
models that learned from scratch any possible meaningful function just by experi-
ence. The success of PDP was largely due to an efficient mathematical rule, known
as backpropagation, for adapting the connections between units, from examples of
the desired function between known inputs and outputs. Being 𝐰 the vector of all
learnable parameters in a network, such as 𝐰i,k in Eq. (1), and L(𝐱, 𝐰) a measure of
the error of the network with parameters 𝐰 when applied to the sample 𝐱, the back-
propagation updates the parameters iteratively, according to the following formula:

(2)
( )
𝐰t+1 = 𝐰t+1 − 𝜂∇w L 𝐱t , 𝐰t

where t spans over all available samples 𝐱t , and 𝜂 is the learning rate. Since the
learning rate is typically a small value, Eq. (2) produces a tiny modification in all
parameters 𝐰, such that the error made by the neural network on the current 𝐱t is
slightly reduced. By iterating this procedure many times, the network gradually con-
verges in approximating the unknown function sampled at 𝐱t . We will delve into

13
520 A. Plebe, G. Grasso

historical details of this cornerstone learning method in Sect. 6.2. The mathematics
of learning in deep networks is an evolution and a refinement of the same backprop-
agation rule for learning in PDP models, and in fact Geoffrey Hinton himself was
one of the main contributors to the PDP project (Hinton et al. 1986).

2.1 From shallow to deep

The “deep” addition to PDP style of feedforward network is just in the number of
layers between the input and output layers, usually called “hidden” layers. Neural
models can learn increasingly complex function by augmenting the number of
units, this way, however, the number of parameters to optimize increases as well,
and learning becomes more difficult. In particular, it was observed that increasing
the number of units by adding layers was much less efficient than increasing the
width of a single hidden layer, as reported, for example, by de Villers and Bar-
nard (1992):
We have found no difference in the optimal performance of three- and four-
layered networks [...] four layer networks are more prone to the local min-
ima problem during training [...] The above points lead us to conclude that
there seems to be no reason to use four layer networks in preference to three
layers nets in all but the most esoteric applications.
In Fig. 1 two examples of networks are juxtaposed: the one in (a) respects the
golden rule of no more than one hidden layer; the one in (b) is irrespective of this
rule, possibly aimed to esoteric applications, as conceded by Villers and Barnard.
The dogma of no more than three layers was broken by Hinton and Salakhutdi-
nov (2006), with a novel learning strategy, called deep belief network. The idea
derived from a neural architecture, called Boltzmann Machines (Aarts and Korst
1989), in which neurons have binary values that can change stocastically, with
probability given by the contributions of the other connected neurons. Boltzmann
Machines adapt their connections in an unsupervised way, with a sort of energy
minimization, this is the reason for the dedication to the great Austrian physicist.
The clever trick of Hinton was to take two adjacent layers in a feedforward net-
work, and train them as Boltzmann Machines. The procedure starts with the input
and the first hidden layer, so that it is possible to use the inputs of the dataset to
train the unsupervised Boltzmann Machine model. Then, this model is used to
generate a new dataset, just by processing all the inputs. This new set is used to
train the next couple of layers. This procedure is a sort of pre-training that gives a
first shape to all the connections in the network, to be further refined by ordinary
backpropagation using both the inputs and the known outputs of the dataset.
Here we can dispense with the mathematical details of the deep belief network,
because this first success in training deep networks boosted the research, leading
to simpler but yet effective solutions. The most popular in deep learning is just a
slight modification of backpropagation, in the following equation:

13
The Unbearable Shallow Understanding of Deep Learning 521

CNN RNN

pooling

convolution

time 1 time 2

Fig. 2  Examples of artificial neural networks different from feedforward revitalized as deep. On the left
the basic twin of layers of deep convolutional neural networks, with first a convolution, in this example
with just two different kernels, followed by pooling for dimensional reduction. On the right the basic
organization of a recurrent neural networks, in this example the input vector has three components, and
two consecutive time steps are shown. The hidden layer at time step 2 collects, in addition to the contri-
butions from the input layer, its same values at the previous time step, with their own weights (for clarity
only connections from the leftmost unit in the hidden layer are shown)

M
1 ∑ (
(3)
)
𝐰t+1 = 𝐰t − 𝜂∇w L 𝐱i , 𝐰t
M i

where instead of computing the gradients over a single sample t, a stochastic estima-
tion is made over a random subset of size M of the entire dataset, and at each itera-
tion step t a different subset, with the same size, is sampled. This way, the parameters
at the next iteration step are determined by the parameters at the previous iteration
step adjusted by the mean sampled gradient over M samples. Despite strong similar-
ity between Eqs. (2) and (3) the term “backpropagation” is now out of fashion, and
techniques related with Eq. (3) are referred as stochastic gradient descent. More will
be said about the climate conducive to the success of this technique in Sect. 6.2.
The first favourable results in training deep feedforward model paved the road
for the revitalization of other old neural models, by filling more hidden layers.
It is the case of DCNN (Deep convolutional neural network) for image process-
ing, that will be discussed in detail in Sect. 5.2, where instead of fully connected
feedforward units, layers are made by two dimensional convolutions, as shown
in Fig. 2 on the left. Another case is the rejuvenation of RNN (Recurrent neu-
ral networks). These models are much like feedforward networks, with the addi-
tion of recursive connections in the hidden layer that allow for a sort of memory,
enabling the processing of temporal data. This architecture is sketched in Fig. 2
on the right. RNN where first introduced by Elman (1990) for the simulation of
psycholinguistics phenomena, and today natural language processing is still the
prevailing domain were RNN are deployed. There are few variations on the basic
RNN, like LSTM (Long short-term memory) (Schmidhuber 2015) that uses addi-
tional controls on the recurrent connections, in order to maintain memory of sig-
nals over a long span of time. In all its variants and applications, deep learning
preserves the main philosophy of radical empiricism, its chances of functioning
depends entirely on learning from experiences.

13
522 A. Plebe, G. Grasso

3 Success and limits of deep learning

In this section we try to assess the degree of success of deep learning, arguing that it
is so extraordinary to deserve a thorough explanation. The requirement for an expla-
nation emerges form the coincidence of the intensity of the success on one hand,
and the apparent lack on a drastic shift in technology on the other hand. As seen in
the previous sections, deep learning appears in close continuation with technolo-
gies existing 40 years ago. As we will try to show now, its success has been abrupt
and intense. It is commonplace among scholars of scientific discovery (Peirce 1935;
Hanson 1958; Simon 1977) that the quest for explanation typically stems from
observation of surprising facts. In the case of deep learning too, the surprise aroused
by the coincidence just described.

3.1 Aspects of success

How to assess “success” is a topic widely discussed in philosophy of science, of


course in relation to the degree of success of a scientific theory. Probably Laudan
(1984) has drawn one of the most systematic account of “success”, that should be
assessed by the capability of a theory to:

1. Acquire predictive control over parts of the world;


2. Acquire manipulative control over parts of the world so to be able to modify
events;
3. Increase the precision of the parameters governing the explanations of natural
phenomena;
4. Integrate and simplify the various components of our picture of the world.

Even if the two first points might be of a certain relevance for deep learn-
ing—which indeed are exploited in predictions and controls—it is not a theory
of any part of the world. Still, Laudan comes useful here, because his conceptual
account of “success” is precisely relevant in our case. Laudan has stressed that
success is not a valuational or a normative concept, but should be handled as
a relational concept. In the case of deep learning too, we will try to capitalize
on Laudan’s account, by relativizing aspects of “success” within some relevant
contexts. In the case of a scientific theory, the relation against which success is
evaluated is in the set of relevant goals, such as those listed above. Activities that
do not qualify as science in search of a theory have different goals, and this dis-
tinction has been much discussed in philosophy of science. For Niiniluoto (1993)
the distinction between basic and applied research can be best drawn in terms of
“utilities”, that we can take as almost synonyms of Laudan’s goals. Basic science
is characterized as the attempt to maximize “epistemic utilities”, for Niiniluoto
specifically as “truthlikeness”, the combination of truth and information. Applied
research is further differentiated in technology and applied science. For technol-
ogy the relevant utilities are practical utilities, first of all its effectiveness relative

13
The Unbearable Shallow Understanding of Deep Learning 523

to the intended use, plus other possible utilities such as economical, ergonomical,
aesthetic, ethical. Applied science can be evaluated both in terms of epistemic
and practical utilities, plus its own utilities such as simplicity or manageability.
A tripartite framework is also proposed by Hendricks et al. (2000) in terms of
pure science, applied science and engineering science, and their equivalent to
Laudan’s goals and Niiniluoto’s utilities are “values”. Prominent values for pure
science are truth, explicit justification, but could include also simplicity, unifica-
tion, consistency. Conversely, salient values for engineering science are efficiency
and practical usefulness. Applied science share some values of pure science and
engineering science.
Where does deep learning fit into this picture? The most straightforward and
simple answer is that deep learning is a pure engineering effort, therefore its goals
include practical but not epistemic utilities. Therefore, it would be correct to evalu-
ate the success of deep learning against practical utilities only. Behind this simple
answer, however, lies a set of more subtle considerations that will be dealt with in
Sect. 3.2, where possible epistemic utilities will be evaluated as well.
In the introduction we mentioned some of the more epidermic aspects of success,
such as gaining headlines on famous magazines. Although these are certainly the
less measurable and less significant aspects of success, they still contribute to the
character of “surprise” of the rise of deep learning, mentioned at the beginning of
this section.
A first measurable aspect of pragmatic success is the trend of scientific publi-
cations (Hemlin 1996). Ziman (2000, p. 258) suggested that—ideally—the best
account of “scientific knowledge” is the accumulated archive of publications, there-
fore “scientific progress can be directly measured by the growth of the archive”.
However, we know how far from ideal the peer review process is Cicchetti (1991).
Moreover, the volume of scientific production derives from competition for intellec-
tual and economic resources, driven by a variety of considerations related to research
and technology policies. Nevertheless, biliometry is supposed to provide—at least—
a rough and partial measure of scientific success (Daniel 2005). In order to rela-
tivize this measure, we analyze the trend of publications on deep learning together
with other components of AI. As discussed in Sect. 2, the two most outstanding
components correspond roughly to the well established philosophical traditions of
rationalism and empiricism, with deep learning inside the latter. Alongside rational-
ism and empiricism other philosophical guises can be found in AI, such as Darwin-
ism, followed by Holland (1975), who invented in the 1960s the genetic algorithms
(GA). For years GA have been marginal and developed only by the group of Hol-
land at Ann Arbor, and even not considered part of AI, but in the late 1990s gradu-
ally gained attention becoming an important component of AI (Booker et al. 2005).
Because of this plurality of components, the progress in time of AI summon up
different trends, with a periodic alternating of progression and stagnation between
components. This phenomenon is well observed for the two main constituents of
AI, the rationalist and the empiricist traditions. When rationalism was ascendant,
like during 1970 and 1990, or during 2000 and 2005, empiricism languished; during
rationalism stagnation, like during 1990 and 2000, empiricism thrived. Now it is its
fortunate time again, and rationalism languishes.

13
524 A. Plebe, G. Grasso

Fig. 3  Yearly number of publications from 1960 to 2018, searching Google Scholar for the following
keywords: “artificial intelligence” (AI), “artificial neural networks” (ANN), “expert systems” (ES),
“genetic algorithms” (GA), and “deep learning” (DL)

In Fig. 3 we collected the number of publications with keywords related to AI


and to its components, in the period between 1960 and 2017, using Google Scholar.
We used the keyword “expert system” as the best representative of the rationalist
component of AI, and we searched for “deep learning” and “artificial neural net-
work” independently. It is clear in Fig. 3 that the increase around 2008 for the key-
word “artificial intelligence” is largely boosted by deep learning. It is the only com-
ponent inside AI currently growing at exponential rate. Artificial neural networks
is on a positive trend, but with a slower increase rate in the last years, and has been
outranked by deep learning in 2017. All the other components are in a decreasing
phase. These results are consistent with other recent bibliometric analysis of AI,
which are more limited in terms of time span (Niu et al. 2016) or keywords (Lu
2019). Thus, under the aspect of scientific production in relation with the other com-
ponents inside AI, and acknowledging the limitations of the bibliometric analysis
discussed above, deep learning is indeed successful.
The aspect of success that most of all characterizes deep learning in terms of its
scientific value pertains its performance on benchmark tasks. The context of com-
parison is with competing methods, developed for the same tasks.
We collected a selection of representative tasks in Table 1, for each we show the
score of the best non neural algorithm (to the best of our knowledge), the first deep
learning algorithm experimented on the task, and the current best algorithm (again,
to the best of our knowledge). The topmost row is the ImageNet Large-Scale Vis-
ual Recognition Challenge, which dataset is made of more than one million images
grouped in a thousand categories, organized according to the hierarchy of nouns in
the lexical dictionary WordNet (Fellbaum 1998). It has been probably the most strik-
ing success of deep learning, both in terms of the impressive gap between the best

13
Table 1  Performances of deep learning on several benchmark tasks in the domains of image processing (upper rows) and natural language processing (bottom rows)
Benchmark Best non-DL First DL Best DL

ILSVRC (Russakovsky et al. 2015) 25.8 (Sánchez and Perronnin 2011) 16.4 (Krizhevsky et al. 2012) 0.02 (Hu et al. 2018)
CIFAR-10 (Krizhevsky and Hinton 2009) 20.0 (Bo et al. 2011) 11.2 (Cireşan et al. 2012) 2.9 (Pham et al. 2018)
PASCAL-VOC (Everingham et al. 2010) 64.2 (Cinbis et al. 2012) 37.6 (Girshick 2015) 13.2 (Pham et al. 2018)
The Unbearable Shallow Understanding of Deep Learning

MCTest-500 (Richardson et al. 2013) 32.2 (Sachan et al. 2015) 29.0 (Trischler et al. 2016) 29.0 (Trischler et al. 2016)
MT-news-test-En-De (Bojar et al. 2014) 20.7 [BLEU] (Durrani et al. 2014) 20.7 [BLEU] (Zhou et al. 2016) 28.4 [BLEU] (Vaswani et al. 2017)
SWITCHBOARD-Hub5 (Godfrey et al. 1992) 23.9 (Hain et al. 2005) 12.6 (Veselý et al. 2013) 5.5 (Saon et al. 2017)

For each benchmark the best non neural algorithm, the first deep learning algorithm (DL), and the current best DL algorithm are compared. Values, when not otherwise
specified, are expressed in percentage error
525

13
526 A. Plebe, G. Grasso

non neural and the first deep learning solution, and in terms of the improvements
achieved in subsequent deep models. The figures in Table 1 certify the success of
deep learning in terms of relative performances, indicating also a more rapid and
consistent gain in the domain of image processing, comparing to natural language
processing.
Still, the table provides only a partial picture, for practical reasons. In any field
of computing, benchmarks evolve continuously in order to provide increasingly
challenging contexts, following technological improvements. For the task of inter-
est here, benchmark updating has been even faster paced in the period across 2012,
precisely for the sudden shift in performances brought about by deep learning. It has
to be stated that has not been easy to find benchmark challenges stable for a period
covering both non-neural and deep learning winning algorithms. For example, the
MCTest of open-domain question/answer is now replaced by the larger and more
challenging RACE benchmark (Lai et al. 2017), dominated by neural algorithms
(Zhu et al. 2018). Today both ILSVRC and PASCAL VOC have been discontinued
and the current most popular challenges in image processing, like MSCOCO (Vin-
yals et al. 2016), are hopeless for non-neural methods.
There are several other domains introduced for evaluating and comparing AI
technologies, in which the competition is now restricted to deep learning algorithms
only, for their overwhelming superiority. It is the case of gaming, which have a close
and long-standing ties to AI (Shannon 1950), and has more recently proposed in the
format of modern computer games (Laird and van Lent 2001). We already included
in the events of major excitement for deep learning its success at the traditional
Chinese game Go. Advanced interactive computer games, however, are supposed
to offer more variety of tasks and domains, thus providing excellent challenges for
evaluating the development of general, domain-independent AI technology. The
Arcade Learning Environment (Bellemare et al. 2013) is one of the most popular
interfaces to computer game environments, set up as benchmark for AI. Few years
after its introduction, a deep learning model developed by DeepMind, called DQN
(Deep Q-Network), not only surpassed the performances of all non-neural algo-
rithms on all games in the Arcade collection, it achieved a level comparable to that
of a professional human players across most of the games (Mnih et al. 2015). In
summary, results over current benchmarks in a variety of domains certificate a sig-
nificant success of deep learning, under the goal of practical utilities.

3.2 Epistemic success

Up to now, we have assumed that only practical utilities pertain to deep learn-
ing. Epistemic utilities have been dismissed too easily, however. There are rea-
sons to look at deep learning as a potential source of new knowledge. First of
all, AI has long been conceived as a possible method for achieving understand-
ing and predicting the behavior of the mind (Simon 1996). But, above all, deep
learning is the descendant of a lineage—artificial neural networks—designed for
knowledge. The epistemic ambitions of the PDP project were boldly stated in the
subtitle of the book by Rumelhart and McClelland (1986): “Explorations in the

13
The Unbearable Shallow Understanding of Deep Learning 527

Microstructure of Cognition”. For more than a decade, artificial neural networks


have been a hot topic in philosophy of mind (Fodor and Pylyshyn 1988; Pinker
and Prince 1988; Ramsey et al. 1991), psychology (Quinlan 1991; Karmiloff-
Smith 1992), and linguistics (Elman et al. 1996; MacWhinney 1999). Deep learn-
ing, however, did not stand in this tradition, its models are developed with engi-
neering goals in mind, without any ambition or interest in exploring cognition,
even if most of the protagonists are the same of earlier artificial neural networks,
like Hinton. This is the reason why, prima facie, we have excluded epistemic util-
ities in our discussion.
Nevertheless, the wave of success of deep learning, in the facets just described,
has raised the question of whether it might be the case that deep learning is pro-
gressing our knowledge and our understanding of the mind. In fact, there is a dif-
fuse opinion that the capability of deep learning in specializing AI applications,
that behave intelligently in many individual task areas, is nearing the realization of
AGI, human-like Artificial General Intelligence (Batin et al. 2017; Özkural 2018).
There are, however, critics who challenge the possibility of achieving AGI by simple
summing up many independent AI specialized applications, as those provided by
deep learning (Lake et al. 2017; Landgrebe and Smith 2019). It is not surprising that
the highest skepticism about the progress in AGI possibly propelled by deep learn-
ing, comes from its opposite philosophical perspective, rationalism. In their recent
review of 84 cognitive different architectures, proposed during the past 40 years,
Kotseruba and Tsotsos (2018) simply dismiss deep learning all together, with the
following justification (p. 7):
Recently, claims have been made that deep learning is capable of solving AI
[...] However, the question is where does this work stand with respect to cogni-
tive architectures? [...] Although particular models already demonstrate cogni-
tive abilities in limited domains, at this point they do not represent a unified
model of intelligence.
Marcus (2018) presented ten arguments challenging the perspective for deep learn-
ing to provide epistemic utilities, and to reach AGI. The most compelling arguments
are all articulated as fundamental advantages of rationalist approaches over pure
empiricist approaches, such as difficulties of deep learning to deal with hierarchical
logical structures, to integrate with prior knowledge, or its inherent inability to dis-
tinguish causation from correlation.
These are interesting and important considerations, but a crucial difference with
respect to the debate on earlier neural network is the lack of interest of the deep
learning community in defending epistemic utilities. On the contrary, leading figures
in the empiricist community of AI simply acknowledge some of the constitutional
limitations in deep learning highlighted by its critics. Chollet (2018, p. 325), who
developed Keras, one of the most popular frameworks for deep learning modeling,
writes:
In general, anything that requires reasoning – like programming or applying
the scientific method – long-term planning, and algorithmic data manipulation
is out of reach for deep-learning models, no matter how much data you throw

13
528 A. Plebe, G. Grasso

at them. [...] A deep-learning model can be interpreted as a kind of program;


but, inversely, most programs can’t be expressed as deep-learning models.
In summary, the epistemic achievements of deep learning—if any—would be some
sort of by-product, given that the primary goals are engineering in kind. We will
explore in more depth possible epistemic successes as quasi secondary outcome of
deep learning in the case of vision, in Sect. 5.2. Our main point here is that deep
learning, even when suspending an evaluation of its epistemic merits, has gained an
intensity of success in applications, that is so surprising to call for explanation.

4 The serious side of ludic hardware

The two prevailing attempts to satisfy the desire for an explanation of the success
of deep learning point to hardware performance or neuroscience, in this section we
look in some detail at the first one. Allegedly, deep learning succeeded because com-
putational power has increased, allowing training of models with very many param-
eters, over large datasets, unfeasible at the time of previous generation artificial
neural networks. In fact, all along the history of artificial neural networks, efforts
have been taken to design computer hardware optimized for neural computations.
This endeavor has been always unsatisfactory, mainly due to the limited advantages
of neural hardware and its high cost, compared to standard processors (Plebe and
Grasso 2016). On the contrary, a great benefit for neural computation originates,
unexpectedly, from technologies conceived for the entertainment industry, specifi-
cally for videogaming.
An interesting earlier example of exchanging technologies between artificial neu-
ral networks and more mundane applications was given in the mid 90’s by Philips.
The L-Neuro chip (Theeten et al. 1990; Maudit et al. 1992) was developed within
the European Esprit project Galatea (Alippi and Vellasco 1992) as a specialized
hardware for neural networks, it had no application success, but some of the design
concepts has been exploited by Philips’ TriMedia family of processors, targeted for
digital television (Slavenburg et al. 1996).
More recently a new class of electronic devices, developed for gaming and com-
puter graphics applications, have been exploited for scientific computing and, more
in general, parallel computation. This technology, embedded in graphics process-
ing units (GPUs), consists of a specialized electronic circuit, designed to rapidly
manipulate and alter memory to accelerate the creation of images in a frame buffer,
intended for output to a display device. GPUs are used in many computing devices,
ranging from mobile phones, to personal computers, to workstations, and game
consoles. Modern GPUs are very efficient at manipulating computer graphics and
image processing, and their highly parallel structure makes them more efficient than
general-purpose CPUs for algorithms where the processing of large blocks of data
is done in parallel. This latter characteristic of GPUs, namely the ability to process
in parallel large volumes of data, is very well suited for implementing computa-
tional models that are intrinsically parallel, such those employed for artificial neural
networks.

13
The Unbearable Shallow Understanding of Deep Learning 529

In spite of this very high potential of GPUs in crunching ANN data at very high
speed, as late as 2008, a review of potential applications for GPUs, different from
gaming (Wu and Liu 2008) did not include neural networks at all. One of the most
important breakthrough in the development of GPUs for neural computation and
more generally for scientific computing overall, has been the introduction of the
CUDA framework by NVIDIA in 2007. CUDA is a parallel computing platform
and application programming interface (API) model created by NVIDIA. It allows
software developers to exploit GPUs, made by NVIDIA, for general purpose pro-
cessing—an approach termed GPGPU (General-Purpose computing on Graphics
Processing Units). The CUDA platform is a software infrastructure that gives direct
access to the GPU’s parallel processing capabilities, interfacing the virtual instruc-
tion set embedded in the graphics processor (Compute Unified Device Architecture)
(Sanders and Kandrot 2014).
It was in 2013 that a system developed at Stanford University, with the direct
collaboration of NVIDIA, paved the road to a new era in deep learning, powered by
what has been called “COTS HPC” (Commodity Off-The-Shelf High Performance
Computing). The system was build using 16 servers each with 4 NVIDIA GTX680
GPUs, and training a deep network with 11.2 billion parameters took less than 1 s
for a single mini-batch of 96 images (Coates et al. 2013).
Similar to the contamination of gaming and graphics hardware technology with
that of deep learning research, GPUs have contributed to the acceleration of other
fields, such as, for example, that related to blockchain frameworks. However, the
advantages specific to deep neural networks are such that complete GPU-based
computer systems have now been developed, such as NVIDIA Drive PX and Jet-
son AGX (both released in 2015), enabling deep learning intelligence inside robots,
drones and self-driving cars. It is important to note that this shift in GPU design
towards neural computation is a consequence of the success of deep learning in
applications, described in Sect. 3.1, and therefore cannot be used as an explanation
of the origin of the success itself.

5 Neuroscience and deep learning

Quite often the history of artificial neural networks is depicted as an interaction


between the fields of neuroscience and that of computer science, more close and
collaborative than it actually has been. For example, Hassabis et al. (2017, p. 246)
narrate that “the implications of this method [backpropagation] for understanding
intelligence, including AI, were first appreciated by a group of neuroscientists and
cognitive scientists, working under the banner of parallel distributed processing
(PDP).” In fact, in the PDP group there was just one representative of neuroscience,
Terrence Sejnowski, although he was coming from a PhD in physics. Most of the
main players in the PDP group had a background in psychology (Geoffrey Hinton,
Michael Jordan, James McClelland, David Rumelhart), some in linguistics (Jeffrey
Elman, Paul Smolensky, although his Ph.D was in mathematical physics).
In this section we try to trace back the relations between neuroscience and arti-
ficial neural networks, pointing to the period in which a division between the two

13
530 A. Plebe, G. Grasso

different kinds of computation occurred. One followed the path taken by artificial
neural networks, the other served as an effective tool of investigation for neuroscien-
tists. The two paths, as we will see, gradually diverged. One can envisage a partial
reconciliation between deep learning and neuroscience in the case of vision, as dis-
cussed in Sect. 5.2.

5.1 Two diverging paths

Most, if not all, the protagonists of the developments from PDP up to deep learning
were fond admirers of neuroscience, and certainly were motivated in their research
by the suggestion of capturing, in software, various aspects of the brain. However,
it was a kind of unrequited love. James Bower (Miller and Bower 2013, p. 5) recalls
one of the first neural network meeting at Santa Barbara in 1983, where participants
“represented a remarkable mix of scientists and government officials [...] only two
made any claim to being real biologists, myself and Terry Sejnowski. [...] I presented
my work with Matt Wilson modeling the olfactory cortex and I remember distinctly
that it was news to many in the room that synaptic inputs could also be inhibitory.”
During the PDP project attempts were made to involve neuroscientists in the compu-
tational world, a good example was the series of conferences NIPS (Neural Informa-
tion Processing) started in 1986. For James Bower (Miller and Bower 2013, p. 6)
“the neurobiologists, including my friend John Miller, who I had invited to partici-
pate in the second NIPS meeting, found most of the talks either irrelevant to neuro-
biology or naive in their neurobiological claims.” On the other side, in neuroscience
there was a genuine interest in exploring the nature of the processing tasks executed
by nerve cells and systems with computations, that cannot be fulfilled by the simple
PDP models. For this aim, a new field was established, called Computational Neu-
roscience or sometimes Theoretical Neuroscience (Dayan and Abbott 2001), with its
own series of conferences, like CNS started in 1992 (Miller and Bower 2013).
Before the upsprout of PDP-style artificial networks, major advances in neu-
ral modeling were achieved by Rall (1957, 1964, 1969). He adapted an equation
describing the electric potential as a function of time and space in cables, derived by
William Thomson, Lord Kelvin of Largs (Kelvin 1855) for the project of the trans-
atlantic telegraph cable, for much smaller “cables”: dendrites. A first model using
the cable equation for “compartments”, idealized cylinders composing dendrites was
created by Rall and Shepherd (1968). Around the same time there were attempts to
model the equations of Hodgkin and Huxley (1952) within one single neural com-
partment (Connors and Stevens 1971), and eventually Traub (1977, 1979) combined
compartmental modeling with the Hodgkin–Huxley equations.
A major breakthrough progressed computational neuroscience two decades after
Traub’s models, with the construction of neural simulators that greatly propelled
computational neuroscience: the neural simulators NEURON by Hines and Car-
nevale (1997) and GENESIS by Bower and Beeman (1998). Both computational
frameworks provide environments for implementing in software biologically realis-
tic models of electrical and chemical signaling between neurons. The emergence of
the PDP enterprise was just in between 1977 and 1998, but the two researches had

13
The Unbearable Shallow Understanding of Deep Learning 531

no interaction, with PDP-style networks and computational neuroscience taking two


diverging paths. It was not just a matter of scale, with computational neuroscience
simulators aimed at single neurons and PDP models aimed at networks of neurons.
GENESIS was designed to allow modeling at different levels of neural organization,
including large networks with even more neurons than typical PDP models (Proto-
papas et al. 1998). Moreover, an influential line of research within computational
neuroscience was aimed at designing the so-called canonical microcircuits of the
cerebral cortex (Shepherd 1988; Douglas et al. 1989; Douglas and Martin 2004;
Plebe 2018). This research focused on the highly repetitive circuital structure of the
cortex and attempted to identify a sort of prototype circuit and corresponding gov-
erning equations, able to explaining its peculiar computational efficiency. One of
the highest achievements in this field of research is represented by the simulator of
the somatosensorial cortex of the mouse, built by Markram et al. (2015) within the
European Brain Project.
Since the beginning of computational neuroscience there has been little cross-
fertilization between the artificial neural network community and the brain-friendly
neurocomputation enthusiasts. Indeed, several frameworks have been proposed
that mediate between the realistic level of detail of NEURON or GENESIS, and
the abstraction useful for simulating higher cognitive processes like the one used
in Topographica (Bednar 2009, 2014) and Nengo (Eliasmith et al. 2012; Eliasmith
2013). Deep learning did not encompass any of the features that make these interme-
diate models closer to the brain, like spiking neurons, lateral connections, Hebbian
learning, and so on. Recently, due to its large success in applications, deep learn-
ing has garnered attention within computational neuroscience, for a possible role in
future developments (Kass et al. 2018), despite its profound difference in design and
in scope from ordinary network models in neuroscience (Maex et al. 2010).
By using the distinction, mentioned in Sect. 1, between the contexts of discovery
and justification (Reichenbach 1938; Schickore and Steinle 2006), the short histori-
cal overview sketched here definitely depicts neuroscience as an important back-
ground in the context of the discovery of deep learning. At the same time, no clear
and identifiable elements, pertaining to neuroscience, appear as responsible for the
functioning and the efficiency of deep learning in the context of its justification.
A possible exception is in the relation between biological vision and the applica-
tion of deep learning to image processing, that will be discussed in the next section.

5.2 The case of vision

The case of vision deserves special analysis, for several reasons:

– Artificial networks for image processing, DCNN, have a peculiar architecture


with significant differences from ordinary layered neural networks;
– Vision is the most successful field of application for deep learning;
– Recent results show surprising similarities between patterns at various stage of
processing in DCNN’s and the visual system of primates, humans included.

13
532 A. Plebe, G. Grasso

In image processing convolution is performed by moving a matrix, often called


mask or kernel, from point to point in an image, computing the sum of the prod-
ucts of the matrix values and the corresponding underlying image pixels values.
In mathematics the convolution on continuous functions, often called composition
(Volterra 1930), became a powerful tool in the field of signal processing (Wiener
1949; Rabiner and Gold 1975) and its extension to two dimensions is one of the old-
est and most popular technique in image processing (Rosenfeld 1969). By choosing
appropriate convolution kernels, many different filtering and feature extraction can
be performed on an image (Rosenfeld and Kak 1982; Bracewell 2003). Integration
of convolution operations in artificial networks was first done by Fukushima (1980)
in the architecture called Neocognitron, where “neo” is with reference to his ear-
lier Cognitron (Fukushima 1975). The Neocognitron alternates layers of S-cell type
units with C-cell type units, which naming are evocative of the classification in sim-
ple and complex cells by Hubel and Wiesel (1962, 1968). The S-units act as convo-
lution kernels, while the C-units downsample the images resulting from the convolu-
tion, by spatial averaging. The crucial difference from conventional convolution in
image processing is that the kernels are learned. The first version of the Neocogni-
tron learned by unsupervised self-organization, with a winner-take-all strategy: only
the weights of the maximum responding S units, within a certain area, are modified,
together with those of neighboring cells. A later version (Fukushima 1988) used a
weak form of supervision: at the beginning of the training the units to be modified in
the S-layer are selected manually rather than by winner-take-all, after this first sort
of seeding, training proceed in unsupervised way.
So Neocognitron was a development independent from the PDP project, depart-
ing from the standard layered structure, but mostly it was not using the key factor of
the PDP success, the backpropagation learning algorithm. The convergence between
Neocognitron and PDP was done by LeCun et al. (1989), applying backpropagation
to an architecture composed by two layers of Fukushima’s S-cell type, followed by
ordinary PDP neural layers. It was an early step towards DCNN. Like the artificial
neural networks of the PDP project, this mixture of Neocognitron and backpropaga-
tion meet with a relative good success, especially in the field of character recogni-
tion (LeCun et al. 1998), but it was not the main choice within mainstream computer
vision. A major shift came about when DCNN’s, like ordinary layered networks,
become “deep”, and once again thanks to the work of Hinton, together with his
Ph.D student Alex Krizhevsky et al. (2012). This model has five layers of convolu-
tions, each with a large number of different kernels, for example 384 in the third
and fourth layers, followed by three ordinary neural layers, with a total number of
60 million parameters. Therefore, a crucial effort was in training the network, over
the ILSVRC dataset that we described in Sect. 3.1. It was done in pure supervised
mode, by stochastic gradient descent (see Sect. 6.2) with few additional heuristics
(see Sect. 7). The model dominated the challenge, dropping the previous error rate
from 26.0% down to 16.4%, is now known colloquially as AlexNet.
This first success steered computer vision towards DCNN and many new designs,
which stemmed from it, continued to improve performances. The model VGG-16
(Simonyan and Zisserman 2015), with thirteen convolutional layers and three ordi-
nary layers, and kernels smaller than AlexNet, achieved an error of 7.3% on the

13
The Unbearable Shallow Understanding of Deep Learning 533

2014 ImageNet challenge, further improved to 6.7% by the Inception (or GoogleNet)
model (Szegedy et al. 2015). An extensive review of variations and improvements in
DCNN can be found in Rawat and Wang (2017).
One of the first attempts to relate results of DCNN with the visual system was
based on the idea of adding, at a given level of an artificial network model, a layer
predicting in the space of voxel response and to train this layer on sets of images and
corresponding fMRI responses (Güçlü and van Gerven 2014). Using this method
Güçlü and van Gerven (2015) compared a model very similar to AlexNet (Chatfield
et al. 2014) with fMRI data, training the mapping to voxels on 1750 images. The
model responses were predictive of the voxels in the visual cortex above chance,
with a prediction accuracy slightly below 0.5 for area V1, and of slightly below 0.3
for area LO. The same technique has been further exploited, by generating artifi-
cial fMRI data, using stimuli of classical vision experiments, such as simple reti-
notopy or face/places contrast, for which good agreement between synthetic fMRI
responses and DCNN was found (Eickenberg et al. 2017).
The use of synthetic fMRI data is pursued also by Khan and Tripp (2017), but
with a different strategy, constructing a statistical model of the activity in the higher
visual cortex, by combining a wide range of information from previous studies.
This model allows the interpolation of novel responses as needed for experimental
purposes. Using this method Tripp (2017) was able to test similarities with corti-
cal responses and DCNN models, on various different properties: population sparse-
ness; orientation, size and position tuning; occlusion; clutter; and so on. The DCNN
tested were AlexNet Krizhevsky et al. (2012) and VGG-16 (Simonyan and Zisser-
man 2015) The results show some similarities, in particular for sparseness and size
tuning, but also differences, including scale and translation invariance, orientation
tuning, and responses to occlusion, and most of all clutter responses.
An alternative method for comparing DCNN models and fMRI responses was
offered by the representational similarity analysis, introduced by Kriegeskorte et al.
(2009); Kriegeskorte (2009). This method can be applied to any sort of distributed
responses to stimuli, computing one minus the correlation between all pairs of stim-
uli. The resulting matrix is especially informative when the stimuli are grouped by
their known categorial similarities. The whole idea is that the responses across the
set of stimuli reflect an underlying space in which reciprocal relations correspond to
relations between the stimuli. This is exactly the idea of structural representations,
one of the fundamental concepts in cognitive science (Swoyer 1991; Gallistel 1990;
O’Brien and Opie 2004; Shea 2014; Plebe and De La Cruz 2018). The representa-
tional similarity analysis is applied by Khaligh-Razavi and Kriegeskorte (2014) in
comparing responses in the higher visual cortex, measured with fMRI in humans,
and with cell recording in monkey, with several artificial models. This study is very
interesting because it includes, in addition to AlexNet, few models with more bio-
logical plausibility.
As described in Sect. 5.1, models belonging to computational neuroscience are
radically different from deep learning, however they have never reached dimensions
and structures wide enough for being used as a model of vision, even of early visual
cortical areas. Nevertheless, there is a long tradition of research in neural models
of vision that, although departing from the precise behavior of biological neurons,

13
534 A. Plebe, G. Grasso

include as much as possible realistic features of the visual system (Riesenhuber and
Poggio 1999; Rolls and Deco 2002; Miikkulainen et al. 2005; Cadieu et al. 2007;
Plebe and Domenella 2007). The study of Khaligh-Razavi and Kriegeskorte (2014)
included two notable models of this category.
The most biologically plausible model is VisNet (Wallis and Rolls 1997; Stringer
and Rolls 2002; Rolls and Stringer 2006; Stringer et al. 2007), organized into five
layers, which connectivity approximate the sizes of receptive fields in V2, V2, V4,
posterior inferior temporal cortex, and inferior temporal cortex. The network learns
by unsupervised self-organization (von der Malsburg 1973; Willshaw and von der
Malsburg 1976) with synaptic modifications derived from Hebb (1949) rule. Learn-
ing include a specific mechanism called trace memory, since learning of a single
cell is affected by a decaying trace of previous cell activity. This rule attempts to
reproduce in a static network the natural dynamics of vision, where invariant rec-
ognition of objects is learned by seeing them when moving under various different
prospective.
The second model included in the study endowed with some biological plausi-
bility is HMAX (Riesenhuber and Poggio 1999), which resemble the Neocogni-
tron in alternating S-cell layers and C-cell layers, but the latter select the maximum
response only from the connected S-cells. This form of neural selectivity is one
among the typical computations performed in biological neural assemblies (Kouh
and Poggio 2008). HMAX, like Neocognitron and VisNet, learns by unsupervised
self-organization, though the max operation is hardwired.
Khaligh-Razavi and Kriegeskorte (2014) constructed several representational
similarity matrices on a set of natural images, spanning multiple animate and inani-
mate categories, comparing the models (the study actually compared 37 different
models, of which only AlexNet, VisNet and HMAX are of interest here). The anal-
ysis revealed that AlexNet was significantly more similar to the the IT structural
representation of the categorical distinction animate/inanimate than the two more
biological plausible model (and of all the other compared models). The most plau-
sible model, VisNet, scored the worst in matching the IT representational similar-
ity. Other studies, using the same comparison techniques Cadieu et al. (2014) and
Yamins et al. (2014), compared HMAX and DCNN models in predicting represen-
tations in visual areas, and again the DCNN model better correlated with cortical
representations.
Recently, the significance of representational similarity analysis has been ques-
tioned, because its pooling over many images misses the variation in difficulty across
images of the same object (Rajalingham et al. 2018). To overcome this limitation,
Rajalingham and co-workers collected a large number (over one million) of behavio-
ral trials of object discrimination tasks in humans and monkeys. The stimuli were 24
objects, each with 100 variations in orientation, position, and background. The same
tasks have been tested on seven DCNN, including AlexNet, VGG-16, Inception, and
no model was found predictive of the behavioral results of humans or monkeys.
More recently investigations on similarities between DCNN and the visual sys-
tem investigations have also focused on the temporal dimension of the visual pro-
cess. Of course DCNN does not model time, but its hierarchy may correspond to
the time elapsed by the biological vision process, and this comparison is made

13
The Unbearable Shallow Understanding of Deep Learning 535

possible by high temporal resolution magnetoencephalography (MEG). Compari-


sons of models and MEG data initiated with HMAX and its variants (Clarke et al.
2015), the first study on DCNN (Cichy et al. 2016) used AleXNet as model (Kriz-
hevsky et al. 2012), and applied representational similarity analysis over 118 natural
images, comparing the dissimilarity matrices of MEG and of all layers of AlexNet.
Surprisingly, the convolutional layers show a weak negative relationship, with the
first layer more correlated with later latency, and the fifth layer more correlated with
earlier latency. A strong positive correlation with latency is found, instead, in the
three non-convolutional layers. Yang et al. (2018) refined this analysis by decom-
posing DCNN features into three groups: components common between low and
high levels; low-level features that are roughly orthogonal to high-level ones; high-
level features that are roughly orthogonal to low-level features. This step is crucial in
that there is, obviously, a high correlation between features at the different levels in
DCNN. Moreover, they distinguished MEG time courses for the early visual cortex
and for higher cortical areas. A major novel result is that in the early visual cortex,
in the late time window, correlates better with the common components than with
the low-level features orthogonal to high-level ones. This results can be explained
with the top-down influences on the early visual cortex.
This impetus of studies on the analogies between DCNN and the visual system
has led to a broad discussion in the visual neuroscience community on the relevance
of deep learning models for their scientific objective. Positions span from a mostly
positive acceptance (Gauthier and Tarr 2016; VanRullen 2017), to a cautious inter-
est (Lehky and Tanaka 2016; Grill-Spector et al. 2018; Tacchetti et al. 2018), down
to more skeptical stances (Olshausen 2014; Robinson and Rolls 2015; Rolls 2016;
Conway 2018). A key argument in this discussion is the balance between similari-
ties and differences between DCNN and the visual system, taking into account the
existing more plausible models. However, the discussions does not involve the issue
here at stake: the hypothesis that the analogies borrowed from biological vision can
explain why DCNN works well. Of course explaining deep learning is not a concern
for vision scientists, nevertheless, in most of the literature just cited, it is taken for
granted that the imitation of biological vision is, at least, one of the main reasons of
the success of DCNN.
The similarities found between DCNN and the visual system, here reviewed, sug-
gest some analogy between organizational features or processes in the visual path,
and in DCNN models. However, at the moment it is not possible to identify which
kind of process or organizational features is shared by the two systems. The identifi-
cation of this shared process would require a mechanistic mapping between DCNN
and the visual system. This mapping can be established if variables in the DCNN
model correspond to identifiable components, activities, and organizational features
of the visual system, and if the mathematical dependencies posited among these var-
iables in the DCNN model correspond to causal relations among the components
of the visual system (Piccinini 2006, 2007; Kaplan 2011; Kaplan and Craver 2011).
There is obviously a large number of structural features of the visual system that
drastically departs from a DCNN model. Just to mention few: visual maps in the
cortex have many strong interconnections and a very large number of weaker con-
nections (Felleman and Van Essen 1991; Van Essen and DeYoe 1994; Van Essen

13
536 A. Plebe, G. Grasso

2003; Markov et al. 2014); receptive field sizes change within a cortical map, and
the degree of changes is larger in higher cortical areas (Kay et al. 2013); receptive
field are also modulated by tasks (Klein et al. 2014); scene dynamics affects recog-
nition areas, in addition to motion areas (Stigliani et al. 2017).
Moreover, as highlighted in Sect. 2, deep learning is the most extreme realization
of the empiricist perspective in Artificial Intelligence. Its success may indeed sup-
port a dominant role of experience in the development of cognitive capacities, like
visual recognition. However, a pure learning of visual capacities is unrealistic, there
are evidences that visual learning in infancy combines experience with some sort of
innate mechanisms (Ullman et al. 2012).
In sum, it is difficult to sustain the analogy with biological vision as sufficient, or
even important, in explaining the performances of deep learning, and DCNN in par-
ticular. In observing that models with much more biological plausibility, like VisNet
an HMAX, score worst than DCNN in the comparison with responses in the visual
system, may lead to a very different interpretation. Perhaps the lack of biological
plausibility is an important factor for the success of deep learning. By freeing the
model from constraints imposed by biological similarity, such as respecting recep-
tive field sizes across layers, implementing plausible learning algorithms, the space
of mathematical solutions becomes much wider.

6 The hidden mathematics of hidden layers

Deep learning comprises a series of mathematical techniques that gradually adapt,


in a network of connected units, the connection parameters so to achieve a desired
function. Even if these techniques are fully formalized in mathematical terms, their
formulations does not derive from assumptions that justify why they can work at
all. For most of the techniques in deep learning, and in artificial neural networks in
general, their degree of success has been an empirical assessment. Oddly as it may
seem, the mathematics for deep learning is relatively simple, but the mathematical
justification of why it works is extremely elusive, and requires highly sophisticated
mathematical frameworks.
There are two explanations sought from mathematics:

– Why the artificial neural network idea works well in general;


– Why the deep variant is superior to the shallow one.

The search for mathematical explanations of neural networks begun when the
PDP projects had spread its results, so successful to prompt the first of the above
questions. Obviously the second question had to wait for deep learning to exist for
being asked, with the benefit of a first set of results achieved. For both question, the
mathematical investigation can follow two different paths:

– Try to characterize the class of functions that can be generated by a neural archi-
tecture, often called its expressivity;

13
The Unbearable Shallow Understanding of Deep Learning 537

– Try to analyze the efficiency of the search done by learning algorithms in the
space of the possible functions of a neural architecture, often called generaliza-
tion.

6.1 Which functions can be generated by a network

A first line of research on the mathematical reasons for the success of PDP networks
focused on the class of functions that can be approximated by feedforward networks
with one hidden layer, and sigmoidal activation function (Cybenko 1989; Hornik
et al. 1989; Stinchcombe and White 1989). The most advanced properties of these
networks,o proved by Stinchcombe (1999), are in approximating any continuous func-
tion ∈ C1, the compactification of ℝ1. Compactifications in topology are spaces for
which every open cover contains a finite subcover. Intuitively, this properties rules
out functions such as sin(), which can never be approximated by neural networks.
These properties demonstrate mathematically the universal power of neural net-
works, however, most theorems presuppose an arbitrary number of hidden units.
These theoretical researches provided a good ground for trying to answer the new
question: why deep networks work better than shallow ones.
One of the first results was given by Bianchini and Scarselli (2014a, b), using
a topological properties of the space of functions generated by neural networks,
known as Betti number. The term was introduced by Henri Poincaré after the work
of Betti (1872), and, informally, refers to the number of holes on a topological
surface, in a given dimension. Bianchini and Scarselli found different asymptotic
expressions for the Betti numbers of the topology generated by all functions in neu-
ral networks for shallow and deep networks. The analysis is limited to networks with
a single output, and the topology is investigated for the set of all 𝐱 for which the out-
put of the network is positive, as typical in a binary classifier. Calling Sn such set for
shallow networks, and Dn for deep ones, with n the overall number of units, the sum
of Betti numbers B() is ruled by the following equations:

B Sn ) ∈ O nD , (4)
( ) ( )

B Dn ∈ Ω(2n ), (5)
( )

where D is the dimension of the input vector. Equation (3) states that for shallow
networks Betti numbers grows at most polynomially with respect to the number of
the hidden units n, while from Eq. (4) it turns out that for deep architectures Betti
numbers can grow exponentially in the number of the hidden units. Therefore, by
increasing the number of hidden units, more complex functions can be generated
when the architectures are deep.
Eldan and Shamir (2016) demonstrated that a simple family of functions on ℝd
is expressible by a feedforward neural networks with two hidden layers and not
by a network with one hidden layer, unless its width will grow with O(ed ). The
same group extended later (Safran and Shamir 2017) the results to hyperspherical
and hyperellittical functions. By using topological tools of analysis, other authors

13
538 A. Plebe, G. Grasso

(Petersen et al. 2018) have found that the set of functions that can be implemented
by neural networks of a fixed size have, surprisingly, several undesirable properties.
One is that the mapping from the parameter of network parameters to network func-
tion is not inverse stable, in other words two networks with very close functions may
have large differences in their parameters.

6.2 How functions can be learned

The key of the success of shallow networks is the simple—yet powerful—back-


propagation learning algorithm, first introduced by Rumelhart et al. (1986), here
described in Eq. (1), Sect. 2. In fact the term back propagation was used earlier
by Rosenblatt (1962), a pioneer of artificial neural networks in the pre-PDP period.
Rosenblatt attempted with back propagation to generalize its perceptron architecture
(Rosenblatt 1958), based on a single layer, to multiple layers. His attempt was dif-
ferent from Eq. (1) and not especially successful. Paul Werbos (1994), by titling his
book The Roots of Backpropagation, claimed to be the originator of the backpropa-
gation algorithm, in his Ph.D thesis at Harvard (Werbos 1974). The domain of his
research was social and political science, his supervisor was Karl Deutsch, one of
the leading social scientists of the twentieth century, and one of the first in intro-
ducing statistical methods and formal analysis in political and social sciences. The
novel technique developed by Werbos aimed at testing the Deutsch-Solow model of
national assimilation and political mobilization (Deutsch 1966) on real data. For this
purpose, he used an iterative technique, termed dynamic feedback in which deriva-
tives of the error estimates with respect to the parameters were computed. Therefore,
even if the research domain was far away from networks and brains, the mathematics
of Werbos was much convergent with backpropagation, and Rumelhart et al. (1995)
recognized that this independent invention provided useful insights about general
properties of the algorithm. Both the algorithm of Werbos and the backpropagation
formulated by Rumelhart and Hinton have their roots in the gradient methods, devel-
oped in mathematics for the minimization of a continuously differentiable function
f(x). The first proposal of a gradient method dates back to Augustin-Louis Cauchy
(1847), and found renewed interest in the first half of last century in several engi-
neering applications. The Fire Control Design Division was the advanced military
research division at the Frankford Arsenal, where high technologies such as LIDAR
have been invented. Two notable mathematicians in this departments, Levenberg
(1944) and Curry (1944), developed two independent refinements of the original
method of Cauchy, further variants, including his own, are collected in Polak (1971).
Therefore, there was a mature mathematical context in the ’70s, for using gradient
methods in solving engineering problems, fertile enough for being imported in the
domain of artificial neural networks.
Standard backpropagation loses its efficiency when shifting from shallow to
deep learning. As seen in Sect. 2.1, the first breakthrough was made by Hinton and
Salakhutdinov (2006) with the deep belief strategy: turning the full deep feedfor-
ward network into a stack of two layered Boltzmann machines. These architecture
are trained in unsupervised way, using a technique different from backpropagation,

13
The Unbearable Shallow Understanding of Deep Learning 539

called Contrastive Divergence Learning (Carreira-Perpiñán and Hinton 2005). How-


ever, soon backpropagation reappeared in the deep learning world, powered by few
modifications, shown in Eq. (2) in Sect. 2.1. The term “backpropagation” gradually
disappeared, and the deep learning community prefers to call its modification “sto-
chastic gradient descent”. This change in name gives credit to a different mathemat-
ical context, that of stochastic approximation, established by Robbins and Monro
(1951). The idea is to solve the equation f (𝐰) = 𝐚 for a vector 𝐰, in the case when
the function f is not observable, using samples of an auxiliary random function
g(𝐰) such that E[g(𝐰)] = f (𝐰). The solution is obtained by the following iterative
equation:
𝛼( )
𝐰t+1 = 𝐰t − g(𝐱t ) − 𝐚 .
t (6)

Stochastic approximation was mostly developed in engineering domains, and have


turned into an ample mathematical discipline (Kushner and Clark 1978; Benveniste
et al. 1990).
This mathematical domain provided a second fertile context for developing more
and more efficient variations of learning techniques for deep neural networks (Bot-
tou and LeCun 2004; Kingma and Ba 2014; Schmidt et al. 2017). Much like for the
class of functions generated by deep networks, it is extremely difficult to justify why
learning succeeded. The issue is well expresses by Hodas and Stinis (2018) in the
title of their paper: Doing the impossible: Why neural networks can be trained at all.
A strategy followed by Mei et al. (2018) is in demonstrating the equivalence
between the evolution of a neural network during stochastic gradient descent, and
a class of nonlinear partial differential equation, whose properties have been deeply
investigated (Ambrosio et al. 2008), for their relevance in describing interacting par-
ticle systems in physics. One important property is of being endowed with the Was-
serstein metric (Jordan et al. 1998), which, intuitively, is a distance between two
distributions interpreted as two different ways of piling up a certain amount of earth,
computed by the cost of moving one pile into the other. Due to this interpretation, it
is also known as earth mover’s distance. By using this metric, Mei and co-workers
have shown how, under certain assumptions, the nonlinear partial differential equa-
tion converge in time towards a minimum of the average prediction error. Therefore,
there are conditions under which the convergence of stochastic gradient descent is
guaranteed. This result was in fact achieved for a network with one hidden layer
only, but it has been later extended to multilayered networks (Nguyen 2019).
There are also studies that attempt to take into consideration the expressivity and
the generalization of neural networks at the same time. In other words, these stud-
ies aim at constraining the richness in functions provided by deep networks with
the additional learning effort. One metric that is often used in this kind of studies is
the covering number, that can be characterized as the number of multi-dimensional
spheres of a given radius, necessary to fill a space. Covering numbers are used in
kernel machine learning, as they can express bounds on the number of required
samples and estimating the errors (Zhou 2002). By computing covering numbers of
neural networks, Lin (2018) compared two classes of models, one with one hidden
layers, and one other with two hidden layers, showing that the latter can generate a

13
540 A. Plebe, G. Grasso

class of function larger than the shallow model, at the same cost in term of learning
effort. This result has been later extended to networks with arbitrary number of lay-
ers, realizing some specific data features such as rotation-invariance or spareness
(Guo et al. 2019): deep models can realize more complex features retaining the same
covering numbers of the equivalent shallow models.

6.3 Analogies to theoretical physics

Analogical reasoning is a powerful tool in science, it requires a well understood


source domain, mapped to the target domain to be explained. The most remarkable
analogy made for explaining deep neural models is with theoretical physics, and
specifically with the renormalization group (Mehta and Schwab 2014). This tech-
nique plays a fundamental role in contemporary physics, overcoming the problem of
series in fundamental equations, like Dirac’s quantum electrodynamics, summing up
to infinite probability. The renormalization group, first introduced by Stueckelberg
and Petermann (1953), allows to relate changes of a physical system that appears at
different scales, yet exhibits scale invariance properties, under certain transforma-
tions. By applying renormalization it turned out that one set in Dirac’s equation that
sums to infinity can be offset against another, and the remaining terms give a finite
result. Renormalization group has been later successfully applied to other funda-
mental equations, such as quantum flavourdynamics and quantum chromodynamics.
Moreover, renormalization group is the best tool for the analysis of critical behav-
ior of physical systems. Critical phenomena in physics are phase transitions at the
boundaries between the ordinary discontinuous behavior between phases, and the
continuum of phases observed at temperatures above a certain threshold. Near the
critical point physical systems have a remarkable invariance in scale of most of their
parameters, making the renormalization group very effective in connecting phenom-
ena which occur at quite different length scales (Wilson and Kogut 1974).
Mehta and Schwab worked on the classical Ising model Ising (1925), made by
units with a binary spin, organized in lattices. Neighboring units interact with each
other, and can be subjected to an overall magnetic field. The application of the
renormalization group to a Ising model can be intuitively described as the coales-
cence of a box of units into a single abstract unit, with its own spin. The correspond-
ing deep neural model is a Restricted Boltzmann Machine, described in Sect. 2. The
choice is obvious, not just because Boltzmann Machines are a physically inspired
neural model, but rather because neural units are binary, therefore can be directly
interpreted as spin units. The affinity between this type of deep neural model and the
application of the renormalization group to the Ising spin model is exemplified in
Fig. 4. Mehta and Schwab worked out a precise mathematical mapping between lay-
ers of the Restricted Boltzmann Machine and applications of a specific formulation
of the renormalization group (Kadanoff 2000) to the Ising spin physical systems.
This connection between artificial neural networks and the renormalization
group has led to speculations about the possibility that the success of deep learning
stems from its aptitude to capture the physical structure of the universe (Flack 2018;
López-Rubio 2018). This speculation is intriguing, but probably still too daring.

13
The Unbearable Shallow Understanding of Deep Learning 541

(a) (b)

BM2

R.G.

BM1

Fig. 4  The similarity between the renormalization group applied to the Ising spin model and deep neural
Boltzmann Machines. In a in the bottom there is an example of lattice sites, each with a possible up or
down spin. By operating the renormalization group the top structure is obtained, where groups of physi-
cal sites are replaced by an abstract spin at a reduced scale. In b there is a stack of neural Boltzmann
Machines, with the lower one at higher grain, and the upper one at reduced resolution. The scale reduc-
tion operated by the deep neural network and by the renormalization group are quite alike

The connection established between the two domains of the analogy needs further
work, and there are controversial aspects. For example, Lin et al. (2017) argue that
the parallel between the renormalization group and Restricted Boltzmann Machines
is flawed, because the former is a essentially a supervised procedure while neural
Boltzmann Machines are unsupervised. A recent work (Iso et al. 2018) compared
the flow diagram of the renormalization group for the Ising model and of Restricted
Boltzmann Machines. Flow diagrams connects points in the phase space of a sys-
tem, to which progressive renormalization transformations are applied. In the exper-
iments of Iso and coworkers, the deep neural model generates a flow along which
the temperature tends to the critical value, while the renormalization group display
the opposite flow, towards T = 0.

7 Heuristic appraisal in action

The account of heuristic appraisal proposed by Nickles (2006) is broad in scope, and
addresses aspects such as the economy and the politics of scientific research. Several
of the considerations included in heuristic appraisal may be applicable in an over-
all evaluation of deep learning, however, we will limit ourselves here to the analy-
sis of what makes deep learning so effective. For this purpose, heuristic appraisal
addresses aspects that can be identified as contributing to the success of deep learn-
ing, but lack a definite and relevant context of their discovery, and resist a logical-
mathematical justification.
Many of the mathematical refinements and variations that continue to improve
deep learning derive nor from biological insights, neither from theoretical ground,
but are typically heuristic. In some cases the new refinements replace previous
strategies that did come from biological inspirations. For example, we described in
Sect. 5.2 that selecting a maximum response, instead of the average, at the output of

13
542 A. Plebe, G. Grasso

a convolution, bears some biological plausibility, and it is the basis of the HMAX
model (Riesenhuber and Poggio 1999). This computation, now termed max pooling,
is adopted by AlexNet too. A number of alternative expedients have been explored
that mix in clever ways the average and the max operations. Among of the most
successful ones, Lee et al. (2018) add a tree structure on the output of a convolu-
tion, combining the values progressively, with learned pooling filters; Williams and
Li (2018) uses a method traditional in computer vision: wavelet decomposition, to
reduce data by pooling, preserving at the same time local details.
Other strategies adopted layers in the hierarchy different from the standard alter-
nating of convolution and pooling. A significant improvement was achieved by He
et al. (2016) with the idea of layers that learns residual functions with reference to
their preceding layer inputs. This allowed errors to be propagated directly to the pre-
ceding units, facilitating learning, and thus stuffing more layers, 34 in the version
that won the competition on ImageNet in 2015.
There are several passages in the literature where heuristics were expected, from
a theoretical analysis, to have a negative impact, but the empirical results were found
in the opposite direction, and therefore the heuristics has been kept.
One early example is in the method Hinton developed to train Boltzmann
machines, necessary in his first strategy for training deep architectures (Hinton
and Salakhutdinov 2006), a method called contrastive divergence (Hinton 2002).
Contrastive divergence learning should approximate the minimization of the Kull-
back–Leibler divergence, but a theoretical analysis carried out by Carreira-Perpiñán
and Hinton (2005, p. 40) showed differently:
Our first result is negative: for two types of Boltzmann machine we have
shown that, in general, the fixed points of CD [(Contrastive Divergence)] differ
from those of ML [(Maximum-Likelihood)], and thus CD is a biased algo-
rithm. This might suggest that CD is not a competitive method for ML estima-
tion of random fields. Our remaining, empirical results show otherwise: the
bias is generally very small
What count in promoting a method is its empirical result, no matter if the theory
shows otherwise, and it has been the case for contrastive divergence.
A more recent example is about DCNN. Most of the improvement from AlexNet
to GoogleNet is due to the Inception concept, named as the title of the film by Chris-
topher Nolan, in which the main character, Dom Cobb, uttered the phrase “we need
to go Deeper”. Going deeper in CNN was helped by a clever trick: substituting every
5 convolutions by 3 convolutions that produces as output, on the same windows,
3 × 3 values. These values can be reduced to one single result with a second 3 con-
volution, with a gain in the number of parameters from 25, in the case of 5 con-
volutions, to 18 (9 + 9) with the Inception trick. This replacement in theory is not
neutral, since part of the linear space spanned by the original 5 convolution is lost,
due to application of non linear rectifiers at the end of the first 3 convolution. Again,
heuristic appraisal is applied (Szegedy et al. 2016, p. 2821):
Still, this setup raises two general questions: Does this replacement result
in any loss of expressiveness? If our main goal is to factorize the linear part

13
The Unbearable Shallow Understanding of Deep Learning 543

of the computation, would it not suggest to keep linear activation in the first
layer? We have ran several control experiments and using linear activation was
always inferior to using rectified linear units in all stages of the factorization.
Pragmatics and heuristic search of improvements are become more common, in
part, as a consequence of the larger basis of researchers worldwide. The number of
papers on deep learning had a compound annual growth rate of 37% from 2014 to
2017, and the number of job openings requiring deep learning increased 34 times
from 2015 to 2017. Certainly most of the novel additions and modification to deep
learning architecture experimented in the world will not lead to improvement, and
will never be known at all. Only the few with empirical success will be published,
and contribute to the progress of deep learning.
Note that to impute the merits of deep learning entirely to a collection of scat-
tered and disparate heuristics, might be vulnerable to a sort of “no-miracle argu-
ment”1 Since Putnam (1978, pp. 18–19) on, one of the preferred arguments of sci-
entific realists says, roughly, that without assuming that laws and objects of a theory
describe, approximately, the real world, the success of that theory in a multitude of
predictions, would be an utter miracle. Antirealists, obviously, developed their ways
to try to escape the no-miracle argument (Psillos 2000), and this is not a debate of
relevance here. In the context of deep learning, the miracle would be to impute its
series of top computational performances to a large collection of disconnected heu-
ristics, without acknowledging any core theoretical principle. Miracles are not sup-
posed to take place, yet the very point about deep learning is that a unifying working
principle has still to be identified, thus there is no miracle to be invoked as a replace-
ment. Nevertheless, several hypotheses of a core mathematical principle explaining
the efficiency of deep learning have been put forward, reviewed in Sect. 6. The heu-
ristics here listed do not rule out the possibility of a core theoretical principle, how-
ever, they certainly play a crucial role in the current progression of performances of
deep learning, and it would be not easy to discriminate their contribution from that
of a putative core principle (or principles).

8 Conclusions

This work has addressed the reasons underlying the enormous success that deep
learning has gained in recent years, within a variety of application domains. We have
ascertained that within this “success”, together with factors depending on market-
ing and sociology of scientific research, there are genuine elements of a remarkable
progress in performance. Within a relatively short timespan, a set of AI problems,
long considered out of reach for existing computational tools, have found efficient
solutions thanks to deep learning. Analyzing the history of neural networks, and the
recent shift from “shallow” to “deep”, there is no trace of a well defined innova-
tion capable of justifying such success. Both the current leading factors considered

1
We are grateful to an anonymous reviewer for pointing this out.

13
544 A. Plebe, G. Grasso

to have propelled deep neural network development, namely computing hardware


performance and neuroscience findings, provide a largely unsatisfactory justification
for deep learning achievements, as extensively discussed. The field we found most
promising in providing a partial explanatory picture is that of mathematics. How-
ever, despite significant recent efforts, a full mathematical explanation of why deep
learning work is still missing. All demonstrated properties of deep models, although
of great relevance, are obtained on a limited class of neural networks only. Also,
it appears that the mathematical context, relevant for understanding deep learning,
is wide and diversified, spanning frameworks developed inside application domains
ranging from engineering to theoretical physics. A further aspect that makes it dif-
ficult to outline a comprehensive explanation is that the progress of deep learning is
characterized by the summation of a large number of heuristics, without any theo-
retical basis, justified by empirical results only. The conclusions that can drawn from
the current understanding of deep neural networks is that a comprehensive explana-
tion of the true reasons of their success is yet to come about.

References
Aarts, E., & Korst, J. (1989). Simulated annealing and Boltzmann machines. New York: Wiley.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., et al. (2015). TensorFlow: Large-scale
machine learning on heterogeneous systems. Technical report, Google Brain Team.
Alippi, C., & Vellasco, M. (1992). GALATEA neural VLSI architectures: Communication and control
considerations. Microprocessing and Microprogramming, 35, 175–181.
Ambrosio, L., Gigli, N., & Savaré, G. (2008). Gradient flows in metric spaces and in the space of prob-
ability measures. Basel: Birkhäuser.
Anderson, J. A., & Rosenfeld, E. (Eds.). (2000). Talking nets: An oral history of neural networks. Cam-
bridge: MIT Press.
Arel, I., Rose, D. C., & Karnowski, T. P. (2010). Deep machine learning-a new frontier in artificial intel-
ligence research. IEEE Computational Intelligence Magazine, 5, 13–18.
Batin, M., Turchin, A., Markov, S., Zhila, A., & Denkenberger, D. (2017). Artificial intelligence in life
extension: From deep learning to superintelligence. Informatica, 41, 401–417.
Bednar, J. A. (2009). Topographica: Building and analyzing map-level simulations from Python, C/C++,
MATLAB, NEST, or NEURON components. Frontiers in Neuroinformatics, 3, 8.
Bednar, J. A. (2014). Topographica. In D. Jaeger & R. Jung (Eds.), Encyclopedia of computational neu-
roscience (pp. 1–5). Berlin: Springer.
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The Arcade learning environment: An
evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279.
Benveniste, A., Metivier, M., & Priouret, P. (1990). Adaptive algorithms and stochastic approximations.
Berlin: Springer.
Betti, E. (1872). Il nuovo cimento. Series, 2, 7.
Bianchini, M., & Scarselli, F. (2014a). On the complexity of neural network classifiers: A comparison
between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Sys-
tems, 25, 1553–1565.
Bianchini, M., & Scarselli, F. (2014b). On the complexity of shallow and deep neural network classifiers.
In Proceedings of European Symposium on Artificial Neural Networks (pp. 371–376).
Bo, L., Lai, K., Ren, X., & Fox, D. (2011). Object recognition with hierarchical kernel descriptors. In
Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (pp.
1729–1736).
Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., et al. (2014). Find-
ings of the 2014 workshop on statistical machine translation. In Proceedings of the Workshop on
Statistical Machine Translation (pp. 12–58).

13
The Unbearable Shallow Understanding of Deep Learning 545

Booker, L., Forrest, S., Mitchell, M., & Riolo, R. (Eds.). (2005). Perspectives on adaptation in natural
and artificial systems. Oxford: Oxford University Press.
Bottou, L., & LeCun, Y. (2004). Large scale online learning. In Advances in neural information process-
ing systems (pp. 217–224).
Bower, J. M., & Beeman, D. (1998). The book of GENESIS: Exploring Realistic Neural Models with the
GEneral NEural SImulation System (2nd ed.). New York: Springer.
Bracewell, R. (2003). Fourier analysis and imaging. Berlin: Springer.
Cadieu, C. F., Hong, H., Yamins, D. L. K., Pinto, N., Ardila, D., Solomon, E. A., et al. (2014). Deep neu-
ral networks rival the representation of primate IT cortex for core visual object recognition. PLoS
Computational Biology, 10, e1003963.
Cadieu, C., Kouh, M., Pasupathy, A., Connor, C. E., Riesenhuber, M., & Poggio, T. (2007). A model of
V4 shape selectivity and invariance. Journal of Neurophysiology, 98, 1733–1750.
Carnap, R. (1938). The logical syntax of language. New York: Harcourt, Brace and World.
Carreira-Perpiñán, M., & Hinton, G. (2005). On contrastive divergence learning. In R. Cowell, & Z.
Ghahramani (Eds.), Proceedings of the Tenth International Workshop on Artificial Intelligence and
Statistics (pp. 33–40).
Cauchy, A. L. (1847). Méthode générale pour la résolution des systèmes d’équations simultanées.
Comptes rendus des séances de l’Académie des sciences de Paris, 25, 536–538.
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details:
Delving deep into convolutional nets. CoRR arXiv​:abs/1405.3531.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., et al. (2015). MXNet: A flexible and efficient
machine learning library for heterogeneous distributed systems. CoRR arXiv​:abs/1512.01274​.
Chollet, F. (2018). Deep learning with python. Shelter Island (NY): Manning.
Chui, M., Manyika, J., Miremadi, M., Henke, N., Chung, R., Nel, P., et al. (2018). Notes from the AI
frontier: Insights from hundreds of use cases. Technical Reports. April, McKinsey Global Institute.
Cicchetti, D. V. (1991). The reliability of peer review for manuscript and grant submissions: A cross-
disciplinary investigation. Behavioral and Brain Science, 14, 119–186.
Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural net-
works to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical
correspondence. Scientific Reports, 6, 27755.
Cinbis, R.G., Verbeek, J., & Schmid, C. (2012). Segmentation driven object detection with fisher vectors.
In International Conference on Computer Vision, (pp. 2968–2975).
Cireşan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image clas-
sification. In Proceedings of IEEE International Conference on Computer Vision and Pattern
Recognition.
Clarke, A., Devereux, B. J., Randall, B., & Tyler, L. K. (2015). Predicting the time course of individual
objects with MEG. Cerebral Cortex, 25, 3602–3612.
Coates, A., Huval, B., Wang, T., Wu, D.J., Ng, A.Y., & Catanzaro, B. (2013). Deep learning with COTS
HPC systems. In International Conference on Machine Learning, (pp. 1337–1345).
Connors, J. A., & Stevens, C. F. (1971). Prediction of repetitive firing behaviour from voltage clamp data
on an isolated neurone soma. Journal of Physiology, 213, 31–53.
Conway, B. R. (2018). The organization and operation of inferior temporal cortex. Annual Review of
Vision Science, 4, 19.1–19.22.
Copeland, J., & Proudfoot, D. (1996). On Alan Turing’s anticipation of connectionism. Synthese, 108,
361–377.
Curry, H. B. (1944). The method of steepest descent for non-linear minimization problems. Quarterly of
Applied Mathematics, 2, 258–261.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function, mathematics of control.
Signals and Systems, 2, 303–314.
Daniel, H. D. (2005). Publications as a measure of scientific advancement and of scientists’ productivity.
Learned Publishing, 18, 143–148.
Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge: MIT Press.
de Villers, J., & Barnard, E. (1992). Backpropagation neural nets with one and two hidden layers. IEEE
Transactions on Neural Networks, 4, 136–141.
Deutsch, K. W. (1966). The nerves of government: Models of political communication and control. New
York: Free Press.
Douglas, R. J., & Martin, K. A. (2004). Neuronal circuits of the neocortex. Annual Review of Neurosci-
ence, 27, 419–451.

13
546 A. Plebe, G. Grasso

Douglas, R. J., Martin, K. A., & Whitteridge, D. (1989). A canonical microcircuit for neocortex. Neural
Computation, 1, 480–488.
Durrani, N., Haddow, B., Koehn, P., & Heafield, K. (2014). Edinburgh’s phrase-based machine transla-
tion systems for WMT-14. In Proceedings of the Workshop on Statistical Machine Translation (pp.
97–104).
Eickenberg, M., Gramfort, A., Varoquaux, G., & Thirion, B. (2017). Seeing it all: Convolutional network
layers map the function of the human visual system. NeuroImage, 152, 184–194.
Eldan, R., & Shamir, O. (2016). The power of depth for feedforward neural networks. Journal of Machine
Learning Research, 49, 1–34.
Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition. Oxford:
Oxford University Press.
Eliasmith, C., Stewart, T. C., Choo, X., Bekolay, T., DeWolf, T., Tang, Y., et al. (2012). A large-scale
model of the functioning brain. Science, 338, 1202–1205.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–221.
Elman, J. L., Bates, E., Johnson, M. H., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethink-
ing innateness: A connectionist perspective on development. Cambridge: MIT Press.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal visual
object classes (VOC) challenge. Journal of Computer Vision, 88, 303–338.
Fellbaum, C. (1998). WordNet. Malden: Blackwell Publishing.
Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral
cortex. Cerebral Cortex, 1, 1–47.
Flack, J. C. (2018). Coarse-graining as a downward causation mechanism. Philosophical transactions of
the Royal Society A, 375, 20160338.
Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis. Cogni-
tion, 28, 3–71.
Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network. Biological Cybernet-
ics, 20, 121–136.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pat-
tern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.
Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual pattern recogni-
tion. Neural Networks, 1, 119–130.
Gallistel, C. R. (1990). The organization of learning. Cambridge (MA): MIT Press.
Gauthier, I., & Tarr, M. J. (2016). Visual object recognition: Do we (finally) know more now than we
did? Annual Review of Vision Science, 2, 16.1–16.20.
Girshick, R. (2015). Fast R-CNN. In Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition (pp. 1440–1448).
Godfrey, J., Holliman, E., & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for
research and development. In International Conference on Acoustics, Speech and Signal Process-
ing (pp. 517–520).
Grill-Spector, K., Weiner, K. S., Gomez, J., Stigliani, A., & Natu, V. S. (2018). The functional neuro-
anatomy of face perception: From brain measurements to deep neural networks. Interface Focus,
8, 20180013.
Güçlü, U., & van Gerven, M. A. J. (2014). Unsupervised feature learning improves prediction of human
brain activity in response to natural images. PLoS Computational Biology, 10, 1–16.
Güçlü, U., & van Gerven, M. A. J. (2015). Deep neural networks reveal a gradient in the complexity of
neural representations across the ventral stream. Journal of Neuroscience, 35, 10005–10014.
Guo, Z. C., Shi, L., & Lin, S. B. (2019). Realizing data features by deep nets. CoRR arXiv​
:abs/1901.00139​.
Hain, T., Woodland, P. C., Evermann, G., Gales, M. J. F., Liu, X., Moore, G. L., et al. (2005). Automatic
transcription of conversational telephone speech. IEEE Transactions on Speech and Audio Process-
ing, 13, 1173–1185.
Hanson, N. R. (1958). Patterns of discovery. Cambridge: Cambridge University Press.
Hassabis, D., Kumaran, D., Summerfield, C., & Botvinick, M. (2017). Neuroscience-inspired artificial
intelligence. Neuron, 95, 245–258.
Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y.,
Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., & Wang, X. (2018).
Applied machine learning at Facebook: A datacenter infrastructure perspective. In IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA) (pp. 620–629).

13
The Unbearable Shallow Understanding of Deep Learning 547

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Pro-
ceedings of IEEE International Conference on Computer Vision and Pattern Recognition (pp.
2818–2826).
Hebb, D. O. (1949). The organization of behavior. New York: Wiley.
Hemlin, S. (1996). Research on research evaluation. Social Epistemology, 10, 209–250.
Hendricks, V. F., Jakobsen, A., & Pedersen, S. A. (2000). Identification of matrices in science and engi-
neering. Journal for General Philosophy of Science, 31, 277–305.
Hines, M., & Carnevale, N. (1997). The NEURON simulation environment. Neural Computation, 9,
1179–1209.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Compu-
tation, 162, 83–112.
Hinton, G.E., McClelland, J.L., & Rumelhart, D.E. (1986). Distributed representations. In D. E. Rumel-
hart & J. L. McClelland (Eds.) (pp. 77–109).
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.
Science, 28, 504–507.
Hodas, N., & Stinis, P. (2018). Doing the impossible: Why neural networks can be trained at all. CoRR
arXiv​:abs/1805.04928​.
Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of ion currents and its applications to
conduction and excitation in nerve membranes. Journal of Physiology, 117, 500–544.
Holland, J. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan
Press.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approx-
imators. Neural Networks, 2, 359–366.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of IEEE Interna-
tional Conference on Computer Vision and Pattern Recognition (pp. 7132–7142).
Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular interaction, and functional architecture in the
cat’s visual cortex. Journal of Physiology, 160, 106–154.
Hubel, D., & Wiesel, T. (1968). Receptive fields and functional architecture of mokey striate cortex.
Journal of Physiology, 195, 215–243.
Ising, E. (1925). Beitrag zur Theorie des Rerromagnetismus. Zeitschrift für Physik, 31, 253–258.
Iso, S., Shiba, S., & Yokoo, S. (2018). Scale-invariant feature extraction of neural network and renormali-
zation group flow. Physical Review E, 97, 053304.
Jones, W., Alasoo, K., Fishman, D., & Parts, L. (2017). Computational biology: Deep learning. Emerging
Topics in Life Sciences, 1, 136–161.
Jordan, R., Kinderlehrer, D., & Otto, F. (1998). The variational formulation of the Fokker–Planck equa-
tion. SIAM Journal Mathematical Analysis, 29, 1–17.
Kadanoff, L. P. (2000). Statistical physics: Statics, dynamics and renormalization. Singapore: World Sci-
entific Publishing.
Kaplan, D. M. (2011). Explanation and description in computational neuroscience. Synthese, 183,
339–373.
Kaplan, D. M., & Craver, C. F. (2011). Towards a mechanistic philosophy of neuroscience. In S. French
& J. Saatsi (Eds.), Continuum companion to the philosophy of science (pp. 268–292). London:
Continuum Press.
Karmiloff-Smith, A. (1992). Beyond modularity: A developmental perspective on cognitive science.
Cambridge: MIT Press.
Kass, R. E., Amari, S. I., Arai, K., Diekman, E. N. B. C. O., Diesmann, M., Doiron, B., et al. (2018).
Computational neuroscience: Mathematical and statistical perspectives. Annual Review of Statis-
tics and Its Application, 5, 183–214.
Kay, K. N., Winawer, J., Mezer, A., & Wandell, B. A. (2013). Compressive spatial summation in human
visual cortex. Journal of Neurophysiology, 110, 481–494.
Ketkar, N. (2017). Introduction to PyTorch (pp. 195–208). Berkeley: Apress.
Khaligh-Razavi, S. M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may
explain it cortical representation. PLoS Computational Biology, 10, e1003915.
Khan, S., & Tripp, B. P. (2017). One model to learn them all. CoRR arXiv​:abs/1706.05137​.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. In Proceedings of Interna-
tional Conference on Learning Representations.
Klein, B., Harvey, B. M., & Dumoulin, S. O. (2014). Attraction of position preference by spatial attention
throughout human visual cortex. Neuron, 84, 227–237.

13
548 A. Plebe, G. Grasso

Kotseruba, I., & Tsotsos, J. K. (2018). 40 years of cognitive architectures: Core cognitive abilities and
practical applications. Artificial Intelligence Review,. https​://doi.org/10.1007/s1046​2-018-9646-y.
Kouh, M., & Poggio, T. (2008). A canonical neural circuit for cortical nonlinear operations. Neural Com-
putation, 20, 1427–1451.
Kriegeskorte, N. (2009). Relating population-code representations between man, monkey, and computa-
tional models. Frontiers in Neuroscience, 3, 363–373.
Kriegeskorte, N., Mur, M., & Bandettini, P. (2009). Representational similarity analysis-connecting the
branches of systems neuroscience. Frontiers in Systems Neuroscience, 2, 4.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical
Reports. Vol. 1, No. 4, University of Toronto.
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional
neural networks. In Advances in neural information processing systems (pp. 1090–1098).
Kushner, H. J., & Clark, D. (1978). Stochastic approximation methods for constrained and unconstrained
systems. Berlin: Springer.
Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). RACE: Large-scale reading comprehension data-
set from examinations. In Conference on Empirical Methods in Natural Language Processing (pp
796–805).
Laird, J. E., & van Lent, M. (2001). Human-level AI’s killer application: Interactive computer games. AI
Magazine, 22, 15–25.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn
and think like people. Behavioral and Brain Science, 40, 1–72.
Landgrebe, J., & Smith, B. (2019). Making AI meaningful again. Synthese,. https​://doi.org/10.1007/
s1122​9-019-02192​-y:1-21.
Laudan, L. (1984). Explaining the success of science: Beyond epistemic realism and relativism. In J. T.
Cushing, C. F. Delaney, & G. Gutting (Eds.), Science and reality: Recent work in the philosophy of
science (pp. 83–105). Notre Dame: University of Notre Dame Press.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., et al. (1989). Backprop-
agation applied to handwritten zip code recognition. Neural Computation, 1, 541–551.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86, 2278–2324.
Lee, C. Y., Gallagher, P. W., & Tu, Z. (2018). Generalizing pooling functions in CNNs: Mixed, gated, and
tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 863–875.
Lehky, S. R., & Tanaka, K. (2016). Neural representation for object recognition in inferotemporal cortex.
Current Opinion in Neurobiology, 37, 23–35.
Leibniz, G.W. (1666). De arte combinatoria. Ginevra, in Opera Omnia a cura di L. Dutens, 1768.
Lettvin, J., Maturana, H., McCulloch, W., & Pitts, W. (1959). What the frog’s eye tells the frog’s brain.
Proceedings of IRE, 47, 1940–1951.
Levenberg, K. (1944). A method for solution of certain non-linear problems in least squares. Quarterly of
Applied Mathematics, 2, 164–168.
Lin, H. W., Tegmark, M., & Rolnick, D. (2017). Why does deep and cheap learning work so well? Jour-
nal of Statistical Physics, 168, 1223–1247.
Lin, S. B. (2018). Generalization and expressivity for deep nets. IEEE Transactions on Neural Networks
and Learning Systems, 30, 1392–1406.
Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., & Alsaadi, F. E. (2017). A survey of deep neural network
architectures and their applications. Neurocomputing, 234, 11–26.
López-Rubio, E. (2018). Computational functionalism for the deep learning era. Minds and Machines,
28, 667–688.
Lorente de Nó, R. (1938). Architectonics and structure of the cerebral cortex. In J. Fulton (Ed.), Physiol-
ogy of the nervous system (pp. 291–330). Oxford: Oxford University Press.
Lu, Y. (2019). Artificial intelligence: A survey on evolution, models, applications and future trends. Jour-
nal of Management Analytics,. https​://doi.org/10.1080/23270​012.2019.15703​65:1-29.
MacWhinney, B. (Ed.). (1999). The emergence of language (2nd ed.). Mahwah: Lawrence Erlbaum
Associates.
Maex, R., Berends, M., & Cornelis, H. (2010). Large-scale network simulations in systems neuroscience.
In E. De Schutter (Ed.), Computational modeling methods for neuroscientists (pp. 317–354). Cam-
bridge: MIT Press.
Marcus, G. (2018). Deep learning: A critical appraisal. CoRR arXiv​:abs/1801.00631​.

13
The Unbearable Shallow Understanding of Deep Learning 549

Markov, N., Ercsey-Ravasz, M. M., Gomes, A. R. R., Lamy, C., Magrou, L., Vezoli, J., et al. (2014). A
weighted and directed interareal connectivity matrix for macaque cerebral cortex. Cerebral Cortex,
24, 17–36.
Markram, H., Muller, E., Ramaswamy, S., Reimann, M. W., et al. (2015). Reconstruction and simulation
of neocortical microcircuitry. Cell, 163, 456–492.
Maudit, N., Duranton, M., Gobert, J., & Sirat, J. (1992). Lneuro1.0: A piece of hardware lego for building
neural network systems. IEEE Transactions on Neural Networks, 3, 414–422.
McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin
of Mathematical Biophysics, 5, 115–133.
Mehta, P., & Schwab, D. J. (2014). An exact mapping between the variational renormalization group and
deep learning. CoRR arXiv​:abs/1410.03831​.
Mei, S., Montanari, A., & Nguyen, P. M. (2018). A mean field view of the landscape of two-layer neural
networks. Proceedings of the Natural Academy of Science USA, 115, E7665–E7671.
Miikkulainen, R., Bednar, J., Choe, Y., & Sirosh, J. (2005). Computational maps in the visual cortex.
New York: Springer.
Miller, J., & Bower, J. M. (2013). Introduction: Origins and history of the cns meetings. In J. M. Bower
(Ed.), 20 years of computational neuroscience (pp. 1–13). Berlin: Springer.
Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge: MIT Press.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-
level control through deep reinforcement learning. Nature, 518, 529–533.
Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood Cliffs: Prentice Hall.
Nguyen, P. M. (2019). Mean field limit of the learning dynamics of multilayer neural networks. CoRR
arXiv​:abs/1902.02880​.
Nickles, T. (2006). Heuristic appraisal: Context of discovery or justification? In J. Schickore & F. Steinle
(Eds.), Revisiting discovery and justification (pp. 159–182). Dordrecht: Springer.
Niiniluoto, I. (1993). The aim and structure of applied research. Erkenntnis, 38, 1–21.
Niu, J., Tang, W., Xu, F., Zhou, X., & Song, Y. (2016). Global research on artificial intelligence from
1990–2014: Spatially-explicit bibliometric analysis. International Journal of Geo-Information, 5,
66.
O’Brien, G., & Opie, J. (2004). Notes toward a structuralist theory of mental representation. In H. Clapin,
P. Staines, & P. Slezak (Eds.), Representation in mind: New approaches to mental representation.
Amsterdam: Elsevier.
Olshausen, B. A. (2014). Perception as an inference problem. In M. S. Gazzaniga (Ed.), The cognitive
neurosciences (fifth ed., pp. 295–304). Cambridge: MIT Press.
Özkural, E. (2018). The foundations of deep learning with a path towards general intelligence. In Pro-
ceedings of International Conference on Artificial General Intelligence (pp. 162–173).
Peirce, C. S. (1935). Pragmatism and abduction. In C. Hartshorne & P. Weiss (Eds.), Collected papers of
Charles Sanders Peirce (Vol. 5, pp. 112–128). Cambridge: Harvard University Press.
Petersen, P., Raslan, M., & Voigtlaender, F. (2018). Topological properties of the set of functions gener-
ated by neural networks of fixed size. CoRR arXiv​:abs/1806.08459​.
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via
parameter sharing. CoRR arXiv​:abs/1802.03268​.
Piccinini, G. (2004). The first computational theory of mind and brain: A close look at McCulloch and
Pitts’s ’Logical calculus of ideas immanent in nervous activity’. Synthese, 141, 175–215.
Piccinini, G. (2006). Computational explanation in neuroscience. Synthese, 153, 343–353.
Piccinini, G. (2007). Computational modeling vs. computational explanation: Is everything a turing
machine, and does it matter to the philosophy of mind? Australasian Journal of Philosoph, 85,
93–115.
Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed pro-
cessing model of language acquisition. Cognition, 28, 73–193.
Plebe, A. (2018). The search of “canonical” explanations for the cerebral cortex. History and Philosophy
of the Life Sciences, 40, 40–76.
Plebe, A., & De La Cruz, V. M. (2018). Neural representations beyond “plus X”. Minds and Machines,
28, 93–117.
Plebe, A., & Domenella, R. G. (2007). Object recognition by artificial cortical maps. Neural Networks,
20, 763–780.
Plebe, A., & Grasso, G. (2016). The brain in silicon: History, and skepticism. In F. Gadducci & M.
Tavosanis (Eds.), History and philosophy of computing (pp. 273–286). Berlin: Springer.

13
550 A. Plebe, G. Grasso

Polak, E. (1971). Computational methods in optimization: A unified approach. New York: Academic
Press.
Protopapas, A. D., Vanier, M., & Bower, J. M. (1998). Simulating large networks of neurons. In C. Koch
& I. Segev (Eds.), Methods in neuronal modeling from ions to networks (second ed.). Cambridge:
MIT Press.
Psillos, S. (2000). The present state of the scientific realism debate. British Journal for the Philosophy of
Science, 51, 705–728.
Putnam, H. (1978). Meaning and the moral sciences. London: Routledge.
Quinlan, P. (1991). Connectionism and psychology. Hemel Hempstead: Harvester Wheatshaft.
Rabiner, L. R., & Gold, B. (1975). Theory and application of digital signal processing. Englewood Cliffs:
Prentice Hall.
Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018). Large-scale,
high-resolution comparison of the core visual object recognition behavior of humans, monkeys,
and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38, 7255–7269.
Rall, W. (1957). Membrane time constant of motoneurons. Science, 126, 454.
Rall, W. (1964). Theoretical significance of dendritic tress for neuronal input-output relations. In R. F.
Reiss (Ed.), Neural theory and modeling (pp. 73–97). Stanford: Stanford University Press.
Rall, W. (1969). Time constants and electrotonic length of membrane cylinders and neurons. Biophysic
Journal, 9, 1483–1508.
Rall, W., & Shepherd, G. M. (1968). Theoretical reconstruction of field potentials and dendrodendritic
synaptic interactions in olfactory bulb. Journal of Neurophysiology, 31, 884–915.
Ramón y Cajal, S. (1917). Recuerdos de mi vida (Vol. II). Madrid: Imprenta y Librería de Nicolás Moya.
Ramsey, W., Stich, S. P., & Rumelhart, D. E. (Eds.). (1991). Philosophy and connectionist theory.
Mahwah: Lawrence Erlbaum Associates.
Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A compre-
hensive review. Neural Computation, 29, 2352–2449.
Reichenbach, H. (1938). Experience and prediction: An analysis of the foundations and the structure of
knowledge. Chicago: Chicago University Press.
Richardson, M., Burges, C.J., & Renshaw, E. (2013). MCTest: A challenge dataset for the open-domain
machine comprehension of text. In Conference on Empirical Methods in Natural Language Pro-
cessing (pp. 193–203).
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neu-
roscience, 2, 1019–1025.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics,
22, 400–407.
Robinson, L., & Rolls, E. T. (2015). Invariant visual object recognition: Biologically plausible
approaches. Biological Cybernetics, 109, 505–535.
Rolls, E. (2016). Cerebral cortex: Principles of operation. Oxford: Oxford University Press.
Rolls, E., & Deco, G. (2002). Computational neuroscience of vision. Oxford: Oxford University Press.
Rolls, E. T., & Stringer, S. M. (2006). Invariant visual object recognition: A model, with lighting invari-
ance. Journal of Physiology, 100, 43–62.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organisation in
the brain. Psychological Review, 65, 386–408.
Rosenblatt, F. (1962). Principles of neurodynamics: Perceptron and the theory of brain mechanisms.
Washington (DC): Spartan.
Rosenfeld, A. (1969). Picture processing by computer. New York: Academic Press.
Rosenfeld, A., & Kak, A. C. (1982). Digital picture processing (2nd ed.). New York: Academic Press.
Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropagation: The basic theory. In
Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architectures and applications
(pp. 1–34). Mahwah: Lawrence Erlbaum Associates.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating
errors. Nature, 323, 533–536.
Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed processing: Explorations in the
microstructure of cognition. Cambridge: MIT Press.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale
visual recognition challenge. International Journal of Computer Vision, 115, 211–252.

13
The Unbearable Shallow Understanding of Deep Learning 551

Sachan, M., Dubey, A., Xing, E.P., & Richardson, M. (2015). Learning answer-entailing structures for
machine comprehension. In Annual Meeting of the Association for Computational Linguistics
(pp. 239–249).
Safran, I., & Shamir, O. (2017). Depth-width tradeoffs in approximating natural functions with neural
networks. CoRR arXiv​:abs/1610.09887​.
Sánchez, J., & Perronnin, F. (2011). High-dimensional signature compression for large-scale image
classification. In Proceedings of IEEE International Conference on Computer Vision and Pat-
tern Recognition (pp 1665–1672).
Sanders, J., & Kandrot, E. (2014). CUDA by example: An introduction to general-purpose GPU pro-
gramming. Reading: Addison Wesley.
Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran,
B., et al. (2017). English conversational telephone speech recognition by humans and machines.
In Conference of the International Speech Communication Association (pp 132–136).
Schickore, J., & Steinle, F. (Eds.). (2006). Revisiting discovery and justification: Historical and philo-
sophical perspectives on the context distinction. Berlin: Springer.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61,
85–117.
Schmidt, M., Roux, N. L., & Bach, F. (2017). Minimizing finite sums with the stochastic average gra-
dient. Mathematical Programming, 162, 83–112.
Shannon, C. (1950). Programming a computer for playing chess. Philosophical Magazine, 41,
256–275.
Shea, N. (2014). Exploitable isomorphism and structural representation. Proceedings of the Aristotelian
Society, 114, 123–144.
Shepherd, G. M. (1988). A basic circuit for cortical organization. In M. S. Gazzaniga (Ed.), Perspectives
on memory research (pp. 93–134). Cambridge: MIT Press.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering
the game of Go with deep neural networks and tree search. Nature, 529, 484–489.
Simon, H. A. (1977). Models of discovery. Dordrecht: Reidel Publishing Company.
Simon, H. A. (1996). The sciences of the artificial (third ed.). Cambridge: MIT Press.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recogni-
tion. CoRR arXiv​:abs/1409.1556.
Slavenburg, G.A., Rathnam, S., & Dijkstra, H. (1996). The Trimedia TM-1 PCI VLIW media processor.
In Hot Chips Symposium.
Stigliani, A., Jeska, B., & Grill-Spector, K. (2017). Encoding model of temporal processing in human
visual cortex. Proceedings of the Natural Academy of Science USA, 1914, E11047–E11056.
Stinchcombe, M. (1999). Neural network approximation of continuous functionals and continuous func-
tions on compactifications. Neural Networks, 12, 467–477.
Stinchcombe, M., & White, H. (1989). Universal approximation using feedforward networks with non-
sigmoid hidden layer activation functions. In Proceedings International Joint Conference on Neu-
ral Networks, S. Diego (CA) (pp. 613–617).
Stringer, S. M., & Rolls, E. T. (2002). Invariant object recognition in the visual system with novel views
of 3d objects. Neural Computation, 14, 2585–2596.
Stringer, S. M., Rolls, E. T., & Tromans, J. M. (2007). Invariant object recognition with trace learning and
multiple stimuli present during training. Network: Computation in Neural Systems, 18, 161–187.
Stueckelberg, E., & Petermann, A. (1953). La normalisation des constantes dans la théorie des quanta.
Helvetica Physica Acta, 26, 499–520.
Swoyer, C. (1991). Structural representation and surrogative reasoning. Synthese, 87, 449–508.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabi-
novich, A. (2015). Going deeper with convolutions. In Proceedings of IEEE International Confer-
ence on Computer Vision and Pattern Recognition (pp. 1–9).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architec-
ture for computer vision. In Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition (pp. 2818–2826).
Tacchetti, A., Isik, L., & Poggio, T. A. (2018). Invariant recognition shapes neural representations of
visual input. Annual Review of Vision Science, 4, 403–422.
Tan, K. H., & Lim, B. P. (2018). The artificial intelligence renaissance: Deep learning and the road to
human-level machine intelligence. APSIPA Transactions on Signal and Information Processing,
7, e6.

13
552 A. Plebe, G. Grasso

Theeten, J., Duranton, M., Maudit, N., & Sirat, J. (1990). The l-neuro chip: A digital VLSI with an on-
chip learning mechanism. In Proceedings of International Neural Network Conference (pp. 593–
596). Kluwer Academic.
Thomson Kelvin, W. (1855). On the theory of the electric telegraph. Proceedings of the Royal Society of
London, 7, 382–399.
Traub, R. D. (1977). Motorneurons of different geometry and the size principle. Biological Cybernetics,
25, 163–176.
Traub, R. D. (1979). Neocortical pyramidal cells: A model with dendritic calcium conductance repro-
duces repetitive firing and epileptic behavior. Brain, 173, 243–257.
Tripp, B.P. (2017). Similarities and differences between stimulus tuning in the inferotemporal visual
cortex and convolutional networks. In International Joint Conference on Neural Networks (pp.
3551–3560).
Trischler, A., Ye, Z., Yuan, X., He, J., Bachman, P., & Suleman, K. (2016). A parallel–hierarchical model
for machine comprehension on sparse data. CoRR arXiv​:abs/1603.08884​.
Turing, A. (1948). Intelligent machinery. Tech. rep., National Physical Laboratory, London, raccolto. In
D. C. Ince (Ed.) Collected works of A. M. Turing: Mechanical intelligence, Edinburgh University
Press, 1969.
Ullman, S., Harari, D., & Dorfman, N. (2012). From simple innate biases to complex visual concepts.
Proceedings of the Natural Academy of Science USA, 109, 18215–18220.
Van Essen, D. C. (2003). Organization of visual areas in macaque and human cerebral cortex. In L. Cha-
lupa & J. Werner (Eds.), The visual neurosciences. Cambridge: MIT Press.
Van Essen, D. C., & DeYoe, E. A. (1994). Concurrent processing in the primate visual cortex. In M. S.
Gazzaniga (Ed.), The cognitive neurosciences. Cambridge: MIT Press.
VanRullen, R. (2017). Perception science in the age of deep neural networks. Frontiers in Psychology, 8,
142.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin,
I. (2017). Attention is all you need. In Advances in neural information processing systems (pp.
6000–6010).
Veselý, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative training of deep neural
networks. In Conference of the International Speech Communication Association (pp. 2345–2349).
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015
MSCOCO image captioning challenge. IEEE Transaction on Pattern Analysis and Machine Intel-
ligence, 39, 652–663.
Volterra, V. (1930). Theory of functionals and of integral and integro-differential equations. London:
Blackie & Son. (Translation by M. Long).
von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kyber-
netic, 14, 85–100.
von Economo, C., & Koskinas, G. N. (1925). Die Cytoarchitektonik der Hirnrinde des erwachsenen
Menschen. Berlin: Springer.
Wallis, G., & Rolls, E. (1997). Invariant face and object recognition in the visual system. Progress in
Neurobiology, 51, 167–194.
Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences.
Ph.D thesis, Harvard University.
Werbos, P. (1994). The roots of backpropagation: From ordered derivatives to neural networks. New
York: Wiley.
Wiener, N. (1949). Extrapolation, interpolation and smoothing of stationary time series. New York:
Wiley.
Williams, T., & Li, R. (2018). Wavelet pooling for convolutional neural networks. In International Con-
ference on Learning Representations.
Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-
organization. Proceedings of the Royal Society of London, B194, 431–445.
Wilson, K. G., & Kogut, J. (1974). The renormalization group and the 𝜖 expansion. Physics Reports, 12,
75–199.
Wu, E., & Liu, Y. (2008). Emerging technology about GPGPU. In IEEE Asia Pacific Conference on Cir-
cuits and Systems (pp. 618–622).
Yamins, D. L. K., Honga, H., Cadieua, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Perfor-
mance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings
of the Natural Academy of Science USA, 23, 8619–8624.

13
The Unbearable Shallow Understanding of Deep Learning 553

Yang, Y., Tarr, M.J., Elissa, M., & Aminoff, R.E.K. (2018). Exploring spatio–temporal neural dynamics
of the human visual cortex. bioRxiv arXiv​:42257​6.
Zhou, D. X. (2002). The covering number in learning theory. Journal of Complexity, 18, 739–767.
Zhou, J., Cao, Y., Wang, X., Li, P., & Xu, W. (2016). Deep recurrent models with fast-forward connec-
tions for neural machine translation. Transactions of the Association for Computational Linguis-
tics, 4, 371–383.
Zhu, H., Wei, F., Qin, B., & Liu, T. (2018). Hierarchical attention flow for multiple-choice reading com-
prehension. In AAAI Conference on Artificial Intelligence (pp. 6077–6084).
Ziman, J. (2000). Real science: What it is and what it means. Cambridge: Cambridge University Press.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

13

You might also like