Professional Documents
Culture Documents
Alessio Plebe, Giorgio Grasso - The Unbearable Shallow Understanding of Deep Learning
Alessio Plebe, Giorgio Grasso - The Unbearable Shallow Understanding of Deep Learning
https://doi.org/10.1007/s11023-019-09512-8
ORIGINAL PAPER
Received: 10 April 2019 / Accepted: 28 November 2019 / Published online: 12 December 2019
© Springer Nature B.V. 2019
Abstract
This paper analyzes the rapid and unexpected rise of deep learning within Artifi-
cial Intelligence and its applications. It tackles the possible reasons for this remark-
able success, providing candidate paths towards a satisfactory explanation of why
it works so well, at least in some domains. A historical account is given for the ups
and downs, which have characterized neural networks research and its evolution
from “shallow” to “deep” learning architectures. A precise account of “success” is
given, in order to sieve out aspects pertaining to marketing or sociology of research,
and the remaining aspects seem to certify a genuine value of deep learning, calling
for explanation. The alleged two main propelling factors for deep learning, namely
computing hardware performance and neuroscience findings, are scrutinized, and
evaluated as relevant but insufficient for a comprehensive explanation. We review
various attempts that have been made to provide mathematical foundations able to
justify the efficiency of deep learning, and we deem this is the most promising road
to follow, even if the current achievements are too scattered and relevant for very
limited classes of deep neural models. The authors’ take is that most of what can
explain the very nature of why deep learning works at all and even very well across
so many domains of application is still to be understood and further research, which
addresses the theoretical foundation of artificial learning, is still very much needed.
* Alessio Plebe
aplebe@unime.it
Giorgio Grasso
gmgrasso@unime.it
1
Department of Cognitive Science, Università degli Studi di Messina, via Concezione 8,
Messina 98121, Italy
13
Vol.:(0123456789)
516 A. Plebe, G. Grasso
1 Introduction
The most dramatic shift in computing in the last few years is due to a family of
techniques collected under the name of deep learning (Schmidhuber 2015), the
last evolution of the idea of artificial neural networks organized in layers (Rumel-
hart and McClelland 1986). By adopting deep learning techniques, computers are
endowed with the capability of acting without being explicitly programmed, con-
structing algorithms that adapt their functions from data, producing decisions or
predictions. Deep learning is responsible for the current AI Renaissance (Tan and
Lim 2018), the fast resurgence of Artificial Intelligence (AI) after several decades
of slow and unsatisfactory advances. Recently AI, thanks to the vast success of
deep learning, appeared as highlights on the cover of journals such as Science
(July 2015), Nature (January 2016), The Economist (May 2015). The worldwide
investment in private companies focused on AI increased from $589 million in
2012 to over 5 billions in 2016 (from the CB Insights database). AI economy is
dominated by deep learning, which, according to the McKinsey Global Institute
(Chui et al. 2018) accounts for about 40% of the annual value potentially created
by all analytics techniques, and can potentially enable the creation of between
$3.5 trillion and $5.8 trillion in value annually.
During this decade a number of exciting results have stirred a great deal of
attention on deep learning, we report here just a couple. In 2012, the group at the
University of Toronto lead by Geoffrey Hinton, the inventor of deep learning, won
the most challenging image classification competition. Soon Hinton was invited
by Google that adopted deep learning for its image search engine. In 2016, the
company DeepMind, founded by Demis Hassabis and soon acquired by Google,
defeated the world champion of Go, the Chinese chessboard game much more
complex than chess (Silver et al. 2016). The leading Internet companies were
among the first in employing deep learning on a scale (Hazelwood et al. 2018)
and are also the largest investors in research well over their own applications. The
release of deep learning programming frameworks such as Google’s TensorFlow
(Abadi et al. 2015), Facebook’s PyTorch (Ketkar 2017), Apache’s MXNet (Chen
et al. 2015) boosted the deployment of deep learning in a vast range of applica-
tions (Liu et al. 2017; Jones et al. 2017).
The remarkable performances of deep learning were totally unexpected. As
we will describe in Sect. 2.1 it is a derivation from artificial neural networks,
a field that was stagnating at the beginning of this century. After decades of
intense development, briefly summarized in Sect. 2, artificial neural networks
seemed to have exhausted their potential, and how relative small adjustments
have lead to such impressive resurrection remains largely unexplained. One may
argue that deep learning is, after all, a representative component of AI which
typically suffers recurrent periods of excessive enthusiasm, followed by round of
disappointment known as “AI winters”. Therefore, it might be that deep learn-
ing is just enjoying its evanescent turn of summertime hype. There are certainly
sociological and marketing factors contributing to the current fortunate period
of deep learning, we try in Sect. 3.1 to distill out of the various facets of its
13
The Unbearable Shallow Understanding of Deep Learning 517
13
518 A. Plebe, G. Grasso
Deep learning evolved from artificial neural networks, therefore the name itself
reveals that neuroscience must have constituted, to some extent, a context for the
development of this technique. We try now to assess what has been the role of
neuroscience in the rise of deep learning. More historical details on the interplay
between neuroscience and computing can be found in Plebe and Grasso (2016).
The extraordinary developments of neuroscience at the beginning of the nine-
teenth century (Ramón y Cajal 1917; von Economo and Koskinas 1925; Lorente
de Nó 1938), have been very influential in many fields of knowledge of that time.
Before the existence of digital computers, a highly celebrated paper (McCull-
och and Pitts 1943) came up with a bold connection between neurobiology and
mathematical logic. Seemingly (Anderson and Rosenfeld 2000, p. 3), the work
of Leibniz (1666) was the first source of inspiration for Pitts, in the fascinating
enterprise of explicating thinking with logic at the physiological level. Moreover,
Pitts strengthened his skills in logic studying with Rudolph Carnap and tried to
adopt his specific formalism (Carnap 1938) in the paper with McCulloch. Even if
this paper did not progressed neuroscience, and was basically flawed in the inter-
pretation of neural behavior (Lettvin et al. 1959), it has been highly influential on
the forthcoming field of artificial neural networks (Piccinini 2004).
In the brave new world of computing, Turing (1948) himself was the first to
advance the idea that computers can be designed borrowing hints from biological
neurons. He envisioned a machine based on distributed interconnected elements,
called B-type unorganized machine. Turing’s neurons were simple NAND gates
with two inputs, randomly interconnected, and each NAND input can be con-
nected or disconnected, and thus a learning method can “organize” the machine
by modifying the connections. His idea of learning generic algorithms by rein-
forcing useful links and by cutting useless ones was the most farsighted of this
report, anticipating the empiricist approach characteristic of deep learning. Not
so farsighted was his employer at the National Physical Laboratory, where the
report was produced, who dismissed the work as a “schoolboy essay”. Therefore,
this report remained hidden for decades, until upheld by Copeland and Proudfoot
(1996).
Under the influence of the newborn cognitive science, early AI opposed empir-
icism, in favor of a rationalist approach, as in the problem solving algorithms
(Newell and Simon 1972), and the research on artificial neural network was mar-
ginalized (Minsky and Papert 1969). It was only in the late ’80s that artificial
neural network found their way, with the PDP (Parallel Distributed Processing)
project of Rumelhart and McClelland (1986). The basic structure of the “parallel
distributed” is made of simple units organized into distinct layers, with unidirec-
tional connections between each layer and the next one. This structure, known as
feedforward network is preserved in most deep learning models. The values of the
units, affectionately called “neurons”, are computed with the following equations:
13
The Unbearable Shallow Understanding of Deep Learning 519
(a) f (x 1 ) (b)
output
layer
output
layer
hidden
x2,1 x2,2 layer
hidden
w1,1 layers
input
layer
input
layer
x1
Fig. 1 Examples of artificial neural feedforward network: in a a “shallow” network with three layers, in
b a “deep” network. In network a the mathematical symbols correspond to those used in Eq. (1) in the
text
𝐱1 = 𝐀(I) 𝐱 + 𝐛(I) ,
f̂ (𝐱) = 𝐀(O) 𝐱N + 𝐛(O) , (1)
( )
xi,k = h 𝐰i,k 𝐱i−1 − 𝜃i,k 1 < i < N.
The first layer, ruled by Eq. (1), simply provides input values to the network, nor-
malized with the linear operators 𝐀(I) and 𝐛(I). The top layer is where the output
data appear. The entire feedforward network can be expressed as a function f̂ (𝐱) of
the input vector 𝐱, and, as described by Eq. (1), it is normalized again to meet the
desired data range. The real dirty work is done by the layers in between, according
to Eq. (1). In a layer i, each unit xi,k sums up all the values from the previous level,
weighted by parameters 𝐰i,k , and the result is modified by a non-linear activation
function h(⋅), such as the sigmoid or the hyperbolic tangent. A sketch illustrating the
computationin feedforward networks is provided in Fig. 1a.
Parallel distributed processing reestablished a strong empiricist account, with
models that learned from scratch any possible meaningful function just by experi-
ence. The success of PDP was largely due to an efficient mathematical rule, known
as backpropagation, for adapting the connections between units, from examples of
the desired function between known inputs and outputs. Being 𝐰 the vector of all
learnable parameters in a network, such as 𝐰i,k in Eq. (1), and L(𝐱, 𝐰) a measure of
the error of the network with parameters 𝐰 when applied to the sample 𝐱, the back-
propagation updates the parameters iteratively, according to the following formula:
(2)
( )
𝐰t+1 = 𝐰t+1 − 𝜂∇w L 𝐱t , 𝐰t
where t spans over all available samples 𝐱t , and 𝜂 is the learning rate. Since the
learning rate is typically a small value, Eq. (2) produces a tiny modification in all
parameters 𝐰, such that the error made by the neural network on the current 𝐱t is
slightly reduced. By iterating this procedure many times, the network gradually con-
verges in approximating the unknown function sampled at 𝐱t . We will delve into
13
520 A. Plebe, G. Grasso
historical details of this cornerstone learning method in Sect. 6.2. The mathematics
of learning in deep networks is an evolution and a refinement of the same backprop-
agation rule for learning in PDP models, and in fact Geoffrey Hinton himself was
one of the main contributors to the PDP project (Hinton et al. 1986).
The “deep” addition to PDP style of feedforward network is just in the number of
layers between the input and output layers, usually called “hidden” layers. Neural
models can learn increasingly complex function by augmenting the number of
units, this way, however, the number of parameters to optimize increases as well,
and learning becomes more difficult. In particular, it was observed that increasing
the number of units by adding layers was much less efficient than increasing the
width of a single hidden layer, as reported, for example, by de Villers and Bar-
nard (1992):
We have found no difference in the optimal performance of three- and four-
layered networks [...] four layer networks are more prone to the local min-
ima problem during training [...] The above points lead us to conclude that
there seems to be no reason to use four layer networks in preference to three
layers nets in all but the most esoteric applications.
In Fig. 1 two examples of networks are juxtaposed: the one in (a) respects the
golden rule of no more than one hidden layer; the one in (b) is irrespective of this
rule, possibly aimed to esoteric applications, as conceded by Villers and Barnard.
The dogma of no more than three layers was broken by Hinton and Salakhutdi-
nov (2006), with a novel learning strategy, called deep belief network. The idea
derived from a neural architecture, called Boltzmann Machines (Aarts and Korst
1989), in which neurons have binary values that can change stocastically, with
probability given by the contributions of the other connected neurons. Boltzmann
Machines adapt their connections in an unsupervised way, with a sort of energy
minimization, this is the reason for the dedication to the great Austrian physicist.
The clever trick of Hinton was to take two adjacent layers in a feedforward net-
work, and train them as Boltzmann Machines. The procedure starts with the input
and the first hidden layer, so that it is possible to use the inputs of the dataset to
train the unsupervised Boltzmann Machine model. Then, this model is used to
generate a new dataset, just by processing all the inputs. This new set is used to
train the next couple of layers. This procedure is a sort of pre-training that gives a
first shape to all the connections in the network, to be further refined by ordinary
backpropagation using both the inputs and the known outputs of the dataset.
Here we can dispense with the mathematical details of the deep belief network,
because this first success in training deep networks boosted the research, leading
to simpler but yet effective solutions. The most popular in deep learning is just a
slight modification of backpropagation, in the following equation:
13
The Unbearable Shallow Understanding of Deep Learning 521
CNN RNN
pooling
convolution
time 1 time 2
Fig. 2 Examples of artificial neural networks different from feedforward revitalized as deep. On the left
the basic twin of layers of deep convolutional neural networks, with first a convolution, in this example
with just two different kernels, followed by pooling for dimensional reduction. On the right the basic
organization of a recurrent neural networks, in this example the input vector has three components, and
two consecutive time steps are shown. The hidden layer at time step 2 collects, in addition to the contri-
butions from the input layer, its same values at the previous time step, with their own weights (for clarity
only connections from the leftmost unit in the hidden layer are shown)
M
1 ∑ (
(3)
)
𝐰t+1 = 𝐰t − 𝜂∇w L 𝐱i , 𝐰t
M i
where instead of computing the gradients over a single sample t, a stochastic estima-
tion is made over a random subset of size M of the entire dataset, and at each itera-
tion step t a different subset, with the same size, is sampled. This way, the parameters
at the next iteration step are determined by the parameters at the previous iteration
step adjusted by the mean sampled gradient over M samples. Despite strong similar-
ity between Eqs. (2) and (3) the term “backpropagation” is now out of fashion, and
techniques related with Eq. (3) are referred as stochastic gradient descent. More will
be said about the climate conducive to the success of this technique in Sect. 6.2.
The first favourable results in training deep feedforward model paved the road
for the revitalization of other old neural models, by filling more hidden layers.
It is the case of DCNN (Deep convolutional neural network) for image process-
ing, that will be discussed in detail in Sect. 5.2, where instead of fully connected
feedforward units, layers are made by two dimensional convolutions, as shown
in Fig. 2 on the left. Another case is the rejuvenation of RNN (Recurrent neu-
ral networks). These models are much like feedforward networks, with the addi-
tion of recursive connections in the hidden layer that allow for a sort of memory,
enabling the processing of temporal data. This architecture is sketched in Fig. 2
on the right. RNN where first introduced by Elman (1990) for the simulation of
psycholinguistics phenomena, and today natural language processing is still the
prevailing domain were RNN are deployed. There are few variations on the basic
RNN, like LSTM (Long short-term memory) (Schmidhuber 2015) that uses addi-
tional controls on the recurrent connections, in order to maintain memory of sig-
nals over a long span of time. In all its variants and applications, deep learning
preserves the main philosophy of radical empiricism, its chances of functioning
depends entirely on learning from experiences.
13
522 A. Plebe, G. Grasso
In this section we try to assess the degree of success of deep learning, arguing that it
is so extraordinary to deserve a thorough explanation. The requirement for an expla-
nation emerges form the coincidence of the intensity of the success on one hand,
and the apparent lack on a drastic shift in technology on the other hand. As seen in
the previous sections, deep learning appears in close continuation with technolo-
gies existing 40 years ago. As we will try to show now, its success has been abrupt
and intense. It is commonplace among scholars of scientific discovery (Peirce 1935;
Hanson 1958; Simon 1977) that the quest for explanation typically stems from
observation of surprising facts. In the case of deep learning too, the surprise aroused
by the coincidence just described.
3.1 Aspects of success
Even if the two first points might be of a certain relevance for deep learn-
ing—which indeed are exploited in predictions and controls—it is not a theory
of any part of the world. Still, Laudan comes useful here, because his conceptual
account of “success” is precisely relevant in our case. Laudan has stressed that
success is not a valuational or a normative concept, but should be handled as
a relational concept. In the case of deep learning too, we will try to capitalize
on Laudan’s account, by relativizing aspects of “success” within some relevant
contexts. In the case of a scientific theory, the relation against which success is
evaluated is in the set of relevant goals, such as those listed above. Activities that
do not qualify as science in search of a theory have different goals, and this dis-
tinction has been much discussed in philosophy of science. For Niiniluoto (1993)
the distinction between basic and applied research can be best drawn in terms of
“utilities”, that we can take as almost synonyms of Laudan’s goals. Basic science
is characterized as the attempt to maximize “epistemic utilities”, for Niiniluoto
specifically as “truthlikeness”, the combination of truth and information. Applied
research is further differentiated in technology and applied science. For technol-
ogy the relevant utilities are practical utilities, first of all its effectiveness relative
13
The Unbearable Shallow Understanding of Deep Learning 523
to the intended use, plus other possible utilities such as economical, ergonomical,
aesthetic, ethical. Applied science can be evaluated both in terms of epistemic
and practical utilities, plus its own utilities such as simplicity or manageability.
A tripartite framework is also proposed by Hendricks et al. (2000) in terms of
pure science, applied science and engineering science, and their equivalent to
Laudan’s goals and Niiniluoto’s utilities are “values”. Prominent values for pure
science are truth, explicit justification, but could include also simplicity, unifica-
tion, consistency. Conversely, salient values for engineering science are efficiency
and practical usefulness. Applied science share some values of pure science and
engineering science.
Where does deep learning fit into this picture? The most straightforward and
simple answer is that deep learning is a pure engineering effort, therefore its goals
include practical but not epistemic utilities. Therefore, it would be correct to evalu-
ate the success of deep learning against practical utilities only. Behind this simple
answer, however, lies a set of more subtle considerations that will be dealt with in
Sect. 3.2, where possible epistemic utilities will be evaluated as well.
In the introduction we mentioned some of the more epidermic aspects of success,
such as gaining headlines on famous magazines. Although these are certainly the
less measurable and less significant aspects of success, they still contribute to the
character of “surprise” of the rise of deep learning, mentioned at the beginning of
this section.
A first measurable aspect of pragmatic success is the trend of scientific publi-
cations (Hemlin 1996). Ziman (2000, p. 258) suggested that—ideally—the best
account of “scientific knowledge” is the accumulated archive of publications, there-
fore “scientific progress can be directly measured by the growth of the archive”.
However, we know how far from ideal the peer review process is Cicchetti (1991).
Moreover, the volume of scientific production derives from competition for intellec-
tual and economic resources, driven by a variety of considerations related to research
and technology policies. Nevertheless, biliometry is supposed to provide—at least—
a rough and partial measure of scientific success (Daniel 2005). In order to rela-
tivize this measure, we analyze the trend of publications on deep learning together
with other components of AI. As discussed in Sect. 2, the two most outstanding
components correspond roughly to the well established philosophical traditions of
rationalism and empiricism, with deep learning inside the latter. Alongside rational-
ism and empiricism other philosophical guises can be found in AI, such as Darwin-
ism, followed by Holland (1975), who invented in the 1960s the genetic algorithms
(GA). For years GA have been marginal and developed only by the group of Hol-
land at Ann Arbor, and even not considered part of AI, but in the late 1990s gradu-
ally gained attention becoming an important component of AI (Booker et al. 2005).
Because of this plurality of components, the progress in time of AI summon up
different trends, with a periodic alternating of progression and stagnation between
components. This phenomenon is well observed for the two main constituents of
AI, the rationalist and the empiricist traditions. When rationalism was ascendant,
like during 1970 and 1990, or during 2000 and 2005, empiricism languished; during
rationalism stagnation, like during 1990 and 2000, empiricism thrived. Now it is its
fortunate time again, and rationalism languishes.
13
524 A. Plebe, G. Grasso
Fig. 3 Yearly number of publications from 1960 to 2018, searching Google Scholar for the following
keywords: “artificial intelligence” (AI), “artificial neural networks” (ANN), “expert systems” (ES),
“genetic algorithms” (GA), and “deep learning” (DL)
13
Table 1 Performances of deep learning on several benchmark tasks in the domains of image processing (upper rows) and natural language processing (bottom rows)
Benchmark Best non-DL First DL Best DL
ILSVRC (Russakovsky et al. 2015) 25.8 (Sánchez and Perronnin 2011) 16.4 (Krizhevsky et al. 2012) 0.02 (Hu et al. 2018)
CIFAR-10 (Krizhevsky and Hinton 2009) 20.0 (Bo et al. 2011) 11.2 (Cireşan et al. 2012) 2.9 (Pham et al. 2018)
PASCAL-VOC (Everingham et al. 2010) 64.2 (Cinbis et al. 2012) 37.6 (Girshick 2015) 13.2 (Pham et al. 2018)
The Unbearable Shallow Understanding of Deep Learning
MCTest-500 (Richardson et al. 2013) 32.2 (Sachan et al. 2015) 29.0 (Trischler et al. 2016) 29.0 (Trischler et al. 2016)
MT-news-test-En-De (Bojar et al. 2014) 20.7 [BLEU] (Durrani et al. 2014) 20.7 [BLEU] (Zhou et al. 2016) 28.4 [BLEU] (Vaswani et al. 2017)
SWITCHBOARD-Hub5 (Godfrey et al. 1992) 23.9 (Hain et al. 2005) 12.6 (Veselý et al. 2013) 5.5 (Saon et al. 2017)
For each benchmark the best non neural algorithm, the first deep learning algorithm (DL), and the current best DL algorithm are compared. Values, when not otherwise
specified, are expressed in percentage error
525
13
526 A. Plebe, G. Grasso
non neural and the first deep learning solution, and in terms of the improvements
achieved in subsequent deep models. The figures in Table 1 certify the success of
deep learning in terms of relative performances, indicating also a more rapid and
consistent gain in the domain of image processing, comparing to natural language
processing.
Still, the table provides only a partial picture, for practical reasons. In any field
of computing, benchmarks evolve continuously in order to provide increasingly
challenging contexts, following technological improvements. For the task of inter-
est here, benchmark updating has been even faster paced in the period across 2012,
precisely for the sudden shift in performances brought about by deep learning. It has
to be stated that has not been easy to find benchmark challenges stable for a period
covering both non-neural and deep learning winning algorithms. For example, the
MCTest of open-domain question/answer is now replaced by the larger and more
challenging RACE benchmark (Lai et al. 2017), dominated by neural algorithms
(Zhu et al. 2018). Today both ILSVRC and PASCAL VOC have been discontinued
and the current most popular challenges in image processing, like MSCOCO (Vin-
yals et al. 2016), are hopeless for non-neural methods.
There are several other domains introduced for evaluating and comparing AI
technologies, in which the competition is now restricted to deep learning algorithms
only, for their overwhelming superiority. It is the case of gaming, which have a close
and long-standing ties to AI (Shannon 1950), and has more recently proposed in the
format of modern computer games (Laird and van Lent 2001). We already included
in the events of major excitement for deep learning its success at the traditional
Chinese game Go. Advanced interactive computer games, however, are supposed
to offer more variety of tasks and domains, thus providing excellent challenges for
evaluating the development of general, domain-independent AI technology. The
Arcade Learning Environment (Bellemare et al. 2013) is one of the most popular
interfaces to computer game environments, set up as benchmark for AI. Few years
after its introduction, a deep learning model developed by DeepMind, called DQN
(Deep Q-Network), not only surpassed the performances of all non-neural algo-
rithms on all games in the Arcade collection, it achieved a level comparable to that
of a professional human players across most of the games (Mnih et al. 2015). In
summary, results over current benchmarks in a variety of domains certificate a sig-
nificant success of deep learning, under the goal of practical utilities.
3.2 Epistemic success
Up to now, we have assumed that only practical utilities pertain to deep learn-
ing. Epistemic utilities have been dismissed too easily, however. There are rea-
sons to look at deep learning as a potential source of new knowledge. First of
all, AI has long been conceived as a possible method for achieving understand-
ing and predicting the behavior of the mind (Simon 1996). But, above all, deep
learning is the descendant of a lineage—artificial neural networks—designed for
knowledge. The epistemic ambitions of the PDP project were boldly stated in the
subtitle of the book by Rumelhart and McClelland (1986): “Explorations in the
13
The Unbearable Shallow Understanding of Deep Learning 527
13
528 A. Plebe, G. Grasso
The two prevailing attempts to satisfy the desire for an explanation of the success
of deep learning point to hardware performance or neuroscience, in this section we
look in some detail at the first one. Allegedly, deep learning succeeded because com-
putational power has increased, allowing training of models with very many param-
eters, over large datasets, unfeasible at the time of previous generation artificial
neural networks. In fact, all along the history of artificial neural networks, efforts
have been taken to design computer hardware optimized for neural computations.
This endeavor has been always unsatisfactory, mainly due to the limited advantages
of neural hardware and its high cost, compared to standard processors (Plebe and
Grasso 2016). On the contrary, a great benefit for neural computation originates,
unexpectedly, from technologies conceived for the entertainment industry, specifi-
cally for videogaming.
An interesting earlier example of exchanging technologies between artificial neu-
ral networks and more mundane applications was given in the mid 90’s by Philips.
The L-Neuro chip (Theeten et al. 1990; Maudit et al. 1992) was developed within
the European Esprit project Galatea (Alippi and Vellasco 1992) as a specialized
hardware for neural networks, it had no application success, but some of the design
concepts has been exploited by Philips’ TriMedia family of processors, targeted for
digital television (Slavenburg et al. 1996).
More recently a new class of electronic devices, developed for gaming and com-
puter graphics applications, have been exploited for scientific computing and, more
in general, parallel computation. This technology, embedded in graphics process-
ing units (GPUs), consists of a specialized electronic circuit, designed to rapidly
manipulate and alter memory to accelerate the creation of images in a frame buffer,
intended for output to a display device. GPUs are used in many computing devices,
ranging from mobile phones, to personal computers, to workstations, and game
consoles. Modern GPUs are very efficient at manipulating computer graphics and
image processing, and their highly parallel structure makes them more efficient than
general-purpose CPUs for algorithms where the processing of large blocks of data
is done in parallel. This latter characteristic of GPUs, namely the ability to process
in parallel large volumes of data, is very well suited for implementing computa-
tional models that are intrinsically parallel, such those employed for artificial neural
networks.
13
The Unbearable Shallow Understanding of Deep Learning 529
In spite of this very high potential of GPUs in crunching ANN data at very high
speed, as late as 2008, a review of potential applications for GPUs, different from
gaming (Wu and Liu 2008) did not include neural networks at all. One of the most
important breakthrough in the development of GPUs for neural computation and
more generally for scientific computing overall, has been the introduction of the
CUDA framework by NVIDIA in 2007. CUDA is a parallel computing platform
and application programming interface (API) model created by NVIDIA. It allows
software developers to exploit GPUs, made by NVIDIA, for general purpose pro-
cessing—an approach termed GPGPU (General-Purpose computing on Graphics
Processing Units). The CUDA platform is a software infrastructure that gives direct
access to the GPU’s parallel processing capabilities, interfacing the virtual instruc-
tion set embedded in the graphics processor (Compute Unified Device Architecture)
(Sanders and Kandrot 2014).
It was in 2013 that a system developed at Stanford University, with the direct
collaboration of NVIDIA, paved the road to a new era in deep learning, powered by
what has been called “COTS HPC” (Commodity Off-The-Shelf High Performance
Computing). The system was build using 16 servers each with 4 NVIDIA GTX680
GPUs, and training a deep network with 11.2 billion parameters took less than 1 s
for a single mini-batch of 96 images (Coates et al. 2013).
Similar to the contamination of gaming and graphics hardware technology with
that of deep learning research, GPUs have contributed to the acceleration of other
fields, such as, for example, that related to blockchain frameworks. However, the
advantages specific to deep neural networks are such that complete GPU-based
computer systems have now been developed, such as NVIDIA Drive PX and Jet-
son AGX (both released in 2015), enabling deep learning intelligence inside robots,
drones and self-driving cars. It is important to note that this shift in GPU design
towards neural computation is a consequence of the success of deep learning in
applications, described in Sect. 3.1, and therefore cannot be used as an explanation
of the origin of the success itself.
13
530 A. Plebe, G. Grasso
different kinds of computation occurred. One followed the path taken by artificial
neural networks, the other served as an effective tool of investigation for neuroscien-
tists. The two paths, as we will see, gradually diverged. One can envisage a partial
reconciliation between deep learning and neuroscience in the case of vision, as dis-
cussed in Sect. 5.2.
Most, if not all, the protagonists of the developments from PDP up to deep learning
were fond admirers of neuroscience, and certainly were motivated in their research
by the suggestion of capturing, in software, various aspects of the brain. However,
it was a kind of unrequited love. James Bower (Miller and Bower 2013, p. 5) recalls
one of the first neural network meeting at Santa Barbara in 1983, where participants
“represented a remarkable mix of scientists and government officials [...] only two
made any claim to being real biologists, myself and Terry Sejnowski. [...] I presented
my work with Matt Wilson modeling the olfactory cortex and I remember distinctly
that it was news to many in the room that synaptic inputs could also be inhibitory.”
During the PDP project attempts were made to involve neuroscientists in the compu-
tational world, a good example was the series of conferences NIPS (Neural Informa-
tion Processing) started in 1986. For James Bower (Miller and Bower 2013, p. 6)
“the neurobiologists, including my friend John Miller, who I had invited to partici-
pate in the second NIPS meeting, found most of the talks either irrelevant to neuro-
biology or naive in their neurobiological claims.” On the other side, in neuroscience
there was a genuine interest in exploring the nature of the processing tasks executed
by nerve cells and systems with computations, that cannot be fulfilled by the simple
PDP models. For this aim, a new field was established, called Computational Neu-
roscience or sometimes Theoretical Neuroscience (Dayan and Abbott 2001), with its
own series of conferences, like CNS started in 1992 (Miller and Bower 2013).
Before the upsprout of PDP-style artificial networks, major advances in neu-
ral modeling were achieved by Rall (1957, 1964, 1969). He adapted an equation
describing the electric potential as a function of time and space in cables, derived by
William Thomson, Lord Kelvin of Largs (Kelvin 1855) for the project of the trans-
atlantic telegraph cable, for much smaller “cables”: dendrites. A first model using
the cable equation for “compartments”, idealized cylinders composing dendrites was
created by Rall and Shepherd (1968). Around the same time there were attempts to
model the equations of Hodgkin and Huxley (1952) within one single neural com-
partment (Connors and Stevens 1971), and eventually Traub (1977, 1979) combined
compartmental modeling with the Hodgkin–Huxley equations.
A major breakthrough progressed computational neuroscience two decades after
Traub’s models, with the construction of neural simulators that greatly propelled
computational neuroscience: the neural simulators NEURON by Hines and Car-
nevale (1997) and GENESIS by Bower and Beeman (1998). Both computational
frameworks provide environments for implementing in software biologically realis-
tic models of electrical and chemical signaling between neurons. The emergence of
the PDP enterprise was just in between 1977 and 1998, but the two researches had
13
The Unbearable Shallow Understanding of Deep Learning 531
13
532 A. Plebe, G. Grasso
13
The Unbearable Shallow Understanding of Deep Learning 533
2014 ImageNet challenge, further improved to 6.7% by the Inception (or GoogleNet)
model (Szegedy et al. 2015). An extensive review of variations and improvements in
DCNN can be found in Rawat and Wang (2017).
One of the first attempts to relate results of DCNN with the visual system was
based on the idea of adding, at a given level of an artificial network model, a layer
predicting in the space of voxel response and to train this layer on sets of images and
corresponding fMRI responses (Güçlü and van Gerven 2014). Using this method
Güçlü and van Gerven (2015) compared a model very similar to AlexNet (Chatfield
et al. 2014) with fMRI data, training the mapping to voxels on 1750 images. The
model responses were predictive of the voxels in the visual cortex above chance,
with a prediction accuracy slightly below 0.5 for area V1, and of slightly below 0.3
for area LO. The same technique has been further exploited, by generating artifi-
cial fMRI data, using stimuli of classical vision experiments, such as simple reti-
notopy or face/places contrast, for which good agreement between synthetic fMRI
responses and DCNN was found (Eickenberg et al. 2017).
The use of synthetic fMRI data is pursued also by Khan and Tripp (2017), but
with a different strategy, constructing a statistical model of the activity in the higher
visual cortex, by combining a wide range of information from previous studies.
This model allows the interpolation of novel responses as needed for experimental
purposes. Using this method Tripp (2017) was able to test similarities with corti-
cal responses and DCNN models, on various different properties: population sparse-
ness; orientation, size and position tuning; occlusion; clutter; and so on. The DCNN
tested were AlexNet Krizhevsky et al. (2012) and VGG-16 (Simonyan and Zisser-
man 2015) The results show some similarities, in particular for sparseness and size
tuning, but also differences, including scale and translation invariance, orientation
tuning, and responses to occlusion, and most of all clutter responses.
An alternative method for comparing DCNN models and fMRI responses was
offered by the representational similarity analysis, introduced by Kriegeskorte et al.
(2009); Kriegeskorte (2009). This method can be applied to any sort of distributed
responses to stimuli, computing one minus the correlation between all pairs of stim-
uli. The resulting matrix is especially informative when the stimuli are grouped by
their known categorial similarities. The whole idea is that the responses across the
set of stimuli reflect an underlying space in which reciprocal relations correspond to
relations between the stimuli. This is exactly the idea of structural representations,
one of the fundamental concepts in cognitive science (Swoyer 1991; Gallistel 1990;
O’Brien and Opie 2004; Shea 2014; Plebe and De La Cruz 2018). The representa-
tional similarity analysis is applied by Khaligh-Razavi and Kriegeskorte (2014) in
comparing responses in the higher visual cortex, measured with fMRI in humans,
and with cell recording in monkey, with several artificial models. This study is very
interesting because it includes, in addition to AlexNet, few models with more bio-
logical plausibility.
As described in Sect. 5.1, models belonging to computational neuroscience are
radically different from deep learning, however they have never reached dimensions
and structures wide enough for being used as a model of vision, even of early visual
cortical areas. Nevertheless, there is a long tradition of research in neural models
of vision that, although departing from the precise behavior of biological neurons,
13
534 A. Plebe, G. Grasso
include as much as possible realistic features of the visual system (Riesenhuber and
Poggio 1999; Rolls and Deco 2002; Miikkulainen et al. 2005; Cadieu et al. 2007;
Plebe and Domenella 2007). The study of Khaligh-Razavi and Kriegeskorte (2014)
included two notable models of this category.
The most biologically plausible model is VisNet (Wallis and Rolls 1997; Stringer
and Rolls 2002; Rolls and Stringer 2006; Stringer et al. 2007), organized into five
layers, which connectivity approximate the sizes of receptive fields in V2, V2, V4,
posterior inferior temporal cortex, and inferior temporal cortex. The network learns
by unsupervised self-organization (von der Malsburg 1973; Willshaw and von der
Malsburg 1976) with synaptic modifications derived from Hebb (1949) rule. Learn-
ing include a specific mechanism called trace memory, since learning of a single
cell is affected by a decaying trace of previous cell activity. This rule attempts to
reproduce in a static network the natural dynamics of vision, where invariant rec-
ognition of objects is learned by seeing them when moving under various different
prospective.
The second model included in the study endowed with some biological plausi-
bility is HMAX (Riesenhuber and Poggio 1999), which resemble the Neocogni-
tron in alternating S-cell layers and C-cell layers, but the latter select the maximum
response only from the connected S-cells. This form of neural selectivity is one
among the typical computations performed in biological neural assemblies (Kouh
and Poggio 2008). HMAX, like Neocognitron and VisNet, learns by unsupervised
self-organization, though the max operation is hardwired.
Khaligh-Razavi and Kriegeskorte (2014) constructed several representational
similarity matrices on a set of natural images, spanning multiple animate and inani-
mate categories, comparing the models (the study actually compared 37 different
models, of which only AlexNet, VisNet and HMAX are of interest here). The anal-
ysis revealed that AlexNet was significantly more similar to the the IT structural
representation of the categorical distinction animate/inanimate than the two more
biological plausible model (and of all the other compared models). The most plau-
sible model, VisNet, scored the worst in matching the IT representational similar-
ity. Other studies, using the same comparison techniques Cadieu et al. (2014) and
Yamins et al. (2014), compared HMAX and DCNN models in predicting represen-
tations in visual areas, and again the DCNN model better correlated with cortical
representations.
Recently, the significance of representational similarity analysis has been ques-
tioned, because its pooling over many images misses the variation in difficulty across
images of the same object (Rajalingham et al. 2018). To overcome this limitation,
Rajalingham and co-workers collected a large number (over one million) of behavio-
ral trials of object discrimination tasks in humans and monkeys. The stimuli were 24
objects, each with 100 variations in orientation, position, and background. The same
tasks have been tested on seven DCNN, including AlexNet, VGG-16, Inception, and
no model was found predictive of the behavioral results of humans or monkeys.
More recently investigations on similarities between DCNN and the visual sys-
tem investigations have also focused on the temporal dimension of the visual pro-
cess. Of course DCNN does not model time, but its hierarchy may correspond to
the time elapsed by the biological vision process, and this comparison is made
13
The Unbearable Shallow Understanding of Deep Learning 535
13
536 A. Plebe, G. Grasso
2003; Markov et al. 2014); receptive field sizes change within a cortical map, and
the degree of changes is larger in higher cortical areas (Kay et al. 2013); receptive
field are also modulated by tasks (Klein et al. 2014); scene dynamics affects recog-
nition areas, in addition to motion areas (Stigliani et al. 2017).
Moreover, as highlighted in Sect. 2, deep learning is the most extreme realization
of the empiricist perspective in Artificial Intelligence. Its success may indeed sup-
port a dominant role of experience in the development of cognitive capacities, like
visual recognition. However, a pure learning of visual capacities is unrealistic, there
are evidences that visual learning in infancy combines experience with some sort of
innate mechanisms (Ullman et al. 2012).
In sum, it is difficult to sustain the analogy with biological vision as sufficient, or
even important, in explaining the performances of deep learning, and DCNN in par-
ticular. In observing that models with much more biological plausibility, like VisNet
an HMAX, score worst than DCNN in the comparison with responses in the visual
system, may lead to a very different interpretation. Perhaps the lack of biological
plausibility is an important factor for the success of deep learning. By freeing the
model from constraints imposed by biological similarity, such as respecting recep-
tive field sizes across layers, implementing plausible learning algorithms, the space
of mathematical solutions becomes much wider.
The search for mathematical explanations of neural networks begun when the
PDP projects had spread its results, so successful to prompt the first of the above
questions. Obviously the second question had to wait for deep learning to exist for
being asked, with the benefit of a first set of results achieved. For both question, the
mathematical investigation can follow two different paths:
– Try to characterize the class of functions that can be generated by a neural archi-
tecture, often called its expressivity;
13
The Unbearable Shallow Understanding of Deep Learning 537
– Try to analyze the efficiency of the search done by learning algorithms in the
space of the possible functions of a neural architecture, often called generaliza-
tion.
A first line of research on the mathematical reasons for the success of PDP networks
focused on the class of functions that can be approximated by feedforward networks
with one hidden layer, and sigmoidal activation function (Cybenko 1989; Hornik
et al. 1989; Stinchcombe and White 1989). The most advanced properties of these
networks,o proved by Stinchcombe (1999), are in approximating any continuous func-
tion ∈ C1, the compactification of ℝ1. Compactifications in topology are spaces for
which every open cover contains a finite subcover. Intuitively, this properties rules
out functions such as sin(), which can never be approximated by neural networks.
These properties demonstrate mathematically the universal power of neural net-
works, however, most theorems presuppose an arbitrary number of hidden units.
These theoretical researches provided a good ground for trying to answer the new
question: why deep networks work better than shallow ones.
One of the first results was given by Bianchini and Scarselli (2014a, b), using
a topological properties of the space of functions generated by neural networks,
known as Betti number. The term was introduced by Henri Poincaré after the work
of Betti (1872), and, informally, refers to the number of holes on a topological
surface, in a given dimension. Bianchini and Scarselli found different asymptotic
expressions for the Betti numbers of the topology generated by all functions in neu-
ral networks for shallow and deep networks. The analysis is limited to networks with
a single output, and the topology is investigated for the set of all 𝐱 for which the out-
put of the network is positive, as typical in a binary classifier. Calling Sn such set for
shallow networks, and Dn for deep ones, with n the overall number of units, the sum
of Betti numbers B() is ruled by the following equations:
B Sn ) ∈ O nD , (4)
( ) ( )
B Dn ∈ Ω(2n ), (5)
( )
where D is the dimension of the input vector. Equation (3) states that for shallow
networks Betti numbers grows at most polynomially with respect to the number of
the hidden units n, while from Eq. (4) it turns out that for deep architectures Betti
numbers can grow exponentially in the number of the hidden units. Therefore, by
increasing the number of hidden units, more complex functions can be generated
when the architectures are deep.
Eldan and Shamir (2016) demonstrated that a simple family of functions on ℝd
is expressible by a feedforward neural networks with two hidden layers and not
by a network with one hidden layer, unless its width will grow with O(ed ). The
same group extended later (Safran and Shamir 2017) the results to hyperspherical
and hyperellittical functions. By using topological tools of analysis, other authors
13
538 A. Plebe, G. Grasso
(Petersen et al. 2018) have found that the set of functions that can be implemented
by neural networks of a fixed size have, surprisingly, several undesirable properties.
One is that the mapping from the parameter of network parameters to network func-
tion is not inverse stable, in other words two networks with very close functions may
have large differences in their parameters.
13
The Unbearable Shallow Understanding of Deep Learning 539
13
540 A. Plebe, G. Grasso
class of function larger than the shallow model, at the same cost in term of learning
effort. This result has been later extended to networks with arbitrary number of lay-
ers, realizing some specific data features such as rotation-invariance or spareness
(Guo et al. 2019): deep models can realize more complex features retaining the same
covering numbers of the equivalent shallow models.
13
The Unbearable Shallow Understanding of Deep Learning 541
(a) (b)
BM2
R.G.
BM1
Fig. 4 The similarity between the renormalization group applied to the Ising spin model and deep neural
Boltzmann Machines. In a in the bottom there is an example of lattice sites, each with a possible up or
down spin. By operating the renormalization group the top structure is obtained, where groups of physi-
cal sites are replaced by an abstract spin at a reduced scale. In b there is a stack of neural Boltzmann
Machines, with the lower one at higher grain, and the upper one at reduced resolution. The scale reduc-
tion operated by the deep neural network and by the renormalization group are quite alike
The connection established between the two domains of the analogy needs further
work, and there are controversial aspects. For example, Lin et al. (2017) argue that
the parallel between the renormalization group and Restricted Boltzmann Machines
is flawed, because the former is a essentially a supervised procedure while neural
Boltzmann Machines are unsupervised. A recent work (Iso et al. 2018) compared
the flow diagram of the renormalization group for the Ising model and of Restricted
Boltzmann Machines. Flow diagrams connects points in the phase space of a sys-
tem, to which progressive renormalization transformations are applied. In the exper-
iments of Iso and coworkers, the deep neural model generates a flow along which
the temperature tends to the critical value, while the renormalization group display
the opposite flow, towards T = 0.
The account of heuristic appraisal proposed by Nickles (2006) is broad in scope, and
addresses aspects such as the economy and the politics of scientific research. Several
of the considerations included in heuristic appraisal may be applicable in an over-
all evaluation of deep learning, however, we will limit ourselves here to the analy-
sis of what makes deep learning so effective. For this purpose, heuristic appraisal
addresses aspects that can be identified as contributing to the success of deep learn-
ing, but lack a definite and relevant context of their discovery, and resist a logical-
mathematical justification.
Many of the mathematical refinements and variations that continue to improve
deep learning derive nor from biological insights, neither from theoretical ground,
but are typically heuristic. In some cases the new refinements replace previous
strategies that did come from biological inspirations. For example, we described in
Sect. 5.2 that selecting a maximum response, instead of the average, at the output of
13
542 A. Plebe, G. Grasso
a convolution, bears some biological plausibility, and it is the basis of the HMAX
model (Riesenhuber and Poggio 1999). This computation, now termed max pooling,
is adopted by AlexNet too. A number of alternative expedients have been explored
that mix in clever ways the average and the max operations. Among of the most
successful ones, Lee et al. (2018) add a tree structure on the output of a convolu-
tion, combining the values progressively, with learned pooling filters; Williams and
Li (2018) uses a method traditional in computer vision: wavelet decomposition, to
reduce data by pooling, preserving at the same time local details.
Other strategies adopted layers in the hierarchy different from the standard alter-
nating of convolution and pooling. A significant improvement was achieved by He
et al. (2016) with the idea of layers that learns residual functions with reference to
their preceding layer inputs. This allowed errors to be propagated directly to the pre-
ceding units, facilitating learning, and thus stuffing more layers, 34 in the version
that won the competition on ImageNet in 2015.
There are several passages in the literature where heuristics were expected, from
a theoretical analysis, to have a negative impact, but the empirical results were found
in the opposite direction, and therefore the heuristics has been kept.
One early example is in the method Hinton developed to train Boltzmann
machines, necessary in his first strategy for training deep architectures (Hinton
and Salakhutdinov 2006), a method called contrastive divergence (Hinton 2002).
Contrastive divergence learning should approximate the minimization of the Kull-
back–Leibler divergence, but a theoretical analysis carried out by Carreira-Perpiñán
and Hinton (2005, p. 40) showed differently:
Our first result is negative: for two types of Boltzmann machine we have
shown that, in general, the fixed points of CD [(Contrastive Divergence)] differ
from those of ML [(Maximum-Likelihood)], and thus CD is a biased algo-
rithm. This might suggest that CD is not a competitive method for ML estima-
tion of random fields. Our remaining, empirical results show otherwise: the
bias is generally very small
What count in promoting a method is its empirical result, no matter if the theory
shows otherwise, and it has been the case for contrastive divergence.
A more recent example is about DCNN. Most of the improvement from AlexNet
to GoogleNet is due to the Inception concept, named as the title of the film by Chris-
topher Nolan, in which the main character, Dom Cobb, uttered the phrase “we need
to go Deeper”. Going deeper in CNN was helped by a clever trick: substituting every
5 convolutions by 3 convolutions that produces as output, on the same windows,
3 × 3 values. These values can be reduced to one single result with a second 3 con-
volution, with a gain in the number of parameters from 25, in the case of 5 con-
volutions, to 18 (9 + 9) with the Inception trick. This replacement in theory is not
neutral, since part of the linear space spanned by the original 5 convolution is lost,
due to application of non linear rectifiers at the end of the first 3 convolution. Again,
heuristic appraisal is applied (Szegedy et al. 2016, p. 2821):
Still, this setup raises two general questions: Does this replacement result
in any loss of expressiveness? If our main goal is to factorize the linear part
13
The Unbearable Shallow Understanding of Deep Learning 543
of the computation, would it not suggest to keep linear activation in the first
layer? We have ran several control experiments and using linear activation was
always inferior to using rectified linear units in all stages of the factorization.
Pragmatics and heuristic search of improvements are become more common, in
part, as a consequence of the larger basis of researchers worldwide. The number of
papers on deep learning had a compound annual growth rate of 37% from 2014 to
2017, and the number of job openings requiring deep learning increased 34 times
from 2015 to 2017. Certainly most of the novel additions and modification to deep
learning architecture experimented in the world will not lead to improvement, and
will never be known at all. Only the few with empirical success will be published,
and contribute to the progress of deep learning.
Note that to impute the merits of deep learning entirely to a collection of scat-
tered and disparate heuristics, might be vulnerable to a sort of “no-miracle argu-
ment”1 Since Putnam (1978, pp. 18–19) on, one of the preferred arguments of sci-
entific realists says, roughly, that without assuming that laws and objects of a theory
describe, approximately, the real world, the success of that theory in a multitude of
predictions, would be an utter miracle. Antirealists, obviously, developed their ways
to try to escape the no-miracle argument (Psillos 2000), and this is not a debate of
relevance here. In the context of deep learning, the miracle would be to impute its
series of top computational performances to a large collection of disconnected heu-
ristics, without acknowledging any core theoretical principle. Miracles are not sup-
posed to take place, yet the very point about deep learning is that a unifying working
principle has still to be identified, thus there is no miracle to be invoked as a replace-
ment. Nevertheless, several hypotheses of a core mathematical principle explaining
the efficiency of deep learning have been put forward, reviewed in Sect. 6. The heu-
ristics here listed do not rule out the possibility of a core theoretical principle, how-
ever, they certainly play a crucial role in the current progression of performances of
deep learning, and it would be not easy to discriminate their contribution from that
of a putative core principle (or principles).
8 Conclusions
This work has addressed the reasons underlying the enormous success that deep
learning has gained in recent years, within a variety of application domains. We have
ascertained that within this “success”, together with factors depending on market-
ing and sociology of scientific research, there are genuine elements of a remarkable
progress in performance. Within a relatively short timespan, a set of AI problems,
long considered out of reach for existing computational tools, have found efficient
solutions thanks to deep learning. Analyzing the history of neural networks, and the
recent shift from “shallow” to “deep”, there is no trace of a well defined innova-
tion capable of justifying such success. Both the current leading factors considered
1
We are grateful to an anonymous reviewer for pointing this out.
13
544 A. Plebe, G. Grasso
References
Aarts, E., & Korst, J. (1989). Simulated annealing and Boltzmann machines. New York: Wiley.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., et al. (2015). TensorFlow: Large-scale
machine learning on heterogeneous systems. Technical report, Google Brain Team.
Alippi, C., & Vellasco, M. (1992). GALATEA neural VLSI architectures: Communication and control
considerations. Microprocessing and Microprogramming, 35, 175–181.
Ambrosio, L., Gigli, N., & Savaré, G. (2008). Gradient flows in metric spaces and in the space of prob-
ability measures. Basel: Birkhäuser.
Anderson, J. A., & Rosenfeld, E. (Eds.). (2000). Talking nets: An oral history of neural networks. Cam-
bridge: MIT Press.
Arel, I., Rose, D. C., & Karnowski, T. P. (2010). Deep machine learning-a new frontier in artificial intel-
ligence research. IEEE Computational Intelligence Magazine, 5, 13–18.
Batin, M., Turchin, A., Markov, S., Zhila, A., & Denkenberger, D. (2017). Artificial intelligence in life
extension: From deep learning to superintelligence. Informatica, 41, 401–417.
Bednar, J. A. (2009). Topographica: Building and analyzing map-level simulations from Python, C/C++,
MATLAB, NEST, or NEURON components. Frontiers in Neuroinformatics, 3, 8.
Bednar, J. A. (2014). Topographica. In D. Jaeger & R. Jung (Eds.), Encyclopedia of computational neu-
roscience (pp. 1–5). Berlin: Springer.
Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013). The Arcade learning environment: An
evaluation platform for general agents. Journal of Artificial Intelligence Research, 47, 253–279.
Benveniste, A., Metivier, M., & Priouret, P. (1990). Adaptive algorithms and stochastic approximations.
Berlin: Springer.
Betti, E. (1872). Il nuovo cimento. Series, 2, 7.
Bianchini, M., & Scarselli, F. (2014a). On the complexity of neural network classifiers: A comparison
between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Sys-
tems, 25, 1553–1565.
Bianchini, M., & Scarselli, F. (2014b). On the complexity of shallow and deep neural network classifiers.
In Proceedings of European Symposium on Artificial Neural Networks (pp. 371–376).
Bo, L., Lai, K., Ren, X., & Fox, D. (2011). Object recognition with hierarchical kernel descriptors. In
Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (pp.
1729–1736).
Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., et al. (2014). Find-
ings of the 2014 workshop on statistical machine translation. In Proceedings of the Workshop on
Statistical Machine Translation (pp. 12–58).
13
The Unbearable Shallow Understanding of Deep Learning 545
Booker, L., Forrest, S., Mitchell, M., & Riolo, R. (Eds.). (2005). Perspectives on adaptation in natural
and artificial systems. Oxford: Oxford University Press.
Bottou, L., & LeCun, Y. (2004). Large scale online learning. In Advances in neural information process-
ing systems (pp. 217–224).
Bower, J. M., & Beeman, D. (1998). The book of GENESIS: Exploring Realistic Neural Models with the
GEneral NEural SImulation System (2nd ed.). New York: Springer.
Bracewell, R. (2003). Fourier analysis and imaging. Berlin: Springer.
Cadieu, C. F., Hong, H., Yamins, D. L. K., Pinto, N., Ardila, D., Solomon, E. A., et al. (2014). Deep neu-
ral networks rival the representation of primate IT cortex for core visual object recognition. PLoS
Computational Biology, 10, e1003963.
Cadieu, C., Kouh, M., Pasupathy, A., Connor, C. E., Riesenhuber, M., & Poggio, T. (2007). A model of
V4 shape selectivity and invariance. Journal of Neurophysiology, 98, 1733–1750.
Carnap, R. (1938). The logical syntax of language. New York: Harcourt, Brace and World.
Carreira-Perpiñán, M., & Hinton, G. (2005). On contrastive divergence learning. In R. Cowell, & Z.
Ghahramani (Eds.), Proceedings of the Tenth International Workshop on Artificial Intelligence and
Statistics (pp. 33–40).
Cauchy, A. L. (1847). Méthode générale pour la résolution des systèmes d’équations simultanées.
Comptes rendus des séances de l’Académie des sciences de Paris, 25, 536–538.
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details:
Delving deep into convolutional nets. CoRR arXiv:abs/1405.3531.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., et al. (2015). MXNet: A flexible and efficient
machine learning library for heterogeneous distributed systems. CoRR arXiv:abs/1512.01274.
Chollet, F. (2018). Deep learning with python. Shelter Island (NY): Manning.
Chui, M., Manyika, J., Miremadi, M., Henke, N., Chung, R., Nel, P., et al. (2018). Notes from the AI
frontier: Insights from hundreds of use cases. Technical Reports. April, McKinsey Global Institute.
Cicchetti, D. V. (1991). The reliability of peer review for manuscript and grant submissions: A cross-
disciplinary investigation. Behavioral and Brain Science, 14, 119–186.
Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural net-
works to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical
correspondence. Scientific Reports, 6, 27755.
Cinbis, R.G., Verbeek, J., & Schmid, C. (2012). Segmentation driven object detection with fisher vectors.
In International Conference on Computer Vision, (pp. 2968–2975).
Cireşan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image clas-
sification. In Proceedings of IEEE International Conference on Computer Vision and Pattern
Recognition.
Clarke, A., Devereux, B. J., Randall, B., & Tyler, L. K. (2015). Predicting the time course of individual
objects with MEG. Cerebral Cortex, 25, 3602–3612.
Coates, A., Huval, B., Wang, T., Wu, D.J., Ng, A.Y., & Catanzaro, B. (2013). Deep learning with COTS
HPC systems. In International Conference on Machine Learning, (pp. 1337–1345).
Connors, J. A., & Stevens, C. F. (1971). Prediction of repetitive firing behaviour from voltage clamp data
on an isolated neurone soma. Journal of Physiology, 213, 31–53.
Conway, B. R. (2018). The organization and operation of inferior temporal cortex. Annual Review of
Vision Science, 4, 19.1–19.22.
Copeland, J., & Proudfoot, D. (1996). On Alan Turing’s anticipation of connectionism. Synthese, 108,
361–377.
Curry, H. B. (1944). The method of steepest descent for non-linear minimization problems. Quarterly of
Applied Mathematics, 2, 258–261.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function, mathematics of control.
Signals and Systems, 2, 303–314.
Daniel, H. D. (2005). Publications as a measure of scientific advancement and of scientists’ productivity.
Learned Publishing, 18, 143–148.
Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge: MIT Press.
de Villers, J., & Barnard, E. (1992). Backpropagation neural nets with one and two hidden layers. IEEE
Transactions on Neural Networks, 4, 136–141.
Deutsch, K. W. (1966). The nerves of government: Models of political communication and control. New
York: Free Press.
Douglas, R. J., & Martin, K. A. (2004). Neuronal circuits of the neocortex. Annual Review of Neurosci-
ence, 27, 419–451.
13
546 A. Plebe, G. Grasso
Douglas, R. J., Martin, K. A., & Whitteridge, D. (1989). A canonical microcircuit for neocortex. Neural
Computation, 1, 480–488.
Durrani, N., Haddow, B., Koehn, P., & Heafield, K. (2014). Edinburgh’s phrase-based machine transla-
tion systems for WMT-14. In Proceedings of the Workshop on Statistical Machine Translation (pp.
97–104).
Eickenberg, M., Gramfort, A., Varoquaux, G., & Thirion, B. (2017). Seeing it all: Convolutional network
layers map the function of the human visual system. NeuroImage, 152, 184–194.
Eldan, R., & Shamir, O. (2016). The power of depth for feedforward neural networks. Journal of Machine
Learning Research, 49, 1–34.
Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition. Oxford:
Oxford University Press.
Eliasmith, C., Stewart, T. C., Choo, X., Bekolay, T., DeWolf, T., Tang, Y., et al. (2012). A large-scale
model of the functioning brain. Science, 338, 1202–1205.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–221.
Elman, J. L., Bates, E., Johnson, M. H., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethink-
ing innateness: A connectionist perspective on development. Cambridge: MIT Press.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal visual
object classes (VOC) challenge. Journal of Computer Vision, 88, 303–338.
Fellbaum, C. (1998). WordNet. Malden: Blackwell Publishing.
Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral
cortex. Cerebral Cortex, 1, 1–47.
Flack, J. C. (2018). Coarse-graining as a downward causation mechanism. Philosophical transactions of
the Royal Society A, 375, 20160338.
Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis. Cogni-
tion, 28, 3–71.
Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network. Biological Cybernet-
ics, 20, 121–136.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pat-
tern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.
Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual pattern recogni-
tion. Neural Networks, 1, 119–130.
Gallistel, C. R. (1990). The organization of learning. Cambridge (MA): MIT Press.
Gauthier, I., & Tarr, M. J. (2016). Visual object recognition: Do we (finally) know more now than we
did? Annual Review of Vision Science, 2, 16.1–16.20.
Girshick, R. (2015). Fast R-CNN. In Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition (pp. 1440–1448).
Godfrey, J., Holliman, E., & McDaniel, J. (1992). SWITCHBOARD: Telephone speech corpus for
research and development. In International Conference on Acoustics, Speech and Signal Process-
ing (pp. 517–520).
Grill-Spector, K., Weiner, K. S., Gomez, J., Stigliani, A., & Natu, V. S. (2018). The functional neuro-
anatomy of face perception: From brain measurements to deep neural networks. Interface Focus,
8, 20180013.
Güçlü, U., & van Gerven, M. A. J. (2014). Unsupervised feature learning improves prediction of human
brain activity in response to natural images. PLoS Computational Biology, 10, 1–16.
Güçlü, U., & van Gerven, M. A. J. (2015). Deep neural networks reveal a gradient in the complexity of
neural representations across the ventral stream. Journal of Neuroscience, 35, 10005–10014.
Guo, Z. C., Shi, L., & Lin, S. B. (2019). Realizing data features by deep nets. CoRR arXiv
:abs/1901.00139.
Hain, T., Woodland, P. C., Evermann, G., Gales, M. J. F., Liu, X., Moore, G. L., et al. (2005). Automatic
transcription of conversational telephone speech. IEEE Transactions on Speech and Audio Process-
ing, 13, 1173–1185.
Hanson, N. R. (1958). Patterns of discovery. Cambridge: Cambridge University Press.
Hassabis, D., Kumaran, D., Summerfield, C., & Botvinick, M. (2017). Neuroscience-inspired artificial
intelligence. Neuron, 95, 245–258.
Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y.,
Kalro, A., Law, J., Lee, K., Lu, J., Noordhuis, P., Smelyanskiy, M., Xiong, L., & Wang, X. (2018).
Applied machine learning at Facebook: A datacenter infrastructure perspective. In IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA) (pp. 620–629).
13
The Unbearable Shallow Understanding of Deep Learning 547
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Pro-
ceedings of IEEE International Conference on Computer Vision and Pattern Recognition (pp.
2818–2826).
Hebb, D. O. (1949). The organization of behavior. New York: Wiley.
Hemlin, S. (1996). Research on research evaluation. Social Epistemology, 10, 209–250.
Hendricks, V. F., Jakobsen, A., & Pedersen, S. A. (2000). Identification of matrices in science and engi-
neering. Journal for General Philosophy of Science, 31, 277–305.
Hines, M., & Carnevale, N. (1997). The NEURON simulation environment. Neural Computation, 9,
1179–1209.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Compu-
tation, 162, 83–112.
Hinton, G.E., McClelland, J.L., & Rumelhart, D.E. (1986). Distributed representations. In D. E. Rumel-
hart & J. L. McClelland (Eds.) (pp. 77–109).
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks.
Science, 28, 504–507.
Hodas, N., & Stinis, P. (2018). Doing the impossible: Why neural networks can be trained at all. CoRR
arXiv:abs/1805.04928.
Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of ion currents and its applications to
conduction and excitation in nerve membranes. Journal of Physiology, 117, 500–544.
Holland, J. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan
Press.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approx-
imators. Neural Networks, 2, 359–366.
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of IEEE Interna-
tional Conference on Computer Vision and Pattern Recognition (pp. 7132–7142).
Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular interaction, and functional architecture in the
cat’s visual cortex. Journal of Physiology, 160, 106–154.
Hubel, D., & Wiesel, T. (1968). Receptive fields and functional architecture of mokey striate cortex.
Journal of Physiology, 195, 215–243.
Ising, E. (1925). Beitrag zur Theorie des Rerromagnetismus. Zeitschrift für Physik, 31, 253–258.
Iso, S., Shiba, S., & Yokoo, S. (2018). Scale-invariant feature extraction of neural network and renormali-
zation group flow. Physical Review E, 97, 053304.
Jones, W., Alasoo, K., Fishman, D., & Parts, L. (2017). Computational biology: Deep learning. Emerging
Topics in Life Sciences, 1, 136–161.
Jordan, R., Kinderlehrer, D., & Otto, F. (1998). The variational formulation of the Fokker–Planck equa-
tion. SIAM Journal Mathematical Analysis, 29, 1–17.
Kadanoff, L. P. (2000). Statistical physics: Statics, dynamics and renormalization. Singapore: World Sci-
entific Publishing.
Kaplan, D. M. (2011). Explanation and description in computational neuroscience. Synthese, 183,
339–373.
Kaplan, D. M., & Craver, C. F. (2011). Towards a mechanistic philosophy of neuroscience. In S. French
& J. Saatsi (Eds.), Continuum companion to the philosophy of science (pp. 268–292). London:
Continuum Press.
Karmiloff-Smith, A. (1992). Beyond modularity: A developmental perspective on cognitive science.
Cambridge: MIT Press.
Kass, R. E., Amari, S. I., Arai, K., Diekman, E. N. B. C. O., Diesmann, M., Doiron, B., et al. (2018).
Computational neuroscience: Mathematical and statistical perspectives. Annual Review of Statis-
tics and Its Application, 5, 183–214.
Kay, K. N., Winawer, J., Mezer, A., & Wandell, B. A. (2013). Compressive spatial summation in human
visual cortex. Journal of Neurophysiology, 110, 481–494.
Ketkar, N. (2017). Introduction to PyTorch (pp. 195–208). Berkeley: Apress.
Khaligh-Razavi, S. M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may
explain it cortical representation. PLoS Computational Biology, 10, e1003915.
Khan, S., & Tripp, B. P. (2017). One model to learn them all. CoRR arXiv:abs/1706.05137.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. In Proceedings of Interna-
tional Conference on Learning Representations.
Klein, B., Harvey, B. M., & Dumoulin, S. O. (2014). Attraction of position preference by spatial attention
throughout human visual cortex. Neuron, 84, 227–237.
13
548 A. Plebe, G. Grasso
Kotseruba, I., & Tsotsos, J. K. (2018). 40 years of cognitive architectures: Core cognitive abilities and
practical applications. Artificial Intelligence Review,. https://doi.org/10.1007/s10462-018-9646-y.
Kouh, M., & Poggio, T. (2008). A canonical neural circuit for cortical nonlinear operations. Neural Com-
putation, 20, 1427–1451.
Kriegeskorte, N. (2009). Relating population-code representations between man, monkey, and computa-
tional models. Frontiers in Neuroscience, 3, 363–373.
Kriegeskorte, N., Mur, M., & Bandettini, P. (2009). Representational similarity analysis-connecting the
branches of systems neuroscience. Frontiers in Systems Neuroscience, 2, 4.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical
Reports. Vol. 1, No. 4, University of Toronto.
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet classification with deep convolutional
neural networks. In Advances in neural information processing systems (pp. 1090–1098).
Kushner, H. J., & Clark, D. (1978). Stochastic approximation methods for constrained and unconstrained
systems. Berlin: Springer.
Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). RACE: Large-scale reading comprehension data-
set from examinations. In Conference on Empirical Methods in Natural Language Processing (pp
796–805).
Laird, J. E., & van Lent, M. (2001). Human-level AI’s killer application: Interactive computer games. AI
Magazine, 22, 15–25.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn
and think like people. Behavioral and Brain Science, 40, 1–72.
Landgrebe, J., & Smith, B. (2019). Making AI meaningful again. Synthese,. https://doi.org/10.1007/
s11229-019-02192-y:1-21.
Laudan, L. (1984). Explaining the success of science: Beyond epistemic realism and relativism. In J. T.
Cushing, C. F. Delaney, & G. Gutting (Eds.), Science and reality: Recent work in the philosophy of
science (pp. 83–105). Notre Dame: University of Notre Dame Press.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., et al. (1989). Backprop-
agation applied to handwritten zip code recognition. Neural Computation, 1, 541–551.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86, 2278–2324.
Lee, C. Y., Gallagher, P. W., & Tu, Z. (2018). Generalizing pooling functions in CNNs: Mixed, gated, and
tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40, 863–875.
Lehky, S. R., & Tanaka, K. (2016). Neural representation for object recognition in inferotemporal cortex.
Current Opinion in Neurobiology, 37, 23–35.
Leibniz, G.W. (1666). De arte combinatoria. Ginevra, in Opera Omnia a cura di L. Dutens, 1768.
Lettvin, J., Maturana, H., McCulloch, W., & Pitts, W. (1959). What the frog’s eye tells the frog’s brain.
Proceedings of IRE, 47, 1940–1951.
Levenberg, K. (1944). A method for solution of certain non-linear problems in least squares. Quarterly of
Applied Mathematics, 2, 164–168.
Lin, H. W., Tegmark, M., & Rolnick, D. (2017). Why does deep and cheap learning work so well? Jour-
nal of Statistical Physics, 168, 1223–1247.
Lin, S. B. (2018). Generalization and expressivity for deep nets. IEEE Transactions on Neural Networks
and Learning Systems, 30, 1392–1406.
Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., & Alsaadi, F. E. (2017). A survey of deep neural network
architectures and their applications. Neurocomputing, 234, 11–26.
López-Rubio, E. (2018). Computational functionalism for the deep learning era. Minds and Machines,
28, 667–688.
Lorente de Nó, R. (1938). Architectonics and structure of the cerebral cortex. In J. Fulton (Ed.), Physiol-
ogy of the nervous system (pp. 291–330). Oxford: Oxford University Press.
Lu, Y. (2019). Artificial intelligence: A survey on evolution, models, applications and future trends. Jour-
nal of Management Analytics,. https://doi.org/10.1080/23270012.2019.1570365:1-29.
MacWhinney, B. (Ed.). (1999). The emergence of language (2nd ed.). Mahwah: Lawrence Erlbaum
Associates.
Maex, R., Berends, M., & Cornelis, H. (2010). Large-scale network simulations in systems neuroscience.
In E. De Schutter (Ed.), Computational modeling methods for neuroscientists (pp. 317–354). Cam-
bridge: MIT Press.
Marcus, G. (2018). Deep learning: A critical appraisal. CoRR arXiv:abs/1801.00631.
13
The Unbearable Shallow Understanding of Deep Learning 549
Markov, N., Ercsey-Ravasz, M. M., Gomes, A. R. R., Lamy, C., Magrou, L., Vezoli, J., et al. (2014). A
weighted and directed interareal connectivity matrix for macaque cerebral cortex. Cerebral Cortex,
24, 17–36.
Markram, H., Muller, E., Ramaswamy, S., Reimann, M. W., et al. (2015). Reconstruction and simulation
of neocortical microcircuitry. Cell, 163, 456–492.
Maudit, N., Duranton, M., Gobert, J., & Sirat, J. (1992). Lneuro1.0: A piece of hardware lego for building
neural network systems. IEEE Transactions on Neural Networks, 3, 414–422.
McCulloch, W., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin
of Mathematical Biophysics, 5, 115–133.
Mehta, P., & Schwab, D. J. (2014). An exact mapping between the variational renormalization group and
deep learning. CoRR arXiv:abs/1410.03831.
Mei, S., Montanari, A., & Nguyen, P. M. (2018). A mean field view of the landscape of two-layer neural
networks. Proceedings of the Natural Academy of Science USA, 115, E7665–E7671.
Miikkulainen, R., Bednar, J., Choe, Y., & Sirosh, J. (2005). Computational maps in the visual cortex.
New York: Springer.
Miller, J., & Bower, J. M. (2013). Introduction: Origins and history of the cns meetings. In J. M. Bower
(Ed.), 20 years of computational neuroscience (pp. 1–13). Berlin: Springer.
Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge: MIT Press.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-
level control through deep reinforcement learning. Nature, 518, 529–533.
Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood Cliffs: Prentice Hall.
Nguyen, P. M. (2019). Mean field limit of the learning dynamics of multilayer neural networks. CoRR
arXiv:abs/1902.02880.
Nickles, T. (2006). Heuristic appraisal: Context of discovery or justification? In J. Schickore & F. Steinle
(Eds.), Revisiting discovery and justification (pp. 159–182). Dordrecht: Springer.
Niiniluoto, I. (1993). The aim and structure of applied research. Erkenntnis, 38, 1–21.
Niu, J., Tang, W., Xu, F., Zhou, X., & Song, Y. (2016). Global research on artificial intelligence from
1990–2014: Spatially-explicit bibliometric analysis. International Journal of Geo-Information, 5,
66.
O’Brien, G., & Opie, J. (2004). Notes toward a structuralist theory of mental representation. In H. Clapin,
P. Staines, & P. Slezak (Eds.), Representation in mind: New approaches to mental representation.
Amsterdam: Elsevier.
Olshausen, B. A. (2014). Perception as an inference problem. In M. S. Gazzaniga (Ed.), The cognitive
neurosciences (fifth ed., pp. 295–304). Cambridge: MIT Press.
Özkural, E. (2018). The foundations of deep learning with a path towards general intelligence. In Pro-
ceedings of International Conference on Artificial General Intelligence (pp. 162–173).
Peirce, C. S. (1935). Pragmatism and abduction. In C. Hartshorne & P. Weiss (Eds.), Collected papers of
Charles Sanders Peirce (Vol. 5, pp. 112–128). Cambridge: Harvard University Press.
Petersen, P., Raslan, M., & Voigtlaender, F. (2018). Topological properties of the set of functions gener-
ated by neural networks of fixed size. CoRR arXiv:abs/1806.08459.
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via
parameter sharing. CoRR arXiv:abs/1802.03268.
Piccinini, G. (2004). The first computational theory of mind and brain: A close look at McCulloch and
Pitts’s ’Logical calculus of ideas immanent in nervous activity’. Synthese, 141, 175–215.
Piccinini, G. (2006). Computational explanation in neuroscience. Synthese, 153, 343–353.
Piccinini, G. (2007). Computational modeling vs. computational explanation: Is everything a turing
machine, and does it matter to the philosophy of mind? Australasian Journal of Philosoph, 85,
93–115.
Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed pro-
cessing model of language acquisition. Cognition, 28, 73–193.
Plebe, A. (2018). The search of “canonical” explanations for the cerebral cortex. History and Philosophy
of the Life Sciences, 40, 40–76.
Plebe, A., & De La Cruz, V. M. (2018). Neural representations beyond “plus X”. Minds and Machines,
28, 93–117.
Plebe, A., & Domenella, R. G. (2007). Object recognition by artificial cortical maps. Neural Networks,
20, 763–780.
Plebe, A., & Grasso, G. (2016). The brain in silicon: History, and skepticism. In F. Gadducci & M.
Tavosanis (Eds.), History and philosophy of computing (pp. 273–286). Berlin: Springer.
13
550 A. Plebe, G. Grasso
Polak, E. (1971). Computational methods in optimization: A unified approach. New York: Academic
Press.
Protopapas, A. D., Vanier, M., & Bower, J. M. (1998). Simulating large networks of neurons. In C. Koch
& I. Segev (Eds.), Methods in neuronal modeling from ions to networks (second ed.). Cambridge:
MIT Press.
Psillos, S. (2000). The present state of the scientific realism debate. British Journal for the Philosophy of
Science, 51, 705–728.
Putnam, H. (1978). Meaning and the moral sciences. London: Routledge.
Quinlan, P. (1991). Connectionism and psychology. Hemel Hempstead: Harvester Wheatshaft.
Rabiner, L. R., & Gold, B. (1975). Theory and application of digital signal processing. Englewood Cliffs:
Prentice Hall.
Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018). Large-scale,
high-resolution comparison of the core visual object recognition behavior of humans, monkeys,
and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38, 7255–7269.
Rall, W. (1957). Membrane time constant of motoneurons. Science, 126, 454.
Rall, W. (1964). Theoretical significance of dendritic tress for neuronal input-output relations. In R. F.
Reiss (Ed.), Neural theory and modeling (pp. 73–97). Stanford: Stanford University Press.
Rall, W. (1969). Time constants and electrotonic length of membrane cylinders and neurons. Biophysic
Journal, 9, 1483–1508.
Rall, W., & Shepherd, G. M. (1968). Theoretical reconstruction of field potentials and dendrodendritic
synaptic interactions in olfactory bulb. Journal of Neurophysiology, 31, 884–915.
Ramón y Cajal, S. (1917). Recuerdos de mi vida (Vol. II). Madrid: Imprenta y Librería de Nicolás Moya.
Ramsey, W., Stich, S. P., & Rumelhart, D. E. (Eds.). (1991). Philosophy and connectionist theory.
Mahwah: Lawrence Erlbaum Associates.
Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A compre-
hensive review. Neural Computation, 29, 2352–2449.
Reichenbach, H. (1938). Experience and prediction: An analysis of the foundations and the structure of
knowledge. Chicago: Chicago University Press.
Richardson, M., Burges, C.J., & Renshaw, E. (2013). MCTest: A challenge dataset for the open-domain
machine comprehension of text. In Conference on Empirical Methods in Natural Language Pro-
cessing (pp. 193–203).
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neu-
roscience, 2, 1019–1025.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics,
22, 400–407.
Robinson, L., & Rolls, E. T. (2015). Invariant visual object recognition: Biologically plausible
approaches. Biological Cybernetics, 109, 505–535.
Rolls, E. (2016). Cerebral cortex: Principles of operation. Oxford: Oxford University Press.
Rolls, E., & Deco, G. (2002). Computational neuroscience of vision. Oxford: Oxford University Press.
Rolls, E. T., & Stringer, S. M. (2006). Invariant visual object recognition: A model, with lighting invari-
ance. Journal of Physiology, 100, 43–62.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organisation in
the brain. Psychological Review, 65, 386–408.
Rosenblatt, F. (1962). Principles of neurodynamics: Perceptron and the theory of brain mechanisms.
Washington (DC): Spartan.
Rosenfeld, A. (1969). Picture processing by computer. New York: Academic Press.
Rosenfeld, A., & Kak, A. C. (1982). Digital picture processing (2nd ed.). New York: Academic Press.
Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (1995). Backpropagation: The basic theory. In
Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architectures and applications
(pp. 1–34). Mahwah: Lawrence Erlbaum Associates.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating
errors. Nature, 323, 533–536.
Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed processing: Explorations in the
microstructure of cognition. Cambridge: MIT Press.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale
visual recognition challenge. International Journal of Computer Vision, 115, 211–252.
13
The Unbearable Shallow Understanding of Deep Learning 551
Sachan, M., Dubey, A., Xing, E.P., & Richardson, M. (2015). Learning answer-entailing structures for
machine comprehension. In Annual Meeting of the Association for Computational Linguistics
(pp. 239–249).
Safran, I., & Shamir, O. (2017). Depth-width tradeoffs in approximating natural functions with neural
networks. CoRR arXiv:abs/1610.09887.
Sánchez, J., & Perronnin, F. (2011). High-dimensional signature compression for large-scale image
classification. In Proceedings of IEEE International Conference on Computer Vision and Pat-
tern Recognition (pp 1665–1672).
Sanders, J., & Kandrot, E. (2014). CUDA by example: An introduction to general-purpose GPU pro-
gramming. Reading: Addison Wesley.
Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran,
B., et al. (2017). English conversational telephone speech recognition by humans and machines.
In Conference of the International Speech Communication Association (pp 132–136).
Schickore, J., & Steinle, F. (Eds.). (2006). Revisiting discovery and justification: Historical and philo-
sophical perspectives on the context distinction. Berlin: Springer.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61,
85–117.
Schmidt, M., Roux, N. L., & Bach, F. (2017). Minimizing finite sums with the stochastic average gra-
dient. Mathematical Programming, 162, 83–112.
Shannon, C. (1950). Programming a computer for playing chess. Philosophical Magazine, 41,
256–275.
Shea, N. (2014). Exploitable isomorphism and structural representation. Proceedings of the Aristotelian
Society, 114, 123–144.
Shepherd, G. M. (1988). A basic circuit for cortical organization. In M. S. Gazzaniga (Ed.), Perspectives
on memory research (pp. 93–134). Cambridge: MIT Press.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering
the game of Go with deep neural networks and tree search. Nature, 529, 484–489.
Simon, H. A. (1977). Models of discovery. Dordrecht: Reidel Publishing Company.
Simon, H. A. (1996). The sciences of the artificial (third ed.). Cambridge: MIT Press.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recogni-
tion. CoRR arXiv:abs/1409.1556.
Slavenburg, G.A., Rathnam, S., & Dijkstra, H. (1996). The Trimedia TM-1 PCI VLIW media processor.
In Hot Chips Symposium.
Stigliani, A., Jeska, B., & Grill-Spector, K. (2017). Encoding model of temporal processing in human
visual cortex. Proceedings of the Natural Academy of Science USA, 1914, E11047–E11056.
Stinchcombe, M. (1999). Neural network approximation of continuous functionals and continuous func-
tions on compactifications. Neural Networks, 12, 467–477.
Stinchcombe, M., & White, H. (1989). Universal approximation using feedforward networks with non-
sigmoid hidden layer activation functions. In Proceedings International Joint Conference on Neu-
ral Networks, S. Diego (CA) (pp. 613–617).
Stringer, S. M., & Rolls, E. T. (2002). Invariant object recognition in the visual system with novel views
of 3d objects. Neural Computation, 14, 2585–2596.
Stringer, S. M., Rolls, E. T., & Tromans, J. M. (2007). Invariant object recognition with trace learning and
multiple stimuli present during training. Network: Computation in Neural Systems, 18, 161–187.
Stueckelberg, E., & Petermann, A. (1953). La normalisation des constantes dans la théorie des quanta.
Helvetica Physica Acta, 26, 499–520.
Swoyer, C. (1991). Structural representation and surrogative reasoning. Synthese, 87, 449–508.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabi-
novich, A. (2015). Going deeper with convolutions. In Proceedings of IEEE International Confer-
ence on Computer Vision and Pattern Recognition (pp. 1–9).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architec-
ture for computer vision. In Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition (pp. 2818–2826).
Tacchetti, A., Isik, L., & Poggio, T. A. (2018). Invariant recognition shapes neural representations of
visual input. Annual Review of Vision Science, 4, 403–422.
Tan, K. H., & Lim, B. P. (2018). The artificial intelligence renaissance: Deep learning and the road to
human-level machine intelligence. APSIPA Transactions on Signal and Information Processing,
7, e6.
13
552 A. Plebe, G. Grasso
Theeten, J., Duranton, M., Maudit, N., & Sirat, J. (1990). The l-neuro chip: A digital VLSI with an on-
chip learning mechanism. In Proceedings of International Neural Network Conference (pp. 593–
596). Kluwer Academic.
Thomson Kelvin, W. (1855). On the theory of the electric telegraph. Proceedings of the Royal Society of
London, 7, 382–399.
Traub, R. D. (1977). Motorneurons of different geometry and the size principle. Biological Cybernetics,
25, 163–176.
Traub, R. D. (1979). Neocortical pyramidal cells: A model with dendritic calcium conductance repro-
duces repetitive firing and epileptic behavior. Brain, 173, 243–257.
Tripp, B.P. (2017). Similarities and differences between stimulus tuning in the inferotemporal visual
cortex and convolutional networks. In International Joint Conference on Neural Networks (pp.
3551–3560).
Trischler, A., Ye, Z., Yuan, X., He, J., Bachman, P., & Suleman, K. (2016). A parallel–hierarchical model
for machine comprehension on sparse data. CoRR arXiv:abs/1603.08884.
Turing, A. (1948). Intelligent machinery. Tech. rep., National Physical Laboratory, London, raccolto. In
D. C. Ince (Ed.) Collected works of A. M. Turing: Mechanical intelligence, Edinburgh University
Press, 1969.
Ullman, S., Harari, D., & Dorfman, N. (2012). From simple innate biases to complex visual concepts.
Proceedings of the Natural Academy of Science USA, 109, 18215–18220.
Van Essen, D. C. (2003). Organization of visual areas in macaque and human cerebral cortex. In L. Cha-
lupa & J. Werner (Eds.), The visual neurosciences. Cambridge: MIT Press.
Van Essen, D. C., & DeYoe, E. A. (1994). Concurrent processing in the primate visual cortex. In M. S.
Gazzaniga (Ed.), The cognitive neurosciences. Cambridge: MIT Press.
VanRullen, R. (2017). Perception science in the age of deep neural networks. Frontiers in Psychology, 8,
142.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin,
I. (2017). Attention is all you need. In Advances in neural information processing systems (pp.
6000–6010).
Veselý, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative training of deep neural
networks. In Conference of the International Speech Communication Association (pp. 2345–2349).
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015
MSCOCO image captioning challenge. IEEE Transaction on Pattern Analysis and Machine Intel-
ligence, 39, 652–663.
Volterra, V. (1930). Theory of functionals and of integral and integro-differential equations. London:
Blackie & Son. (Translation by M. Long).
von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kyber-
netic, 14, 85–100.
von Economo, C., & Koskinas, G. N. (1925). Die Cytoarchitektonik der Hirnrinde des erwachsenen
Menschen. Berlin: Springer.
Wallis, G., & Rolls, E. (1997). Invariant face and object recognition in the visual system. Progress in
Neurobiology, 51, 167–194.
Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences.
Ph.D thesis, Harvard University.
Werbos, P. (1994). The roots of backpropagation: From ordered derivatives to neural networks. New
York: Wiley.
Wiener, N. (1949). Extrapolation, interpolation and smoothing of stationary time series. New York:
Wiley.
Williams, T., & Li, R. (2018). Wavelet pooling for convolutional neural networks. In International Con-
ference on Learning Representations.
Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-
organization. Proceedings of the Royal Society of London, B194, 431–445.
Wilson, K. G., & Kogut, J. (1974). The renormalization group and the 𝜖 expansion. Physics Reports, 12,
75–199.
Wu, E., & Liu, Y. (2008). Emerging technology about GPGPU. In IEEE Asia Pacific Conference on Cir-
cuits and Systems (pp. 618–622).
Yamins, D. L. K., Honga, H., Cadieua, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Perfor-
mance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings
of the Natural Academy of Science USA, 23, 8619–8624.
13
The Unbearable Shallow Understanding of Deep Learning 553
Yang, Y., Tarr, M.J., Elissa, M., & Aminoff, R.E.K. (2018). Exploring spatio–temporal neural dynamics
of the human visual cortex. bioRxiv arXiv:422576.
Zhou, D. X. (2002). The covering number in learning theory. Journal of Complexity, 18, 739–767.
Zhou, J., Cao, Y., Wang, X., Li, P., & Xu, W. (2016). Deep recurrent models with fast-forward connec-
tions for neural machine translation. Transactions of the Association for Computational Linguis-
tics, 4, 371–383.
Zhu, H., Wei, F., Qin, B., & Liu, T. (2018). Hierarchical attention flow for multiple-choice reading com-
prehension. In AAAI Conference on Artificial Intelligence (pp. 6077–6084).
Ziman, J. (2000). Real science: What it is and what it means. Cambridge: Cambridge University Press.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
13