Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Neural Computing and Applications (2022) 34:10257–10277

https://doi.org/10.1007/s00521-021-06359-y (0123456789().,-volV)
(0123456789().,-volV)

S. I. : EFFECTIVE AND EFFICIENT DEEP LEARNING

On the post-hoc explainability of deep echo state networks for time


series forecasting, image and video classification
Alejandro Barredo Arrieta1 • Sergio Gil-Lopez1 • Ibai Laña1 • Miren Nekane Bilbao2 • Javier Del Ser1

Received: 16 February 2021 / Accepted: 23 July 2021 / Published online: 6 August 2021
 The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021

Abstract
Since their inception, learning techniques under the reservoir computing paradigm have shown a great modeling capability
for recurrent systems without the computing overheads required for other approaches, specially deep neural networks.
Among them, different flavors of echo state networks have attracted many stares through time, mainly due to the simplicity
and computational efficiency of their learning algorithm. However, these advantages do not compensate for the fact that
echo state networks remain as black-box models whose decisions cannot be easily explained to the general audience. This
issue is even more involved for multi-layered (also referred to as deep) echo state networks, whose more complex
hierarchical structure hinders even further the explainability of their internals to users without expertise in machine
learning or even computer science. This lack of explainability can jeopardize the widespread adoption of these models in
certain domains where accountability and understandability of machine learning models is a must (e.g., medical diagnosis,
social politics). This work addresses this issue by conducting an explainability study of echo state networks when applied
to learning tasks with time series, image and video data. Among these tasks, we stress on the latter one (video classifi-
cation) which, to the best of our knowledge, has never been tackled before with echo state networks in the related literature.
Specifically, the study proposes three different techniques capable of eliciting understandable information about the
knowledge grasped by these recurrent models, namely potential memory, temporal patterns and pixel absence effect.
Potential memory addresses questions related to the effect of the reservoir size in the capability of the model to store
temporal information, whereas temporal patterns unveil the recurrent relationships captured by the model over time.
Finally, pixel absence effect attempts at evaluating the effect of the absence of a given pixel when the echo state network
model is used for image and video classification. The benefits of the proposed suite of techniques are showcased over three
different domains of applicability: time series modeling, image and, for the first time in the related literature, video
classification. The obtained results reveal that the proposed techniques not only allow for an informed understanding of the
way these models work, but also serve as diagnostic tools capable of detecting issues inherited from data (e.g., presence of
hidden bias).

Keywords Explainable artificial intelligence  Randomization-based machine learning  Reservoir computing 


Echo state networks

1 Introduction

Since their inception [1, 2], echo state networks (ESNs)


have been frequently proposed as an efficient replacement
for traditional Recurrent Neural Networks (RNNs). As
& Javier Del Ser
javier.delser@tecnalia.com opposed to conventional gradient-based RNN training, the
recurrent part (reservoir) of ESNs is not updated via gra-
1
TECNALIA, Basque Research and Technology Alliance dient backpropagation, but is simply initialized at random,
(BRTA), 48160 Derio, Spain given that certain mathematical properties are met. This
2
University of the Basque Country (UPV/EHU), randomized nature renders ESNs within the realm of
48013 Bilbao, Spain

123
10258 Neural Computing and Applications (2022) 34:10257–10277

randomized neural networks, along with random vector Given the above context, this manuscript takes a step
functional links, stochastic configuration networks and ahead by presenting a novel suite of xAI techniques suited
random features for kernel approximation [3, 4]. As a to issue explanations of already trained Deep ESN models.
result, their training overhead for real life applications The proposed techniques elicit visual information that
becomes much less computationally demanding than that permit to assess the memory properties of these models,
of RNNs. Over the well-known Mackey–Glass chaotic visualize their detected patterns over time, and analyze the
time series prediction benchmark, ESN has been shown to importance of an individual input in the model’s output.
improve the accuracy scores achieved by multi-layer per- Mathematical definitions of the xAI factors involved in the
ceptrons (MLPs), support vector machines (SVMs), back- tools of the proposed framework are given, complemented
propagation-based RNNs and other learning approaches by by examples that help illustrate their applicability and
a factor of over 2000 [5]. These proven benefits have comprehensibility to the general audience. This introduc-
appointed ESN as a top contending dynamical model for tion of the overall framework is complemented by several
performance and computational efficiency reasons when experiments showcasing the use of our xAI framework to
compared to other modeling counterparts. three different real applications: 1) battery cell consump-
Unfortunately, choosing the right parameters to initial- tion prediction, 2) road traffic flow forecasting, 3) image
ize this reservoir falls a bit on the side of luck and past classification and 4) video classification. The results are
experience of the scientist [6], and less on that of sound conclusive: the outputs of the proposed xAI techniques
reasoning. As stated in [7] and often referred thereafter, the confirm, in a human-readable fashion, that the Deep ESN
current approach for assessing whether a reservoir is suited models capture temporal dependencies existing in data that
for a particular task is to observe if it yields accurate could be expected due to prior knowledge, and that this
results, either by handcrafting the values of the reservoir captured information can be summarized to deepen the
parameters or by automating their configuration via an understanding of a general practitioner/user consuming the
external optimizer. All in all, this poses tough questions to model’s output. The novel ingredients of this work can be
address when developing an ESN for a certain application, summarized as follows:
since knowing whether the created structure is optimal for
1. We propose three xAI methods for ESNs:
the problem at hand is not possible without actually
training it. Furthermore, despite recent attempts made in • Potential memory, which permits to quantitatively
this direction [8, 9], there is no clear consensus on how to assess the estimated memory retained in a trained
guide the search for good reservoir based models. ESN, and that can be extrapolated to any recursive
Concerns in this matter go a step beyond the ones model.
exposed above about the configuration of these models. • Temporal patterns, which examines the correlative
Model design and development should orbit around a deep patterns captured by the ESN, and visualizes them
understanding of the multiple factors hidden below the to aid the training and explanation process. This
surface of the model. For this purpose, a manifold of tool allows checking whether the model captures
techniques have been proposed under the Explainable from data what the practitioner could expect as per
Artificial Intelligence (xAI) paradigm for easing the his/her expert knowledge or prior intuition told by
understanding of decisions issued by existing AI-based the application at hand.
models. The information delivered by xAI techniques • Pixel absence effect, which evaluates the effect of
allows improving the design/configuration of AI models, each dimension of the input instance on the output
extracting augmented knowledge about their outputs, of the model. This improves the granularity of the
accelerating debugging processes, or achieving a better analysis, uncovering which parts of the input are
outreach and adoption of this technology by non-special- most influential for the prediction.
ized audience [10]. Although the activity in this research
2. The aforementioned tools are evaluated over data of
area has been vibrant for explaining many black-box
distinct nature: time series, image and video data. The
machine learning models, there is no prior work on the
latter one (video) implies transforming images that
development of techniques of this sort for dynamical
flow over time into a multidimensional time series.
approaches. The need for providing explanatory informa-
3. Explanations issued by the proposed techniques are
tion about the knowledge learned by ESNs remains unad-
proven not only to inform further future studies dealing
dressed, even though recent advances on the construction
with these recurrent randomized neural networks, but
of multi-layered reservoirs (Deep ESN [11]) that make
to also unveil important modeling issues (e.g., the
these models more opaque than their single-layered
presence of bias) that are often not easy to detect from
counterparts.
sequential data.

123
Neural Computing and Applications (2022) 34:10257–10277 10259

The rest of the paper is organized as follows: Sect. 2 pro- from a K-dimensional input uðtÞ (with t denoting index
vides the reader with the required background on ESN within the sequence) to a L-dimensional output yðtÞ by a
literature, and sets common mathematical grounds for the reservoir of N neurons. Following Fig. 1, the reservoir state
rest of the work. Section 3 introduces the framework and is updated as:
analyses each of the techniques proposed in its internal
xðt þ 1Þ
subsections. Section 4 presents the experiments designed to
ensure the viability of this study with real data. Section 5 ¼ af ðWin uðt þ 1Þ þ WxðtÞ þ Wfb yðtÞÞ þ aÞxðtÞ;
analyzes and discusses the obtained results. Finally Sect. 6 ð1Þ
ends the paper by drawing conclusions and outlining future
where xðtÞ denotes the N-sized state vector of the reservoir,
research lines rooted on our findings.
f ðÞ denotes an activation function, WNN is the matrix of
internal weights, Win NK are the input connection weights,
fb
2 Background and WNL is a feedback connection matrix. Parameter a 2
Rð0; 1 denotes the leaking rate, which allows to set dif-
Before proceeding with the description of the proposed ferent learning dynamics in the above recurrence [19]. The
suite of xAI techniques, this section briefly revisits the output of the ESN at index t can be computed once the state
fundamentals of ESN and Deep ESN models (Sect. 2.1), of the reservoir has been updated, yielding:
notable advances in the explainability of recurrent neural y ðtÞ ¼ gðWout ½xðtÞ; uðtÞÞ ¼ gðWout zðtÞÞ;
b ð2Þ
networks and models for time series (Sect. 2.2), and tech-
niques used for quantifying the importance of features where ½a; b is a concatenation operator between vectors a
(Sect. 2.3). and b, Wout
LðKþNÞ is a matrix containing the output weights,
and gðÞ denotes an activation function. Weights belonging
2.1 Fundamentals of echo state networks to the aforementioned matrices can be adjusted as per a
training data-set with examples fðuðtÞ; yðtÞÞg. However, as
In 2001, Wolfgang Maass and Herbert Jaeger indepen- opposed to other recurrent neural networks, not all weight
dently introduced Liquid State Machines [12] and ESNs inside the matrices Win , W, Wfb and Wout are adjusted.
[13], respectively. The combination of these studies with Instead, the weight values of input, hidden state, and
research on computational neuroscience and machine feedback matrices are drawn initially at random, whereas
learning [14, 15] brought up the field of reservoir com- those of the output matrix Wout are the only ones tuned
puting. Methods belonging to this field consist of a set of during the training phase, using Moore–Penrose pseudo-
sparsely connected, recurrent neurons capable of mapping
high-dimensional sequential data to a low-dimensional
space, over which a learning model can be trained to
capture patterns that relate this low-dimensional space to a
target output. This simple yet effective modeling strategy
has been harnessed for regression and classification tasks in
a diversity of applications, such as road traffic forecasting
[16], human recognition [17] or smart grids [18], among
others [11].
Besides their competitive modeling performance in
terms of accuracy/error, reservoir computing models are
characterized by a less computationally demanding training
process than other recursive models: in these systems, only
the learner mapping the output of the reservoir to the target
variable of interest needs to be trained. Neurons composing
the reservoir are initialized at random under some stability
constraints. This alternative not only alleviates the com-
putational complexity of recurrent neural networks, but
also circumvents one of the downsides of gradient back-
propagation, namely exploding and vanishing gradients.
To support the subsequent explanation of the proposed
xAI techniques, we now define mathematically the inter- Fig. 1 Schematic diagram showing a canonical ESN (upper plot), and
nals and procedure followed by ESNs to learn a mapping the multi-layered stacking architecture of a Deep ESN model (below)

123
10260 Neural Computing and Applications (2022) 34:10257–10277

inversion (for classification) or regularized least squares


with qðÞ denoting the largest absolute eigenvalue of the
(regression). In this latter case:
matrix set at its argument, and qmax \1 the so-called
!2
XT X
K þN  2 spectral radius of the model. Once WðlÞ and WðlÞ;in have
min wl;j  zj ðtÞ  yl ðtÞ þkwout
out
l
 ;
2
ð3Þ been set at random fulfilling the above property, they are
wout
l t¼1 j¼1
kept fixed for the rest of the training process, whereas
where l 2 f1; . . .; Lg; wout
l 2R
KþN
; kk2 denotes L2 norm; weights in Wout are tuned over the training data by means
zj ðtÞ is the j-th entry of zðtÞ; and wout out out T of regularized least-squares regression as per Expression
l ¼ ½wl;1 ; :::; wl;KþN 
(3) (or any other low-complexity learning model alike).
denotes the l-th row of Wout ; and k 2 R½0; 1Þ permits to
Although it has not been covered in the literature, the
adjust the importance of the L2 regularization term in the
readout layer can be designed, selected or tailored as per
minimization.
the needs of the problem or task at hand. In this work,
More recently, a multi-layered architecture based on the
multiple readout layers are considered to tackle multi-class
leaky ESN model described above was introduced in [11].
(random forests) and regression problems (linear
In essence, the Deep ESN model embodies a stacking
regression).
ensemble of NL reservoirs, set one after the other, forming
Despite the simplicity and computational efficiency of
a hierarchically layered series of reservoirs. As a result of
the ESN and Deep ESN training process, the composition
this concatenation, the state vector of the global Deep ESN
: of the network itself—namely the selection of the number
model is given by ½xð1Þ ðtÞ; :::; xðNL Þ ðtÞ ¼½xðlÞ ðtÞNl¼1
L
, which
of layers and the value of hyperparameters such as aðlÞ is a
can be conceived as a multi-scale representation of the
matter of study that remain so far without a clear answer
input uðtÞ. It is this property, together with other advan-
[7, 13, 22]. Some automated approaches have been lately
tages such as a lower computational burden of their train-
proposed, either relying on the study of the frequency
ing algorithm and the enriched reservoir dynamics of
spectrum of concatenated reservoirs [23] or by means of
stacked reservoirs [20], what lends Deep ESNs a compet-
heuristic wrappers [24]. However, no previous work can be
itive modeling performance when compared to other
found on explainability measures that can be drawn from
RNNs.
an already trained Deep ESN to elucidate diverse proper-
As shown in Fig. 1, in what follows ðlÞ indicates that the
ties of its captured knowledge. When the audience for
parameter featuring this index belongs to the l-th layer of
which such measures are produced is embodied by
the Deep ESN. The first stacked ESN is hence updated as:
  machine learning experts, the suite of xAI techniques can
xð1Þ ðt þ 1Þ ¼ 1  að1Þ xð1Þ ðtÞ help them discern what they observe at its input, quantify
  ð4Þ relevant features (e.g., its memory depth) and thereby, ease
þ að1Þ f Wð1Þ xð1Þ ðtÞ þ Wð1Þ;in uð1Þ ðt þ 1Þ ; the process of configuring them properly for the task at
hand.
whereas the recurrence in layers l 2 f2; . . .; NL g is given
by: 2.2 Explainability of recurrent neural networks
 
xðlÞ ðt þ 1Þ ¼ 1  aðlÞ xðlÞ ðtÞ
  ð5Þ Several application domains have traditionally shown a
þ aðlÞ f WðlÞ xðlÞ ðtÞ þ WðlÞ;in xðl1Þ ðt þ 1Þ ; harsh reluctance against embracing the latest advances in
machine learning due to the opaque, black-box nature of
and the Deep ESN output yields as: models emerging over the years. This growing concern
  with the need for explaining black-box models has been
y ðtÞ ¼ g Wout ½xð1Þ ðtÞ; . . .; xðNL Þ ðtÞ ;
b ð6Þ mainly showcased in deep learning models for image
classification, wherein explanations produced by xAI
where Wout LðNL NÞ is the weight matrix that maps the con- techniques can be inspected easily. However, the explain-
catenation of state vectors of the stacked reservoirs to the ability of models developed for sequence data (e.g., time
target output. Analogously to canonical single-layer ESN, series) has also been studied at a significantly smaller
weights in WðlÞ and WðlÞ;in for each layer l ¼ 1; . . .; NL are corpus of literature. A notable milestone is the work in
initialized at random and rescaled to fulfill the so-called [25], where a post-hoc xAI technique originally developed
Echo State Property [21], i.e., for image classification was adopted for LSTM architec-
  tures, thereby allowing for the generation of local expla-
max q ð1  aðlÞ ÞI þ aðlÞ WðlÞ \qmax ; ð7Þ nations in sequential data modeling and regression.
l2f1;...;NL g
Stimulated in part by this work, several contributions have
hitherto proposed different approaches for the

123
Neural Computing and Applications (2022) 34:10257–10277 10261

explainability of recurrent neural networks, including gra- networks or randomized neural networks for that matter.
dient-based approaches [26, 27], ablation-based estima- There has been extended use of this concept for Tree
tions of the relevance of particular sequence variables for Ensembles and Support Vector Machines [41, 42] con-
the predicted output [28, 29], or the partial linearization of structing a well-developed body of research around these
the activations through the network, permitting to isolate techniques. The assessment of the validity of a machine
the contribution at different time scales to the output [30]. learning model’s output is paramount for a well-structured
In addition to standard deep learning approaches system validation/improvement workflow. Feature attribu-
resorting to gradient backpropagation [31], time series data tion methods intend on giving interpretability cues by
are also processed by means of other modeling approaches, measuring the importance each feature has in the final
mostly traditional data mining methods for time series prediction outcome. Many techniques have been proposed
analysis. By moving away from Deep Neural Networks, in the last few years to gauge the importance of features in
further possibilities emerge for achieving inter- different learning tasks [43–52], yet none of them has been
pretable methods by design, which allow for a better utilized with the recurrent randomized neural networks that
understandability of its inner working by the audience are at the core of this study.
without requiring external tools for the purpose [32].
Among them, Symbolic Aggregate Approximation (SAX)
[33, 34] and Fuzzy Logic appear to be the best contenders. 3 Proposed xAI framework
SAX works by first transforming the time series into its
piece-wise aggregate approximation [35]. Then the result- The framework proposed in this work is composed by three
ing transformation is converted into a string of symbols xAI techniques, each tackling some of the most common
assigned equiprobably among the regions of the dis- aspects that arise when training an ESN model or under-
cretization. Finally the obtained representation permits a standing their captured knowledge. These techniques cover
clear visualization of the recurrent patterns hand in hand three main characteristics that help understand strengths
with the stream-like implementation with minimal over- and weaknesses of these models:
heads. Fuzzy logic, fuzzy sets and computing with words
1. Potential memory, which is a simple test that closely
[36–39] are methods that improve the interpretability of
relates to the amount of sequential memory retained
otherwise hard-to-understand systems, by bringing them
within the network. The potential memory can be
closer to how human reasoning operate when inspecting
quantified by addressing how the model behaves when
multiple dimensions over time. By transforming continu-
its input fades abruptly. The intuition behind this
ous values into meaningful levels, fuzzy computing is able
concept recalls to the notion of stored information, i.e.,
to tackle many types of problems without hindering the
the time needed by the system to return to its resting
interpretability of data and operations held over them
state.
through the model of choice.
2. Temporal patterns, which target the visualization of
Despite the numerous attempts made recently at
patterns occurring inside the system by means of
explaining dynamic systems, most of them are fabricated in
recurrence plots [53, 54]. Temporal patterns permit to
a ad-hoc manner, thus leaving aside other flavors of time-
examine the multiple scales at which patterns are
series and recurrent neural computation whose intrinsically
captured by the Deep ESN model through their layers,
abstract nature also calls for studies on their explainability.
easing the inspection of the knowledge within the data
Indeed, this noted lack of proposals to elicit explanations
that is retained by the neuron reservoir. This technique
for alternative models is in close agreement with the con-
also helps determine when the addition of a layer does
clusions and prospects made in comprehensive surveys on
not contribute to the predictive performance of the
xAI [10, 40]. This is the purpose of the set of techniques
overall model.
presented in the following section, to provide different
3. Pixel absence effect, which leverages the concept of
interpretable indicators of the properties of a trained Deep
local explanations [10] and extends it to ESN and Deep
ESN model, as well as information that permit to visualize
ESN models. This technique computes the prediction
and summarize the knowledge captured by its reservoirs.
difference resulting from the suppression of one of the
inputs of the model. The generated matrix of deviations
2.3 Importance attribution methods
will be contained within the dimensions of the input
signal, allowing for the evaluation of the importance of
Due to one of the techniques in the proposed framework
the absence of a single component over the input
(pixel absence effect), it is compulsory that importance
signal.
attribution methods are discussed in this background.
Importance attribution methods are not exclusive to neural

123
10262 Neural Computing and Applications (2022) 34:10257–10277

The following subsections provide rationale for the above The time taken for the model to return to its initial state is
set of xAI techniques, their potential uses, and examples the potential memory of the network, representing the
with synthetic data previewing the explanatory information information of the history kept by the model after training.
they produce. Mathematically:
Definition 1 (Potential memory) Given an already
3.1 Potential memory
trained ESN model with parameters
One of the main concerns when training an ESN model is faðlÞ ; WðlÞ;in ; WðlÞ ; Wfb ; Wout g;
to evaluate the potential memory the network retains within
their reservoir(s). Quantifying this information becomes y ð0Þ ¼ y0 2 RL when uð0Þ ¼ 0, the
and initial state set to b
essential to ensure that the model possesses enough mod- potential memory (PM) of the model at evaluation time T0
eling complexity to capture the patterns that best correlate is given by:
with the target to be predicted. PM ðT0 Þ ¼ T0  arg min kb
y ðtÞ  y0 k2 \;
t [ T0
ð9Þ
Intuitively, a reservoir is able to hold information to
interact with the newest inputs being fed to the network in where  is the tolerance below which convergence of the
order to generate the current output. It follows that, upon measure is declared.
the removal of these new inputs, the network should keep
outputting the information retained in the reservoir, until it In order to illustrate the output of this first technique, we
is flushed out. The multi-layered structure of the network consider a single ESN model with varying number of
does not allow relating the flow through the Deep ESN neurons in its reservoir to learn the phase difference of 50
model directly to its memory, since the state in which this samples between an input and an output sinusoid. Plots
memory is conserved would most probably reside in an nested in Fig. 2 depict the fade dynamics of different ESN
unknown abstract space. This is the purpose of the pro- models trained with N ¼ 5 (plots in Fig. 2a, b), N ¼ 50
posed potential memory indicator: to elicit a hint of the (plots in Fig. 2c, d) and N ¼ 500 neurons (plots in Fig. 2e,
memory contained in the stacked reservoirs. When f) in their reservoirs. Specifically, the graph on the top
informed to the audience, the potential memory can represents the actual signal to be predicted, while the
improve the confidence of the user with respect to the bottom graphs display the behavior of the output of each
application at hand, especially in those cases where a ESN model when the input signal is zeroed at time
hypothesis on the memory that the model should possess T0 ¼ 10100, along with the empirical distribution of the
can be drawn from intuition. For instance, in an order-ten potential memory computed for T0 2 ð10000; 10100.
nonlinear autoregressive-moving average (NARMA) sys- When the reservoir is composed by just N ¼ 5 neurons,
tem, the model should be able to store at least ten past steps the potential memory of the network is low given the quick
to produce a forecast for the next time step. transition of the output signal to its initial state after the
Before proceeding further, it is important to note that a input is set to 0. When increasing the size of the reservoir
high potential memory is not sufficient for the model to to 50 neurons, a rather different behavior is noted in the
produce accurate predictions for the task under considera- output of the ESN model when flushing its contained
tion. However, it is a necessary condition to produce esti- knowledge, not reaching its steady state until 20 time steps
mations based on inputs further away than the number of for the example depicted in Fig. 2c. Finally, an ESN
sequence steps upper bounded by this indicator. comprising 500 neural units in its reservoir does not come
For the sake of simplicity, let us assume a single-layer with a significant increase of its potential memory; how-
ESN model without leaking rate (a ¼ 1) nor feedback ever, the model is able to compute the target output in a
connection (i.e., Wfb ¼ 0NL ). These assumptions simplify more precise fashion with respect to the previous case. This
the recurrence in Expression (1) to: simple experiment shows that the potential memory is a
useful metric to evaluate the past range of the sequence
xðt þ 1Þ ¼ f ðWin uðt þ 1Þ þ WxðtÞÞ; ð8Þ modeled and retained at its input. As later exposed in
which can be seen as an expansion of the current input Sect. 5, this technique is of great help when fitting a model,
uðt þ 1Þ and its history represented by xðtÞ. If we further since only when the potential memory is large enough,
assume that the network is initially1 set to b
y ð0Þ ¼ 0 when does the model succeed on the testing.
uð0Þ ¼ 0, the system should evolve gradually to the same
state when the input signal uðtÞ is set to 0 at time t ¼ T0 . 3.2 Temporal patterns

1 One important aspect when designing and building a model


For clarity any consideration to the bias term is avoided in this
statement. is to ascertain whether it has been able to capture the

123
Neural Computing and Applications (2022) 34:10257–10277 10263

definition to extract information from the visualized foot-


print. However, when staring at an already enlarged image,
by the effect of a magnifying glass, the human visual
cortex can pick on features they could not detect before,
hence obtaining a similar effect to that of layering multiple
reservoirs. The depth of the network, as well as the mag-
nification optics, has to be appropriately sized for the task:
there is no point in searching for a car using a microscope,
as there is no point on using a deep structure to model a
simple phenomenon.
Following the findings from [11, 55] in which their
authors already found how Deep ESN architectures
developed multiple time-scale representations that grew
with layering, it seems of utmost importance from an
explainability perspective to dispose of a tool that will help
us inspect this property in already trained models. For this
purpose, recurrence plots are used to analyze the temporal
patterns found within the different layers of the network.
When the layers capture valuable information, their
recurrence plots show more focused patterns than their
Fig. 2 (Top) Input and target sinusoids used for training an ESN previous layers. This accounts for the fact that each layer is
model with a ¼ 0:5 and a single reservoir with N 2 f5; 50; 500g able to better focus on patterns, leaving aside noise present
neurons. In the rest of plots, reservoir memory dynamics when the in data. When this growing detail in the recurrence plots
input signal is set to 0 at t ¼ 10100, along with the histogram of disappears at a point of the hierarchy of reservoirs, adding
PMðT0 Þ values obtained for T0 2 ð10000; 10100 and  ¼ 0:05
more layers seem counterproductive and does not poten-
tially yield any modeling gain whatsoever.
temporal patterns observed in the dynamics of the data
Mathematically we embrace the definition of recurrence
being modeled. To shed light on this matter, the devised
ðl;l0 Þ
temporal patterns technique resorts to a well-established plots in [53]. Specifically, the recurrence plot Rt;t0
0
tool for the analysis of dynamical systems: Recurrence between input xðlÞ ðtÞ and xðl Þ ðt0 Þ is given by:
Plots [53, 54]. This component of the suite aims at tackling (  
if xðlÞ ðtÞ  xðl Þ ðt0 Þ2 \w;
0

two main problems: (1) to determine whether the model ðl;l0 Þ 1


Rt;t0 ¼ ð10Þ
has captured any temporal patterns; and (2) if the depth of 0 otherwise,
the model (as per its number of stacked layers) is adding
i.e., as a N  N binary matrix in which value 1 is set for
new knowledge that contributes to the predictive perfor-
those indices corresponding to time steps ðt; t0 Þ where the
mance of the model.
hidden state vectors of the reservoirs l and l0 are equal up to
Before proceeding further with the mathematical basis
an error w. Obviously, the case when l ¼ 1 (i.e., the first
of this technique, we pause at some further motivating
signal is the input data uðtÞ), dimensions of the recurrence
intuition behind temporal patterns. Given the black-box
plot can be enlarged to cover all its length. Likewise, the
nature of neural networks, their training often brings up the
case with heterogeneously sized reservoirs also fits in the
question whether the model is capturing the patterns
above definition by simply adjusting accordingly the size
expected to faithfully model the provided data. From logic
of the matrix. For our particular case, recurrence plots run
it can be deduced that, the deeper the network is, the more
for each of the states of a certain layer pair l; l0 throughout
detail it will be able to hold on to. However, this does not
the test history portrait the repetitions in phase space that
always result in a better predictive model. Since any given
happened throughout a given time window.
layer feeds on the previous layer’s high dimension repre-
Figure 3 exemplifies the output of the recurrence plots
sentation (with the exception of the first layer), the patterns
when applied to the output layers of an ESN trained to map
found in these latter representations should potentially be
a noisy sinusoidal signal with an increasing additive trend
more intricate, since their input is already more detailed.
to its noiseless version. The first thing that can be extracted
This is best understood from a simile with the process of
from the figure is that the system being modeled is non-
examining a footprint. The human cortex has much
stationary (for the inspected scope). As shown in the signal

123
10264 Neural Computing and Applications (2022) 34:10257–10277

best configuration (two layers) that the model is clearly


capturing the system non-stationary condition along with
its low amplitude fluctuations that present a recurrent
behavior through time. A further analysis of this tool when
applied to real data will be covered in Sect. 5.
A byproduct derived from this tool is a numerical score
that determines whether an added layer is contributing to
the modeling process. A preliminary experiment is devised
to support the claim that the sharpness of a certain layer’s
recurrent plot adds to the validity of such layer’s addition
to the predictive power of the model. Given input and
output signals, an ESN of equal characteristics is run for a
number of times and each number of layers. For each
0
layer’s recurrent plot Rl;l
t;t0 , its average value is computed,
which relates to how sharp the recurrence plot is. Results in
Fig. 4 run for the example in Fig. 3 reveal that the average
test error computed over 50 trials correlates with the
decrease of the average difference between recurrence
plots, which validates the intuition about the relative use-
fulness of every layer. In summary, a clear positive cor-
Fig. 3 Representation of an ESN model trained to model a trended
relation exists between the decrease in recurrent pattern’s
sinusoid by means of recurrence plots with three different layer average and prediction error rate, which means that testing
configurations (1 layer: top, 2 layers: middle, 3 layers: bottom), whether the average of the subsequent recurrent plots is
displaying the evolution of the patterns seen on the recurrence plots decreasing could be of use when determining the contri-
along with an indication of the mean of the recurrence plots
bution of adding the layer to the ESN.
and the recurrent plots, both present an upward slope. The
second property to notice is the periodicity of the plots. By
observing the processed and predicted signal (processed-
gray, predicted-red), one may note that the four oscillations
are captured by the model, and that they are mostly sym-
metric. A last feature shown by the model is that the signal
is noisy, hence those dots close to the valleys, meaning that
there are agglomerations of readings in those spots that a
clean signal would not have. The figure clearly shows that
the recurrent behavior of the signal can be inspected on the
recurrence plots of its hidden states. This type of analysis
helps detect the properties of the phenomena producing the
data under analysis: stationary or non-stationary nature,
periodicity, trend, strong fluctuations, similar/inverse evo-
lution through time, static states and similarity between
states but at different rates. The extraction of such features
from the recurrence plot perfectly couples with the notion
of deepening our understanding of these models giving a
further insight on the temporal patterns happening within
the data.
This tool serves a two-way purpose that will be further
discussed in the experiment section. First, it allows Fig. 4 Mean absolute error as a function of the number of reservoirs
inspecting whether the model is actually capturing the NL (top), and evolution of the average difference between recurrence
features that could be expected from the system; secondly, plots corresponding to layers l  1 and l þ 1 for NL 2 f4; . . .; 10g
once these features are captured correctly, it will allow for (bottom). There is a clear positive correlation between the test error
and the RP difference, suggesting that performance improvements
a better understanding of the system’s dynamics through occur whenever subsequent reservoirs yield a more detailed recur-
time. For example, the above image clearly shows in each rence plot

123
Neural Computing and Applications (2022) 34:10257–10277 10265

3.3 Pixel absence effect Implicitly, this transformation is not unique to image data,
since what is being processed is actually a time series.
Finally, we delve into the third technique that comprises However, this technique is adopted over image data since it
the proposed xAI suite for ESN-based models. The allows reverting all the computed importance levels back to
explanation departs from the thought that the analysis of a the image without having to quantify the relevance of this
predictive model always brings up the question of whether attribution over time.
an input is causing any effect on the model’s prediction. In accordance with the previous notation, this technique
Naturally, the goal is to find out how the inputs deemed to is now mathematically described starting from the input-
be the most important ones for the task at hand are actually output a single-layered ESN model in Expression (2),
the ones on which the model focuses most when eliciting which is included below for convenience:
its predictions. However, discovering whether this is y ðtÞ ¼ gðWout ½xðtÞ; uðtÞÞ ¼ gðWout zðtÞÞ:
b ð11Þ
actually true in the trained model is not an easy task.
Intuitively, if a step back is taken from the architecture The cancellation of a certain point T in the input sequence
to simply stare at the difference caused by the cancellation uðtÞ would affect not only in the sequence itself, but also in
of a certain input, it should be possible to observe the future predictions by ðtÞ 8t [ T due to the propagation of
individual influence of such input for the issued prediction. the altered value throughout the reservoir. Namely:
Similarly, inputs could be canceled in groups to evaluate  
y  ðtÞ ¼ g Wout ½x ðtÞ; u ðtÞ ;
b ð12Þ
the importance of each neighboring region of the input
space. Given that the result of this technique can be laid where u ðtÞ denotes the input signal with uðT Þ ¼ 0K1 ,
directly over the input of the model, the results can be and b y  ðtÞ results from applying recurrence (1) to this
readily read and understood by any type of user, disre- altered signal. The intensity of the effect of the suppression
garding his/her technical background. This capability on the output signal at time T can be quantified as:
makes this xAI approach promising from an explainability
standpoint. Indeed, the term pixel in the name of the eðt; T Þ ¼ ½el ðt; T ÞLl¼1 ¼ b
y  ðtÞ  b
y ðtÞ; 8t  T : ð13Þ
technique comes purposely from the noted suitability of the
where eðt; T Þ 2 RL1 , and its sign indicates the direction
technique to be depicted along with the input data, apply-
in which the output is pushed as per the modification of the
ing naturally when dealing with image or video data.
input. Clearly, when dealing with classification tasks (i.e.,
However, this importance attribution technique can be
yðtÞ 2 NL1 ), a similar rationale can be followed by con-
applied to other input data flavors (e.g., time series) with-
ceiving the output of the ESN-based model as the class
out requiring any modification of its steps.
probabilities elicited for the input sequence. Therefore, the
Before delving further into this technique, it is of
intensity computed as per (13) denotes whether the modi-
paramount importance to understand that the transforma-
fication of the input sequence increases (el ðt; T Þ [ 0) or
tion of an image into a time series is needed to harness the
decreases (el ðt; T Þ\0) the probability associated to class
modeling potential of ESN models. For this, the transfor-
l 2 f1; . . .; Lg.
mation presented by [56] is adopted: in accordance with
At this point it is important to remark that, given that
Fig. 5, an gray-scale (RGB) image with dimensions W  H
this third xAI tool derives from the perturbation of indi-
(pixels) can be represented as a W  H-length time series
vidual pixels, possible correlations between pixels over the
u(t) (corr. uðtÞ) by taking its rows/columns and concate-
image are neglected, thereby quantifying importances that
nating them serially. For this work, images were trans-
do not consider the plausibility of such perturbations. To
formed in a row basis, which means that for W rows, each
solve this matter, the pixel absence analysis can be
of the pixels in the given row are left as series of pixels,
extended by bootstrapping: pixels to be altered are sampled
stitching the last pixel of the given row with the first pixel
by bootstrapping (in a sample size Np established before-
of the next, until every row has been concatenated.
hand, e.g., ten at a time), for a total of Nb bootstraps. The
resulting probability changes are accumulated for each
u(t) Channel given pixel, giving rise to an enhanced quantification of the
... ... overall importance of every pixel, considering the effect of
... pixel-to-pixel interactions on the output of the ESN model.
... Time
The description of this technique is summarized in Pseu-
docode 1.
RGB channels

Fig. 5 Schematic diagram showing the process of converting an RGB


image to a K ¼ 3 time series uðtÞ

123
10266 Neural Computing and Applications (2022) 34:10257–10277

Pseudocode 1: Bootstrap pixel absence effect that, once identified from the information displayed in the
Data: Input series u(t) (e.g. coming from an recurrent plots, are easy to detect from data.
image-to-time-series transformation), output Finally, pixel absence gives further insights about the
 (t) of the model at time t, time T at which
y
model’s behavior, and allows a data scientist to tune the
the xAI tool is queried.
Result: Vector e(t; T ) of accumulated relative parameters of the model in a more effective manner. Most
output probability differences interestingly, the output of this tool helps discover biases
1 for nb = 1, . . . , Nb (number of bootstraps) do hidden in data and propagated to the model’s output
2 Sample uniformly at random Np positions from
u(t) (sampled positions)
through the architecture. When the computed feature
3 Cancel sampled positions in u(t) by interpolating importances are shown to a general user, they support the
their values as per their closest neighboring fast interpretation of the most relevant parts of the input
points in the time series sequences with respect to the prediction, and the verifica-
4 Compute ESN output y (t) for perturbed input
tion that such influential parts conform to prior knowledge
5 Measure difference between old and new
(perturbed) output as per Expression (13) of the user.
6 Update vector e(t; T ) by accumulating
differences at the sampled positions
4 Experimental setup

We assess the performance of the set of proposed tech-


3.4 Benefits of the framework for different
niques described previously by means of several computer
audiences experiments where different ESN-based models are used to
model datasets of diverse nature. All experiments have
All the xAI tools composing the proposed framework yield
been run several times before showing the results, verifying
different insights and uses depending on the audience to
that explanations issued by the developed xAI techniques
which the issued explanations are presented. This section
are stable and do not vary under different randomized
elaborates further on the benefits that such explanations can
values of the reservoirs’ parameters. As such, experiments
entail to different user profiles.
are divided in three use cases:
To begin with, the output of the potential memory tool
can be thought to be a quantitative measure of the amount • To begin with, experiments deal with the most explored
of information retained in the ESN reservoirs through time. nature of data in ESN modeling: time series. When
For a data scientist, the size of the ESN’s reservoir(s) is dealing with tasks formulated over time series data
arguably the first parameter to be tuned when designing (particularly forecasting), it is of utmost importance to
ESN architectures. This being said, this parameter has to be understand the behavior of the ESN architecture. To
solved before focusing on other parameters. By being this end, the output of potential memory and temporal
informed about the potential memory of the model, a data patterns is exemplified when used to explain an ESN-
scientist can ascertain whether the reservoir is large enough based model for three different regression tasks: one
for the application at hand whenever intuition and prior comprising a variable-amplitude sinusoid, and two real-
expert knowledge can give a hint of the minimum memory world datasets for battery consumption and traffic flow
that an ESN should possess. A similar reasoning applies to forecasting.
the case of a general user without a background on machine • Second, experiments focus on the explanation of ESN
learning: prior intuition and knowledge can tell whether a models when used for image classification. This
ESN can realistically model a pattern over time, increasing application is not new [57–59], but considering this
the confidence of the user on the model’s outcomes. task as a benchmark for the developed xAI suite permits
In what refers to recurrent patterns, a data scientist can to show their utility to unveil strengths and weaknesses
harness its generated recurrence plot to further validate the of these architecture in such a setup.
model’s efficacy by matching the features found in the raw • Finally, video classification is approached with ESN-
data with the patterns encountered in its recurrent patterns, based models. Differently than the other two tasks
including stationarity, recurrence, and other time-depen- under consideration, we will not only analyze the
dent properties alike. Likewise, a general user can discover knowledge captured by the ESN-based classifier, but
information about the behavior of the system being mod- also compare it to a Conv2DLSTM model [60] used as
eled, since recurrent patterns showcase certain properties a baseline for video classification.

123
Neural Computing and Applications (2022) 34:10257–10277 10267

The following subsections describe the setup and datasets type of data allows for an inspection of a multivariate ESN
used for the tasks described above. that presents a very stable working regime, featuring very
abrupt changes (as the middle plot in Fig. 6 clearly shows).
4.1 Time series forecasting On the other hand, traffic flow data collected by a road
sensor in Madrid is used. This data source is interesting
As mentioned before, time series analysis stands at the core since it contains long-term recurrence dynamics (bottom
of the ESN architecture’s main objective: modeling plot in Fig. 6), suited to be modeled via ESN-based
sequential data. Therefore, three experiments with different approaches. These datasets have been made available in a
time series data sources are devised to gain understanding public GitHub repository, along with all scripts needed for
upon being modeled by means of ESNs. Specifically, such replicating the experiments (https://github.com/alejan
experiments consider (1) a synthetic NARMA system of drobarredo/xAI4ESN).
order 10 fed with a variable noisy sinusoidal signal; (2) a Methodologically the experiments with the time series
dataset composed by real traffic flow data collected by datasets discussed in Sect. 5.1 are structured in the fol-
different sensors deployed over the city of Madrid, Spain lowing way. First, a potential memory analysis is carried
[61], and (3) a dataset of lithium-ion battery sensor read- out for the three datasets, elaborating on how the results
ings. A regression task can be formulated over the above relate to the characteristics of the data being modeled.
datasets: to predict the next value of the time series data, Then, a follow-up analysis is performed around the tem-
given the history of past values. poral patterns emerging from the reservoir dynamics, also
To begin with, we need a controllable dynamical system crosschecking whether such patterns conform to intuition.
producing data to be modeled via a reservoir-based All figures summarizing the outputs of these xAI tech-
approach. Opting for this initial case ease the process of niques are depicted together in the same page for a better
verifying whether the knowledge of the trained reservoir readability of the results and an easier verification of the
visualized with the xAI tools conforms to what could be claims made thereon.
expected from the dynamical properties underneath. For
this purpose, a first system is chosen featuring a recurrent 4.2 Image classification
behavior to see its implications on the potential memory
and temporal patterns. A NARMA system governed by the As mentioned before, ESN models have been explored for
following expression meets this recurrent nature sought for image classification in the past, attaining competitive
the time series: accuracy levels with a significantly reduced training com-
X
9 plexity [57–59]. To validate whether an ESN-based model
yðt þ 1Þ ¼ tanhð0:3  yðtÞ þ 0:05  yðtÞ  yðt  iÞ is capturing features from its input image that intuitively
i¼0 relate to the target class to be predicted, the second
þ 1:5  uðt  9Þ  uðtÞ þ 0:1Þ; experiment focuses on the application of the developed
pixel absence technique to an ESN image classifier trained
ð14Þ
to solve the well-known MNIST digits classification task
i.e., a one-dimensional time series u(t) given by: [70]. This dataset comprises 60,000 train images and
10,000 test images of 28  28 pixels, each belonging to
uðtÞ ¼ sinð2pFtÞ þ 0:02  nðtÞ; ð15Þ
one among 10 different categories.
with F ¼ 4 Hz, nðtÞ  N ð0; 1Þ, i.e., a Gaussian distribution Since these recurrent models are rather used for mod-
with mean 0 and unit variance. The recurrence as per eling sequential data, there is a prior need for encoding
Expression (14) imposes that any model used for modeling image data to sequences, so that the resulting sequence
the relationship between u(t) and yðt þ 1Þ should focus on preserves most of the spatial correlation that could be
at least ten prior time steps when producing its outcome, exploited for classification. This being said, a column
and should reflect the behavior of the signal amplitude encoding strategy is selected, so that the input sequence is
changes over time. A segment of the time series resulting built by the columnwise concatenation of the pixels of
from the NARMA system is shown in Fig. 6 (top plot). every image. This process yields, for every image, a one-
The second and third datasets, however, deal with the dimensional sequence of 28  28 ¼ 784 points, which is
task of forecasting the next value of time series data gen- input to a Deep ESN model with NL ¼ 4 reservoirs of N ¼
erated by a real-life recurrent phenomenon. On one hand, a 100 neurons each. Once trained, the Deep ESN model
battery dataset is built with recorded electric current and achieves a test accuracy of 86%, which is certainly lower
temperature measurements of a lithium-ion battery cycled than CNN models used for the same task, but comes along
to depletion following different discharging profiles. This with a significantly decrease in training complexity.
Training and testing over this dataset was performed in less

123
10268 Neural Computing and Applications (2022) 34:10257–10277

Fig. 6 Segments of the different time series datasets used in the temperature (red) from a battery; and (bottom) road traffic flow
experiments: (top) input u(t) (black) and output y(t) (red) of a (vehicles per hour) collected over time (color figure online)
NARMA model of order 10; (middle) electrical current (black) and

than 5 s using a parallelized Python implementation of the .


.. (Pixel,
u(t) channel)
Deep ESN model and an i7 2.8 GHz processor with 16GB ... ... ...
... ... 2nd pixel
...
RAM. RGB

1st pixel ...


In this case results and the discussion held on them focus RGB

Time RGB channels Time


on the pixel absence effect computed not only for single
pixels (namely points of the converted input sequence), but Fig. 7 Schematic diagram showing the process of converting a
also to 2  2, 4  4 and 8  8 pixel blobs showing the sequence of images (video) to a multi-dimensional time series uðtÞ
importance the model granted to regions of the input image
with increasing granularity. As elaborated in Sect. 5.2, the In order to assess its predictive accuracy, an ESN-based
absence effects noted in this dataset exemplify further uses model with NL ¼ 4 reservoirs and N ¼ 500 neurons each is
beyond explainability, connecting with the robustness of run over several known video classification datasets and a
the model against adversarial attacks. synthetic dataset built in the attempt of showcasing, Pixel
Absence’s ability to discover biases. These datasets are
4.3 Video classification thoroughly described in Table 1. However, achieving a fair
comparison between other video classification models
With this third kind of data a further step is taken beyond reported in the literature and the ESN-based model is not
the state of the art, assessing the performance of ESN- straightforward. Most existing counterparts are deep neural
based models for video classification. The transformation networks that exploit assorted video preprocessing strate-
of an image to sequences paves the way toward exploring gies (e.g., local space-time features [62] or the so-called
means to follow similar encoding strategies for the classi- universal background model [71]) and/or massively pre-
fication of stacked images, which essentially gives rise to trained weights [72], which permit to finely tune the net-
the structure of a video. work for the task at hand. Including these ingredients in the
In its seminal approach, the video classification task benchmark would not allow for a fair attribution of the
comprises not only predicting the class associated to an noted performance gaps to the modeling power of one or
image (frame), but also aggregating the predictions issued another classification model.
for every frame over the length of every video. These two To avoid this problem and focus the scope of the dis-
steps can be performed sequentially or, instead, jointly cussion, the ESN-based architecture is compared to a
within the model, using for this purpose assorted algo- Conv2DLSTM deep neural network [60] having at its input
rithmic strategies such as joint learnable frame classifica- the original, unprocessed sequence of video frames. The
tion and aggregation models. In this case, similarly to CNN part learns visual features from each of the frames,
image classification, each video (i.e., a series of frames) is whereas the LSTM part learns to correlate the learned
encoded to a multidimensional sequence data that embeds visual features over time with the target class. Models are
the evolution of each pixel of the frames over the length of trained with just the information available in each of the
the video. Specifically, each component uk ðtÞ of uðtÞ is a datasets, without data augmentation nor pretraining. This
sequence denoting the value of the k-th pixel as a function comparison allows comparing the strengths of these raw
of the frame index t. Therefore, the dimension K of the modeling architectures without any impact of other pro-
input uðtÞ to the ESN model is equal to the number of cessing elements along the video classification pipeline.
pixels of the frames composing the video under analysis. Table 2 summarizes the characteristics of these models
Figure 7 depicts a diagram of this video encoding process.

123
Neural Computing and Applications (2022) 34:10257–10277 10269

Table 1 List of action recognition video datasets considered in the experiments, along with their characteristics
Dataset # # Frame Description
classes videos size

KTH [62] 6 600 160  120 Videos capturing 6 human actions (walking, jogging, running, boxing, hand waving
and hand clapping) performed several times by 25 subjects in 4 different scenarios
WEIZMANN [63] 10 90 180  144 Videos of 9 different people, each performing 10 natural actions (e.g., run, walk,
skip, jumping-jack) divided in examples of 8 frames
IXMAS [64] 13 1650 320  240 Lab-generated multi-orientation videos of 5 calibrated and synchronized cameras
recording common human actions (e.g., checking-watch, crossing-arms, scratching-
head)
UCF11 [65] 11 3040 320  240 Action recognition dataset of realistic action videos, collected from YouTube, having
a variable number of action categories
UCF50 [66] 50 6676
UCF101 [67] 101 13320
UCFSPORTS [68] 16 800 320  240 Sport videos, collected from YouTube, having 16 sport categories
HMDB51 [69] 51 6766 320  240 Videos collected from different sources (mainly YouTube and movies) divided into
51 actions that can be grouped in five types: general facial actions, facial actions
with object, general body movement, body with object, human body interactions
SquaresVsCrosses 2 2000 28  28 Synthetic video dataset containing two classes (squares and crosses) with clear
(new in this work) spatial separability. The squares class contains videos of a 2  2 pixel blob moving
randomly in a square trajectory of equal horizontal and vertical length, and radius
centered in the middle of the frame. The crosses class contains an equally sized
2  2 blob moving randomly in a cross motion through the center of the frame.

Table 2 Considered ESN and Conv2DLSTM models 5 Results and discussion


Experiment NL N a qmax
This section goes through and discusses on the results
NARMA 1 100 0.99 0.95 obtained for the experiments introduced previously. The
Battery 5 200 0.90 0.9 section is structured in three main parts: time series fore-
Traffic 2 200 0.99 0.9 casting (Sect. 5.1), image classification (Sect. 5.2), and
MNIST 4 100 0.95 0.9 video classification (Sect. 5.3).
Video 4 500 0.99 0.9
Conv2DLSTM 64 3  3 kernels ? LSTM, tanh activation 5.1 Time series analysis
? Dropout (0.2)
? Dense (256 neurons), ReLu activation As mentioned before, in this first set of experiments two
? Dropout (0.3) different studies are carried out. The first one is centered
? Dense(# classes), softmax output around the potential memory technique, whereas the sec-
Optimizer: SGD, lr ¼ 0:001 ond one focuses on the temporal patterns method.
Categorical cross-entropy, 20 epochs
5.1.1 Experiment 1: potential memory

Reservoir size is one of the most important parameters


under comparison, as well as the parameters of the ESN- when designing an ESN-based model, mostly because it is
based approaches used in the preceding tasks. a compulsory albeit not sufficient parameter for the model
In regard to explainability, the developed pixel-absence to be able to model a system’s behavior. It is compulsory
technique is applied to the ESN model trained for video because there is always a minimum amount of memory
classification to extract insights about the model’s learning required to learn a recurrent behavior, yet it is not sufficient
abilities in this setup. The discussion about the results of because memory does not model behavior by itself, but
the benchmark is presented in Sect. 5.3. only allows for its persistence. In most cases, this value is
set manually, at most blindly automated by means of
wrappers, without any further interpretable information

123
10270 Neural Computing and Applications (2022) 34:10257–10277

given of the appropriateness of its finally selected value. From the experimentation, two main findings can be
This experiment attempts at bringing light into this matter. underscored. first, stacking several reservoirs may jeopar-
For this purpose different reservoir sizes were explored dize the tuning process for the other parameters of the
when fitting an ESN to the three datasets to be modeled. model. Figure 9a, b, c shows that adding a new layer
For every dataset, the reservoir size varies while keeping without changing any other parameter has decreased the
the rest of the parameters fixed. The rest of the parameters error and even sharpened the patterns captured by the
are chosen through experimentation. A spectral radius qmax reservoirs. This effect occurs for every dataset, although it
equal to 0.95 is established, since all system has long-term is hard to see the differences in the patterns due to the
interactions between input and target, as well as a leaky complexity of the models being explained. Figure 9d, e, f
rate of 0.99 to ensure a fast plasticity of the reservoirs. shows the difference between a high and low qmax ,
As shown in the images in Fig. 8a, b, c, the size of the demonstrating the relevance of this parameter to attain a
reservoir can be considered a major issue when fitting the further level of detail in the patterns captured by the
model, until a certain threshold is surpassed. The potential reservoirs. Although plots in Fig 9e, f are harder to analyze,
memory technique allows monitoring this circumstance by both expose more detailed patterns, along with the decrease
estimating the model’s memory capacity. After surpassing in error for the highest qmax configuration. Finally, Fig. 9g,
a memory threshold, the potential memory of the reservoirs h, i examines the differences found when tweaking the a
becomes predictable and does not change any longer. This parameter. All three cases present blurred patterns, along
phenomenon seems to be related with its improvement in with newly emerging artifacts. Since the a parameter is
error rate, although this second aspect should be considered linked to the decay, tuning its value may induce new pat-
with caution, since the process of fitting an ESN-based terns captured in the reservoirs that may—or may not—be
model involves several other parameters that interact with informative for the audience or even for the overall mod-
each other. In any case, the technique presented in this first eling performance of the network. This roughly depends on
study provides some hints that might help the user under- the dynamics of the system and the task under
stand when a workable reservoir size is selected. consideration.
Apart from the usability of showing interpretable infor-
5.1.2 Experiment 2: temporal patterns mation to the user about the capability of the model to
capture the patterns within data, these plots enable another
When designing a deep ESN model, the next question analysis. If we focus on the battery dataset, it is quite clear
arising along the process deals with the understanding of that the signal is repetitive, stationary and that the system’s
whether the model has captured all the dynamics inside the recurrent patterns are also visible in the recurrence plots. In
dataset so as to successfully undertake the modeling task. the case of the traffic data, this phenomenon is much more
This second experiment analyzes the extent to which a important. By just staring at the traffic signal it is not easy
Deep ESN is able to capture the underlying patterns of a to realize that the signal also contains certain stationary
system by means of the proposed temporal patterns. In this events (weekends) that produce different behaviors in
experiment, different parameters are varied to show the traffic. By simply staring at the recurrence plots, it is
usefulness of this technique. This puts in perspective two straightforward to realize the periodic structure of the
interesting aspects: first, the technique allows for a quick signal, and how these events happen at equally distant time
inspection of whether the model is capturing the patterns of steps throughout the signal (as happens with weekends),
the system. As per the results exposed in what follows, the with a especially large effect at the top right of the recur-
clarity by which such patterns show up in the plot seems to rence plot corresponding to a long weekend.
be related to how well the model is capturing them. The experiments discussed above should not be limiting
Figure 9a/d/g, b/e/h, d/f/i summarizes the results for in what refers to the different features that can be extracted
different reservoir configurations for the three time series from recurrence plots. Within these plots, large- and small-
datasets under consideration. Figures are ordered in three scale patterns can be distinguished. From [54] large-scale
different sets {a, b, c}, {d, e, f} and {g, h, i}. The first set patterns are considered topological features, whereas
correspond to ESN models with equal parameters and small-scale ones are regarded as textures. Manifold topo-
different layer configurations (1 and 2 layers). The second logical features can be ascertained from recurrence plots,
set analyzes system configurations with two different qmax such as homogeneity (stationarity), the presence of periodic
values. The third set regards differences in the value of the and quasi-periodic structures that represent oscillating
a parameter. A simple visual inspection to these plots systems, drifts (non-stationarities evolving at a slow speed
uncovers that d.2, e.2 and f.2 present clear patterns (in over time) and abrupt changes that can be of interest to see
comparison to the other configurations), yielding the whether the model is over-fit to rare events in the modeled
smallest errors for each dataset. sequences. Regarding textures, from the recurrence plots

123
Neural Computing and Applications (2022) 34:10257–10277 10271

Fig. 8 Potential memory analysis of the ESN model trained over the a NARMA, b battery and c traffic datasets for different reservoir sizes
NL ¼ f1; 10; 50; 500; 1000g and constant reservoir parameters

123
10272 Neural Computing and Applications (2022) 34:10257–10277

Fig. 9 Temporal patterns analysis over the a and d and g NARMA; radius impacts directly on the detail of the recurrence plots, to the
b and e and h battery; and c and f and i traffic datasets, presenting extent of causing a significant performance degradation if not selected
different configuration parameters. It can be observed that the spectral properly

123
Neural Computing and Applications (2022) 34:10257–10277 10273

one can infer rare and isolated correlated states that can be cancellation, push the output class probability distribution
associated with noise (single dots), phase spaces that are toward or against a certain desired class.
visited at different times (recurrent states evinced by di- We exemplify the output of the experiment for a given
agonal lines) and vertical/horizontal lines, which represent example of digit 8. The reason for this choice is that this
an steady state throughout a given time frame. When digit has several coinciding visual features with other
occurring in recurrence plots, a proper interpretation of digits. As such, it was expected that a great effect was
these events usually requires intuition and domain knowl- discerned at the center of the digit itself, favoring the
edge, as we have exemplified in our prior experiments with classification for digit 0 since the elimination of the center
battery consumption and traffic data. pixels in digit 8 should directly render the image as a 0.
However, it is important to note that this does not happen
5.2 Image classification for ESN architectures that are sensitive to space transfor-
mations. This sensitivity impedes any ability of the model
As introduced before, this second experiment analyzes a to extract spatially invariant features from the digit image,
Deep ESN model for image classification over the MNIST thereby circumscribing its knowledge within the sequential
dataset. Once trained, this model is used to examine the column-wise patterns found in examples of its training set.
output of the third technique comprising the proposed suite This conclusion emphasizes the relevance of finding
of xAI techniques: pixel absence effect. transformation strategies from the original domain of data
Figure 10 shows the results of computing the pixel to sequences, not only for ensuring a good performance of
absence effect for single pixels, wherein the color repre- the learned ESN-based model, but also to be able to elicit
sents the importance of canceling each pixel. Specifically, explanations that conform to general intuition.
the scale on the upper left corner shows the amount of
change the cancelation of a certain pixel has caused on the 5.3 Video classification
probability of every class. The blue color indicates the
pixels that are important to predict the image as the desired As stated in the design of the experimental setup, the dis-
class. The darker the blue color of a certain pixel is, the cussion ends up with a video classification task approached
higher the decrease in probability for the desired class will via ESN-based models. Since to our knowledge this is the
be. The red color means exactly the opposite. Therefore, first time in the literature that video classification is tackled
the output of this technique must be conceived as an with ESN architectures, a benchmark is first run to compare
adversarially driven local explanation technique that pur- its predictive performance to a Conv2DLSTM network, in
sues to explain which regions of a certain input, upon their both cases trained only over unprocessed video data pro-
vided in the datasets of Table 1. As argued before, it would
not be fair to include preprocessing elements that could
bias the extracted conclusions about the modeling power of
both approaches under comparison.
Table 3 summarizes the results of the benchmark, both
in terms of test predictive accuracy and the size of the
trained models (in terms of trainable parameters). It can be
observed that Deep ESN models consistently outperforms
their Conv2DLSTM counterparts in all the considered
datasets, using significantly less trainable parameters that
translate to dramatically reduced training latencies.
Clearly, these results are far from state-of-the-art levels of
performance attained by deep neural networks benefiting
from pretrained weights (from, e.g., ImageNet). Never-
theless, they serve as an unprecedented preview of the high
potential of Deep ESN models to undertake complex tasks
with sequence data beyond those for which they have been
utilized so far (mostly, time series and text modeling).
The developed pixel absence effect test was run over the
trained Deep ESN model for video classification. The core
of the discussion is set on the results for a single video
Fig. 10 Pixel absence effect over a Deep ESN model trained to (local explanation) belonging to the UCF50 dataset, which
classify MNIST digits for single pixels

123
10274 Neural Computing and Applications (2022) 34:10257–10277

Table 3 Accuracy and number of trainable parameters of models tested over each video classification dataset
Dataset Deep ESN Conv2DLSTM Difference State of the art Technique

KTH 56% 22% ?34% 94.39% Differential gating [73]


WEIZMANN 25% 12% ?13% 98.63% Bio-inspired hierarchy [74]
IXMAS 40% 36% ?4% 94.16% Structural information [75]
UCF11 54% 26% ?28% 84.96% GoogLeNet ? RNN [76]
UCF50 75% 63% ?12% 95.24% Trajectory descriptors [77]
UCF101 45% 42% ?3% 95.10% Trajectory descriptors [78]
HMDB51 25% 17% ?8% 65.90% Trajectory descriptors [78]
UCFSPORTS 32% 23% ?9% 96.6% Grassmann graph embedding [79]
Trainable Min: 12, 000 Min: 305, 626, 418 Min: þ293; 626; 418
parameters Max: 200, 000 Max: 1, 205, 453, 326 Max: ?1, 005, 453, 326

with a dataset synthetically furnished for the purpose. This


dataset, denoted as SquaresVsCrosses, is composed
by videos of two different classes. Examples of one of
these classes consist of a 3  3-sized blob of pixels with a
square trajectory centered in the middle of the frame,
whereas examples of the other class move over the frame
(a) by outlining a cross-shape trajectory. To showcase the bias
problem, two different experiments are run. In the first one,
classes are set spatially disjoint from each other, i.e., the
space over the frame in which examples of one class move
over time is totally different than the space delimited by
trajectories of the examples of the other class. This first
experiment should exemplify the problem of spatial bias.
(b) (c)
The second experiment is run with a dataset in which
examples of both classes occupy the same space over the
Fig. 11 Pixel absence analysis of a video frame showing an frame, hence, spatial bias is removed.
unintended result that gives importance to sections of the image that
Figure 12 illustrates the results for these 2 different
apparently should not be important
dataset configurations, where Fig. 12a, c corresponds to the
should ideally determine what zones of the video frame are squares class, and Fig. 12b, d to the crosses class. In
most important for the current prediction. Fig. 12a, b, examples of both classes are spatially disjoint:
The results from this pixel absence analysis depicted in squares move along the border of the frame, while
Fig. 11 uncover an unintended yet interesting result. crosses move in the center of the frame. These two first
Although the image shown in this picture is just one frame plots reveal that there is no entanglement between the
of the video, the region of the frame which is declared to be classes. Their pixel absence heat map contains only one
relevant by the analysis is the same throughout the whole color for each class, meaning that all the pixels contained
video, the difference being the changing colors of its in the image under analysis push toward one class (the true
constituent pixel blobs. This effect unveils that the Deep class) and are not present in the other. A bias emerge from
ESN model has learned to focus on the floor to produce its the trajectory followed by the blob. Although it should be
predictions, which suggests that its knowledge is severely informative with respect to the class (square or cross), the
subject to an evident bias in the UCF50 dataset. This bias shape of the blob does not convey any predictive power
can be detected visually by simply observing the images when compared to the trajectory itself.
depicted in Figure 9c, which are samples of frames of the Contrarily, the second pair of videos in Fig. 12—namely
same class that also contain a similar floor feature that Fig. 12c, d—correspond to a dataset where the trajectories
might cause a bias in the model. of examples of both classes spatially overlap, hence
The conclusions drawn above are further elaborated by removing the trajectory bias. The apparent difference is
undertaking a simplified video classification experiment that this time, both heat maps show mixed colors when
analyzing them with the pixel absence tool. Consequently,

123
Neural Computing and Applications (2022) 34:10257–10277 10275

with results over those that could be expected given their


inherent random nature. However, the suite of xAI tech-
niques has also revealed that random reservoirs composed
by recurrently connected neural units undergo architectural
limitations to model data sources that call for spatially
invariant feature learning, such as image and video. Indeed,
ESN-based models achieve reasonable levels of predictive
accuracy given their low training complexity, specially for
video classification without any prior preprocessing, data
augmentation and/or pretraining stages along the pipeline.
A deeper inspection of their learned knowledge by means
of the proposed xAI tools has identified huge data biases
across classes in video data, showing that such superior
scores might not be extrapolable to videos recorded in
other contextual settings.
Several research paths are planned for the future
departing from the findings herein reported. To begin with,
different ways will be investigated to leverage the design
flexibility of the reservoirs toward inducing expert
knowledge contained in a transparent model in the ini-
tialization parameters of a Deep ESN. In this regard, ele-
ments from model distillation will be explored to convey
such expert knowledge, possibly by driving the reservoir
initialization process not entirely at random. Furthermore,
Fig. 12 Pixel absence analysis of four different videos: a and explanations generated by the proposed tools can be used
b correspond to a model trained with spatially disjoint classes,
for other machine learning models that are also partly
whereas c and d correspond to a model with spatially overlapping
classes governed by random processes (e.g., random vector func-
tional link networks, or random convolutional kernels).
some pixels occur more frequently in examples of one class Another interesting research direction is to derive new
than the other, hence pushing the prediction toward that strategies to transform spatially correlated data into
class when those pixels are activated. Contrarily to the case sequences, as results reported in this study have found out
in which spatial bias was present, the analysis showed pixel that an inherent loss of information is held as a result of
importance patterns that are more complex that a simple transforming spatially correlated data into sequences.
spatial separation. This experiment concurs with the intu- Finally, a close look will be taken at the interplay between
ition surfacing from Fig. 11 and confirms that pixel absence explainability and epistemic uncertainty, the latter espe-
might be a suitable tool for bias detection over the data cially present in reservoir computing models. As in other
from which the ESN model has learned. In summary: the randomization-based machine learning approaches for
use of the pixel absent effect tool can help the user detect sequential data, it is a matter of describing memoirs from
sources of bias systematically, so that countermeasures to the past in a statistically consistent, understandable fashion.
avoid their effect in the generalization capability of the
Acknowledgements This work has received funding support from the
model can be triggered. Basque Government (Eusko Jaurlaritza) through the Consolidated
Research Group MATHMODE (IT1294-19) and ELKARTEK pro-
gram (3KIA project, KK-2020/00049).
6 Conclusions and future research lines
Author contributions A. Barredo Arrieta, J. Del Ser, and M.
N. Bilbao contributed to conceptualization; A. Barredo Arrieta, S.
This manuscript has elaborated on the need for explaining Gil-Lopez, I. Laña, and J. Del Ser provided methodology; A. Barredo
the knowledge modeled over sequential data by ESN-based Arrieta and J. Del Ser performed formal analysis and investigation; A.
model by means of a set of novel post-hoc xAI techniques. Barredo Arrieta done writing—original draft preparation; S. Gil-
Lopez, I. Laña, M. N. Bilbao, and J. Del Ser performed writing—
Through the lenses of the proposed methods, hidden review and editing; and J. Del Ser contributed to funding acquisition
strengths and weaknesses have been discovered for these and supervision.
type of models. To begin with, the modeling power of these
models has been assessed over a diversity of data sources,

123
10276 Neural Computing and Applications (2022) 34:10257–10277

Declarations 19. Jaeger H, Lukoševičius M, Popovici D, Siewert U (2007) Opti-


mization and applications of echo state networks with leaky-in-
Conflict of interest The authors declare no conflict of interest. tegrator neurons. Neural Netw 20(3):335–352
20. Gallicchio C, Micheli A (2019) Richness of deep echo state
network dynamics. In: International work-conference on artificial
neural networks, pp 480–491
References 21. Gallicchio C, Micheli A (2017) Echo state property of deep
reservoir computing networks. Cognit Comput 9(3):337–350
1. Jaeger H (2003) Adaptive nonlinear system identification with 22. Jaeger H (2002) Tutorial on training recurrent neural networks,
echo state networks. In: Advances in neural information pro- covering BPPT, RTRL, EKF and the ‘‘echo state network’’
cessing systems, pp 609–616 approach, volume 5. GMD-Forschungszentrum Information-
2. Lukoševičius M, Jaeger H (2009) Reservoir computing approa- stechnik Bonn
ches to recurrent neural network training. Comput Sci Rev 23. Gallicchio C, Micheli A, Pedrelli L (2018) Design of deep echo
3(3):127–149 state networks. Neural Netw 108:33–47
3. Gallicchio C, Scardapane S (2020) Deep randomized neural 24. Liu K, Zhang J (2020) Nonlinear process modelling using echo
networks. Recent Trends Learn Data, pp 43–68 state networks optimised by covariance matrix adaption evolu-
4. Zhang L, Suganthan PN (2016) A survey of randomized algo- tionary strategy. Comput Chem Eng 135:106730
rithms for training neural networks. Inf Sci 364:146–155 25. Arras L, Montavon G, Müller K-R, Samek W (2017) Explaining
5. Jaeger H, Haas H (2004) Harnessing nonlinearity: predicting recurrent neural network predictions in sentiment analysis. In:
chaotic systems and saving energy in wireless communication. Proceedings of the 8th workshop on computational approaches to
Science 304(5667):78–80 subjectivity, sentiment and social media analysis, pp 159–168
6. Wu Q, Fokoue E, Kudithipudi D (2018) On the statistical chal- 26. Li J, Chen X, Hovy E, Jurafsky D (2016) Visualizing and
lenges of echo state networks and some potential remedies. understanding neural models in nlp. In: Proceedings of NAACL-
arXiv:1802.07369 HLT, pp 681–691
7. Jaeger H (2005) Reservoir riddles: suggestions for echo state 27. Denil M, Demiraj A, De Freitas N (2014) Extraction of salient
network research. In:Proceedings. 2005 IEEE international joint sentences from labelled documents. arXiv:1412.6815
conference on neural networks, vol 3, pp 1460–1462. IEEE 28. Li J, Monroe W, Jurafsky D (2016) Understanding neural net-
8. Luca AT, Ulrich P (2019) Gradient based hyperparameter opti- works through representation erasure. arXiv:1612.08220
mization in echo state networks. Neural Netw 115:23–29 29. Kádár A, Chrupała G, Alishahi A (2017) Representation of lin-
9. Öztürk MM, Cankaya IA, Ipekci D (2020) Optimizing echo state guistic form and function in recurrent neural networks. Comput
network through a novel fisher maximization based stochastic Linguist 43(4):761–780
gradient descent. Neurocomputing 30. Murdoch W, James L, Peter J, Yu B (2018) Beyond word
10. Arrieta AB, Dı́az-Rodrı́guez N, Del SJ, Bennetot A, Tabik S, importance: contextual decomposition to extract interactions
Barbado A, Salvador G, Sergio G-L, Daniel M, Richard B, et al from lstms. arXiv:1801.05453
(2020) Explainable artificial intelligence (XAI): concepts, tax- 31. Hassaballah M, Awad AI (2020) Deep learning in computer
onomies, opportunities and challenges toward responsible ai. Inf vision: principles and applications. CRC Press, Boca Raton
Fusion 58:82–115 32. Rojat T, Puget R, Filliat D, Del S, Javier G, Rodolphe ı́-R,
11. Gallicchio C, Micheli A, Pedrelli L (2017) Deep reservoir com- Natalia D (2021) Explainable artificial intelligence (xai) on time
puting: a critical experimental analysis. Neurocomputing series data: a survey. arXiv:2104.00950
268:87–99 33. Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a
12. Maass W, Natschläger T, Markram H (2002) Real-time com- novel symbolic representation of time series. Data Mining Knowl
puting without stable states: a new framework for neural com- Discov 15(2):107–144
putation based on perturbations. Neural comput 34. Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic repre-
14(11):2531–2560 sentation of time series, with implications for streaming algo-
13. Jaeger H (2001) The ‘‘echo state’’ approach to analysing and rithms. In: Proceedings of the 8th ACM SIGMOD workshop on
training recurrent neural networks-with an erratum note. Bonn, Research issues in data mining and knowledge discovery,
Germany: German National Research Center for Information pp 2–11
Technology GMD Technical Report 148(34):13 35. Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001)
14. Dominey PF (1995) Complex sensory-motor sequence learning Dimensionality reduction for fast similarity search in large time
based on recurrent state representation and reinforcement learn- series databases. Knowl Inf Syst 3(3):263–286
ing. Biol Cybern 73(3):265–274 36. Zadeh LA (1988) Fuzzy logic. Computer 21(4):83–93
15. Steil JJ (2004) Backpropagation-decorrelation: online recurrent 37. Herrera F, Herrera-Viedma E, Martinez L (2000) A fusion
learning with o (n) complexity. In: 2004 IEEE international joint approach for managing multi-granularity linguistic term sets in
conference on neural networks (IEEE Cat. No. 04CH37541), decision making. Fuzzy Sets Syst 114(1):43–58
vol 2, pp 843–848. IEEE 38. Herrera F, Alonso S, Chiclana Francisco H-VE (2009) Comput-
16. Del S, Javier L, Ibai, M, Eric L, Oregi I, Osaba E, Lobo JL, ing with words in decision making: foundations, trends and
Bilbao MN, Vlahogianni EI (2020) Deep echo state networks for prospects. Fuzzy Optim Decis Making 8(4):337–364
short-term traffic forecasting: performance comparison and sta- 39. Mencar C, Alonso JM (2018) Paving the way to explainable
tistical assessment. In: IEEE international conference on intelli- artificial intelligence with fuzzy modeling. In: International
gent transportation systems (ITSC), pp 1–6. IEEE Workshop on Fuzzy Logic and Applications, pp 215–227.
17. Palumbo F Gallicchio C, Pucci R, Micheli A (2016) Human Springer
activity recognition using multisensor data fusion based on 40. Samek W, Montavon G, Vedaldi A, Hansen LK, Müller K-R
reservoir computing. J Ambient Intell Smart Environ 8(2):87–107 (2019) Explainable AI: interpreting, explaining and visualizing
18. Crisostomi E, Gallicchio C, Micheli A, Raugi M, Tucci M (2015) deep learning, vol 11700. Springer
Prediction of the italian electricity price for smart grid applica- 41. Chang Y-W, Lin C-J (2008) Feature ranking using linear svm. In:
tions. Neurocomputing 170:286–295 Causation and prediction challenge, pp 53–64. PMLR

123
Neural Computing and Applications (2022) 34:10257–10277 10277

42. Lundberg SM, Erion GG, Lee S-I (2018) Consistent individual- 62. Schuldt C, Laptev I, Caputo B (2004) Recognizing human
ized feature attribution for tree ensembles. arXiv:1802.03888 actions: a local svm approach. In: Proceedings of the 17th
43. Smilkov D, Thorat N, Kim B, Viégas F, Wattenberg M (2017) International conference on pattern recognition, 2004. ICPR
Smoothgrad: removing noise by adding noise. arXiv:1706.03825 2004., volume 3, pp 32–36. IEEE
44. Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B 63. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005)
(2018) Sanity checks for saliency maps. arXiv:1810.03292 Actions as space-time shapes. In: Tenth IEEE international con-
45. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) ference on computer vision (ICCV’05) Volume 1, volume 2,
Striving for simplicity: the all convolutional net. arXiv:1412.6806 pp 1395–1402. IEEE
46. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra 64. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action
D (2017) Grad-cam: visual explanations from deep networks via recognition using motion history volumes. Comput Vis Image
gradient-based localization. In: Proceedings of the IEEE inter- Understand 104(2–3):249–257
national conference on computer vision, pp 618–626 65. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from
47. Montavon G, Lapuschkin S, Binder A, Samek W, Müller K-R videos ‘‘in the wild’’. In: 2009 IEEE conference on computer
(2017) Explaining nonlinear classification decisions with deep vision and pattern recognition, pp 1996–2003. IEEE
taylor decomposition. Pattern Recognit 65:211–222 66. Reddy KK, Shah M (2013) Recognizing 50 human action cate-
48. Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside con- gories of web videos. Mach Vis Appl 24(5):971–981
volutional networks: visualising image classification models and 67. Soomro K, Zamir AR, Shah M: Ucf101: a dataset of 101 human
saliency maps. arXiv:1312.6034 actions classes from videos in the wild. arXiv:1212.0402
49. Ancona M, Ceolini E, Öztireli C, Gross M (2017) Towards better 68. Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-
understanding of gradient-based attribution methods for deep temporal maximum average correlation height filter for action
neural networks. arXiv:1711.06104 recognition. In: IEEE conference on computer vision and pattern
50. Baehrens D, Schroeter T, Harmeling S, Kawanabe M, Hansen K, recognition, pp 1–8. IEEE
Müller K-R (2010) How to explain individual classification 69. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb:
decisions. J Mach Learn Res 11:1803–1831 a large video database for human motion recognition. In: 2011
51. Shrikumar A, Greenside P, Kundaje A (2017) Learning important International conference on computer vision, pp 2556–2563.
features through propagating activation differences. In: Interna- IEEE
tional conference on machine learning, pp 3145–3153. PMLR 70. LeCun Y (1998) The mnist database of handwritten digits. http://
52. Ribeiro MT, Singh S, Guestrin C (2016) ‘‘Why should i trust yann.lecun.com/exdb/mnist/
you?’’ Explaining the predictions of any classifier. In: Proceed- 71. Han D, Bo L, Sminchisescu C (2009) Selection and context for
ings of the 22nd ACM SIGKDD international conference on action recognition. In: 2009 IEEE 12th international conference
knowledge discovery and data mining, pp 1135–1144 on computer vision, pp 1933–1940
53. Marwan N, Romano MC, Thiel M, Kurths J (2007) Recurrence 72. Ghadiyaram D, Tran D, Mahajan D (2019) Large-scale weakly-
plots for the analysis of complex systems. Phys Rep supervised pre-training for video action recognition. In: Pro-
438(5–6):237–329 ceedings of the IEEE/CVF conference on computer vision and
54. Eckmann J-P, Kamphorst SO, Ruelle D, et al (1995) Recurrence pattern recognition, pp 12046–12055
plots of dynamical systems. World Sci Ser Nonlinear Sci Ser A 73. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011)
16:441–446 Sequential deep learning for human action recognition. In:
55. Gallicchio C, Micheli A (2016) Deep reservoir computing: a International workshop on human behavior understanding,
critical analysis. In: ESANN pp 29–39. Springer
56. Schaetti N, Salomon M, Couturier R (2016) Echo state networks- 74. Shu Na, Tang Q, Liu H (2014) A bio-inspired approach modeling
based reservoir computing for mnist handwritten digits recogni- spiking neural networks of visual cortex for human action
tion. In: IEEE international conference on computational science recognition. In: 2014 international joint conference on neural
and engineering (CSE), pp 484–491. IEEE networks (IJCNN), pp 3450–3457. IEEE
57. Woodward A, Ikegami T (2011) A reservoir computing approach 75. Liu J, Shah M (2008) Learning human actions via information
to image classification using coupled echo state and back-prop- maximization. In: IEEE conference on computer vision and
agation neural networks. In International conference image and pattern recognition, pp 1–8. IEEE
vision computing, Auckland, New Zealand, pp 543–458 76. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition
58. Souahlia A, Belatreche A, Benyettou A, Curran K (2016) An using visual attention. (2015). arXiv:1511.04119
experimental evaluation of echo state network for colour image 77. Shi Y, Zeng W, Huang T, Wang Y (2015) Learning deep tra-
segmentation. In: 2016 International joint conference on neural jectory descriptor for action recognition in videos using deep
networks (IJCNN), pp 1143–1150. IEEE neural networks. In: 2015 IEEE international conference on
59. Tong Z, Tanaka G (2018) Reservoir computing with untrained multimedia and expo (ICME), pp 1–6. IEEE
convolutional neural networks for image recognition. In: Inter- 78. Wang L, Qiao Y, Tang X (2015) Action recognition with tra-
national conference on pattern recognition (ICPR), jectory-pooled deep-convolutional descriptors. In: Proceedings of
pp 1289–1294. IEEE the IEEE conference on computer vision and pattern recognition,
60. Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C pp 4305–4314
(2015) Convolutional lstm network: a machine learning approach 79. Harandi MT, Sanderson C, Shirazi S, Lovell BC (2013) Kernel
for precipitation nowcasting. arXiv:1506.04214 analysis on grassmann manifolds for action recognition. Pattern
61. Laña I, Del SJ, Padró A, Vélez M, Casanova-Mateo C (2016) The Recognit Lett 34(15):1906–1915
role of local urban traffic and meteorological conditions in air
pollution: a data-based case study in Madrid. Spain. Atmos Publisher’s Note Springer Nature remains neutral with regard to
Environ 145:424–438 jurisdictional claims in published maps and institutional affiliations.

123

You might also like