Professional Documents
Culture Documents
s00521-021-06359-y
s00521-021-06359-y
https://doi.org/10.1007/s00521-021-06359-y (0123456789().,-volV)
(0123456789().,-volV)
Received: 16 February 2021 / Accepted: 23 July 2021 / Published online: 6 August 2021
The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021
Abstract
Since their inception, learning techniques under the reservoir computing paradigm have shown a great modeling capability
for recurrent systems without the computing overheads required for other approaches, specially deep neural networks.
Among them, different flavors of echo state networks have attracted many stares through time, mainly due to the simplicity
and computational efficiency of their learning algorithm. However, these advantages do not compensate for the fact that
echo state networks remain as black-box models whose decisions cannot be easily explained to the general audience. This
issue is even more involved for multi-layered (also referred to as deep) echo state networks, whose more complex
hierarchical structure hinders even further the explainability of their internals to users without expertise in machine
learning or even computer science. This lack of explainability can jeopardize the widespread adoption of these models in
certain domains where accountability and understandability of machine learning models is a must (e.g., medical diagnosis,
social politics). This work addresses this issue by conducting an explainability study of echo state networks when applied
to learning tasks with time series, image and video data. Among these tasks, we stress on the latter one (video classifi-
cation) which, to the best of our knowledge, has never been tackled before with echo state networks in the related literature.
Specifically, the study proposes three different techniques capable of eliciting understandable information about the
knowledge grasped by these recurrent models, namely potential memory, temporal patterns and pixel absence effect.
Potential memory addresses questions related to the effect of the reservoir size in the capability of the model to store
temporal information, whereas temporal patterns unveil the recurrent relationships captured by the model over time.
Finally, pixel absence effect attempts at evaluating the effect of the absence of a given pixel when the echo state network
model is used for image and video classification. The benefits of the proposed suite of techniques are showcased over three
different domains of applicability: time series modeling, image and, for the first time in the related literature, video
classification. The obtained results reveal that the proposed techniques not only allow for an informed understanding of the
way these models work, but also serve as diagnostic tools capable of detecting issues inherited from data (e.g., presence of
hidden bias).
1 Introduction
123
10258 Neural Computing and Applications (2022) 34:10257–10277
randomized neural networks, along with random vector Given the above context, this manuscript takes a step
functional links, stochastic configuration networks and ahead by presenting a novel suite of xAI techniques suited
random features for kernel approximation [3, 4]. As a to issue explanations of already trained Deep ESN models.
result, their training overhead for real life applications The proposed techniques elicit visual information that
becomes much less computationally demanding than that permit to assess the memory properties of these models,
of RNNs. Over the well-known Mackey–Glass chaotic visualize their detected patterns over time, and analyze the
time series prediction benchmark, ESN has been shown to importance of an individual input in the model’s output.
improve the accuracy scores achieved by multi-layer per- Mathematical definitions of the xAI factors involved in the
ceptrons (MLPs), support vector machines (SVMs), back- tools of the proposed framework are given, complemented
propagation-based RNNs and other learning approaches by by examples that help illustrate their applicability and
a factor of over 2000 [5]. These proven benefits have comprehensibility to the general audience. This introduc-
appointed ESN as a top contending dynamical model for tion of the overall framework is complemented by several
performance and computational efficiency reasons when experiments showcasing the use of our xAI framework to
compared to other modeling counterparts. three different real applications: 1) battery cell consump-
Unfortunately, choosing the right parameters to initial- tion prediction, 2) road traffic flow forecasting, 3) image
ize this reservoir falls a bit on the side of luck and past classification and 4) video classification. The results are
experience of the scientist [6], and less on that of sound conclusive: the outputs of the proposed xAI techniques
reasoning. As stated in [7] and often referred thereafter, the confirm, in a human-readable fashion, that the Deep ESN
current approach for assessing whether a reservoir is suited models capture temporal dependencies existing in data that
for a particular task is to observe if it yields accurate could be expected due to prior knowledge, and that this
results, either by handcrafting the values of the reservoir captured information can be summarized to deepen the
parameters or by automating their configuration via an understanding of a general practitioner/user consuming the
external optimizer. All in all, this poses tough questions to model’s output. The novel ingredients of this work can be
address when developing an ESN for a certain application, summarized as follows:
since knowing whether the created structure is optimal for
1. We propose three xAI methods for ESNs:
the problem at hand is not possible without actually
training it. Furthermore, despite recent attempts made in • Potential memory, which permits to quantitatively
this direction [8, 9], there is no clear consensus on how to assess the estimated memory retained in a trained
guide the search for good reservoir based models. ESN, and that can be extrapolated to any recursive
Concerns in this matter go a step beyond the ones model.
exposed above about the configuration of these models. • Temporal patterns, which examines the correlative
Model design and development should orbit around a deep patterns captured by the ESN, and visualizes them
understanding of the multiple factors hidden below the to aid the training and explanation process. This
surface of the model. For this purpose, a manifold of tool allows checking whether the model captures
techniques have been proposed under the Explainable from data what the practitioner could expect as per
Artificial Intelligence (xAI) paradigm for easing the his/her expert knowledge or prior intuition told by
understanding of decisions issued by existing AI-based the application at hand.
models. The information delivered by xAI techniques • Pixel absence effect, which evaluates the effect of
allows improving the design/configuration of AI models, each dimension of the input instance on the output
extracting augmented knowledge about their outputs, of the model. This improves the granularity of the
accelerating debugging processes, or achieving a better analysis, uncovering which parts of the input are
outreach and adoption of this technology by non-special- most influential for the prediction.
ized audience [10]. Although the activity in this research
2. The aforementioned tools are evaluated over data of
area has been vibrant for explaining many black-box
distinct nature: time series, image and video data. The
machine learning models, there is no prior work on the
latter one (video) implies transforming images that
development of techniques of this sort for dynamical
flow over time into a multidimensional time series.
approaches. The need for providing explanatory informa-
3. Explanations issued by the proposed techniques are
tion about the knowledge learned by ESNs remains unad-
proven not only to inform further future studies dealing
dressed, even though recent advances on the construction
with these recurrent randomized neural networks, but
of multi-layered reservoirs (Deep ESN [11]) that make
to also unveil important modeling issues (e.g., the
these models more opaque than their single-layered
presence of bias) that are often not easy to detect from
counterparts.
sequential data.
123
Neural Computing and Applications (2022) 34:10257–10277 10259
The rest of the paper is organized as follows: Sect. 2 pro- from a K-dimensional input uðtÞ (with t denoting index
vides the reader with the required background on ESN within the sequence) to a L-dimensional output yðtÞ by a
literature, and sets common mathematical grounds for the reservoir of N neurons. Following Fig. 1, the reservoir state
rest of the work. Section 3 introduces the framework and is updated as:
analyses each of the techniques proposed in its internal
xðt þ 1Þ
subsections. Section 4 presents the experiments designed to
ensure the viability of this study with real data. Section 5 ¼ af ðWin uðt þ 1Þ þ WxðtÞ þ Wfb yðtÞÞ þ aÞxðtÞ;
analyzes and discusses the obtained results. Finally Sect. 6 ð1Þ
ends the paper by drawing conclusions and outlining future
where xðtÞ denotes the N-sized state vector of the reservoir,
research lines rooted on our findings.
f ðÞ denotes an activation function, WNN is the matrix of
internal weights, Win NK are the input connection weights,
fb
2 Background and WNL is a feedback connection matrix. Parameter a 2
Rð0; 1 denotes the leaking rate, which allows to set dif-
Before proceeding with the description of the proposed ferent learning dynamics in the above recurrence [19]. The
suite of xAI techniques, this section briefly revisits the output of the ESN at index t can be computed once the state
fundamentals of ESN and Deep ESN models (Sect. 2.1), of the reservoir has been updated, yielding:
notable advances in the explainability of recurrent neural y ðtÞ ¼ gðWout ½xðtÞ; uðtÞÞ ¼ gðWout zðtÞÞ;
b ð2Þ
networks and models for time series (Sect. 2.2), and tech-
niques used for quantifying the importance of features where ½a; b is a concatenation operator between vectors a
(Sect. 2.3). and b, Wout
LðKþNÞ is a matrix containing the output weights,
and gðÞ denotes an activation function. Weights belonging
2.1 Fundamentals of echo state networks to the aforementioned matrices can be adjusted as per a
training data-set with examples fðuðtÞ; yðtÞÞg. However, as
In 2001, Wolfgang Maass and Herbert Jaeger indepen- opposed to other recurrent neural networks, not all weight
dently introduced Liquid State Machines [12] and ESNs inside the matrices Win , W, Wfb and Wout are adjusted.
[13], respectively. The combination of these studies with Instead, the weight values of input, hidden state, and
research on computational neuroscience and machine feedback matrices are drawn initially at random, whereas
learning [14, 15] brought up the field of reservoir com- those of the output matrix Wout are the only ones tuned
puting. Methods belonging to this field consist of a set of during the training phase, using Moore–Penrose pseudo-
sparsely connected, recurrent neurons capable of mapping
high-dimensional sequential data to a low-dimensional
space, over which a learning model can be trained to
capture patterns that relate this low-dimensional space to a
target output. This simple yet effective modeling strategy
has been harnessed for regression and classification tasks in
a diversity of applications, such as road traffic forecasting
[16], human recognition [17] or smart grids [18], among
others [11].
Besides their competitive modeling performance in
terms of accuracy/error, reservoir computing models are
characterized by a less computationally demanding training
process than other recursive models: in these systems, only
the learner mapping the output of the reservoir to the target
variable of interest needs to be trained. Neurons composing
the reservoir are initialized at random under some stability
constraints. This alternative not only alleviates the com-
putational complexity of recurrent neural networks, but
also circumvents one of the downsides of gradient back-
propagation, namely exploding and vanishing gradients.
To support the subsequent explanation of the proposed
xAI techniques, we now define mathematically the inter- Fig. 1 Schematic diagram showing a canonical ESN (upper plot), and
nals and procedure followed by ESNs to learn a mapping the multi-layered stacking architecture of a Deep ESN model (below)
123
10260 Neural Computing and Applications (2022) 34:10257–10277
123
Neural Computing and Applications (2022) 34:10257–10277 10261
explainability of recurrent neural networks, including gra- networks or randomized neural networks for that matter.
dient-based approaches [26, 27], ablation-based estima- There has been extended use of this concept for Tree
tions of the relevance of particular sequence variables for Ensembles and Support Vector Machines [41, 42] con-
the predicted output [28, 29], or the partial linearization of structing a well-developed body of research around these
the activations through the network, permitting to isolate techniques. The assessment of the validity of a machine
the contribution at different time scales to the output [30]. learning model’s output is paramount for a well-structured
In addition to standard deep learning approaches system validation/improvement workflow. Feature attribu-
resorting to gradient backpropagation [31], time series data tion methods intend on giving interpretability cues by
are also processed by means of other modeling approaches, measuring the importance each feature has in the final
mostly traditional data mining methods for time series prediction outcome. Many techniques have been proposed
analysis. By moving away from Deep Neural Networks, in the last few years to gauge the importance of features in
further possibilities emerge for achieving inter- different learning tasks [43–52], yet none of them has been
pretable methods by design, which allow for a better utilized with the recurrent randomized neural networks that
understandability of its inner working by the audience are at the core of this study.
without requiring external tools for the purpose [32].
Among them, Symbolic Aggregate Approximation (SAX)
[33, 34] and Fuzzy Logic appear to be the best contenders. 3 Proposed xAI framework
SAX works by first transforming the time series into its
piece-wise aggregate approximation [35]. Then the result- The framework proposed in this work is composed by three
ing transformation is converted into a string of symbols xAI techniques, each tackling some of the most common
assigned equiprobably among the regions of the dis- aspects that arise when training an ESN model or under-
cretization. Finally the obtained representation permits a standing their captured knowledge. These techniques cover
clear visualization of the recurrent patterns hand in hand three main characteristics that help understand strengths
with the stream-like implementation with minimal over- and weaknesses of these models:
heads. Fuzzy logic, fuzzy sets and computing with words
1. Potential memory, which is a simple test that closely
[36–39] are methods that improve the interpretability of
relates to the amount of sequential memory retained
otherwise hard-to-understand systems, by bringing them
within the network. The potential memory can be
closer to how human reasoning operate when inspecting
quantified by addressing how the model behaves when
multiple dimensions over time. By transforming continu-
its input fades abruptly. The intuition behind this
ous values into meaningful levels, fuzzy computing is able
concept recalls to the notion of stored information, i.e.,
to tackle many types of problems without hindering the
the time needed by the system to return to its resting
interpretability of data and operations held over them
state.
through the model of choice.
2. Temporal patterns, which target the visualization of
Despite the numerous attempts made recently at
patterns occurring inside the system by means of
explaining dynamic systems, most of them are fabricated in
recurrence plots [53, 54]. Temporal patterns permit to
a ad-hoc manner, thus leaving aside other flavors of time-
examine the multiple scales at which patterns are
series and recurrent neural computation whose intrinsically
captured by the Deep ESN model through their layers,
abstract nature also calls for studies on their explainability.
easing the inspection of the knowledge within the data
Indeed, this noted lack of proposals to elicit explanations
that is retained by the neuron reservoir. This technique
for alternative models is in close agreement with the con-
also helps determine when the addition of a layer does
clusions and prospects made in comprehensive surveys on
not contribute to the predictive performance of the
xAI [10, 40]. This is the purpose of the set of techniques
overall model.
presented in the following section, to provide different
3. Pixel absence effect, which leverages the concept of
interpretable indicators of the properties of a trained Deep
local explanations [10] and extends it to ESN and Deep
ESN model, as well as information that permit to visualize
ESN models. This technique computes the prediction
and summarize the knowledge captured by its reservoirs.
difference resulting from the suppression of one of the
inputs of the model. The generated matrix of deviations
2.3 Importance attribution methods
will be contained within the dimensions of the input
signal, allowing for the evaluation of the importance of
Due to one of the techniques in the proposed framework
the absence of a single component over the input
(pixel absence effect), it is compulsory that importance
signal.
attribution methods are discussed in this background.
Importance attribution methods are not exclusive to neural
123
10262 Neural Computing and Applications (2022) 34:10257–10277
The following subsections provide rationale for the above The time taken for the model to return to its initial state is
set of xAI techniques, their potential uses, and examples the potential memory of the network, representing the
with synthetic data previewing the explanatory information information of the history kept by the model after training.
they produce. Mathematically:
Definition 1 (Potential memory) Given an already
3.1 Potential memory
trained ESN model with parameters
One of the main concerns when training an ESN model is faðlÞ ; WðlÞ;in ; WðlÞ ; Wfb ; Wout g;
to evaluate the potential memory the network retains within
their reservoir(s). Quantifying this information becomes y ð0Þ ¼ y0 2 RL when uð0Þ ¼ 0, the
and initial state set to b
essential to ensure that the model possesses enough mod- potential memory (PM) of the model at evaluation time T0
eling complexity to capture the patterns that best correlate is given by:
with the target to be predicted. PM ðT0 Þ ¼ T0 arg min kb
y ðtÞ y0 k2 \;
t [ T0
ð9Þ
Intuitively, a reservoir is able to hold information to
interact with the newest inputs being fed to the network in where is the tolerance below which convergence of the
order to generate the current output. It follows that, upon measure is declared.
the removal of these new inputs, the network should keep
outputting the information retained in the reservoir, until it In order to illustrate the output of this first technique, we
is flushed out. The multi-layered structure of the network consider a single ESN model with varying number of
does not allow relating the flow through the Deep ESN neurons in its reservoir to learn the phase difference of 50
model directly to its memory, since the state in which this samples between an input and an output sinusoid. Plots
memory is conserved would most probably reside in an nested in Fig. 2 depict the fade dynamics of different ESN
unknown abstract space. This is the purpose of the pro- models trained with N ¼ 5 (plots in Fig. 2a, b), N ¼ 50
posed potential memory indicator: to elicit a hint of the (plots in Fig. 2c, d) and N ¼ 500 neurons (plots in Fig. 2e,
memory contained in the stacked reservoirs. When f) in their reservoirs. Specifically, the graph on the top
informed to the audience, the potential memory can represents the actual signal to be predicted, while the
improve the confidence of the user with respect to the bottom graphs display the behavior of the output of each
application at hand, especially in those cases where a ESN model when the input signal is zeroed at time
hypothesis on the memory that the model should possess T0 ¼ 10100, along with the empirical distribution of the
can be drawn from intuition. For instance, in an order-ten potential memory computed for T0 2 ð10000; 10100.
nonlinear autoregressive-moving average (NARMA) sys- When the reservoir is composed by just N ¼ 5 neurons,
tem, the model should be able to store at least ten past steps the potential memory of the network is low given the quick
to produce a forecast for the next time step. transition of the output signal to its initial state after the
Before proceeding further, it is important to note that a input is set to 0. When increasing the size of the reservoir
high potential memory is not sufficient for the model to to 50 neurons, a rather different behavior is noted in the
produce accurate predictions for the task under considera- output of the ESN model when flushing its contained
tion. However, it is a necessary condition to produce esti- knowledge, not reaching its steady state until 20 time steps
mations based on inputs further away than the number of for the example depicted in Fig. 2c. Finally, an ESN
sequence steps upper bounded by this indicator. comprising 500 neural units in its reservoir does not come
For the sake of simplicity, let us assume a single-layer with a significant increase of its potential memory; how-
ESN model without leaking rate (a ¼ 1) nor feedback ever, the model is able to compute the target output in a
connection (i.e., Wfb ¼ 0NL ). These assumptions simplify more precise fashion with respect to the previous case. This
the recurrence in Expression (1) to: simple experiment shows that the potential memory is a
useful metric to evaluate the past range of the sequence
xðt þ 1Þ ¼ f ðWin uðt þ 1Þ þ WxðtÞÞ; ð8Þ modeled and retained at its input. As later exposed in
which can be seen as an expansion of the current input Sect. 5, this technique is of great help when fitting a model,
uðt þ 1Þ and its history represented by xðtÞ. If we further since only when the potential memory is large enough,
assume that the network is initially1 set to b
y ð0Þ ¼ 0 when does the model succeed on the testing.
uð0Þ ¼ 0, the system should evolve gradually to the same
state when the input signal uðtÞ is set to 0 at time t ¼ T0 . 3.2 Temporal patterns
123
Neural Computing and Applications (2022) 34:10257–10277 10263
123
10264 Neural Computing and Applications (2022) 34:10257–10277
123
Neural Computing and Applications (2022) 34:10257–10277 10265
3.3 Pixel absence effect Implicitly, this transformation is not unique to image data,
since what is being processed is actually a time series.
Finally, we delve into the third technique that comprises However, this technique is adopted over image data since it
the proposed xAI suite for ESN-based models. The allows reverting all the computed importance levels back to
explanation departs from the thought that the analysis of a the image without having to quantify the relevance of this
predictive model always brings up the question of whether attribution over time.
an input is causing any effect on the model’s prediction. In accordance with the previous notation, this technique
Naturally, the goal is to find out how the inputs deemed to is now mathematically described starting from the input-
be the most important ones for the task at hand are actually output a single-layered ESN model in Expression (2),
the ones on which the model focuses most when eliciting which is included below for convenience:
its predictions. However, discovering whether this is y ðtÞ ¼ gðWout ½xðtÞ; uðtÞÞ ¼ gðWout zðtÞÞ:
b ð11Þ
actually true in the trained model is not an easy task.
Intuitively, if a step back is taken from the architecture The cancellation of a certain point T in the input sequence
to simply stare at the difference caused by the cancellation uðtÞ would affect not only in the sequence itself, but also in
of a certain input, it should be possible to observe the future predictions by ðtÞ 8t [ T due to the propagation of
individual influence of such input for the issued prediction. the altered value throughout the reservoir. Namely:
Similarly, inputs could be canceled in groups to evaluate
y ðtÞ ¼ g Wout ½x ðtÞ; u ðtÞ ;
b ð12Þ
the importance of each neighboring region of the input
space. Given that the result of this technique can be laid where u ðtÞ denotes the input signal with uðT Þ ¼ 0K1 ,
directly over the input of the model, the results can be and b y ðtÞ results from applying recurrence (1) to this
readily read and understood by any type of user, disre- altered signal. The intensity of the effect of the suppression
garding his/her technical background. This capability on the output signal at time T can be quantified as:
makes this xAI approach promising from an explainability
standpoint. Indeed, the term pixel in the name of the eðt; T Þ ¼ ½el ðt; T ÞLl¼1 ¼ b
y ðtÞ b
y ðtÞ; 8t T : ð13Þ
technique comes purposely from the noted suitability of the
where eðt; T Þ 2 RL1 , and its sign indicates the direction
technique to be depicted along with the input data, apply-
in which the output is pushed as per the modification of the
ing naturally when dealing with image or video data.
input. Clearly, when dealing with classification tasks (i.e.,
However, this importance attribution technique can be
yðtÞ 2 NL1 ), a similar rationale can be followed by con-
applied to other input data flavors (e.g., time series) with-
ceiving the output of the ESN-based model as the class
out requiring any modification of its steps.
probabilities elicited for the input sequence. Therefore, the
Before delving further into this technique, it is of
intensity computed as per (13) denotes whether the modi-
paramount importance to understand that the transforma-
fication of the input sequence increases (el ðt; T Þ [ 0) or
tion of an image into a time series is needed to harness the
decreases (el ðt; T Þ\0) the probability associated to class
modeling potential of ESN models. For this, the transfor-
l 2 f1; . . .; Lg.
mation presented by [56] is adopted: in accordance with
At this point it is important to remark that, given that
Fig. 5, an gray-scale (RGB) image with dimensions W H
this third xAI tool derives from the perturbation of indi-
(pixels) can be represented as a W H-length time series
vidual pixels, possible correlations between pixels over the
u(t) (corr. uðtÞ) by taking its rows/columns and concate-
image are neglected, thereby quantifying importances that
nating them serially. For this work, images were trans-
do not consider the plausibility of such perturbations. To
formed in a row basis, which means that for W rows, each
solve this matter, the pixel absence analysis can be
of the pixels in the given row are left as series of pixels,
extended by bootstrapping: pixels to be altered are sampled
stitching the last pixel of the given row with the first pixel
by bootstrapping (in a sample size Np established before-
of the next, until every row has been concatenated.
hand, e.g., ten at a time), for a total of Nb bootstraps. The
resulting probability changes are accumulated for each
u(t) Channel given pixel, giving rise to an enhanced quantification of the
... ... overall importance of every pixel, considering the effect of
... pixel-to-pixel interactions on the output of the ESN model.
... Time
The description of this technique is summarized in Pseu-
docode 1.
RGB channels
123
10266 Neural Computing and Applications (2022) 34:10257–10277
Pseudocode 1: Bootstrap pixel absence effect that, once identified from the information displayed in the
Data: Input series u(t) (e.g. coming from an recurrent plots, are easy to detect from data.
image-to-time-series transformation), output Finally, pixel absence gives further insights about the
(t) of the model at time t, time T at which
y
model’s behavior, and allows a data scientist to tune the
the xAI tool is queried.
Result: Vector e(t; T ) of accumulated relative parameters of the model in a more effective manner. Most
output probability differences interestingly, the output of this tool helps discover biases
1 for nb = 1, . . . , Nb (number of bootstraps) do hidden in data and propagated to the model’s output
2 Sample uniformly at random Np positions from
u(t) (sampled positions)
through the architecture. When the computed feature
3 Cancel sampled positions in u(t) by interpolating importances are shown to a general user, they support the
their values as per their closest neighboring fast interpretation of the most relevant parts of the input
points in the time series sequences with respect to the prediction, and the verifica-
4 Compute ESN output y (t) for perturbed input
tion that such influential parts conform to prior knowledge
5 Measure difference between old and new
(perturbed) output as per Expression (13) of the user.
6 Update vector e(t; T ) by accumulating
differences at the sampled positions
4 Experimental setup
123
Neural Computing and Applications (2022) 34:10257–10277 10267
The following subsections describe the setup and datasets type of data allows for an inspection of a multivariate ESN
used for the tasks described above. that presents a very stable working regime, featuring very
abrupt changes (as the middle plot in Fig. 6 clearly shows).
4.1 Time series forecasting On the other hand, traffic flow data collected by a road
sensor in Madrid is used. This data source is interesting
As mentioned before, time series analysis stands at the core since it contains long-term recurrence dynamics (bottom
of the ESN architecture’s main objective: modeling plot in Fig. 6), suited to be modeled via ESN-based
sequential data. Therefore, three experiments with different approaches. These datasets have been made available in a
time series data sources are devised to gain understanding public GitHub repository, along with all scripts needed for
upon being modeled by means of ESNs. Specifically, such replicating the experiments (https://github.com/alejan
experiments consider (1) a synthetic NARMA system of drobarredo/xAI4ESN).
order 10 fed with a variable noisy sinusoidal signal; (2) a Methodologically the experiments with the time series
dataset composed by real traffic flow data collected by datasets discussed in Sect. 5.1 are structured in the fol-
different sensors deployed over the city of Madrid, Spain lowing way. First, a potential memory analysis is carried
[61], and (3) a dataset of lithium-ion battery sensor read- out for the three datasets, elaborating on how the results
ings. A regression task can be formulated over the above relate to the characteristics of the data being modeled.
datasets: to predict the next value of the time series data, Then, a follow-up analysis is performed around the tem-
given the history of past values. poral patterns emerging from the reservoir dynamics, also
To begin with, we need a controllable dynamical system crosschecking whether such patterns conform to intuition.
producing data to be modeled via a reservoir-based All figures summarizing the outputs of these xAI tech-
approach. Opting for this initial case ease the process of niques are depicted together in the same page for a better
verifying whether the knowledge of the trained reservoir readability of the results and an easier verification of the
visualized with the xAI tools conforms to what could be claims made thereon.
expected from the dynamical properties underneath. For
this purpose, a first system is chosen featuring a recurrent 4.2 Image classification
behavior to see its implications on the potential memory
and temporal patterns. A NARMA system governed by the As mentioned before, ESN models have been explored for
following expression meets this recurrent nature sought for image classification in the past, attaining competitive
the time series: accuracy levels with a significantly reduced training com-
X
9 plexity [57–59]. To validate whether an ESN-based model
yðt þ 1Þ ¼ tanhð0:3 yðtÞ þ 0:05 yðtÞ yðt iÞ is capturing features from its input image that intuitively
i¼0 relate to the target class to be predicted, the second
þ 1:5 uðt 9Þ uðtÞ þ 0:1Þ; experiment focuses on the application of the developed
pixel absence technique to an ESN image classifier trained
ð14Þ
to solve the well-known MNIST digits classification task
i.e., a one-dimensional time series u(t) given by: [70]. This dataset comprises 60,000 train images and
10,000 test images of 28 28 pixels, each belonging to
uðtÞ ¼ sinð2pFtÞ þ 0:02 nðtÞ; ð15Þ
one among 10 different categories.
with F ¼ 4 Hz, nðtÞ N ð0; 1Þ, i.e., a Gaussian distribution Since these recurrent models are rather used for mod-
with mean 0 and unit variance. The recurrence as per eling sequential data, there is a prior need for encoding
Expression (14) imposes that any model used for modeling image data to sequences, so that the resulting sequence
the relationship between u(t) and yðt þ 1Þ should focus on preserves most of the spatial correlation that could be
at least ten prior time steps when producing its outcome, exploited for classification. This being said, a column
and should reflect the behavior of the signal amplitude encoding strategy is selected, so that the input sequence is
changes over time. A segment of the time series resulting built by the columnwise concatenation of the pixels of
from the NARMA system is shown in Fig. 6 (top plot). every image. This process yields, for every image, a one-
The second and third datasets, however, deal with the dimensional sequence of 28 28 ¼ 784 points, which is
task of forecasting the next value of time series data gen- input to a Deep ESN model with NL ¼ 4 reservoirs of N ¼
erated by a real-life recurrent phenomenon. On one hand, a 100 neurons each. Once trained, the Deep ESN model
battery dataset is built with recorded electric current and achieves a test accuracy of 86%, which is certainly lower
temperature measurements of a lithium-ion battery cycled than CNN models used for the same task, but comes along
to depletion following different discharging profiles. This with a significantly decrease in training complexity.
Training and testing over this dataset was performed in less
123
10268 Neural Computing and Applications (2022) 34:10257–10277
Fig. 6 Segments of the different time series datasets used in the temperature (red) from a battery; and (bottom) road traffic flow
experiments: (top) input u(t) (black) and output y(t) (red) of a (vehicles per hour) collected over time (color figure online)
NARMA model of order 10; (middle) electrical current (black) and
123
Neural Computing and Applications (2022) 34:10257–10277 10269
Table 1 List of action recognition video datasets considered in the experiments, along with their characteristics
Dataset # # Frame Description
classes videos size
KTH [62] 6 600 160 120 Videos capturing 6 human actions (walking, jogging, running, boxing, hand waving
and hand clapping) performed several times by 25 subjects in 4 different scenarios
WEIZMANN [63] 10 90 180 144 Videos of 9 different people, each performing 10 natural actions (e.g., run, walk,
skip, jumping-jack) divided in examples of 8 frames
IXMAS [64] 13 1650 320 240 Lab-generated multi-orientation videos of 5 calibrated and synchronized cameras
recording common human actions (e.g., checking-watch, crossing-arms, scratching-
head)
UCF11 [65] 11 3040 320 240 Action recognition dataset of realistic action videos, collected from YouTube, having
a variable number of action categories
UCF50 [66] 50 6676
UCF101 [67] 101 13320
UCFSPORTS [68] 16 800 320 240 Sport videos, collected from YouTube, having 16 sport categories
HMDB51 [69] 51 6766 320 240 Videos collected from different sources (mainly YouTube and movies) divided into
51 actions that can be grouped in five types: general facial actions, facial actions
with object, general body movement, body with object, human body interactions
SquaresVsCrosses 2 2000 28 28 Synthetic video dataset containing two classes (squares and crosses) with clear
(new in this work) spatial separability. The squares class contains videos of a 2 2 pixel blob moving
randomly in a square trajectory of equal horizontal and vertical length, and radius
centered in the middle of the frame. The crosses class contains an equally sized
2 2 blob moving randomly in a cross motion through the center of the frame.
123
10270 Neural Computing and Applications (2022) 34:10257–10277
given of the appropriateness of its finally selected value. From the experimentation, two main findings can be
This experiment attempts at bringing light into this matter. underscored. first, stacking several reservoirs may jeopar-
For this purpose different reservoir sizes were explored dize the tuning process for the other parameters of the
when fitting an ESN to the three datasets to be modeled. model. Figure 9a, b, c shows that adding a new layer
For every dataset, the reservoir size varies while keeping without changing any other parameter has decreased the
the rest of the parameters fixed. The rest of the parameters error and even sharpened the patterns captured by the
are chosen through experimentation. A spectral radius qmax reservoirs. This effect occurs for every dataset, although it
equal to 0.95 is established, since all system has long-term is hard to see the differences in the patterns due to the
interactions between input and target, as well as a leaky complexity of the models being explained. Figure 9d, e, f
rate of 0.99 to ensure a fast plasticity of the reservoirs. shows the difference between a high and low qmax ,
As shown in the images in Fig. 8a, b, c, the size of the demonstrating the relevance of this parameter to attain a
reservoir can be considered a major issue when fitting the further level of detail in the patterns captured by the
model, until a certain threshold is surpassed. The potential reservoirs. Although plots in Fig 9e, f are harder to analyze,
memory technique allows monitoring this circumstance by both expose more detailed patterns, along with the decrease
estimating the model’s memory capacity. After surpassing in error for the highest qmax configuration. Finally, Fig. 9g,
a memory threshold, the potential memory of the reservoirs h, i examines the differences found when tweaking the a
becomes predictable and does not change any longer. This parameter. All three cases present blurred patterns, along
phenomenon seems to be related with its improvement in with newly emerging artifacts. Since the a parameter is
error rate, although this second aspect should be considered linked to the decay, tuning its value may induce new pat-
with caution, since the process of fitting an ESN-based terns captured in the reservoirs that may—or may not—be
model involves several other parameters that interact with informative for the audience or even for the overall mod-
each other. In any case, the technique presented in this first eling performance of the network. This roughly depends on
study provides some hints that might help the user under- the dynamics of the system and the task under
stand when a workable reservoir size is selected. consideration.
Apart from the usability of showing interpretable infor-
5.1.2 Experiment 2: temporal patterns mation to the user about the capability of the model to
capture the patterns within data, these plots enable another
When designing a deep ESN model, the next question analysis. If we focus on the battery dataset, it is quite clear
arising along the process deals with the understanding of that the signal is repetitive, stationary and that the system’s
whether the model has captured all the dynamics inside the recurrent patterns are also visible in the recurrence plots. In
dataset so as to successfully undertake the modeling task. the case of the traffic data, this phenomenon is much more
This second experiment analyzes the extent to which a important. By just staring at the traffic signal it is not easy
Deep ESN is able to capture the underlying patterns of a to realize that the signal also contains certain stationary
system by means of the proposed temporal patterns. In this events (weekends) that produce different behaviors in
experiment, different parameters are varied to show the traffic. By simply staring at the recurrence plots, it is
usefulness of this technique. This puts in perspective two straightforward to realize the periodic structure of the
interesting aspects: first, the technique allows for a quick signal, and how these events happen at equally distant time
inspection of whether the model is capturing the patterns of steps throughout the signal (as happens with weekends),
the system. As per the results exposed in what follows, the with a especially large effect at the top right of the recur-
clarity by which such patterns show up in the plot seems to rence plot corresponding to a long weekend.
be related to how well the model is capturing them. The experiments discussed above should not be limiting
Figure 9a/d/g, b/e/h, d/f/i summarizes the results for in what refers to the different features that can be extracted
different reservoir configurations for the three time series from recurrence plots. Within these plots, large- and small-
datasets under consideration. Figures are ordered in three scale patterns can be distinguished. From [54] large-scale
different sets {a, b, c}, {d, e, f} and {g, h, i}. The first set patterns are considered topological features, whereas
correspond to ESN models with equal parameters and small-scale ones are regarded as textures. Manifold topo-
different layer configurations (1 and 2 layers). The second logical features can be ascertained from recurrence plots,
set analyzes system configurations with two different qmax such as homogeneity (stationarity), the presence of periodic
values. The third set regards differences in the value of the and quasi-periodic structures that represent oscillating
a parameter. A simple visual inspection to these plots systems, drifts (non-stationarities evolving at a slow speed
uncovers that d.2, e.2 and f.2 present clear patterns (in over time) and abrupt changes that can be of interest to see
comparison to the other configurations), yielding the whether the model is over-fit to rare events in the modeled
smallest errors for each dataset. sequences. Regarding textures, from the recurrence plots
123
Neural Computing and Applications (2022) 34:10257–10277 10271
Fig. 8 Potential memory analysis of the ESN model trained over the a NARMA, b battery and c traffic datasets for different reservoir sizes
NL ¼ f1; 10; 50; 500; 1000g and constant reservoir parameters
123
10272 Neural Computing and Applications (2022) 34:10257–10277
Fig. 9 Temporal patterns analysis over the a and d and g NARMA; radius impacts directly on the detail of the recurrence plots, to the
b and e and h battery; and c and f and i traffic datasets, presenting extent of causing a significant performance degradation if not selected
different configuration parameters. It can be observed that the spectral properly
123
Neural Computing and Applications (2022) 34:10257–10277 10273
one can infer rare and isolated correlated states that can be cancellation, push the output class probability distribution
associated with noise (single dots), phase spaces that are toward or against a certain desired class.
visited at different times (recurrent states evinced by di- We exemplify the output of the experiment for a given
agonal lines) and vertical/horizontal lines, which represent example of digit 8. The reason for this choice is that this
an steady state throughout a given time frame. When digit has several coinciding visual features with other
occurring in recurrence plots, a proper interpretation of digits. As such, it was expected that a great effect was
these events usually requires intuition and domain knowl- discerned at the center of the digit itself, favoring the
edge, as we have exemplified in our prior experiments with classification for digit 0 since the elimination of the center
battery consumption and traffic data. pixels in digit 8 should directly render the image as a 0.
However, it is important to note that this does not happen
5.2 Image classification for ESN architectures that are sensitive to space transfor-
mations. This sensitivity impedes any ability of the model
As introduced before, this second experiment analyzes a to extract spatially invariant features from the digit image,
Deep ESN model for image classification over the MNIST thereby circumscribing its knowledge within the sequential
dataset. Once trained, this model is used to examine the column-wise patterns found in examples of its training set.
output of the third technique comprising the proposed suite This conclusion emphasizes the relevance of finding
of xAI techniques: pixel absence effect. transformation strategies from the original domain of data
Figure 10 shows the results of computing the pixel to sequences, not only for ensuring a good performance of
absence effect for single pixels, wherein the color repre- the learned ESN-based model, but also to be able to elicit
sents the importance of canceling each pixel. Specifically, explanations that conform to general intuition.
the scale on the upper left corner shows the amount of
change the cancelation of a certain pixel has caused on the 5.3 Video classification
probability of every class. The blue color indicates the
pixels that are important to predict the image as the desired As stated in the design of the experimental setup, the dis-
class. The darker the blue color of a certain pixel is, the cussion ends up with a video classification task approached
higher the decrease in probability for the desired class will via ESN-based models. Since to our knowledge this is the
be. The red color means exactly the opposite. Therefore, first time in the literature that video classification is tackled
the output of this technique must be conceived as an with ESN architectures, a benchmark is first run to compare
adversarially driven local explanation technique that pur- its predictive performance to a Conv2DLSTM network, in
sues to explain which regions of a certain input, upon their both cases trained only over unprocessed video data pro-
vided in the datasets of Table 1. As argued before, it would
not be fair to include preprocessing elements that could
bias the extracted conclusions about the modeling power of
both approaches under comparison.
Table 3 summarizes the results of the benchmark, both
in terms of test predictive accuracy and the size of the
trained models (in terms of trainable parameters). It can be
observed that Deep ESN models consistently outperforms
their Conv2DLSTM counterparts in all the considered
datasets, using significantly less trainable parameters that
translate to dramatically reduced training latencies.
Clearly, these results are far from state-of-the-art levels of
performance attained by deep neural networks benefiting
from pretrained weights (from, e.g., ImageNet). Never-
theless, they serve as an unprecedented preview of the high
potential of Deep ESN models to undertake complex tasks
with sequence data beyond those for which they have been
utilized so far (mostly, time series and text modeling).
The developed pixel absence effect test was run over the
trained Deep ESN model for video classification. The core
of the discussion is set on the results for a single video
Fig. 10 Pixel absence effect over a Deep ESN model trained to (local explanation) belonging to the UCF50 dataset, which
classify MNIST digits for single pixels
123
10274 Neural Computing and Applications (2022) 34:10257–10277
Table 3 Accuracy and number of trainable parameters of models tested over each video classification dataset
Dataset Deep ESN Conv2DLSTM Difference State of the art Technique
123
Neural Computing and Applications (2022) 34:10257–10277 10275
123
10276 Neural Computing and Applications (2022) 34:10257–10277
123
Neural Computing and Applications (2022) 34:10257–10277 10277
42. Lundberg SM, Erion GG, Lee S-I (2018) Consistent individual- 62. Schuldt C, Laptev I, Caputo B (2004) Recognizing human
ized feature attribution for tree ensembles. arXiv:1802.03888 actions: a local svm approach. In: Proceedings of the 17th
43. Smilkov D, Thorat N, Kim B, Viégas F, Wattenberg M (2017) International conference on pattern recognition, 2004. ICPR
Smoothgrad: removing noise by adding noise. arXiv:1706.03825 2004., volume 3, pp 32–36. IEEE
44. Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B 63. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005)
(2018) Sanity checks for saliency maps. arXiv:1810.03292 Actions as space-time shapes. In: Tenth IEEE international con-
45. Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) ference on computer vision (ICCV’05) Volume 1, volume 2,
Striving for simplicity: the all convolutional net. arXiv:1412.6806 pp 1395–1402. IEEE
46. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra 64. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action
D (2017) Grad-cam: visual explanations from deep networks via recognition using motion history volumes. Comput Vis Image
gradient-based localization. In: Proceedings of the IEEE inter- Understand 104(2–3):249–257
national conference on computer vision, pp 618–626 65. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from
47. Montavon G, Lapuschkin S, Binder A, Samek W, Müller K-R videos ‘‘in the wild’’. In: 2009 IEEE conference on computer
(2017) Explaining nonlinear classification decisions with deep vision and pattern recognition, pp 1996–2003. IEEE
taylor decomposition. Pattern Recognit 65:211–222 66. Reddy KK, Shah M (2013) Recognizing 50 human action cate-
48. Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside con- gories of web videos. Mach Vis Appl 24(5):971–981
volutional networks: visualising image classification models and 67. Soomro K, Zamir AR, Shah M: Ucf101: a dataset of 101 human
saliency maps. arXiv:1312.6034 actions classes from videos in the wild. arXiv:1212.0402
49. Ancona M, Ceolini E, Öztireli C, Gross M (2017) Towards better 68. Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-
understanding of gradient-based attribution methods for deep temporal maximum average correlation height filter for action
neural networks. arXiv:1711.06104 recognition. In: IEEE conference on computer vision and pattern
50. Baehrens D, Schroeter T, Harmeling S, Kawanabe M, Hansen K, recognition, pp 1–8. IEEE
Müller K-R (2010) How to explain individual classification 69. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb:
decisions. J Mach Learn Res 11:1803–1831 a large video database for human motion recognition. In: 2011
51. Shrikumar A, Greenside P, Kundaje A (2017) Learning important International conference on computer vision, pp 2556–2563.
features through propagating activation differences. In: Interna- IEEE
tional conference on machine learning, pp 3145–3153. PMLR 70. LeCun Y (1998) The mnist database of handwritten digits. http://
52. Ribeiro MT, Singh S, Guestrin C (2016) ‘‘Why should i trust yann.lecun.com/exdb/mnist/
you?’’ Explaining the predictions of any classifier. In: Proceed- 71. Han D, Bo L, Sminchisescu C (2009) Selection and context for
ings of the 22nd ACM SIGKDD international conference on action recognition. In: 2009 IEEE 12th international conference
knowledge discovery and data mining, pp 1135–1144 on computer vision, pp 1933–1940
53. Marwan N, Romano MC, Thiel M, Kurths J (2007) Recurrence 72. Ghadiyaram D, Tran D, Mahajan D (2019) Large-scale weakly-
plots for the analysis of complex systems. Phys Rep supervised pre-training for video action recognition. In: Pro-
438(5–6):237–329 ceedings of the IEEE/CVF conference on computer vision and
54. Eckmann J-P, Kamphorst SO, Ruelle D, et al (1995) Recurrence pattern recognition, pp 12046–12055
plots of dynamical systems. World Sci Ser Nonlinear Sci Ser A 73. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011)
16:441–446 Sequential deep learning for human action recognition. In:
55. Gallicchio C, Micheli A (2016) Deep reservoir computing: a International workshop on human behavior understanding,
critical analysis. In: ESANN pp 29–39. Springer
56. Schaetti N, Salomon M, Couturier R (2016) Echo state networks- 74. Shu Na, Tang Q, Liu H (2014) A bio-inspired approach modeling
based reservoir computing for mnist handwritten digits recogni- spiking neural networks of visual cortex for human action
tion. In: IEEE international conference on computational science recognition. In: 2014 international joint conference on neural
and engineering (CSE), pp 484–491. IEEE networks (IJCNN), pp 3450–3457. IEEE
57. Woodward A, Ikegami T (2011) A reservoir computing approach 75. Liu J, Shah M (2008) Learning human actions via information
to image classification using coupled echo state and back-prop- maximization. In: IEEE conference on computer vision and
agation neural networks. In International conference image and pattern recognition, pp 1–8. IEEE
vision computing, Auckland, New Zealand, pp 543–458 76. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition
58. Souahlia A, Belatreche A, Benyettou A, Curran K (2016) An using visual attention. (2015). arXiv:1511.04119
experimental evaluation of echo state network for colour image 77. Shi Y, Zeng W, Huang T, Wang Y (2015) Learning deep tra-
segmentation. In: 2016 International joint conference on neural jectory descriptor for action recognition in videos using deep
networks (IJCNN), pp 1143–1150. IEEE neural networks. In: 2015 IEEE international conference on
59. Tong Z, Tanaka G (2018) Reservoir computing with untrained multimedia and expo (ICME), pp 1–6. IEEE
convolutional neural networks for image recognition. In: Inter- 78. Wang L, Qiao Y, Tang X (2015) Action recognition with tra-
national conference on pattern recognition (ICPR), jectory-pooled deep-convolutional descriptors. In: Proceedings of
pp 1289–1294. IEEE the IEEE conference on computer vision and pattern recognition,
60. Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-K, Woo W-C pp 4305–4314
(2015) Convolutional lstm network: a machine learning approach 79. Harandi MT, Sanderson C, Shirazi S, Lovell BC (2013) Kernel
for precipitation nowcasting. arXiv:1506.04214 analysis on grassmann manifolds for action recognition. Pattern
61. Laña I, Del SJ, Padró A, Vélez M, Casanova-Mateo C (2016) The Recognit Lett 34(15):1906–1915
role of local urban traffic and meteorological conditions in air
pollution: a data-based case study in Madrid. Spain. Atmos Publisher’s Note Springer Nature remains neutral with regard to
Environ 145:424–438 jurisdictional claims in published maps and institutional affiliations.
123