Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Model Selection and Error Estimation in

a Nutshell Luca Oneto


Visit to download the full and correct content document:
https://textbookfull.com/product/model-selection-and-error-estimation-in-a-nutshell-luc
a-oneto/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Estimation and Inference in Discrete Event Systems A


Model Based Approach with Finite Automata Christoforos
N. Hadjicostis

https://textbookfull.com/product/estimation-and-inference-in-
discrete-event-systems-a-model-based-approach-with-finite-
automata-christoforos-n-hadjicostis/

Modeling, Control, Estimation, and Optimization for


Microgrids: A Fuzzy-Model-Based Method 1st Edition
Zhixiong Zhong

https://textbookfull.com/product/modeling-control-estimation-and-
optimization-for-microgrids-a-fuzzy-model-based-method-1st-
edition-zhixiong-zhong/

The Art of Sustainable Performance A Model for


Recruiting Selection and Professional Development Bas
Kodden

https://textbookfull.com/product/the-art-of-sustainable-
performance-a-model-for-recruiting-selection-and-professional-
development-bas-kodden/

Robust Integration of Model-Based Fault Estimation and


Fault-Tolerant Control Jianglin Lan

https://textbookfull.com/product/robust-integration-of-model-
based-fault-estimation-and-fault-tolerant-control-jianglin-lan/
International Taxation in a Nutshell Mindy Herzfeld

https://textbookfull.com/product/international-taxation-in-a-
nutshell-mindy-herzfeld/

Law School Success in a Nutshell Burkhart

https://textbookfull.com/product/law-school-success-in-a-
nutshell-burkhart/

Hypothesis Testing and Model Selection in the Social


Sciences 1st Edition David L. Weakliem Phd

https://textbookfull.com/product/hypothesis-testing-and-model-
selection-in-the-social-sciences-1st-edition-david-l-weakliem-
phd/

Sports Law in a Nutshell Fifth Edition Champion

https://textbookfull.com/product/sports-law-in-a-nutshell-fifth-
edition-champion/

Legal Writing and Analysis in a Nutshell Fifth Edition


Bahrych

https://textbookfull.com/product/legal-writing-and-analysis-in-a-
nutshell-fifth-edition-bahrych/
Modeling and Optimization in Science and Technologies

Luca Oneto

Model
Selection
and Error
Estimation in a
Nutshell
Modeling and Optimization in Science
and Technologies

Volume 15

Series Editors
Srikanta Patnaik, SOA University, Bhubaneswar, India
Ishwar K. Sethi, Oakland University, Rochester, USA
Xiaolong Li, Indiana State University, Terre Haute, USA

Editorial Board
Li Chen, The Hong Kong Polytechnic University, Hong Kong
Jeng-Haur Horng, National Formosa University, Yulin, Taiwan
Pedro U. Lima, Institute for Systems and Robotics, Lisbon, Portugal
Mun-Kew Leong, Institute of Systems Science, National University of Singapore,
Singapore
Muhammad Nur, Diponegoro University, Semarang, Indonesia
Luca Oneto, University of Genoa, Italy
Kay Chen Tan, National University of Singapore, Singapore
Sarma Yadavalli, University of Pretoria, South Africa
Yeon-Mo Yang, Kumoh National Institute of Technology, Gumi, Korea (Republic of)
Liangchi Zhang, The University of New South Wales, Australia
Baojiang Zhong, Soochow University, Suzhou, China
Ahmed Zobaa, Brunel University London, Uxbridge, Middlesex, UK
The book series Modeling and Optimization in Science and Technologies (MOST)
publishes basic principles as well as novel theories and methods in the fast-evolving
field of modeling and optimization. Topics of interest include, but are not limited to:
methods for analysis, design and control of complex systems, networks and
machines; methods for analysis, visualization and management of large data sets;
use of supercomputers for modeling complex systems; digital signal processing;
molecular modeling; and tools and software solutions for different scientific and
technological purposes. Special emphasis is given to publications discussing novel
theories and practical solutions that, by overcoming the limitations of traditional
methods, may successfully address modern scientific challenges, thus promoting
scientific and technological progress. The series publishes monographs, contributed
volumes and conference proceedings, as well as advanced textbooks. The main
targets of the series are graduate students, researchers and professionals working at
the forefront of their fields.
Indexed by SCOPUS. The books of the series are submitted for indexing to
Web of Science.

More information about this series at http://www.springer.com/series/10577


Luca Oneto

Model Selection and Error


Estimation in a Nutshell

123
Luca Oneto
DIBRIS
Università degli Studi di Genova
Genoa, Italy

ISSN 2196-7326 ISSN 2196-7334 (electronic)


Modeling and Optimization in Science and Technologies
ISBN 978-3-030-24358-6 ISBN 978-3-030-24359-3 (eBook)
https://doi.org/10.1007/978-3-030-24359-3
© Springer Nature Switzerland AG 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Do not worry Dad.
You know.
I understood you.
Even your mistakes.
And at the end, I am like you.
Luca Oneto
Foreword

The field of Machine Learning plays an increasingly important role in science and
engineering. Over the past two decades, the availability of powerful computing
resources has opened the door to synergistic interactions between empirical and
theoretical studies of Machine Learning, showing the value of the “learning from
example” paradigm in a wide variety of applications. Nevertheless, in many such
practical scenarios, the performance of many algorithms often depends crucially on
manually engineered features or hyperparameter settings, which often make the
difference between bad and state-of-the-art performance. Tools from statistical
learning theory allow to estimate the statistical performance of learning algorithm
and provide a means to better understand the factors that influence algorithms
behavior, ultimately suggesting ways to improve the algorithms or designing novel
ones. This book reviews in an intelligible and synthetic way the problem of tuning
and assessing the performance of an algorithm allowing both young researches and
experienced data scientists to gain a broad overview of the key problems underlying
model selection, state-of-the-art solutions, and key open questions.

London, UK Prof. Massimiliano Pontil


May 2019

Massimiliano Pontil is a Senior Researcher at Istituto Italiano di Tecnologia, where he leads the
CSML research group, and a part-time Professor of Computational Statistics and Machine Learning
in the Department of Computer Science at University College London. He received a Ph.D. in
Physics from the University of Genoa in 1999 (Advisor Prof. A. Verri). His research interests are in
the areas of machine learning, with a focus on statistical learning theory, kernel methods, multitask
and transfer learning, online learning, learning over graphs, and sparsity regularization. He also has
some interests in approximation theory, numerical optimization, and statistical estimation, and he has
pursued machine learning applications arising in computer vision, bioinformatics, and user mod-
eling. Massimiliano has been on the program committee of the main machine learning conferences,
including COLT, ICML, and NIPS, he is an Associate Editor of the Machine Learning Journal, an
Action Editor for the Journal of Machine Learning Research and he is on the Scientific Advisory
Board of the Max Planck Institute for Intelligent Systems Germany. Massimiliano received the
Edoardo R. Caianiello award for the Best Italian Ph.D. Thesis on Connectionism in 2002, an EPSRC
Advanced Research Fellowship in 2006–2011, and a Best Paper Runner Up from ICML in 2013.

vii
Preface

How can we select the best performing data-driven model? How can we rigorously
estimate its generalization error? Statistical Learning Theory (SLT) answers these
questions by deriving non-asymptotic bounds on the generalization error of a model
or, in other words, by upper bounding the true error of the learned model based just
on quantities computed on the available data. However, for a long time, SLT has
been considered only an abstract theoretical framework, useful for inspiring new
learning approaches, but with limited applicability to practical problems. The
purpose of this book is to give an intelligible overview of the problems of Model
Selection (MS) and Error Estimation (EE), by focusing on the ideas behind the
different SLT-based approaches and simplifying most of the technical aspects with
the purpose of making them more accessible and usable in practice. We will start by
presenting the seminal works of the 80s until the most recent results, then discuss
open problems and finally outline future directions of this field of research.

Genoa, Italy Prof. Luca Oneto


May 2019

ix
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 The “Five W” of MS and EE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Induction and Statistical Inference . . . . . . . . . . . . . . . . . . . . . . 13
3.2 The Supervised Learning Problem . . . . . . . . . . . . . . . . . . . . . . 17
3.3 What Methods Will Be Presented and How? . . . . . . . . . . . . . . . 21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Complexity-Based Methods . . . . . . . . . . . . . . ... . . . . . . . . . . . . . 33
5.1 Union and Shell Bounds . . . . . . . . . . . . . ... . . . . . . . . . . . . . 35
5.2 V. N. Vapnik and A. Chernovenkis Theory .. . . . . . . . . . . . . . 39
5.3 Rademacher Complexity Theory . . . . . . . . ... . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . 54
6 Compression Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7 Algorithmic Stability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8 PAC-Bayes Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9 Differential Privacy Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

xi
xii Contents

10 Conclusions and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . 99


References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Appendix A: Bayesian and Frequentist: An Example . . . . . . . . . . . . . . . 101
Appendix B: TheWrong Set of Rules Could Be the Right One . . . . . . . . 107
Appendix C: Properties and Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . 115
About the Author

Luca Oneto was born in Rapallo, Italy in 1986. He received his B.Sc. and M.Sc. in
Electronic Engineering at the University of Genoa, Italy, respectively, in 2008 and
2010. In 2014, he received his Ph.D. from the same university in the School of
Sciences and Technologies for Knowledge and Information Retrieval with the
thesis “Learning Based On Empirical Data”. In 2017, he obtained the Italian
National Scientific Qualification for the role of Associate Professor in Computer
Engineering and in 2018, he obtained the one in Computer Science. He worked as
an Assistant Professor in Computer Engineering at University of Genoa from 2016
to 2019. In 2018, he was co-funder of the spin-off ZenaByte s.r.l. In 2019, he
obtained the Italian National Scientific Qualification for the role of Full Professor in
Computer Engineering. He is currently an Associate Professor in Computer Science
at University of Pisa. His first main topic of research is the Statistical Learning
Theory with particular focus on the theoretical aspects of the problems of (Semi)
Supervised Model Selection and Error Estimation. His second main topic of
research is Data Science with particular reference to the solution of real world
problems by exploiting and improving the most recent Learning Algorithms and
Theoretical Results in the fields of Machine Learning and Data Mining.

xiii
Chapter 1
Introduction

How can we select the best performing data-driven model and quantify its general-
ization error? This question has received a solid answer from the field of statistical
inference since the last century and before [1, 2].
The now classic approach of parametric statistics [3–5] identifies a family of
models (e.g. linear functions), a noise assumption (e.g. Gaussian) and, given some
data, easily provides a criteria for choosing the best model, along with a quantification
of the uncertainty or, in modern terms, an estimation of the generalization error in
terms of a confidence interval. On the contrary, non-parametric statistics addresses the
problem of deriving all the information directly from the data, without any assumption
on the model family nor any other information that is external to the data set itself
[6, 7].
With the advent of the digital information era, this approach has gained more
and more popularity, up to the point of suggesting that effective data-driven models,
with the desired accuracy, can be generated by simply collecting more and more
data (see the work of Dhar [8] for some insights on this provocative and inexact but,
unfortunately, widespread belief).
However, is it really possible to perform statistical inference for building predictive
models without any assumption? Unfortunately, the series of no-free-lunch theorems
provided a negative answer to this question [9]. They also showed that, in general,
is not even possible to solve apparently simpler problems, like differentiating noise
from data, no matter how large the data set is [10].
SLT addresses exactly this problem, by trying to find necessary and sufficient
conditions for non-parametric inference to build data-driven models from data or,
using the language of SLT, learn an optimal model from data [11]. The main SLT
results have been obtained by deriving non-asymptotic bounds on the generalization
error of a model or, to be more precise, upper and lower bounds on the excess risk
between the optimal predictor and the learned model, as a function of the, possibly
infinite and unknown, family of models and the number of available samples [11].

© Springer Nature Switzerland AG 2020 1


L. Oneto, Model Selection and Error Estimation in a Nutshell,
Modeling and Optimization in Science and Technologies 15,
https://doi.org/10.1007/978-3-030-24359-3_1
2 1 Introduction

An important byproduct of SLT has been the (theoretical) possibility of applying


these bounds for solving the problems raised by our first question, about the quality
and the performance of the learned model. However, for a long time, SLT has been
considered only a theoretical, albeit very sound and deep, statistical framework,
without any real applicability to practical problems [12]. Only in the last decade, with
important advances in this field, it has been shown that SLT can provide practical
answers [13–18].
We review here the main results of SLT for the purpose to select the best perform-
ing data-driven model and to quantify its generalization error. Note that we will not
cover all the approaches available in the literature, since it would be almost impossi-
ble and also not very useful, but we will cover only the ground-breaking results in the
field since all the other approaches can be seen as a small modification or combina-
tion of these ground-breaking achievements. Inspired by the idea of Shewchuk [19],
our purpose is to provide an intelligible overview of the ideas, the hypotheses, the
advantages and the disadvantages of the different approaches developed in the SLT
framework. SLT is still an open field of research but can be the starting point for a
better understanding of the methodologies able to rigorously assess the performance
and reliability of data-driven models.

References

1. Casella G, Berger RL (2002) Statistical inference. Duxbury Pacific Grove, CA


2. Reid N, Cox DR (2015) On some principles of statistical inference. Int Stat Rev 83(2):293–308
3. Draper NR, Smith H, Pownell E (1966) Applied regression analysis. Wiley, New York
4. Geisser S, Johnson WO (2006) Modes of parametric statistical inference, vol 529. Wiley
5. Tsybakov AB (2008) Introduction to nonparametric estimation. Springer Science & Business
Media
6. Wolfowitz J (1942) Additive partition functions and a class of statistical hypotheses. Ann.
Math. Stat. 13(3):247–279
7. Randles RH, Hettmansperger TP, Casella G (2004) Introduction to the special issue: nonpara-
metric statistics. Stat. Sci. 19(4):561–561
8. Dhar V (2013) Data science and prediction. Commun. ACM 56(12):64–73
9. Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural
Comput. 8(7):1341–1390
10. Magdon-Ismail M (2000) No free lunch for noise prediction. Neural Comput. 12(3):547–564
11. Vapnik VN (1998) Statistical learning theory. Wiley, New York
12. Valiant LG (1984) A theory of the learnable. Commun. ACM 27(11):1134–1142
13. Bartlett PL, Boucheron S, Lugosi G (2002) Model selection and error estimation. Mach. Learn.
48(1–3):85–113
14. Langford J (2005) Tutorial on practical prediction theory for classification. J. Mach. Learn.
Res. 6:273–306
15. Guyon I, Saffari A, Dror G, Cawley G (2010) Model selection: beyond the bayesian/frequentist
divide. J. Mach. Learn. Res. 11:61–87
References 3

16. Anguita D, Ghio A, Oneto L, Ridella S (2012) In-sample and out-of-sample model selection
and error estimation for support vector machines. IEEE Trans. Neural Netw. Learn. Syst.
23(9):1390–1406
17. Oneto L, Ghio A, Ridella S, Anguita D (2015) Fully empirical and data-dependent stability-
based bounds. IEEE Trans. Cybern. 45(9):1913–1926
18. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A (2015) The reusable holdout:
preserving validity in adaptive data analysis. Science 349(6248):636–638
19. Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing
pain
Chapter 2
The “Five W” of MS and EE

Before starting with the technical parts we would like to answer five simple questions
regarding the problem of MS and EE:
• What is MS and EE?
• Why should we care?
• Who should care?
• Where and When this problem has been addressed?
For what concerns the first “W”, MS and EE can be defined respectively as the
problem of selecting the data-driven model with the highest accuracy and the problem
of estimating the error that the selected model will exhibit on previously unseen data
by relying only on the available data. Example of MS problems are: choosing between
different learning algorithms (e.g. Support Vector Machines [1], Random Forests [2],
Neural Networks [3], Gaussian Processes [4], Nearest Neighbor [5], and Decision
Trees [6]), setting the hyperparameters of a learning algorithm (e.g. the regularization
in Support Vector Machines, the number of layers and neurons per layer in a Neural
Network, the depth of a Decision Tree, and k in k-Nearest Neighbor), and choosing
the structure of a learning algorithm (e.g. the type of regularizer in Regularize Least
Squares [7, 8] and the families of kernels in Multiple Kernel Learning [9]). Once all
the choices have been performed during the MS phase and the data-driven model has
been built, the EE phase deals with the problem of estimating the error that this model
will exibit on previously unseen data based on different tools such as: probability
inequalities (e.g. Hoeffding Inequalities [10], Clopper-Pearson Inequalities [11]),
concentration inequalities (e.g. Bounded Difference Function [12], Self Bounding
Functions [13–16]), and moment inequalities [17]. Note that, in this work, we will
deal only with methods which do not make any assumption on the noise that corrupts
the data and that require only quantities that can be computed on the data themselves.
Regarding the second “W”, MS and EE is a fundamental issue when it comes
built data driven models. The first reason is that even advanced practitioners and
researchers in the field often fail to perform MS and EE in the correct way. A quite

© Springer Nature Switzerland AG 2020 5


L. Oneto, Model Selection and Error Estimation in a Nutshell,
Modeling and Optimization in Science and Technologies 15,
https://doi.org/10.1007/978-3-030-24359-3_2
6 2 The “Five W” of MS and EE

representative evidence of this fact is a recent work published in one of the best
journals in the field [18] where the performance of the state-of-the-art learning algo-
rithms have been compared on a large series of benchmark datasets. In this work
it is stated that Random Forests is the best performing algorithm but in a recent
work [19] other researchers showed that the original work [18] contained a pretty
serious flaws in how the MS and EE have been performed: in particular, an hold out
dataset was missing leading to biased results. The second reason is rather simple:
any data scientist wants to build the best data-driven model in terms of accuracy and
have an idea of its actual performances when the model will be exploited in a real
world environment. The third reason is that a series of no-free-lunch theorems [20,
21] ensure us that MS and EE will be always necessary since there will never exist a
golden learning algorithm able to solve all the data related problems in the optimal
way. The last, but not less important, reason is that in literature, due to the lack of
knowledge of the problem of MS and EE, most of the research findings are false or
report biased results [22, 23].
Regarding the “Who”, the answer is a simple consequence of the “Why” since
every data scientist, machine learner, data miner, and every practitioner which uses
data should understand these concepts in order to provide reliable analyses, results,
insights, and models. Probably the most technical issues of MS and EE are not
fundamental for practitioners but having a general idea of the problems and the
state-of-the-art tools and techniques able to address them is crucial for everyone in
order to obtain rigorous and unbiased results.
Finally for the “Where” and “When” the answer, is simpler. The problem of
MS and EE have been addressed in very theoretical Machine Learning and advanced
Statistics papers mainly from the 1960 until today. Because of its intrinsic theoretical
foundations it is prohibitive for a practitioner to read and comprehend the notation
and the ideas proposed in these papers. This book is born from this observation and
from a quite inspiring work on the gradient descend algorithm [24]. In fact, most
of the time, the ideas behind the methods are quite simple and easy to apply but
the technicalities and the knowledge taken for granted make most of the available
literature hard to comprehend. In this book we will try to keep the presentation as
simple as possible by presenting the problems, the hypotheses, and the ideas behind
the methods without going in too many technical details but still presenting the
state-of-the-art results in the field. In particular, we will start from the first works of
1960 about probability inequalities [10, 11, 25–40], and proceed with the asymptotic
analysis [1, 41–43] of 1970, and concentration inequalities [13–17, 44–48] of 1980,
then move to the finite sample analysis [49–103] of 1990, the milestone results of
the 2000 about learnability [31, 104, 104–111], until the most recent results of 2010
on interactive data analysis [22, 112–126].
References 7

References

1. Vapnik VN (1998) Statistical learning theory. Wiley, New York


2. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
3. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press
4. Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT press
Cambridge
5. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory
13(1):21–27
6. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106
7. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Society Ser B
(Methodol) 267–288
8. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc
Ser B (Stat Methodol) 67(2):301–320
9. Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–
2268
10. Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am
Stat Assoc 58(301):13–30
11. Clopper CJ, Pearson ES (1934) The use of confidence or fiducial limits illustrated in the case
of the binomial. Biometrika 404–413
12. McDiarmid C (1989) On the method of bounded differences. Surv Comb 141(1):148–188
13. Talagrand M (1995) Concentration of measure and isoperimetric inequalities in product
spaces. Publ Mathématiques de l’Institut des Hautes Etudes Sci 81(1):73–205
14. Boucheron S, Lugosi G, Massart P (2000) A sharp concentration inequality with applications.
Random Struct Algorithms 16(3):277–292
15. Bousquet O (2002) A bennett concentration inequality and its application to suprema of
empirical processes. Comptes Rendus Math 334(6):495–500
16. Klein T, Rio E (2005) Concentration around the mean for maxima of empirical processes.
Ann Probab 33(3):1060–1077
17. Boucheron S, Lugosi G, Massart P (2013) Concentration inequalities: a nonasymptotic theory
of independence. Oxford University Press
18. Fernández-Delgado M, Cernadas E, Barro S, Amorim D (2014) Do we need hundreds of
classifiers to solve real world classification problems. J Mach Learn Res 15(1):3133–3181
19. Wainberg M, Alipanahi B, Frey BJ (2016) Are random forests truly the best classifiers? J
Mach Learn Res 17(1):3837–3841
20. Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural
Comput 8(7):1341–1390
21. Forster MR (2005) Notice: no free lunches for anyone. University of Wisconsin-Madison
Madison Department of Philosophy, Bayesians Included
22. Hardt M, Ullman J (2014) Preventing false discovery in interactive data analysis is hard. In:
IEEE Annual Symposium on Foundations of Computer Science
23. Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8):e124
24. Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing
pain
25. Bennett G (1962) Probability inequalities for the sum of independent random variables. J Am
Stat Assoc 57(297):33–45
26. Bernstein S (1924) On a modification of chebyshev’s inequality and of the error formula of
laplace. Ann Sci Inst Sav Ukraine, Sect Math 1(4):38–49
27. Maurer A, Pontil M (2009) Empirical bernstein bounds and sample variance penalization.
arXiv preprint arXiv:0907.3740
28. Bentkus V (2004) On hoeffding’s inequalities. Ann Probab 32(2):1650–1673
29. Talagrand M (1995) The missing factor in hoeffding’s inequalities. Ann l’IHP Probabilités
Stat 31(4):689–702
8 2 The “Five W” of MS and EE

30. Massart P (1998) Optimal constants for Hoeffding type inequalities. Université de Paris-Sud,
Département de Mathématique
31. Oneto L, Ghio A, Ridella S, Anguita D (2015) Fully empirical and data-dependent stability-
based bounds. IEEE Trans Cybern 45(9):1913–1926
32. Anguita D, Boni A, Ridella S (2000) Evaluating the generalization ability of support vector
machines through the bootstrap. Neural Process Lett 11(1):51–58
33. Efron B, Tibshirani RJ (1994) An introduction to the bootstrap. CRC Press
34. Efron B (1982) The jackknife, the bootstrap and other resampling plans. SIAM
35. Efron B (1992) Bootstrap methods: another look at the jackknife. Break Stat
36. Anguita D, Ghelardoni L, Ghio A, Oneto L, Ridella S (2012) The ‘k’ in k-fold cross validation.
In: European Symposium on Artificial Neural Networks, Computational Intelligence and
Machine Learning (ESANN)
37. Koavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model
selection. In: International Joint Conference on Artificial Intelligence
38. Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat
Surv 4:40–79
39. Arlot S (2008) V-fold cross-validation improved: V-fold penalization. arXiv preprint
arXiv:0802.0566
40. Devroye L, Wagner T (1979) Distribution-free inequalities for the deleted and holdout error
estimates. IEEE Trans Inf Theory 25(2):202–207
41. Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK (1989) Learnability and the vapnik-
chervonenkis dimension. J ACM 36(4):929–965
42. Abu-Mostafa YS (1989) The vapnik-chervonenkis dimension: information versus complexity
in learning. Neural Comput 1(3):312–317
43. Floyd S, Warmuth M (1995) Sample compression, learnability, and the vapnik-chervonenkis
dimension. Mach Learn 21(3):269–304
44. Talagrand M (1996) New concentration inequalities in product spaces. Inven Math
126(3):505–563
45. Talagrand M (1996) A new look at independence. Ann Probab 24(1):1–34
46. Ledoux M (1997) On talagrand’s deviation inequalities for product measures. ESAIM: Probab
Stat 1:63–87
47. Ledoux M (1996) Isoperimetry and gaussian analysis. In: Lectures on probability theory and
statistics, pp 165–294
48. Bobkov SG, Ledoux M (1998) On modified logarithmic sobolev inequalities for bernoulli
and poisson measures. J Funct Anal 156(2):347–365
49. Bartlett PL, Boucheron S, Lugosi G (2002) Model selection and error estimation. Mach Learn
48(1–3):85–113
50. Oneto L, Ghio A, Ridella S, Anguita D (2015) Global rademacher complexity bounds: from
slow to fast convergence rates. Neural Process Lett 43(2):567–602
51. Oneto L, Ghio A, Ridella S, Anguita D (2015) Local rademacher complexity: sharper risk
bounds with and without unlabeled samples. Neural Netw 65:115–125
52. Oneto L, Anguita D, Ridella S (2016) A local vapnik-chervonenkis complexity. Neural Netw
82:62–75
53. Oneto L, Ghio A, Anguita D, Ridella S (2013) An improved analysis of the rademacher
data-dependent bound using its self bounding property. Neural Netw 44:107–111
54. Anguita D, Ghio A, Oneto L, Ridella S (2014) A deep connection between the vapnik-
chervonenkis entropy and the rademacher complexity. IEEE Trans Neural Netw Learn Syst
25(12):2202–2211
55. Bartlett PL, Bousquet O, Mendelson S (2002) Localized rademacher complexities. In: Com-
putational Learning Theory
56. Blanchard G, Massart P (2006) Discussion: local rademacher complexities and oracle inequal-
ities in risk minimization. Ann Stat 34(6):2664–2671
57. Koltchinskii V (2006) Local rademacher complexities and oracle inequalities in risk mini-
mization. Ann Stat 34(6):2593–2656
References 9

58. Bartlett PL, Bousquet O, Mendelson S (2005) Local rademacher complexities. Ann Stat
33(4):1497–1537
59. Bartlett PL, Mendelson S (2003) Rademacher and gaussian complexities: risk bounds and
structural results. J Mach Learn Res 3:463–482
60. Koltchinskii V (2011) Oracle inequalities in empirical risk minimization and sparse recovery
problems. Springer
61. Shawe-Taylor J, Bartlett PL, Williamson RC, Anthony M (1998) Structural risk minimization
over data-dependent hierarchies. IEEE Trans Inf Theory 44(5):1926–1940
62. Anguita D, Ghio A, Oneto L, Ridella S (2012) In-sample and out-of-sample model selec-
tion and error estimation for support vector machines. IEEE Trans Neural Netw Learn Syst
23(9):1390–1406
63. Van Erven T (2014) Pac-bayes mini-tutorial: a continuous union bound. arXiv preprint
arXiv:1405.1580
64. Shawe-Taylor J, Williamson RC (1997) A pac analysis of a bayesian estimator. In: Compu-
tational Learning Theory
65. Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation
for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
66. Gelman A, Carlin JB, Stern HS, Rubin DB (2014) Bayesian data analysis, vol 2. Taylor &
Francis
67. Berend D, Kontorovitch A (2014) Consistency of weighted majority votes. In: Neural Infor-
mation Processing Systems
68. Nitzan S, Paroush J (1982) Optimal decision rules in uncertain dichotomous choice situations.
Int Econ Rev 23(2):289–97
69. Germain P, Lacoste A, Marchand M, Shanian S, Laviolette F (2011) A pac-bayes sample-
compression approach to kernel methods. In: International Conference on Machine Learning
70. J. Y. Audibert and O. Bousquet. Pac-bayesian generic chaining. In Neural Information Pro-
cessing Systems, 2003
71. Bégin L, Germain P, Laviolette F, Roy JF (2014) Pac-bayesian theory for transductive learning.
In: International Conference on Artificial Intelligence and Statistics
72. Morvant E (2013) Apprentissage de vote de majorité pour la classification supervisée et
l’adaptation de domaine: approches PAC-Bayésiennes et combinaison de similarités. Aix-
Marseille Université
73. Audibert JY (2010) Pac-bayesian aggregation and multi-armed bandits. arXiv preprint
arXiv:1011.3396
74. Oneto L, Ridella S, Anguita D (2016) Tuning the distribution dependent prior in the pac-bayes
framework based on empirical data. In: ESANN
75. Maurer A (2004) A note on the pac bayesian theorem. arXiv preprint cs/0411099
76. Bégin L, Germain P, Laviolette F, Roy JF (2016) Pac-bayesian bounds based on the rényi
divergence. In: International Conference on Artificial Intelligence and Statistics
77. Germain P, Lacasse A, Laviolette F, Marchand M, Roy JF (2015) Risk bounds for the majority
vote: from a pac-bayesian analysis to a learning algorithm. J Mach Learn Res 16(4):787–860
78. Germain P, Lacasse A, Laviolette F, Marchand M (2009) Pac-bayesian learning of linear
classifiers. In: International Conference on Machine Learning
79. Lacasse A, Laviolette F, Marchand M, Germain P, Usunier N (2006) Pac-bayes bounds for
the risk of the majority vote and the variance of the gibbs classifier. In: Neural Information
Processing Systems
80. Laviolette F, Marchand M (2007) Pac-bayes risk bounds for stochastic averages and majority
votes of sample-compressed classifiers. J Mach Learn Res 8(7):1461–1487
81. Laviolette F, Marchand M (2005) Pac-bayes risk bounds for sample-compressed gibbs clas-
sifiers. In: International Conference on Machine learning
82. Langford J, Seeger M (2001) Bounds for averaging classifiers. Technical report, Carnegie
Mellon, Departement of Computer Science
83. Shawe-Taylor J, Langford J (2002) Pac-bayes & margins. In: Neural information processing
systems
10 2 The “Five W” of MS and EE

84. Roy JF, Marchand M, Laviolette F (2011) From pac-bayes bounds to quadratic programs for
majority votes. In: International Conference on Machine Learning
85. Lever G, Laviolette F, Shawe-Taylor J (2010) Distribution-dependent pac-bayes priors. In:
Algorithmic Learning Theory
86. London B, Huang B, Taskar B, Getoor L, Cruz S (2014) Pac-bayesian collective stability. In:
Artificial Intelligence and Statistics
87. McAllester DA (1998) Some pac-bayesian theorems. In: Computational learning theory
88. McAllester DA (2003) Pac-bayesian stochastic model selection. Mach Learn 51(1):5–21
89. McAllester DA (2003) Simplified pac-bayesian margin bounds. In: Learning Theory and
Kernel Machines
90. Ralaivola L, Szafranski M, Stempfel G (2010) Chromatic pac-bayes bounds for non-iid data:
applications to ranking and stationary β-mixing processes. J Mach Learn Res 11:1927–1956
91. Seeger M (2002) Pac-bayesian generalisation error bounds for gaussian process classification.
J Mach Learn Res 3:233–269
92. Seeger M (2003) Bayesian Gaussian process models: PAC-Bayesian generalisation error
bounds and sparse approximations. PhD thesis, University of Edinburgh
93. Seldin Y, Tishby N (2009) Pac-bayesian generalization bound for density estimation with
application to co-clustering. In: International Conference on Artificial Intelligence and Statis-
tics
94. Seldin Y, Tishby N (2010) Pac-bayesian analysis of co-clustering and beyond. J Mach Learn
Res 11:3595–3646
95. Seldin Y, Auer P, Shawe-Taylor JS, Ortner R, Laviolette F (2011) Pac-bayesian analysis of
contextual bandits. In: Neural Information Processing Systems
96. Seldin Y, Laviolette F, Cesa-Bianchi N, Shawe-Taylor J, Auer P (2012) Pac-bayesian inequal-
ities for martingales. IEEE Trans Inf Theory 58(12):7086–7093
97. Tolstikhin IO, Seldin Y (2013) Pac-bayes-empirical-bernstein inequality. In: Neural Informa-
tion Processing Systems
98. Younsi M (2012) Proof of a combinatorial conjecture coming from the pac-bayesian machine
learning theory. arXiv preprint arXiv:1209.0824
99. McAllester DA, Akinbiyi T (2013) Pac-bayesian theory. In: Empirical Inference
100. Ambroladze A, Parrado-Hernández E, Shawe-Taylor J (2006) Tighter pac-bayes bounds. In:
Advances in neural information processing systems
101. Parrado-Hernández E, Ambroladze A, Shawe-Taylor J, Sun S (2012) Pac-bayes bounds with
data dependent priors. J Mach Learn Res 13(1):3507–3531
102. Catoni O (2007) Pac-Bayesian Supervised Classification. Inst Math Stat
103. Lever G, Laviolette F, Shawe-Taylor J (2013) Tighter pac-bayes bounds through distribution-
dependent priors. Theor Comput Sci 473:4–28
104. Shalev-Shwartz S, Shamir O, Srebro N, Sridharan K (2010) Learnability, stability and uniform
convergence. J Mach Learn Res 11:2635–2670
105. Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526
106. Poggio T, Rifkin R, Mukherjee S, Niyogi P (2004) General conditions for predictivity in
learning theory. Nature 428(6981):419–422
107. Mukherjee S, Niyogi P, Poggio T, Rifkin R (2006) Learning theory: stability is sufficient for
generalization and necessary and sufficient for consistency of empirical risk minimization.
Adv Comput Math 25(1):161–193
108. Maurer A (2017) A second-order look at stability and generalization. In: Conference on
Learning Theory, pp 1461–1475
109. Elisseeff A, Evgeniou T, Pontil M (2005) Stability of randomized learning algorithms. J Mach
Learn Res 6:55–79
110. Tomasi C (2004) Learning theory: past performance and future results. Nature 428(6981):378–
378
111. Kearns M, Ron D (1999) Algorithmic stability and sanity-check bounds for leave-one-out
cross-validation. Neural Comput 11(6):1427–1453
References 11

112. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A (2015) Generalization in


adaptive data analysis and holdout reuse. In: Neural Information Processing Systems
113. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A (2015) Preserving statistical
validity in adaptive data analysis. In: Annual ACM Symposium on Theory of Computing
114. Rao RB, Fung G, Rosales R (2008) On the dangers of cross-validation: an experimental
evaluation. In: International Conference on Data Mining
115. John L (2005) Clever methods of overfitting. http://hunch.net/?p=22
116. Chaudhuri K, Vinterbo SA (2013) A stability-based validation procedure for differentially
private machine learning. In: Neural Information Processing Systems
117. Hardt M, Ligett K, McSherry F (2012) A simple and practical algorithm for differentially
private data release. In: Neural Information Processing Systems
118. Williams O, McSherry F (2010) Probabilistic inference and differential privacy. In: Neural
Information Processing Systems
119. Blum A, Hardt M (2015) The ladder: a reliable leaderboard for machine learning competitions.
In: International Conference on Machine Learning
120. Steinke T, Ullman J (2015) Interactive fingerprinting codes and the hardness of preventing
false discovery. In: Conference on Learning Theory
121. Nissim K, Stemmer U (2015) On the generalization properties of differential privacy. arXiv
preprint arXiv:1504.05800
122. Oh S, Viswanath P (2015) The composition theorem for differential privacy. In: International
Conference on Machine Learning
123. Jain P, Thakurta AG (2014) (Near) dimension independent risk bounds for differentially
private learning. In: International Conference on Machine Learning
124. Feldman V, Xiao D (2014) Sample complexity bounds on differentially private learning via
communication complexity. In: Conference on Learning Theory
125. Chaudhuri K, Hsu D (2011) Sample complexity bounds for differentially private learning. In:
Conference on Learning Theory
126. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth A (2015) The reusable holdout:
preserving validity in adaptive data analysis. Science 349(6248):636–638
Chapter 3
Preliminaries

In this section we will give an overview of the problem of learning based on empir-
ical data. In particular we will first generally discuss about the inference problems
with particular reference to the inductive case and the statistical tools exploited to
assess the performance of the induction process. Then we will depict, in details, the
Supervised Learning (SL) framework, which represents one of the most successful
use of the inductive reasoning. In this section we will also introduce the main subject
of this monograph: the MS and EE problems.

3.1 Induction and Statistical Inference

Inference is defined as the act or process of deriving logical conclusions from


premises known or assumed to be true [1]. According to the philosopher Peirce [2]
there are three main different approaches to inference (see Fig. 3.1): deductive, induc-
tive, and abductive reasoning.
In the deductive reasoning based on a rule it is possible to map a case into a result.
An example of deductive reasoning is:
• rule: all the swans in that lake are white
• case: there is a swan coming from that lake
• result: this swan is white
In the inductive reasoning a rule is inferred based on one or more examples of the
mapping between case and result. An example of inductive reasoning is:
• case: this swan comes from that lake
• result: this swan is white
• rule: all the swans in that lake are white

© Springer Nature Switzerland AG 2020 13


L. Oneto, Model Selection and Error Estimation in a Nutshell,
Modeling and Optimization in Science and Technologies 15,
https://doi.org/10.1007/978-3-030-24359-3_3
14 3 Preliminaries

Deduction Induction Abduction

Case Rule Result


(Input) (Model) (Output)

Fig. 3.1 Human inference approaches based on the philosopher Charles Sanders Peirce

In the abductive reasoning a possible case is inferred based on a result and a rule.
An example of abductive reasoning is:
• result: this swan is white
• rule: all the swans in that lake are white
• case: this swan comes from that lake
Deduction [2–4] is the simplest inference approach since it does not imply any risk.
The process is exact in the sense that there is no possibility of making mistakes. Once
the rule is assumed to be true and the case is available there is no source of uncertainty
in deriving, through deduction, the result. In other words deduction is only able to
infer a statement that is already necessarily true from the premises. Mathematics, for
example, is one of the most important example of the importance of the deductive
reasoning. Inductive reasoning [2, 5], instead, implies a certain level of uncertainty
since we are inferring something that is not necessarily true but probable. In particular
we only know that it exists at least one case (the one observed) where the inferred
rule is correct. The inductive reasoning is a simple inference procedure that allows to
increment our level of knowledge, since the induction allows to infer something that
is not possible to logically deduce just based on the premises. Since the rule cannot
be proved it can be falsified, which means that we can test it with other observations
of cases and results [5]. The inductive reasoning is the cornerstone of the scientific
method where a phenomenon is observed and a theory is proposed and remains valid
until one case falsifies it. Consequently another theory must be developed in order
to explain the observations which have falsified the original one. Abduction [2, 5],
finally, is the most complex inference approach since abductive reasoning tries to
derive a result which, as in the inductive case, cannot be deduced from the premises
but, moreover, there is neither one observation which can tell us that the statement
is true. By repeating the observations, differently from the inductive case, it is not
possible to falsify the statement. In order to falsify the statement, another experiment
bust be designed; in other words the phenomenon must be observed from another
point of view. Abductive inference is at the basis of the most modern scientific
theories where, based on the observed facts and current theories, another theory about
unobserved phenomena is built and the experiments for falsifying it are developed.
3.1 Induction and Statistical Inference 15

In this book we will deal only with the problem of induction, since it is the
only one that can be exploited for increasing our level of knowledge starting from
some observations, and that can be also falsified based on them. In particular we
will deal with the problems of inferring the rule, given some examples of cases
and results. In the field of learning this problem is called SL, where the rule is an
unknown system which maps a point from an input space (the case) to a point in
an output space (the result). But this does not conclude the learning process. The
quality of the learning procedure must be assessed and tuned by estimating its error
based on the desired metric. In order to reach this goal we will make use of statistical
inference techniques. Among the others it is possible to identify two main approaches
to statistical inference:
• the bayesian one, where the observations are fixed while the unknown parameters
that we want the estimate are described probabilistically;
• the frequentist one, where the observations are a repeatable random sample (there is
the concept of frequency) while the unknown parameters that we want the estimate
remain constant during this repeatable process (there is no way that probabilities
can be associated with them).
Bayesian inference derives the posterior probability P{H |E}, which represents
the probability of the event H since we observed the event E (note that since E
has been observed, this implies that P{E} = 0), based on the Bayes’ theorem which
states that:

P{E|H }P{H }
P{H |E} = . (3.1)
P{E}

where
• P{E|H } is the likelihood, which is the probability of observing E given H ;
• P{H } is the prior probability, that is the probability of H before observing E;
• P{E} is the marginal likelihood (or evidence), computable from P{E|H } and P{H }.
The proof of the Bayes’ theorem is trivial and comes from the definition of conditional
probability. In fact by definition we have that

P{H, E}
P{H |E} = , (3.2)
P{E}
P{H, E}
P{E|H } = . (3.3)
P{H }

Note that if P{E} = 0 it automatically implies that

P{H, E}, P{H |E}, P{H |E} = 0. (3.4)


16 3 Preliminaries

Analogously if P{H } = 0 we have that

P{H, E}, P{H |E}, P{H |E} = 0. (3.5)

Consequently we can state that

P{H |E}P{E} = P{E|H }P{H }, (3.6)

which implies the Bayes’ theorem. The result of a Bayesian approach is the complete
probability distribution of the event H given the available observation E. From the
distribution it is possible to obtain any information about H . If H is the value of a
parameter that characterizes E we can compute the expected value of the parameter,
its credible interval, etc.
Frequentist inference has been associated with the frequentist interpretation of
probability. In particular the observation of an experiment can be considered as one
of an infinite sequence of possible repetitions of the same experiment. Based on this
interpretation, the aim of the frequentist approach is to draw conclusions from data
which are true with a given (high) probability, among this infinite set of repetitions
of the experiment. Another interpretation, which does not rely on the frequency
interpretation of probability, states that probability relates to a set of random events
which are defined before the observation takes place. In other words an experiment
should include, before performing the experiment, decisions about exactly which
steps will be taken to reach a conclusion from future observations. In a frequentist
approach to inference unknown parameters are considered as fixed and unknown
values that can not be treated as random variables in any sense. The result of a
frequentist approach is either a true-or-false conclusion from a significance test or
a conclusion in the form that a given sample-derived confidence interval covers the
true value: both these conclusions has a given probability of being correct.
Finally we would like to underline that Bayesian and Frequentist statistics can
be also described in a more informal way. Bayesian statistics tries to describe in a
rigorous way the initial prior belief about a problem and the update in the belief
as some observation about the problem are made by creating posterior belief. Fre-
quentist statistics, instead, concerns itself with methods that have guarantees about
future observations, no matter the personal belief. Both approaches are quite natural:
regarding the Frequentist statistics, if you compute something that occurs many times
then you can be assured that it will continue to occur. However, also, the idea of belief
is very natural: if you have prior knowledge about your data then you should try to
incorporate it. Moreover a typical error while applying the Frequentist approach is
to run an experiment, look at the output, and change the experiment design. This
represents the act of incorporating prior belief without taking it into account. With
Bayesian statistics, you can explicitly deal with beliefs.
However in this book we will mainly adopt the Frequentist approach. One can refer
to other works for a general review [6] of the Bayesian approach and the difference
with the Frequentist one.
3.2 The Supervised Learning Problem 17

3.2 The Supervised Learning Problem

In the SL framework the goal is to approximate a generally unknown system, or rule,


S : X → Y, which maps a point x from an input space X into a point y of an output
space Y, through another rule R : X → Y learned by observing S, which again
maps a point x ∈ X into a point  y ∈ Y (see Fig. 3.2). The space Z is defined as the
cartesian product between the input and the output space Z = X × Y and z ∈ Z is a
point in this space. There are many kinds of unknown rules S: the deterministic ones,
where a point x ∈ X is mapped into a single point y ∈ Y, and the probabilistic ones,
where a point x ∈ X can be mapped to different points in y ∈ Y. In this monograph
we assume to always deal with probabilistic rules, since they can model also the
deterministic and hybrid ones. Consequently we suppose that S can be modelled
as a probability distribution μ over Z. Note that, in real world applications, the use
of probabilistic rules is necessary, for example, because X can be just a subspace
of all the inputs of the system S and then, even if S is a deterministic system, it
may degenerate into a probabilistic one. Moreover the inputs and outputs of a real
system must be measured, stored and processed, consequently some noise is always
introduced and it must be modelled through probabilistic tools.
Based on the properties of the output space it is possible to define different SL
problems: the classification problems, where the output space consists in a finite
set of possibilities and there is no hierarchy between the different y ∈ Y, and the
regression problems, where Y is a subset of a possibly infinite set of outputs and
there is a hierarchy between the different y ∈ Y.
From S it is possible to retrieve a series of n examples of mapping, which are
called labelled samples Dn . Note that, in general, n is fixed, hence we cannot ask
for more samples. The goal in the SL framework is to map Dn ∈ Z n into a rule
R belonging to a set of possible ones R during the so called learning phase. The
mapping is performed by a learning algorithm: A : Z n → R. Note that R can be:

• a deterministic rule where R is a deterministic function f : X → Y and R is a set


of deterministic functions F (or hypothesis space). In other words, once f ∗ ∈ F is
chosen by A in order to predict the true label y ∈ Y associated to a point x ∈ X ,

Fig. 3.2 The SL Problem


18 3 Preliminaries

we have to apply f to x and thus we obtain  y = f (x) with  y ∈ Y. Note that to


each point x ∈ X is associated always with the same  y ∈ Y even if we classify a
point x ∈ X many times;
• a probabilistic rule (or randomized function) where R is a probability distribution
Q over a set of deterministic functions F and R is a set of possible distributions Q.
In other words, once a Q∗ ∈ Q is chosen by A in order to predict the label y ∈ Y
associated to a point x ∈ X , we have to sample one function f ∈ F according to
Q∗ and then apply f to x in order to obtain  y = f (x) with  y ∈ Y. Note that to
each point x ∈ X can be associated to a different y ∈ Y if we try to classify a point
x ∈ X more times.

During the presentation we will deal only with deterministic and probabilistic rules.
The same reasoning can be done for A , since it can be:
• a deterministic algorithm, which means that A returns always the same R if Dn
does not change;
• a probabilistic algorithm (or randomized algorithm) where A may return a differ-
ent R even if we provide the same dataset Dn to the algorithm A .
The quality of R in approximating S is measured with reference to a loss function  :
R × Z → R. Mostly we will use a [0, 1]-bounded loss function since the extension
to the [a, b]-bounded is trivial, while to analyze the case of unbounded losses, one
usually truncates the values at a certain threshold and bounds the probability of
exceeding that threshold [7].
Hence it is possible to introduce the concept of generalization (true) error. For a
deterministic rule R ∈ R it is defined as

L(R) = L( f ) = Ez∼μ {( f, z)}. (3.7)

For probabilistic rule, instead, it is defined as

L(R) = L(Q) = Ez∼μ E f ∼Q {( f, z)}. (3.8)

Note that the generalization error is one of the most informative quantities that
we can estimate about R. It allows us to give a rigorous measure of the error of
our approximation by stating that, on average, the error of our approximation with
reference to a loss function  is equal to L(R).
Let us suppose that the probability distribution over Z is known and that we have
access to all the possible rules (probabilistic and deterministic). By all the possible
rules we mean all the possible ways of mapping X to Y. In this case it is possible to
build the Bayes’ rule, which is the best possible approximation of S with reference
to a loss function  (see Fig. 3.3). The Bayes’ rule B : X → Y is defined as

RBayes : arg inf L(R). (3.9)


3.2 The Supervised Learning Problem 19

Fig. 3.3 The Bayes’ rule, its approximation and the different sources of error

Unfortunately, in general it is not possible to have access to all the possible rules
but just to a finite set R of possible ones, therefore, in this case, we can define the
best approximation of the Bayes’ rule in R, which is R∗ , as

R∗ : arg inf L(R). (3.10)


R∈R

The error due to the choice of R is called approximation error since it is the
error due to the fact that RBayes ∈
/ R (see Fig. 3.3). Unfortunately, in general, also
the probability distribution over Z is unknown and consequently L(R) cannot be
computed. The only thing that we can do is to use the algorithm A , which basically
maps Dn into a rule R according to an heuristic, that can be more or less strongly
theoretically grounded, which tries to find R∗ ∈ R based just on Dn . The result is a
 ∗ ∈ R defined as
rule R

 ∗ : A (Dn ).
R (3.11)

The rule R  ∗ is affected by the estimation error, with respect to R∗ , due to the fact
that the probability distribution over Z is unknown and just Dn is available. Conse-
 ∗ is affected by two sources of error: the
quently, with respect to RBayes , the rule R
approximation and the estimation errors (see Fig. 3.3).
In this monograph we will deal with the problem of learning from empirical data
where the only information available during the learning phase is Dn . Consequently
it is possible to identify several problems that arise during any learning process: how
R must be designed, how A must be designed, how to choose between different set
of rules and different algorithms in order to reduce the different sources of error, how
the generalization error can be estimated based on Dn .
These issues related to designing R and A are out of the scope of this book. For
a deep treatment of the problem of algorithms design, implementation and related
issues one can refer to many books [8–12].
The problems that we will face in details in this monograph are, instead the
remaining issues related to any learning process: the MS and the EE phases.
20 3 Preliminaries

Any algorithm, more or less explicitly, is characterized by a set of hyperparameters


which defines the set of rules from which the algorithm will choose the final one.
Consequently one of the most critical problems in learning is how to choose the
best configuration of this hyperparameters h in a set of possible configurations H.
The problem could also be generalized to the problem of choosing between different
algorithms. Basically one has to choose between different algorithms A ∈ A, each
one characterized by its configuration of hyperparameters h ∈ HA

AH = {Ah : A ∈ A, h ∈ HA } . (3.12)

Note that once the best algorithm Ah ∈ AH is chosen the final rule can be selected
by applying the algorithm to the available data Dn . Consequently the first problem
that we will face in this monograph is the one of choosing Ah ∈ AH based just on
the empirical observations (Dn ). In the learning process this phase is MS or, more
generally, performance tuning phase.
MS is tightly related with the second issue that we will face in this monograph,
which is how to estimate the generalization error of the learned rule R. This phase is
called EE or, more generally, performance assessment phase. The link between the
MS and EE phases is that the purpose of any learning procedure is to find the best
possible approximation of the Bayes’ rule given Dn , and by best possible approxi-
mation we mean the rule with the smallest generalization error. In order to estimate
the generalization error of a R the state-of-the-art approach is to use probabilities
and concentration inequalities [13–30] which allow to prove that

P{L(R)  Δ}  δ(Δ), (3.13)

where the confidence δ is a function of the accuracy Δ and vice-versa. In other


words we are able to guarantee, with high probability, that the generalization error
will be larger than a particular quantity or alternatively that the probability of the
generalization error to be large is small.
The MS phase is tightly related with the EE one since, if we find a way to accu-
rately estimate the generalization error of a rule, we can easily obtain a criterion to
select among different rules (i.e. different algorithms or different configurations of
the hyperparameters of the algorithm) by choosing the one which minimizes the esti-
mated generalization error. This approach, which is the golden standard, obviously
has some drawbacks (see Fig. 3.4): the sharp is the estimation of the generalization
error the higher will be the probability to select the best model (perform a good MS
phase) while the loose is the estimation the lower will be the probability to perform
a good MS phase.
Note that, in this book, we will just present methods for MS and EE that rely only
on Dn in order to be applied. Each method presented will not require any additional
oracle or a-priori knowledge in order to be adopted in practice.
Another random document with
no related content on Scribd:
CHAPTER XIX
A single glance at the map to-day—I am writing this in January—is
sufficient to give an idea of the enormous difficulties the Italians have
still to contend against and surmount in forcing their way across the
formidable barrier of stony wilderness between their present position
and their obvious objective—Trieste.
It is said that they will find that apart from the terrible character of
the natural obstacles, which, as I have endeavoured to show, they
are up against, the Austrians have series after series of entrenched
lines, each of which will have to be captured by direct assault.
Although it has been seen that man for man the Italian is far and
away superior in fighting quality to the Austrian, it cannot be denied
that when well supplied with machine guns, and behind positions
which afford him almost complete protection, the Austrian soldier will
put up a very determined resistance before he gives in. This factor
must, therefore, be reckoned with in any commencement of an
attempt at an advance.
The dash and reckless courage of the Italians, whilst thoroughly to
be relied upon under any circumstances however trying, must
always be held, as it were, in leash, otherwise even the smallest
forward move will be only achieved at an awful sacrifice of gallant
lives. The utmost caution, moreover, will have to be exercised, so as
not to fall into a guet apens, and every step forward must be fully
protected.
When success, therefore, can only be reckoned on, as it were, by
yards, it is not surprising to find on examining the map how slow
apparently has been the progress during the past four months; but it
is progress nevertheless, and the most tangible proof of it is
contained in the concluding lines of the brief summary issued by the
Italian supreme command of the operations from September to
December:
“The total of prisoners taken on the Julian Front (i.e., the Isonzo
and the Carso) from August to December was 42,000, and the guns
numbered 60 and the machine guns 200.”
The gradual advance has not been confined to any one particular
sector of this Front, but was part of the general scheme which is
operating like a huge seine net over this part of the Carso. The
interest, therefore, was entirely concentrated in this zone after the
fall of Gorizia, and I never missed an opportunity of going in that
direction on the chance of getting some good subjects for my sketch
book.
I remember some years ago when I was crossing the Gobi desert,
I discovered that the desolation of the scene around me exercised
an inexplicable sort of fascination, and at times I would have a
strange longing to wander away alone into the wilderness.
I experienced somewhat the same sensation on the Carso. It is
most probably what Jack London designated the “call of the wild.” In
this case, however, the fascination was tempered by the knowledge
that one’s wandering fit might be cut short by an Austrian bullet, so
one’s peregrinations had perforce to be somewhat curtailed.
There was, of course, much of great interest to see and sketch in
the area where active operations were in progress, whilst every day
almost there seemed to be something, either in the shape of a
rumour, or in the official communiqué, that formed a good excuse for
getting into a car and heading for the “sound of the guns” again.
They came racing across the stretch of “No man’s land” (see page 294)
To face page 270

On one of these occasions I had as my companion Robert


Vaucher, the correspondent of the Paris Illustration, who had just
arrived at the Front for a short visit.
We decided to make for San Martino del Carso, the first village
captured by the Italians on the Carso, as it was quite close to the
fighting then going on round Oppachiassella.
Our route was via Palmanova, Romans and Sagrado a road one
had got to know by heart, so to speak, but of which one never tired,
for somehow, curiously enough, everything always seemed novel
although you had seen it many times before.
There was also the charm of starting off in a car just after sunrise;
it had a touch of adventure about it that made me feel quite youthful
again. I never tired of the long drives; and in the early morning the air
was like breathing champagne.
One frequently had the road to oneself at this hour, and you could
have imagined you were on a pleasure jaunt till you heard the
booming of the guns above the noise of the engine; for it did not
matter how early one was, the guns never seemed to be silent. With
a sympathetic companion in the car these runs out to the lines were
quite amongst the pleasantest features of one’s life up at the Front.
On this particular occasion the fact of being with someone with
whom I could converse freely made it still more agreeable. We went
off quite “on our own,” as Vaucher speaks Italian fluently; and as our
soldier chauffeur knew the road well, there was not much fear of our
getting lost.
We decided that it was advisable for form’s sake to call on the
Divisional General, and ask for his permission to pass through the
lines. With some little difficulty we succeeded in discovering his
Headquarters. These were, we learned, on the railway, close to the
Rubbia-Savogna Station, on the Trieste-Gorizia line.
It turned out to be about the last place where one would have
expected to unearth a General. The station itself was in ruins, and
presented a pathetically forlorn appearance, with posters and time-
tables hanging in tatters from the walls; no train had passed here for
very many months.
We left the car on the permanent way alongside the platform, and
picked our way along the track through the twisted and displaced
rails to the signalman’s “cabin,” which had been converted into the
Headquarters pro tem.
It was as unconventional and warlike as could well be imagined,
and as a subject for a picture would have delighted a military painter.
The General was a well set up, good-looking man of middle age,
and quite the most unassuming officer of his rank I have ever met.
After carefully examining our military permits to come there, he
received us with the utmost cordiality. He spoke French fluently, and
was apparently much interested in our work as war correspondents.
There was no difficulty, he said, about our going to San Martino,
but we did so at our own risk, as it was his duty to warn us that it was
still being constantly shelled; in fact, he added, the whole
neighbourhood was under fire, and he pointed out a gaping hole a
few yards away where a shell had burst only an hour before our
arrival, and had blown a small hut to atoms.
The railway station was being continually bombarded, and he was
sorry to say that he had lost a good many of his staff here.
No strategic object whatever was attained by this promiscuous
shelling; the only thing it did was to get on the men’s nerves and
make them fidgetty.
“They want to be up and doing instead of waiting about here when
their comrades have gone on ahead.”
We had quite a long talk with him, and gathered some interesting
details of the fighting that had taken place round here. He was most
enthusiastic about the moral of his troops, that no fatigue or pain can
quell.
The only difficulty the officers experienced was in getting them to
advance with caution. “Ils deviennent des tigrés une fois lancés;
c’est difficile de les retenir.”
As we bade him adieu, he asked us as a personal favour not to
mention him in any article we might write, adding modestly: “Je ne
suis qu’un soldat de l’Italie, et ne désire pas de réclame.”
There was a turning off the main road beyond Sdraussina that
passed under the railway embankment, and then went up to San
Martino del Carso.
Here there was an animated scene of military activity. A battalion
of infantry was bivouacing, and up the side of the hill, which was one
of the slopes of Monte San Michele, there was a big camp with tents
arranged in careful alignment. I mention this latter fact as it was an
unusual spectacle to see an encampment so well “pitched.”
Both the bivouac and the tents were quite protected from shell-fire
by the brow of the hill, but they would have made an easy target had
the Austrians had any aeroplanes here; doubtless, though, all
precautions had been taken by the Italian Commander in the event
of this.
Over the crest of the hill the scene changed as though by magic,
and all sign of military movement disappeared.
The desolate waste of the Carso faced us, and we were in the
zone of death and desolation. The road was absolutely without a
vestige of “cover”; it was but a track across the rocky ground, and
now wound over a series of low, undulating ridges, on which one
could trace the battered remains of trenches.
Huge shell-craters were visible everywhere, and the road itself
was so freshly damaged in places that I involuntarily recalled what
the General had told us, and wondered whether we should get back
safely.
The country was so open and uninteresting that one could see all
that there was to see for several miles ahead; it was therefore
certain no scenic surprises were awaiting you.
Meanwhile shells were bursting with unpleasant persistency round
about the road; it was not what one could term an inviting prospect
and recalled an incident that had occurred a few days previously on
this very road.
There was a good deal of firing going on as usual, and the
chauffeur of an officer’s car suddenly lost his nerve and became
completely paralysed with fear, not an altogether unusual case, I
believe.
It was a very awkward situation, as the officer knew nothing about
driving, so he was obliged to sit still for over an hour, when,
fortunately for him, a motor lorry came along, and he was extricated
from his predicament.
It seemed to me to be very purposeless going on further, since
there was absolutely nothing to sketch and still less to write about.
Since, however, we could not be far from our destination now, I
thought it best to say nothing.
But where was the village? I knew by what we had been told that
we must be close to it by now; yet there was no trace of habitation
anywhere.
The chauffeur suddenly turned round and, pointing to what
appeared to be a rugged slope just ahead, said quietly, “Ecco San
Martino del Carso.”
I am hardened to the sight of ruins by now after more than two
years at the war, but I must admit I had a bit of a shock when I
realized that this long, low line of shapeless, dust-covered rubble
actually bore a name, and that this was the place we had risked
coming out to see.
No earthquake could have more effectually wiped out this village
than have the combined Italian and Austrian batteries.
We drove up the slope to what had been the commencement of
the houses. To our surprise a motor lorry was drawn up under the
shelter of a bit of wall; two men were with it. What was their object in
being there one could not perceive, as there was no other sign of life
around.
We left the car here, and I went for a stroll round with my sketch
book, whilst Vaucher took his camera and went off by himself to find
a subject worth a photograph.
The Austrian trenches commenced in amongst the ruins: they
were typical of the Carso; only a couple of feet or so in depth, and
actually hewn out of the solid rock, with a low wall of stones in front
as a breastwork.
So roughly were they made that it was positively tiring to walk
along them even a short distance. To have passed any time in them
under fire with splinters of rock flying about must have been a terrible
ordeal, especially at night.
San Martino was certainly the reverse of interesting, and I was
hoping my comrade would soon return so that we could get away,
when the distant boom of a gun was heard, followed by the ominous
wail of a big shell approaching.
The two soldiers and our chauffeur, who were chatting together,
made a dash for cover underneath the lorry; whilst I, with a sudden
impulse I cannot explain, flung myself face downwards on the
ground, as there was no time to make for the shelter of the trench.
The shell exploded sufficiently near to make one very
uncomfortable, but fortunately without doing us any harm. A couple
more quickly followed, but we could see that the gunners had not yet
got the range, so there was nothing to worry about for the moment.
Vaucher soon returned, having had a futile walk; so we made up
for it all by taking snapshots of ourselves under fire, a somewhat
idiotic procedure.
As we drove back to Udine, we were agreed that, considering how
little there had been to see, le jeu ne valait pas la chandelle, or
rather ne valait pas le petrole, to bring the saying up to date.
A very pleasant little episode a few days later made a welcome
interlude to our warlike energies. The Director of the Censorship,
Colonel Barbarich, received his full colonelcy, and to celebrate the
event the correspondents invited him and his brother officers to a
fish déjeuner at Grado, the little quondam Austrian watering place on
the Adriatic.
We made a “day’s outing” of it; several of the younger men starting
off early so as to have a bathe in the sea before lunch.
It was glorious weather, and we had a “top hole” time. It all went
off without a hitch; the déjeuner was excellent; I don’t think I ever
tasted finer fish anywhere; the wine could not have been better, and,
of course, we had several eloquent speeches to wind up with.
There was just that little “Human” touch about the whole thing that
helped to still further accentuate the camaraderie of the Censorship,
and the good fellowship existing between its officers and the
correspondents.
Grado, though at first sight not much damaged since our visit on
the previous year, had suffered very considerably from the visits of
Austrian aircraft. They were still constantly coming over, in spite of
the apparently adequate defences, and many women and children
had been killed and many more houses demolished.
There was a curious sight in the dining room of the hotel where we
gave the lunch. The proprietor had built a veritable “funk-hole” in a
corner of the room. It was constructed with solid timber, and covered
in with sand-bags in the most approved style.
Inside were a table, chairs, large bed, lamp, food, drink, etc.; in
fact, everything requisite in case a lengthy occupation was
necessary; and there the proprietor and his wife and children would
take refuge whenever the enemy was signalled.
After lunch we were invited to make a trip in one of the new type of
submarine chasers, which are said to be the fastest boats afloat
anywhere, and went for an hour’s run at terrific speed in the direction
of Trieste; in fact, had it not been for a bit of a sea fog hanging about
we should have actually been well in sight of it. Perhaps it was
fortunate for us there was this fog on the water.
Things were a bit quiet in Udine now. Stirring incidents do not
occur every week, and the usual period of comparative inactivity had
come round again whilst further operations were in process of
development; there was but little inducement, therefore, to spend
money on petrol just for the sake of verifying what one knew was
happening up at the lines.
But I had plenty to occupy me in my studio, working on the
numerous sketches the recent doings had provided me with, till
something worth going away for turned up again.
In the interim an event of historic importance occurred.
CHAPTER XX

Declaration of war between Italy and Germany—Effect of declaration


at Udine—Interesting incident—General Cadorna consents to give
me a sitting for a sketch—The curious conditions—Methodic and
business-like—Punctuality and precision—A reminder of old days—I
am received by the Generalissimo—His simple, unaffected manner
—Unconventional chat—“That will please them in England”—My
Gorizia sketch book—The General a capital model—“Hard as
nails”—The sketch finished—Rumour busy again—A visit to
Monfalcone—One of the General’s Aides-de-camp—Start at
unearthly hour—Distance to Monfalcone—Arctic conditions—“In time
for lunch”—Town life and war—Austrian hour for opening fire—
Monfalcone—Deserted aspect—The damage by bombardment—
The guns silent for the moment—The ghost of a town—“That’s only
one of our own guns”—A walk to the shipbuilding yards—The
communication trench—The bank of the canal—The pontoon bridge
—The immense red structure—The deserted shipbuilding
establishment—Fantastic forms—Vessels in course of construction
—A strange blight—The hull of the 20,000 ton liner—The gloomy
interior—The view of the Carso and Trieste through a port-hole—Of
soul stirring interest—Hill No. 144—The “daily strafe”—“Just in
time”—Back to Udine “in time for lunch”—Return to the Carso—
Attack on the Austrian positions at Veliki Hribach—New difficulties—
Dense forest—Impenetrable cover—Formidable lines of trenches
captured—Fighting for position at Nova Vas—Dramatic ending—
Weather breaking up—Operations on a big scale perforce
suspended—Return London await events.
CHAPTER XX
On the 28th August, 1916, Italy declared war on Germany. The
declaration had, however, been so long anticipated that, so far as
one was in a position to judge, it made little or no difference in the
already existing state of affairs; since the two nations had to all
intents and purposes been fighting against each other for months,
and at Udine, at any rate, it scarcely aroused any comment outside
the Press.
However, it settled any doubts that might have existed on the
subject, and henceforth the Italians and the Huns were officially
justified in killing each other whenever they got the chance.
Curiously enough, it was through General Cadorna himself that I
learned that war had been declared. It was under somewhat
interesting circumstances, which I will relate.
I had always desired to make a sketch of the Commander-in-Chief
of the Italian Army, and with this idea had asked Colonel Barbarich at
the Censorship if he would try and arrange it for me.
He willingly agreed, but a few days after he told me that he had
done the best he could for me, but that the General had said that for
the moment he was far too occupied, but perhaps he would accede
to my desire a little later. I must therefore have patience.
This looked like a polite way of putting me off, and I accepted it as
such.
Gorizia and other subjects engaged my attention, and I had
forgotten the incident when, to my surprise, one day Colonel
Barbarich came up to me and said that if I still wished to make the
sketch of the General, His Excellency would be pleased to receive
me on the following Monday morning at eleven o’clock precisely, and
would give me a sitting of exactly half-an-hour.
“But,” added Colonel Barbarich, “you must clearly understand it is
only half-an-hour, and also that the General will not talk to you as he
will not be interviewed.”
The “following Monday” was nearly a week ahead, so this was
methodic and business-like indeed.
Of course, in spite of all the conditions attaching to the sitting, I
was delighted to find that my request had not been overlooked, so I
replied jocularly to the Colonel that failing an earthquake or the ill-
timed intervention of an Austrian shrapnel, I would certainly make it
my duty to keep the appointment.
Well, the auspicious day arrived in due course, and so did I in
good time at the Censorship to meet Colonel Barbarich, who was to
take me on to the General, whose quarters were in a palace
originally intended for the Prefect of Udine, only a short distance
away.
With the knowledge of the punctuality and precision of the
General, at l’heure militaire, that is to say, as the clock was striking
eleven, we made our way up the grand staircase to the first floor
where the General resided and had his offices.
In a large anti-chamber, with a big model of the Isonzo Front
occupying the whole of the centre, we were received by an aide-de-
camp, who evidently expected us exactly at that moment. Colonel
Barbarich briefly introduced me, then to my surprise left at once.
The Aide-de-camp took my card into an adjoining apartment, and
returning immediately, said that His Excellency General Cadorna
was waiting for me, and ushered me in.
A grey-haired officer of medium height, whom I
immediately recognised as the Generalissimo, was
reading an official document
To face page 283

Up till then nothing could have been more matter-of-fact and


business-like. It reminded me of the old days when I sought
journalistic interviews with city magnates. But the business-like
impression vanished as soon as I was inside the door.
I found myself in a very large room, well but scantily furnished.
Standing by a table, which was covered with maps, a grey-haired
officer of medium height, whom I immediately recognised as the
Generalissimo, was reading an official document.
He came forward and cordially shook hands with me in the most
informal way. I began to thank him for his courtesy in receiving me,
and was apologising for not being able to speak Italian, when he cut
me short, saying with a laugh:
“I speak but very leetle English,” but “Peut-être vous parlez
Français.” On my telling him I did he exclaimed genially:
“A la bonheur, then we will speak in French. Now what do you
want me to do? I am at your service.”
His simple and unaffected manner put me at once at my ease and
made me instantly feel that this was going to be a “sympathetic”
interview, and not a quasi official reception.
I must mention that I had asked Colonel Barbarich to explain that I
did not want to worry the General, but would be quite content if I
were permitted to make a few jottings in my sketch book of him at
work and some details of his surroundings. This, as I have
explained, was granted, but with the curious proviso that I was not to
talk whilst I was there.
It came, therefore, as a very pleasant surprise to find myself
received in this amicable fashion, and the ice being thus broken, I
said, I should like to sketch him reading a document, as I found him
on entering the room. He willingly acquiesced, and I at once started
my drawing as there was no time to lose.
With the recollection of the stipulation that I was not to open my
mouth during the sitting, and that I was only allowed half-an-hour I
went on working rapidly and in silence.
But I soon found that the General was not inclined to be taciturn,
and in a few moments we were chatting in the most unconventional
manner as if we were old friends. As a matter of fact, I shrewdly
suspected him of interviewing me.
When he learned how long I had been on the Italian Front he was
much interested, and was immediately anxious to know what I
thought of his soldiers. Were they not splendid? He put the question
with all the enthusiasm and affection of a father who is proud of his
children.
As may be imagined, I had no difficulty in convincing him that I
have a whole-hearted admiration for the Italian Army after what I had
seen of its wonderful doings at Gorizia and elsewhere.
It was then that he gave me the news of the declaration of war
between Italy and Germany; the morning papers had not published it
in the early editions.
“That will please them in England,” he remarked with a laugh. I
agreed with him that it would, although it had long been expected.
The mention of England reminded me that he had just returned
from the war conference in London, so I asked if he had ever been
there before.
“Yes,” he replied with a humorous twinkle in his eye, “forty years
ago; but I do not remember much of it; although my father was
ambassador to England I only lived with him in London for a short
while. It is, of course, much changed since then.”
Whilst thus chatting I was working with feverish haste at my
sketch.
I now noticed he was getting a bit impatient at keeping the same
position, so I suggested a few moments rest. He came over to see
how I had got on, and asked if he might look through my sketch
book.
It happened to be the one I had used at Gorizia, and the sketches
I had made that day pleased him very much.
“You were fortunate, you were able to see something; I never see
anything,” he remarked quite pathetically.
I felt there was no time to lose if I wanted to get finished in the
half-hour, so hinted at his resuming the pose for a few minutes
longer. He did so at once, and I ventured to tell him in a joking way
that he would make a capital model.
“Well, I am as active now as I ever was,” he replied, taking me
seriously, “and I can ride and walk as well now as I could when I was
a young man.”
This I could well believe, because he looks as “hard as nails,” and
chock full of energy and determination, as the Austrian generals
have discovered to their undoing.
I had now completed my rough sketch sufficiently to be able to
finish it in the studio.
The General expressed his gratification at my having done it so
rapidly, so I suggested another ten minutes some other day to put
the finishing touches.
“Come whenever you like, I shall always be pleased to see you,”
replied His Excellency genially.
Although, as I have said, there was no outward evidence of the
declaration of war making any difference in the conduct of the
campaign, rumours soon began to be persistently busy again, and it
became pretty evident that something big was going to happen on
the Carso before the weather broke up and the autumn rains set in
and put a stop to active operations for some time.
There had been a good deal of talk of operations pending in the
vicinity of Monfalcone, so I got permission to accompany a Staff
Officer who was going there one morning. I had always wanted to
see the place and its much talked of shipbuilding yards, but curiously
enough this was the first opportunity I had had of going there.
My companion was one of General Cadorna’s aides-de-camp, so
we went in one of the big cars belonging to Headquarters. We
started at the usual unearthly hour to which one had become
accustomed and which, as I have pointed out, is delightful in the
summer, but is not quite so fascinating on a raw autumn morning
before sunrise.
I was very disappointed when I learned that we should probably be
back in Udine “in time for lunch” unless something untoward
occurred to force us to stay away longer; as I had been looking
forward to an extended run that would last the whole day, but as I
was practically a guest on this occasion I could say nothing. My
companion, like so many Italian officers, spoke French fluently, and
turned out to be a very interesting fellow; and as he had been
stationed for some time at Monfalcone before going on the Staff, he
knew the district we were making for as well as it was possible to
know it.
The distance from Udine to Monfalcone is, roughly, the same as
from London to Brighton, and we went via Palmanova, Cervignano,
and Ronchi.
It was a bitterly cold morning, with an unmistakable nip of frost in
the air, so although I was muffled up to my ears I was gradually
getting frozen, and my eyes were running like taps. It may be
imagined, therefore, how I was envying my companion his big fur-
lined coat.
I had arrived at the Front in the hottest time of the year, so had
taken no precautions against Arctic conditions. Motoring in Northern
Italy in an open car during the winter months must be a very trying
ordeal indeed, if what I experienced that morning was any criterion of
it.
As we sped along I asked the Aide-de-camp if there was any
particular reason for his starting off so early, and if it was absolutely
necessary for us to be back “in time for lunch.” To my mind the very
thought of it took the interest off the trip and brought it down to the
level of an ordinary pleasure jaunt, which was to me particularly
nauseating.
After all these months at the Front I have not yet been able to
accustom myself to the combination of every-day town life and war,
and I am afraid shall never be able to. Doubtless it is a result of old
time experiences.
My companion treated my query somewhat lightly. “You will be
able to see all there is to see in and round Monfalcone in three
hours,” he replied, “so what is the use therefore of staying longer?
Moreover,” he added seriously, “the Austrian batteries have made a
practice of opening fire every morning at about eleven o’clock, and
usually continue for some hours, so there is the risk of not being able
to come away when one wants to.”
There was, of course, no reply possible, and the more especially
as I am not exactly a glutton for high explosives, as will have been
remarked.
Monfalcone is a nice bright little town, typically Austrian, and
before the war must have been a very busy commercial centre.
When I was there it was absolutely deserted, with the exception of
a few soldiers stationed there. The shops were all closed, grass was
growing in the streets, and it presented the usual desolate
appearance of a place continually under the menace of
bombardment.
The damage done to it up till then was really unimportant
considering the reports that had been spread as to its destruction.
Many houses had been demolished, as was to be expected, but I
was surprised to find how relatively undamaged it appeared after the
months of daily gun-fire to which it had been subjected.
We left the car in a convenient courtyard where it was under cover,
and made our way to the Headquarters of the Divisional
Commandant, where, as a matter of etiquette I had to leave my card.
For the moment the guns were silent, and there was a strange
quietude in the streets that struck me as being different to anything I
had noticed anywhere else, except perhaps in Rheims during the
bombardment when there was an occasional lull.
One had the feeling that at any moment something awful might
happen. Even the soldiers one met seemed to me to have a
subdued air, and the drawn expression which is brought about by
constant strain on the nerves.
Instinctively one walked where one’s footsteps made the least
noise, in order to be able to hear in good time the screech of an
approaching shell.
It had turned out a lovely day, and in the brilliant sunshine
Monfalcone should have been a bright and cheerful place, instead of
which it was but the ghost of a town with the shadow of death
continually overhanging it.
The peaceful stillness was not to be of long duration. Silence for
any length of time had been unknown in Monfalcone for many a long
day.
Whilst we were having a talk with the officers at Headquarters
there was a loud detonation, apparently just outside the building. To
my annoyance I could not restrain an involuntary start, as it was
totally unexpected.
“That’s only one of our guns,” remarked, with a smile, a Major with
whom I was chatting, and who had noticed the jump I made. “The
Austrians won’t commence for another couple of hours at least,” he
added.
My companion and I then started off to walk down to the
shipbuilding yard, about a mile and a half from the town, and which
was, of course, the principal sight of the place.
One had not gone far when one had some idea how exposed was
the position of Monfalcone. A deep communication trench
commenced in the main street and continued alongside the road the
whole way down to the port—no one was allowed to walk outside it.
The object of this was to prevent any movement being seen from
the Austrian batteries, which were only a comparatively short
distance away, though it must have been no secret to them what was
going on in Monfalcone.
The Italian guns were now getting busy, and the noise was
deafening, but still there was no response from the enemy; it was
evidently true that he worked to time, and it was not yet eleven
o’clock.
Although only a mile and a half, the walk seemed longer because
one could see nothing on either side, the walls of the trench being
quite six feet high; but at last we came out on the bank of what
looked like a broad canal. This is part of a waterway constructed to
connect up the port with the railway.
The communication trench now took the form of a sunken pathway
winding along the bank under the trees, and was quite picturesque in
places.
At last we reached the end, and facing us was the Adriatic as calm
as a lake, and away on the horizon one could see the hills that guard
Trieste.
We crossed the mouth of the canal by a pontoon bridge, which I
believe had been abandoned by the Austrians when they evacuated
Monfalcone at the beginning of the war.
A short distance ahead, towering above a conglomeration of long
sheds on the low-lying ground, was an immense red structure, the
outlines of which recalled something familiar. As one got nearer one
saw that it was the unfinished steel hull of a gigantic ocean liner, and
that the red colouring was caused by the accumulation of rust from
long exposure.
We soon reached the entrance to a vast shipbuilding
establishment. There were no bolts or bars to prevent our walking in.
The whole place was deserted, and all around us was a spectacle
of ruin and desolation that was more impressive than actual havoc
caused by bombardment.
In the immense workshops the machinery was rotting away; on the
benches lay the tools of workmen; strange metal forms, portions of
the framework of big ships lay here and there on the sodden ground
like huge red skeletons of ante-deluvian animals.
Many vessels had been in the course of construction, mostly for
the Mercantile fleet of Austria, though there were some destroyers
and war-craft on the stocks as well. Rust, of a weird intensity of
colour. I had never seen before, was over everything like a strange
blight.
Alongside the sheds was the hull of the big liner one had seen in
the distance. A 20,000 ton boat, I was told, which was being built for
the Austrian Lloyd Line.

You might also like