Kagan Lecture2

Machine Learning:
Lecture II
Michael Kagan
SLAC
CERN Academic Training Lectures

April 26-28, 2017
Lecture Topics 2
•  Recap of last time

–  What is Machine Learning
–  Linear Regression
–  Logistic Regression
–  Over fitting and Regularization
–  Training procedures and cross validation
–  Gradient descent
•  This Lecture
–  Neural Networks → Just an intro, more on this tomorrow!
–  Decision Trees and Ensemble Methods
–  Unsupervised Learning
•  Dimensionality reduction
•  Clustering
–  No Free Lunch and Practical Advice
Neural Networks 3
Reminder of Logistic Regression 4
•  Input output pairs {xi, yi}, with

–  xi ∈ Rm
–  yi ∈ {0,1}
•  Linear decision boundary h(x; w) = wT x
h(x) > 0
h(x) = 0
h(x) < 0
h(x)
[Bishop]
Reminder of Logistic Regression 5
•  Input output pairs {xi, yi}, with

–  xi ∈ Rm
–  yi ∈ {0,1}
•  Linear decision boundary h(x; w) = wT x
•  Distance from decision boundary p(y = 1|x) = (h(x, w))

is converted to class probability 1
=
using logistic sigmoid function 1 + e wT x
Logis0c Sigmoid
1
(z) =
1+e z
Logistic Regression 6
p(y = 1|x) = (h(x, w))

1
= w Tx
1+e
Adding non-linearity 7
•  What if we want a non-linear decision boundary?


–  Choose basis functions, e.g: φ(x) ~ {x2, sin(x), log(x), …}
1
p(y = 1|x) = wT (x)
1+e

1
p(y = 1|x) = wT (x)
1+e
•  What if we don’t know what basis functions we want?

1
p(y = 1|x) = wT (x)
1+e
•  Learn the basis functions directly from data
φ(x; u) Rm → Rd
–  Where u is a set of parameters for the transformation


1
p(y = 1|x) = wT (x)
1+e
•  Learn the basis functions directly from data
φ(x; u) Rm → Rd
–  Where u is a set of parameters for the transformation
–  Combines basis selection and learning

–  Several different approaches, focus here on neural networks
–  Complicates the optimization
Neural Networks 12
•  Define the basis functions j = {1…d}
φj(x; u) = σ(ujTx)
Neural Networks 13
•  Put all uj ∈ R1xm vectors into matrix U

σ(u1Tx)
σ(u2Tx)
φ(x; U) = σ(Ux) = …
∈ Rd
σ(udTx)
–  σ is a pointwise sigmoid acting on each vector element

Neural Networks 14
•  Put all uj ∈ R1xm vectors into matrix U

σ(u1Tx)
σ(u2Tx)
φ(x; U) = σ(Ux) = …
∈ Rd
σ(udTx)
–  σ is a pointwise sigmoid acting on each vector element
•  Full model becomes

h(x; w, U) = wTφ(x; U)
Feed Forward Neural Network 15
Hidden layer
Composed of neurons
φ(…) oGen called the

U ac0va0on func0on
(x) = (Ux)
h(x) = wT (x)
Multi-layer Neural Network 16
U V
•  Multilayer NN
–  Each layer adapts basis based on previous layer
Universal approximation theorem 17
•  Feed-forward neural network with a single hidden

layer containing a finite number of neurons can
approximate continuous functions arbitrarily well on
a compact space of Rn
–  Only mild assumptions on non-linear activation function

needed. Sigmoid functions work, as do others


•  But no information on how many neurons needed, or

how much data!


•  But no information on how many neurons needed, or

how much data!
•  How to find the parameters, given a dataset, to

perform this approximation?
Neural Network Optimization Problem 20
•  Neural Network Model: h(x) = w T

(Ux)
•  Classification: Cross-entropy loss function

pi = p(yi = 1|xi ) = (h(xi ))
X
L(w, U) = yi ln(pi ) + (1 yi ) ln(1 pi )
i

(Ux)

pi = p(yi = 1|xi ) = (h(xi ))
X
i
•  Regression: Square error loss function

1X
L(w, U) = (yi h(xi ))2
2 i

(Ux)

pi = p(yi = 1|xi ) = (h(xi ))
X
i
•  Regression: Square error loss function

1X
L(w, U) = (yi h(xi ))2
2 i
•  Minimize loss with respect to weights w, U

Gradient Descent 23
•  Minimize loss by repeated gradient steps

–  Compute gradient w.r.t. parameters: @L(w)
@w
0 @L(w)
–  Update parameters: w w ⌘
@w
Gradient Descent 24
•  Minimize loss by repeated gradient steps

–  Compute gradient w.r.t. parameters: @L(w)
@w
0 @L(w)
–  Update parameters: w w ⌘
@w
•  Now we need gradients w.r.t. w and U
•  Gradients will depend on loss and network architecture
•  Loss function is non-convex
(many local minimum / saddle points)
–  Gradient descent may not find
global minimum
–  Can be a major issue!
–  Variants of stachastic gradient descent
can be helpful!
Chain Rule 25
X
L(w, U) = yi ln( (h(xi ))) + (1 yi ) ln(1 (h(xi )))
i
@ (x)
•  Derivative of sigmoid: = (x)(1 (x))
@x
•  Chain rule to compute gradient w.r.t. w

@L @L @h X
= = yi (1 (h(xi ))) (Ux) + (1 yi ) (h(x)) (Uxi )
@w @h @w i
•  Chain rule to compute gradient w.r.t. uj

@L @L @h @
= =
@uj @h @ @uj
X
= yi (1 (h(xi )))wj (uj xi )(1 (uj xi ))xi
i
+ (1 yi ) (h(xi ))wj (uj xi )(1 (uj xi ))xi
Backpropagation 26
•  Loss function composed of layers of nonlinearity

L( a (... 1 (x)))
Backpropagation 27

L( a (... 1 (x)))
•  Forward step (f-prop)
–  Compute and save intermediate computations
a 1
(... (x))
Backpropagation 28

L( a (... 1 (x)))
a 1
(... (x))
(a+1)
@L X@ @L
•  Backward step (b-prop) =
j
@ a
j
@ aj @
(a+1)
j
Backpropagation 29

L( a (... 1 (x)))
a 1
(... (x))
(a+1)
@L X@ @L
•  Backward step (b-prop) =
j
@ a
j
@ aj @
(a+1)
j
@L X @ aj @L
•  Compute parameter gradients @w a
=
@w a @ a
j
j
Training 30
•  Repeat gradient update of weights reduce loss

–  Each iteration through dataset is called an epoch
•  Use validation set to examine for overtraining, and

determine when to stop training
[graphic from H. Larochelle]

Regularization 31
•  L2 regularization: add Ω(w) = ||w||2 to loss

–  Also called “weight decay”
–  Gaussian prior on weights, keep weights from getting too
large and saturating activation function
•  Regularization inside network, example: Dropout

–  Randomly remove nodes during training
–  Avoid co-adaptation of nodes
–  Essentially a large model averaging procedure
arXiv:1207.0580
Activation Functions 32
•  Vanishing gradient problem •  Rectified Linear Unit (ReLU)

–  Derivative of sigmoid: –  ReLU(x) = max{0, x}
∂σ (x) –  Derivative is constant!
= σ (x)(1− σ (x))
∂x ∂Re LU(x) "$ 1 when x > 0
=#
–  Nearly 0 when x is far from 0! ∂x $% 0 otherwise
–  Gradient descent difficult! –  ReLU gradient doesn’t vanish
Neural Network Decision Boundaries 33
One neuron Two neuron
4-class classifica0on
2-hidden layer NN
ReLU ac0va0ons
L2 norm regulariza0on
Three neurons Four neurons
x2
Five neurons Twenty neurons
FiGy neurons
x1
2-class classifica0on
1-hidden layer NN
L2 norm regulariza0on
hXp://www.wildml.com/2015/09/implemen0ng-a-neural-network-from-scratch/ hXp://junma5.weebly.com/data-blog/build-your-own-neural-network-classifier-in-r
Deep Neural Networks 34
•  As data complexity grows, need exponentially large number of neurons in

a single-hidden-layer network to capture all the structure in the data
•  Deep neural networks have many hidden layers
–  Factorize the learning of structure in the data across many layers
•  Difficult to train, only recently possible with large datasets, fast computing
(GPU) and new training procedures / network structures (like dropout)
→ More next time
Neural Network Architectures 35
•  Structure of the networks,

and the node connectivity
can be adapted for problem
at hand
•  Convolutions: shared
weights of neurons, but each
neuron only takes subset of
inputs
[Bishop] hXp://www.asimovins0tute.org/neural-network-zoo/
Neural Networks in HEP 36
Jets at the LHC Neutrino iden0fica0on

Example: NOνA
What do neural networks learn? 37
•  Can visualize weights: neutrino decay classification

arXiv:1604.01444
Image Y-view Weights of First layer Output of convolu0on
hXps://arxiv.org/abs/1511.05190
•  Find inputs that

most activate a
neuron:
–  Separating boosted
W-jets from quark/
gluon jets
Decision Tree Models 38
Decision Trees 39
•  Partition data based on a sequence of thresholds
•  In a given partition, estimate the class probability from Nm examples

in partition m and Nk of the examples in partition from class k:
Nk
pmk =
Nm
Single Decision Trees: Pros and Cons 40
•  Pros:
–  Simple to understand, can visualize a tree
–  Requires little data preparation, and can use continuous
and categorical inputs
•  Cons:
–  Can create complex models that overfit data
–  Can be unstable to small variations in data
–  Training a tree is an NP-complete problem
•  Hard to find a global optimum of all data partitionings
•  Have to use heuristics like greedy optimization where locally
optimal decisions are made
•  We will discuss the ways to overcome these Cons,

including early stopping of training, and ensembles
Greedy Training of a Decision Tree 41
•  Greedy Training: instead of optimizing all

splittings at the same time, optimize them one-by-
one, then move onto next splitting

•  Given Nm examples in a node, for a candidate
splitting θ=(xj , tm) for feature xj and threshold tm

•  If data partitioned into subsets Qleft and Qright ,
compute:
nleft nright
G(Q, ✓) = H(Qleft (✓)) + H(Qright (✓))
Nm Nm
–  Where H() is an impurity function


•  If data partitioned into subsets Qleft and Qright ,
compute:
nleft nright
G(Q, ✓) = H(Qleft (✓)) + H(Qright (✓))
Nm Nm
–  Where H() is an impurity function
•  Choose splitting θ using: ✓⇤ = arg min G(Q, ✓)

✓
Impurity Functions 45
•  Classification
Nk
–  Proportion of class k in node m: pmk =
Nm
X
–  Gini: H(Xm ) = pmk (1 pmk )
k
X
–  Cross entropy: H(Xm ) = pmk log(pmk )
k
–  Miss-classification: H(Xm ) = 1 max(pmk )

k
•  Regression 1 X
–  Continuous target y, in region estimate: cm =
Nm
yi
i2Nm
1 X
–  Square error: H(Xm ) = (yi cm )2
Nm
i2Nm
When to stop splitting? 46
•  In principle, can keep splitting until every event is

properly classified…

Variable 2
[Rogozhnikov]
Variable 1
•  Single decision trees can quickly overfit

•  Especially when increasing the depth of the tree

•  Can stop splitting early. Many criteria:

–  Fixed tree depth
–  Information gain is not enough
–  Fix minimum samples needed in node
–  Fix minimum number of samples needed to split node
–  Combinations of these rules work as well

Mitigating Overfitting 49
[Rogozhnikov]
Ensemble Methods 50
•  Can we reduce the variance of a model without

increasing the bias?
Ensemble Methods 51
•  Can we reduce the variance of a model without

increasing the bias?
•  Yes! By training several slightly different models

and taking majority vote (classification) or
average (regression) prediction
–  Bias does not largely increase because the average

ensemble performance is equal to the average of its
members
–  Variance decreases because a spurious pattern picked

up by one model may not be picked up by other
Ensemble Methods 52
Individual Models Average Model
Green = true func0on
[Bishop]
•  Combining several weak learners (only small correlation

with target value) with high variance can be extremely
powerful
•  Can be used with decision trees to overcome their

problems of overfitting!
Bagging and Boosting 53
•  Bootstrap Aggregating (Bagging):

–  Sample dataset D with replacement N-times, and train a
separate model on each derived training set
–  Classify example with majority vote, or compute average
output from each tree as model output Ntrees
1 X
h(x) = hi (x)
Ntrees i=1
•  Boosting:
–  Train N models in sequence, giving more weight to
examples not correctly classified by previous models
–  Take weighted vote to classify examples PNtrees
i=1 ↵i hi (x)
h(x) = PNtrees
↵i
–  Boosting algorithms include: i=1
AdaBoost, Gradient boost, XGBoost
Random Forest 54
•  One of the most commonly used algorithms in

industry is the Random Forest
–  Use bagging to select random example subset
–  Train a tree, but only use random subset of features

(√m features) at each split. This increases the variance
Ensembles of Trees 55
•  Tree Ensembles
tend to work well
–  Relatively simple
–  Relatively easy to
train
–  Tend not to overfit

(especially random
forests)
–  Work with different

feature types:
continuous,
categorical, etc.
Random Forest [Rogozhnikov]

CMS h→γγ (8 TeV) – Boosted decision tree 56
Eur. Phys. J. C 74 (2014) 3076

Decision Tree Ensembles in HEP 57
https://arxiv.org/abs/1512.05955
•  Decision tree ensembles,

especially with boosting, are
used very widely in HEP!
JHEP 01 (2016) 064
JINST 10 P08010 2015

Unsupervised Learning 58
•  Learning without targets/labels,

find structure in data
Dimensionality Reduction 59
•  Find a low dimensional (less complex)

representation of the data with a mapping
Z=h(X)
Principle Components Analysis 60
•  Given data {xi}i=1…N can we find a directions in

features space that explain most variation of data?

XN
1
•  Data covariance: S= (xi x̄)2
N i=1

XN
1
N i=1
•  Let u1 be the projected direction, we can solve:

Variance of projected data Unit length vector constraint
u⇤1 = arg max uT1 Su1 + (1 uT1 u1 )

u1
! Su1 = u1

XN
1
N i=1
•  Let u1 be the projected direction, we can solve:

Variance of projected data Unit length vector constraint
u⇤1 = arg max uT1 Su1 + (1 uT1 u1 )

u1
! Su1 = u1
•  Principle components are the eigenvectors of the data
covariance matrix!
–  Eigenvalues are the variance explained by that component
PCA Example 64
[Ng]
PCA Example 65
First principle component, projects on to this axis have large variance

[Ng]
PCA Example 66
Second principle component, projects have small variance

[Ng]
Fisher Discriminant 67
•  Suppose our {xi, yi}i=1…N is separated in two classes,

we want a projection to maximize the separation
between the two classes.

–  Want means (mi) of two classes (Ci) to be as far apart as
possible → large between-class variation
SB = (m2 m1 )T (m2 m1 )

SB = (m2 m1 )T (m2 m1 )
–  Want each class tightly clustered, as little overlap as

possible → small within-class variation
X X
T
SW = (xi m1 ) (xi m1 ) + (xi m2 )T (xi m2 )
i2C1 i2C2

SB = (m2 m1 )T (m2 m1 )
–  Want each class tightly clustered, as little overlap as

possible → small within-class variation
X X
T
SW = (xi m1 ) (xi m1 ) + (xi m2 )T (xi m2 )
i2C1 i2C2
•  Maximize Fisher criteria

wT S B w
J(w) = T ! w / SW (m2 m1 )
w SW w
[Bishop]
Comparing Techniques 72
Projected plane is perpendicular

To decision line
hXp://arxiv.org/abs/1407.5675
Average Boosted W jet
2.5
Plotted weights of Fisher Discriminant

10−1
10−2
2.0
10−3
2.5
1.5 10−4 QCD-like 0.6
Q1
10−5
1.0 10−6 W-like
2.0 0.4
10−7
0.5
10−8
10−9
0.2
0.0
0.0 0.5 1.0 1.5 2.0 2.5 Cell
Q2 Coefficient Q1 1.5 0.0
0.2
1.0
Average quark / gluon jet
2.5
0.4
10 −1
2.0 10−2
0.5 0.6
10 −3
1.5 10−4 0.8

Q1
10 −5
1.0 0.0
10−6
1.0
0.5 10−7 0.0 0.5 1.0 1.5 2.0 2.5 Cell
0.0
10−8
Q2 Coefficient
10−9
0.0 0.5 1.0 1.5 2.0 2.5 Cell
Coefficient
Q2
Clustering 74
•  Partition the data into groups D={D1 ∪ D2 … ∪ Dk}
•  What is a good clustering?

•  One where examples within a cluster are more “similar” than to
examples in other clusters
•  What does similar mean? Use distance metric, e.g.

sX
d(x, x0 ) = (xi x0i )2
i
K-means 75
•  Data xi ∈ Rm which you want placed in K clusters
•  Associate each example to a cluster by minimizing

within-class variance
K-means 76

–  Give each cluster Sk a prototype µk∈ Rm where k=1…K
K-means 77

–  Assign each example to a cluster Sk

K-means 78

–  Assign each example to a cluster Sk
–  Find prototypes and assignments to minimize

K X p
X
L(S, µ) = (xi µk ) 2
k=1 i2Sk
•  This is an NP-hard problem, with many local minimum!

K-means algorithm 79
•  Initialize the µk at random (typically using K-means++ initialization)
•  Repeat until convergence: p

–  Assign each example to closest prototype min (xi µk ) 2
k2{1...K}
1 X
–  Update prototypes µk = xi
nk
i2Sk
[Bishop]
Hierarchical Agglomerative Clustering 80
•  Algorithm
–  Start with each example xi as its own cluster
–  Take pairwise distance between examples
–  Merge closest pair into a new cluster
–  Repeat until one cluster
•  Doesn’t require choice of number of clusters

•  Clusters can have arbitrary shape
•  Clusters have intrinsic heirarchy
•  No random initialization
•  What distance metric to use?
–  Here use Euclidean distance between cluster centroid
(average of examples in cluster)
[Parkes]
CD
[Parkes]
AE CD
[Parkes]
AE CD
[Parkes]
BAE
AE CD
[Parkes]
BAECD
BAE
AE CD
[Parkes]
Jet Algorithms 87
•  Sequential pairwise jet clustering algorithms

are hierarchical clustering, and are
a form of unsupervised learning
•  Compute distance between pseudojets i and j
•  Distance between pseudojet and beam
•  Find smallest distance between pseudojets dij or diB

–  Combine (sum 4-momentum) of two
pseudojets if dij smallest
–  If diB is smallest, remove pseudojet i,
call it a jet
–  Repeat until all pseudojets are jets
Practical Advice 88
What To Use? So Many Choices 89
•  Once you know what you want to do…
WHAT algorithm should you use?

–  Linear model
–  Nearest Neighbors
–  (Deep?) Neural network
–  Decision tree ensemble
–  Support vector machine
–  Gaussian processes
–  … and so many more …
No Free Lunch - Wolpert (1996) 90
•  In the absence of prior knowledge, there is no a priori

distinction between algorithms, no algorithm that will
work best for every supervised learning problem
–  You can not say algorithm X will be better without knowing
about the system
–  A model may work really well on one problem, and really

poorly on another
–  This is why data scientists have to try lots of algorithms!
•  But there are some empirical heuristics that have been

observed…
Practical Advice – Empirical Analysis 91
•  Test 179 classifiers (no deep neural networks) on 121 datasets

http://jmlr.csail.mit.edu/papers/volume15/delgado14a/delgado14a.pdf
–  The classifiers most likely to be the bests are the random forest (RF) versions,
the best of which (…) achieves 94.1% of the maximum accuracy
overcoming 90% in the 84.3% of the data sets
From Kaggle
•  For Structured data: “High level” features that have meaning
–  Winning algorithms have been lots of feature engineering + random
forests, or more recently XGBoost (also a decision tree based
algorithm)
•  Unstructured data: “Low level” features, no individual meaning

–  Winning algorithms have been deep learning based, Convolutional
NN for image classification, and Recurrent NN for text and speech
More general advice 92
•  You will likely need to try many algorithms…

–  Start with something simple!
–  Use more complex algorithms as needed
–  Use cross validation to check for overcomplexity / overtraining
•  Check the literature

–  If you can cast your (HEP) problem as something in the ML /
data science domain, there may be guidance on how to proceed
•  Hyperparameters can be hard to tune

–  Use cross validation to compare models with different
hyperparameter values!
•  Use a training / validation / testing split of your data

–  Don’t use training or validation set to determine final
performance
–  And use cross validation as well!
Debugging Learning Algorithms 93
•  Is my model working properly?

–  Where do I stand with respect to bias and variance?
–  Has my training converged?
–  Did I choose the right model / objective?
–  Where is the error in my algorithm coming from?
Sec0on derived from [Ng]

Typical learning curve for high variance 94
Cross valida0on
Valida0on error and RMS
[Ng]
•  Performance is not reaching desired level

•  Error still decreasing with training set size
–  suggests to use more data in training
•  Large gap between training and validtaion error
–  Some gap is expected (inherint bias towards training set)
•  Better: Large Cross-validation RMS, large performance variation in trainings
Typical learning curve for high bias 95
Cross valida0on
Valida0on error and RMS
[Ng]
•  Training error is unacceptably high

•  Small gap between training and validation error
•  Cross validation RMS is small
Potential Fixes 96
•  Fixes to try:
–  Get more training data Fixes high variance
–  Try smaller feature set size Fixes high variance
–  Try larger feature set size Fixes high bias
–  Try different features Fixes high bias
•  Did the training converge?

–  Run gradient descent a few more iterations Fixes optimization algorithm
•  or adjust learning rate
–  Try different optimization algorithm Fixes optimization algorithm
•  Is it the correct model / objective for the problem?

–  Try different regularization parameter value Fixes optimization objective
–  Try different model Fixes optimization objective
•  You will often need to come up with your own diagnostics to

understand what is happening to your algorithm [Ng]
Conclusions 97
•  Machine learning uses mathematical and statistical models

learned from data to characterize patterns and relations between
inputs, and use this for inference / prediction
•  Machine learning provides a powerful toolkit to analyze data
–  Linear methods can help greatly in understanding data
–  Complex models like NN and decision trees can model intricate patterns
•  Care needed to train them and ensure they don’t overfit

–  Unsupervised learning can provide powerful tools to understand data,
even when no labels are available
–  Choosing a model for a given problem is difficult, but there may be some
guidance in the literature
•  Keep in mind the bias-variance tradeoff when building an ML model
•  Deep learning is an exciting frontier and powerful paradigm in

ML research
–  We will hear more about it tomorrow!
Advertisements 98
•  Tomorrow’s lecture on deep learning and

computer vision from Jon Shlens from Google
Brain!
•  Data Science @ HEP workshop on machine

learning in high energy physics
–  May 8-12, 2017 at Fermilab
–  https://indico.fnal.gov/conferenceDisplay.py?
ovw=True&confId=13497
Useful Python ML software 99
•  Anaconda / Conda → easy to setup python ML / scientific computing

environments
–  https://www.continuum.io/downloads
–  http://conda.pydata.org/docs/get-started.html
•  Integrating ROOT / PyROOT into conda

–  https://nlesc.gitbooks.io/cern-root-conda-recipes/content/index.html
–  https://conda.anaconda.org/NLeSC
•  Converting ROOT trees to python numpy arrays / panda dataframes

–  https://pypi.python.org/pypi/root_numpy/
–  https://github.com/ibab/root_pandas
•  Scikit-learn → general ML library

–  http://scikit-learn.org/stable/
•  Deep learning frameworks / auto-differentiation packages

–  https://www.tensorflow.org/
–  http://deeplearning.net/software/theano/
•  High level deep learning package build on top of Theano / Tensorflow

–  https://keras.io/
References 10
0
•  http://scikit-learn.org/
•  [Bishop] Pattern Recognition and Machine Learning, Bishop (2006)
•  [ESL] Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009
•  [Murray] Introduction to machine learning, Murray
–  http://videolectures.net/bootcamp2010_murray_iml/
•  [Ravikumar] What is Machine Learning, Ravikumar and Stone
–  http://www.cs.utexas.edu/sites/default/files/legacy_files/research/documents/MLSS-
Intro.pdf
•  [Parkes] CS181, Parkes and Rush, Harvard University
–  http://cs181.fas.harvard.edu
•  [Ng] CS229, Ng, Stanford University
–  http://cs229.stanford.edu/
•  [Rogozhnikov] Machine learning in high energy physics, Alex Rogozhnikov
–  https://indico.cern.ch/event/497368/
10
1
Example 10
2
•  Classifying hand written digits

–  10-class classification
–  Right plot shows projection of 10-class output onto 2
dimensions
Error Analysis 10
3
•  Anti-spam classifier using logistic regression.

•  How much did each component of the system help?
•  Remove each component one at a time to see how it
breaks
Removing text parser

caused largest drop
in performance
Ensemble Methods 10
4
•  Combine many decision trees, use the ensemble for prediction

N tree
1
•  Averaging: D(x) = ∑ d (x) i
N tree i=1
–  Random Forest, averaging combined with:

•  Bagging: Only use a subset of events for each tree training
•  Feature subsets: Only use a subset of features for each tree
N tree
•  Boosting (weighted voting): D(x) = ∑ αi di (x)
i=1
–  Weight computed such that events in

current tree have higher weight misclassified in previous trees
–  Several boosting algorithms

•  AdaBoost
•  Gradient Boosting
•  XGBoost
Non-Linear Activations 10
5
•  The activation function in the NN must be a non-linear function

–  If all the activations were linear, the network would be linear:
f(X) = Wn( Wn-1 (… W1 X)) = UX, where U = Πi Wi
•  Linear functions can only correctly classify linearly separable data!

•  For complex datasets, need nonlinearities to properly learn data
structure
x2 x2
H1 H1
H0 H0
x1 x1
Linear Classifier Non-linear Classifier
Neural Networks and Local Minima 10
6
•  Large NN’s difficult to train…trapping in local minimum?
•  Not in large neural networks https://arxiv.org/abs/1412.0233

–  Most local minima equivalent, and resonable
–  Global minima may represent overtraining
–  Most bad (high error) critical points are saddle points (different than
small NN’s)
Weight Initializations and Training Procedures 10
7
•  Used to set weights to some small

initial value
–  Creates an almost linear classifier
•  Now initialize such that node outputs

are normally distributed
•  Pre-training with auto-encoder
–  Network reproduces the inputs
–  Hidden layer is a non-linear
dimensionality reduction
–  Learn important features of the input
–  Not as common anymore, except in
certain circumstances…
•  Adversarial training, invented 2014

–  Will potential HEP applications later
ReLU Networks 10
8
hXp://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf
•  Sparse propagation of activations and gradients in a network of rectifier

units. The input selects a subset of active neurons and computation is
linear in this subset.
•  Model is “linear-by-parts”, and can thus be seen as an exponential
number of linear models that share parameters
•  Non-linearity in model comes from path selection
Convolutions in 2D 10
9
Input image Convolved image
•  Scan the filters over the 2D image, producing the

convolved images
Max Pooling 11
0
•  Down-sample the input by taking MAX or

average over a region of inputs
–  Keep only the most useful information
Daya Bay 11
1
Daya Bay Neutrino Experiment arXiv:1601.07621
11
2
•  Aim to reconstruct inverse β-decay interactions from

scintillation light recorded in 8x24 PMT’s
•  Study discrimination power using CNN’s
–  Supervised learning → observed excellent performance (97%
accuracy)
–  Unsupervised learning: ML learns itself what is interesting!
2D distant preserving representa0on of

10D encoding of events
Reconstructed inputs
Nonlinear decoing layers

(using deconvolu0ons)
10D encoding
Nonlinear encoding layers
(using convolu0ons)
Inputs (8x24)
Jet-Images 11
3
1.2
Jet tagging using jet substructure

1
−0.2 0 0.2 0.4 0.6 0.8 11

1
η
4
(a) (b)
•  Typical approach: 2.2

Boosted W Jet, R = 0.6
5.8
Boosted QCD Jet, R = 0.6
Use physics inspired variables to 2 W jet 5.6 QCD jet

provide signal / background
5.4
1.8
5.2
1.6
discrimination
φ
φ
5
1.4
1.2 ΔR 4.8
4.6
1
•  Typical physics inspired variables −0.2 0 0.2 0.4

η
0.6 0.8 1 −1.2 −1 −0.8
η
−0.6 −0.4 −0.2
exploit differences in: (a) (c) (b)

(d)
•  Jet mass
5.8
+ −
Figure 1: Left: Schematic of the fully hadronic decay sequences in (a) W W and (c) dij
5.6
events. Whereas a W jet is typically composed of two distinct lobes of energy, a QCD jet a
•  N-prong structure: invariant mass through multiple splittings. Right: Typical event displays for (b) W jets
5.4
QCD jets with invariant mass near m . The jets are clustered with the anti-k jet algorit W T
using R = 0.6, with5.2 the dashed line giving the approximate boundary of the jet. The mar
o  1-prong (QCD)
φ
for each calorimeter5 cell is proportional to the logarithm of the particle energies in the ce
cells are colored according to how the exclusive kT algorithm divides the cells into two ca
o  2-prong (W,Z,H) QCD jet
subjets. The open4.8square indicates the total jet direction and the open circles indicate
subjet directions. 4.6The discriminating variable τ2 /τ1 measures the relative alignment of
Jet mass
o  3-prong (top) energy along the open circles compared to the open square.
−1.2 −1 −0.8 −0.6 −0.4 −0.2
η
•  Radiation pattern: (c)

W jet
(d)
with τ ≈ 0 have all their radiation aligned with the candidate subjet directio
N
therefore have N (or fewer) subjets. Jets with τ ≫ 0 have a large fraction of their
o  Soft gluon emission distributed
Figure 1: Left: Schematic of the fully away sequences
hadronic decay from theincandidate
(a) W W subjet directions
and (c) dijet QCD and therefore have at least
+ −
N
events. Whereas a W jet is typicallysubjets. Plots of distinct

τ and lobes
τ comparing
of energy,W jets and QCD jets are shown in Fig. 2.
o  Color flow composed of two a QCD jet acquires
1 2
invariant mass through multiple splittings. Right: Typical event displays for (b) W jets and
Less obvious is how best to use τ for identifying boosted(d) W bosons. While one
arXiv:1510.05821
N
QCD jets with invariant mass near m W . Theexpect
naively jets arethat
clustered withwith
an event the anti-k
smallTτ2jet algorithm
would [31] likely to be a W jet, obser
be more
using R = 0.6, with the dashed line QCD
givingjet
thecan
approximate
also have small τ2 , as shown in Fig. 2(b).size
boundary of the jet. The marker Similarly, though W jets ar
for each calorimeter cell is proportional to the logarithm of the particle energies in the cell. The
1.2
Jet tagging using jet substructure

0.02 1
−0.2 0 0.2 0.4 0.6 0.8 11

1
η
0
5
(a) 0 0.2 0.4 0.6 (b) 0.8 1
τ of Boosted
jet QCD Jet, R = 0.6
•  Typical approach: 2.2
Boosted W Jet, R = 0.6
5.8
1
Use physics inspired variables to W jet (a) QCD jet 2 5.6
provide signal / background

5.4
1.8
Figure 2: Distributions of (a) τ1 and (b) τ2 f 1.6

5.2
discrimination
φ
φ
impose an invariant mass window of 65 GeV < 1.4
5
and |η| < 1.3.ΔR By themselves, the τN do not

1.2
4.8
4.6
objects beyond the invariant mass cut.
1
•  Typical physics inspired variables −0.2 0 0.2 0.4

η
0.6 0.8 1 −1.2 −1 −0.8
η
−0.6 −0.4 −0.2
exploit differences in: (a)

65 GeV < m < 95 GeV
(c)
(b)
j
(d)
•  Jet mass
5.8
0.08 + −
Figure 1: Left: Schematic of the fully hadronic decay sequences in (a) W W W jets
and (c) dij
0.07 N-subje6ness
5.6
events. Whereas a W jet is typically composed of two distinct lobes of energy, a QCD jet a
QCD jets
•  N-prong structure: invariant mass through multiple splittings. Right: Typical event displays for (b) W jets
QCD jets with invariant mass near
5.4
0.06m . The jets are clustered Thaler &

with the anti-k jet algorit W T
Relative occurence
using R = 0.6, with5.2 the dashed line giving the approximate boundary of the jet. The mar
o  1-prong (QCD)
φ
for each calorimeter5 cell is proportional
0.05 to the logarithm of the particle energies in the ce Van Tilburg
cells are colored according to how the exclusive kT algorithm divides the cells into two ca
o  2-prong (W,Z,H) 0.04 the total jet direction and the open circles indicate
subjets. The open4.8square indicates
subjet directions. 4.6The discriminating variable τ2 /τ1 measures the relative alignment of
o  3-prong (top) 0.03
energy along the open circles compared
−1.2 −1
to the open square.
−0.8 −0.6 −0.4 −0.2
η
•  Radiation pattern: (c)

0.02
with τ ≈ 0 have all their radiation aligned with the candidate subjet directio
N
(d)
0.01
therefore have N (or fewer) subjets. Jets with τ ≫ 0 have a large fraction of their
o  Soft gluon emission distributed
Figure 1: Left: Schematic of the fully away sequences
hadronic decay from theincandidate
(a) W W subjet
0
directions
and (c) dijet QCD and therefore have at least
+ −
N
events. Whereas a W jet is typicallysubjets. Plots of distinct

τ and lobes
τ comparing W jets and QCD jets0.6 are shown
0.8in Fig. 2.
o  Color flow composed of two 0
of energy, a 0.2
QCD 0.4
jet acquires 1
1 2
invariant mass through multiple splittings. Right: Typical event displays for (b) W jets τ
and /τ
(d)of jet
Less obvious is how best to use τ for identifying boosted W bosons. While one N 2 1
QCD jets with invariant mass near m W . Theexpect
naively jets arethat
clustered withwith
an event the anti-k
smallTτ2jet algorithm
would 1
[31] likely to be a W jet, obser
be more
using R = 0.6, with the dashed line QCD
givingjet
thecan
approximate
also have small
N τ =
boundary of the jet.
τ2 , as shown The marker size
(a)
T ,kin Fig. 2(b).
for each calorimeter cell is proportional to the logarithm of the particle energies in the cell. The
∑p
Similarly,
d
k,axis−1 though k,axis−n
W jets ar min{ΔR ,..., ΔR }
0
Pre-processing and space-time symmetries 11
6
Pythia 8, s = 13 TeV
Pre-processing steps 240 < pT/GeV < 260 GeV, 65 < mass/GeV < 95
Normalized to Unity
may not be Lorentz 0.3
m 2J = ∑ Ei E j (1− cos(θ ij ))
No pixelation
Invariant 0.25
i< j
Only pixelation
Pix+Translate (naive) (×0.75)
Pix+Translate
•  Translations in η are 0.2 Pix+Translate+Flip

Pix+Translate+π/2 Rotation
Lorentz boosts along z-axis Pix+Translate+p2 norm (×170)
T
–  Do not preserve the pixel 0.15

Naïve
energies
Transla0on
–  Use pT rather than E as pixel 0.1 Image
intensity normaliza0on
0.05
•  Jet mass is not invariant

0
under Image normalization 60 70 80 90 100 110
Mass
Restricted phase space 11
7
2-prong τ21 1-prong 79 < m < 81 GeV

Pythia 8, s = 13 TeV
0.19 < τ21 <0.21
240 < p /GeV < 260 GeV, 0.19 < τ21 < 0.21, 79 < mass/GeV < 81
W jets s = 13 TeV
Pythia 8, W'→ WZ, Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV
20
T
1/(Background Efficiency)
240 < p /GeV < 260 GeV, 0.19 < τ21 < 0.21, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.39 < τ21 < 0.41, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.59 < τ21 < 0.61, 79 < mass/GeV < 81
T T T
1 103 1 103 1 103
[Translated] Azimuthal Angle (φ)

Pixel p [GeV]
Pixel p [GeV]
Pixel p [GeV]
0.8 102 0.8 102 0.8 102
T
T
10 10 10
0.6
0.4
1
0.6
0.4
1
0.6
0.4
1 mass
10-1 10-1 10-1
0.2
0
10-2
10-3
0.2
0
10-2
10-3
0.2
0
10-2
10-3
τ21
-0.2 10 -4
-0.2 10 -4
-0.2 10-4 15
-0.4
10-5
-0.4
10-5
-0.4
10-5 ∆R
10-6 10-6 10-6
-0.6 -0.6 -0.6
-0.8
10-7
10-8 -0.8
10-7
10-8 -0.8
10-7
10-8
MaxOut
-1 10-9 -1 10-9 -1 10-9
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
[Translated] Pseudorapidity (η) [Translated] Pseudorapidity (η) [Translated] Pseudorapidity (η) Convnet
10
QCD jets s = 13 TeV
Pythia 8, QCD dijets, Pythia 8, QCD dijets, s = 13 TeV Pythia 8, QCD dijets, s = 13 TeV
Random
InformaHon
T T T
1 103 1 103 1 103

Pixel p [GeV]
Pixel p [GeV]
Pixel p [GeV]
0.8 102 0.8 102 0.8 102
T
T
10 10 10
0.6 0.6 0.6
beyond m,τ21
1 1 1
0.4 0.4 0.4
10-1 10-1 10-1
0.2 10-2 0.2 10-2 0.2 10-2 5
0 10-3 0 10-3 0 10-3

-4 -4 -4
-0.2 10 -0.2 10 -0.2 10
10-5 10-5 10-5
-0.4 -0.4 -0.4
10-6 10-6 10-6
-0.6 -0.6 -0.6
10-7 10-7 10-7
-0.8 10-8 -0.8 10-8 -0.8 10-8
-9 -9
-1 -1 -1 10-9
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
10
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
10
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0
[Translated] Pseudorapidity (η) [Translated] Pseudorapidity (η) [Translated] Pseudorapidity (η)
0.2 0.4 0.6 0.8
[0.19, 0.21] [0.39, 0.41] [0.59, 0.61] Signal Efficiency
Restrict the phase space in very small mass and τ21 bins:
Improvement in discrimina0on from new, unique, informa0on learned by the
network

Deep correlation jet images 11
8
Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV
T T T
1 103 1 103 1 103

Pixel p [GeV]
Pixel p [GeV]
Pixel p [GeV]
0.8 102 0.8 102 0.8 102
T
10 10 10
0.6 0.6 0.6
1 1 1
0.4 0.4 0.4
10-1 10-1 10-1
0.2 10-2 0.2 10-2 0.2 10-2
0 10-3 0 10-3 0 10-3
-0.2 10-4 -0.2 10-4 -0.2 10-4

10-5 10-5 10-5
-0.4 -0.4 -0.4
10-6 10-6 10-6
-0.6 -0.6 -0.6
10-7 10-7 10-7
-0.8 10 -8 -0.8 10 -8 -0.8 10-8
-1 10-9 -1 10-9 -1 10-9
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV
0.19 <τ21< 0.21
T
Pythia 8, QCD dijets, s = 13 TeV
0.39 <τ T
< 0.41
Pythia 8,21QCD dijets,
s = 13 TeV Pythia 8, QCD
T 21
dijets,
0.59 <τ < 0.61
s = 13 TeV

Pearson Correlation Coefficient
AzimuthalCoefficient
Pearson Correlation Coefficient

T T T
1 0.6 103 1 103 0.6 1 103 0.6
Angle (φ)
Pixel p [GeV]
Pixel p [GeV]
Pixel p [GeV]
1 2
1 2
1
0.8 10 0.8 10 0.8 102
T
T
10 10 10
0.6 0.4 0.6 0.4 0.6 0.4
Correlation
1 1 1
0.5 0.4 0.5 0.4 0.4 0.5
10-1 10-1 10-1
0.2 0.2 0.2 0.2 0.2 0.2
10-2 10-2 10-2
[Translated]
-3 -3 -3
0 10 0 10 0 10
Pearson
0 0 10-4 0 10-4 0 0 10-4 0
-0.2 -0.2 -0.2
10-5 10-5 10-5
-0.4 -0.4 -0.4
-0.2 10-6 10-6 -0.2 10-6 -0.2
-0.6 -0.6 -0.6
-0.5 10 -7 -0.5 10 -7 -0.5 10-7
-0.8 10-8 -0.8 10-8 -0.8 10-8
-0.4 -0.4 -0.4
-1 10-9 -1 10-9 -1 10-9
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1 [Translated] Pseudorapidity (-0.6
-1 -1
η) [Translated] Pseudorapidity (η) -0.6 [Translated] Pseudorapidity (η) -0.6
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

SpaHal informaHon indicaHve of radiaHon paWern for W and QCD: where in

the image the network is looking for discrimina0ng features

Kagan Lecture2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kagan Lecture2

Uploaded by

Copyright:

Available Formats

Machine Learning:

CERN Academic Training Lectures

• Recap of last time

• Input output pairs {xi, yi}, with

• Linear decision boundary h(x; w) = wT x

• Input output pairs {xi, yi}, with

• Linear decision boundary h(x; w) = wT x

• Distance from decision boundary p(y = 1|x) = (h(x, w))

p(y = 1|x) = (h(x, w))

• What if we want a non-linear decision boundary?

• What if we want a non-linear decision boundary?

• What if we want a non-linear decision boundary?

• What if we want a non-linear decision boundary?

– Where u is a set of parameters for the transformation

• What if we want a non-linear decision boundary?

– Where u is a set of parameters for the transformation

– Combines basis selection and learning

• Define the basis functions j = {1…d}

• Define the basis functions j = {1…d}

• Put all uj ∈ R1xm vectors into matrix U

– σ is a pointwise sigmoid acting on each vector element

• Define the basis functions j = {1…d}

• Put all uj ∈ R1xm vectors into matrix U

– σ is a pointwise sigmoid acting on each vector element

• Full model becomes

φ(…) oGen called the

• Feed-forward neural network with a single hidden

– Only mild assumptions on non-linear activation function

• Feed-forward neural network with a single hidden

– Only mild assumptions on non-linear activation function

• But no information on how many neurons needed, or

• Feed-forward neural network with a single hidden

– Only mild assumptions on non-linear activation function

• But no information on how many neurons needed, or

• How to find the parameters, given a dataset, to

• Neural Network Model: h(x) = w T

• Classification: Cross-entropy loss function

• Neural Network Model: h(x) = w T

• Classification: Cross-entropy loss function

• Regression: Square error loss function

• Neural Network Model: h(x) = w T

• Classification: Cross-entropy loss function

• Regression: Square error loss function

• Minimize loss with respect to weights w, U

• Minimize loss by repeated gradient steps

• Minimize loss by repeated gradient steps

• Chain rule to compute gradient w.r.t. w

• Chain rule to compute gradient w.r.t. uj

• Loss function composed of layers of nonlinearity

• Loss function composed of layers of nonlinearity

• Loss function composed of layers of nonlinearity

• Loss function composed of layers of nonlinearity

• Repeat gradient update of weights reduce loss

• Use validation set to examine for overtraining, and

[graphic from H. Larochelle]

• L2 regularization: add Ω(w) = ||w||2 to loss

• Regularization inside network, example: Dropout

• Vanishing gradient problem • Rectified Linear Unit (ReLU)

Five neurons Twenty neurons

• As data complexity grows, need exponentially large number of neurons in

• Structure of the networks,

Jets at the LHC Neutrino iden0ﬁca0on

• Can visualize weights: neutrino decay classification

•  Recap of last time

•  Input output pairs {xi, yi}, with

•  Linear decision boundary h(x; w) = wT x

•  Input output pairs {xi, yi}, with

•  Linear decision boundary h(x; w) = wT x

•  Distance from decision boundary p(y = 1|x) = (h(x, w))

•  What if we want a non-linear decision boundary?

•  What if we want a non-linear decision boundary?

•  What if we want a non-linear decision boundary?

•  What if we want a non-linear decision boundary?

–  Where u is a set of parameters for the transformation

•  What if we want a non-linear decision boundary?

–  Where u is a set of parameters for the transformation

–  Combines basis selection and learning

•  Define the basis functions j = {1…d}

•  Define the basis functions j = {1…d}

•  Put all uj ∈ R1xm vectors into matrix U

–  σ is a pointwise sigmoid acting on each vector element

•  Define the basis functions j = {1…d}

•  Put all uj ∈ R1xm vectors into matrix U

–  σ is a pointwise sigmoid acting on each vector element

•  Full model becomes

•  Feed-forward neural network with a single hidden

–  Only mild assumptions on non-linear activation function

•  Feed-forward neural network with a single hidden

–  Only mild assumptions on non-linear activation function

•  But no information on how many neurons needed, or

•  Feed-forward neural network with a single hidden

–  Only mild assumptions on non-linear activation function

•  But no information on how many neurons needed, or

•  How to find the parameters, given a dataset, to

•  Neural Network Model: h(x) = w T

•  Classification: Cross-entropy loss function

•  Neural Network Model: h(x) = w T

•  Classification: Cross-entropy loss function

•  Regression: Square error loss function

•  Neural Network Model: h(x) = w T

•  Classification: Cross-entropy loss function

•  Regression: Square error loss function

•  Minimize loss with respect to weights w, U

•  Minimize loss by repeated gradient steps

•  Minimize loss by repeated gradient steps

•  Chain rule to compute gradient w.r.t. w

•  Chain rule to compute gradient w.r.t. uj

•  Loss function composed of layers of nonlinearity

•  Loss function composed of layers of nonlinearity

•  Loss function composed of layers of nonlinearity

•  Loss function composed of layers of nonlinearity

•  Repeat gradient update of weights reduce loss

•  Use validation set to examine for overtraining, and

•  L2 regularization: add Ω(w) = ||w||2 to loss

•  Regularization inside network, example: Dropout

•  Vanishing gradient problem •  Rectified Linear Unit (ReLU)

•  As data complexity grows, need exponentially large number of neurons in

•  Structure of the networks,

•  Can visualize weights: neutrino decay classification

•  Find inputs that

•  Partition data based on a sequence of thresholds

•  In a given partition, estimate the class probability from Nm examples

•  We will discuss the ways to overcome these Cons,

•  Greedy Training: instead of optimizing all

•  Greedy Training: instead of optimizing all

•  Greedy Training: instead of optimizing all

–  Where H() is an impurity function

•  Greedy Training: instead of optimizing all

–  Where H() is an impurity function

•  Choose splitting θ using: ✓⇤ = arg min G(Q, ✓)

–  Miss-classification: H(Xm ) = 1 max(pmk )

•  In principle, can keep splitting until every event is

•  In principle, can keep splitting until every event is

•  Single decision trees can quickly overfit

•  In principle, can keep splitting until every event is

•  Can stop splitting early. Many criteria:

–  Combinations of these rules work as well

•  Can we reduce the variance of a model without

•  Can we reduce the variance of a model without

•  Yes! By training several slightly different models

–  Bias does not largely increase because the average

–  Variance decreases because a spurious pattern picked

•  Combining several weak learners (only small correlation