Professional Documents
Culture Documents
Kagan Lecture2
Kagan Lecture2
Lecture II
Michael Kagan
SLAC
• This Lecture
– Neural Networks → Just an intro, more on this tomorrow!
– Decision Trees and Ensemble Methods
– Unsupervised Learning
• Dimensionality reduction
• Clustering
– No Free Lunch and Practical Advice
Neural Networks 3
Reminder of Logistic Regression 4
h(x) > 0
h(x) = 0
h(x) < 0
h(x)
[Bishop]
Reminder of Logistic Regression 5
φj(x; u) = σ(ujTx)
Neural Networks 13
φj(x; u) = σ(ujTx)
φj(x; u) = σ(ujTx)
Hidden layer
Composed of neurons
(x) = (Ux)
h(x) = wT (x)
Multi-layer Neural Network 16
U V
• Multilayer NN
– Each layer adapts basis based on previous layer
Universal approximation theorem 17
X
L(w, U) = yi ln(pi ) + (1 yi ) ln(1 pi )
i
Neural Network Optimization Problem 21
X
L(w, U) = yi ln(pi ) + (1 yi ) ln(1 pi )
i
X
L(w, U) = yi ln(pi ) + (1 yi ) ln(1 pi )
i
X
L(w, U) = yi ln( (h(xi ))) + (1 yi ) ln(1 (h(xi )))
i
@ (x)
• Derivative of sigmoid: = (x)(1 (x))
@x
@L X @ aj @L
• Compute parameter gradients @w a
=
@w a @ a
j
j
Training 30
arXiv:1207.0580
Activation Functions 32
FiGy neurons
x1
2-class classifica0on
1-hidden layer NN
L2 norm regulariza0on
hXp://www.wildml.com/2015/09/implemen0ng-a-neural-network-from-scratch/ hXp://junma5.weebly.com/data-blog/build-your-own-neural-network-classifier-in-r
Deep Neural Networks 34
• Difficult to train, only recently possible with large datasets, fast computing
(GPU) and new training procedures / network structures (like dropout)
→ More next time
Neural Network Architectures 35
[Bishop] hXp://www.asimovins0tute.org/neural-network-zoo/
Neural Networks in HEP 36
hXps://arxiv.org/abs/1511.05190
• Pros:
– Simple to understand, can visualize a tree
– Requires little data preparation, and can use continuous
and categorical inputs
• Cons:
– Can create complex models that overfit data
– Can be unstable to small variations in data
– Training a tree is an NP-complete problem
• Hard to find a global optimum of all data partitionings
• Have to use heuristics like greedy optimization where locally
optimal decisions are made
• Classification
Nk
– Proportion of class k in node m: pmk =
Nm
X
– Gini: H(Xm ) = pmk (1 pmk )
k
X
– Cross entropy: H(Xm ) = pmk log(pmk )
k
• Regression 1 X
– Continuous target y, in region estimate: cm =
Nm
yi
i2Nm
1 X
– Square error: H(Xm ) = (yi cm )2
Nm
i2Nm
When to stop splitting? 46
[Rogozhnikov]
Variable 1
[Rogozhnikov]
Ensemble Methods 50
[Bishop]
• Tree Ensembles
tend to work well
– Relatively simple
– Relatively easy to
train
https://arxiv.org/abs/1512.05955
! Su1 = u1
Principle Components Analysis 63
! Su1 = u1
• Principle components are the eigenvectors of the data
covariance matrix!
– Eigenvalues are the variance explained by that component
PCA Example 64
[Ng]
PCA Example 65
[Bishop]
Comparing Techniques 72
hXp://arxiv.org/abs/1407.5675
Average Boosted W jet
2.5
10−2
2.0
10−3
2.5
1.5 10−4 QCD-like 0.6
Q1
10−5
1.0 10−6 W-like
2.0 0.4
10−7
0.5
10−8
10−9
0.2
0.0
0.0 0.5 1.0 1.5 2.0 2.5 Cell
Q2 Coefficient Q1 1.5 0.0
0.2
1.0
Average quark / gluon jet
2.5
0.4
10 −1
2.0 10−2
0.5 0.6
10 −3
10 −5
1.0 0.0
10−6
1.0
0.5 10−7 0.0 0.5 1.0 1.5 2.0 2.5 Cell
0.0
10−8
Q2 Coefficient
10−9
0.0 0.5 1.0 1.5 2.0 2.5 Cell
Coefficient
Q2
Clustering 74
[Bishop]
Hierarchical Agglomerative Clustering 80
• Algorithm
– Start with each example xi as its own cluster
– Take pairwise distance between examples
– Merge closest pair into a new cluster
– Repeat until one cluster
[Parkes]
Hierarchical Agglomerative Clustering 82
CD
[Parkes]
Hierarchical Agglomerative Clustering 83
AE CD
[Parkes]
Hierarchical Agglomerative Clustering 84
AE CD
[Parkes]
Hierarchical Agglomerative Clustering 85
BAE
AE CD
[Parkes]
Hierarchical Agglomerative Clustering 86
BAECD
BAE
AE CD
[Parkes]
Jet Algorithms 87
– The classifiers most likely to be the bests are the random forest (RF) versions,
the best of which (…) achieves 94.1% of the maximum accuracy
overcoming 90% in the 84.3% of the data sets
From Kaggle
• For Structured data: “High level” features that have meaning
– Winning algorithms have been lots of feature engineering + random
forests, or more recently XGBoost (also a decision tree based
algorithm)
Cross valida0on
Valida0on error and RMS
[Ng]
Cross valida0on
Valida0on error and RMS
[Ng]
• Fixes to try:
– Get more training data Fixes high variance
– Try smaller feature set size Fixes high variance
– Try larger feature set size Fixes high bias
– Try different features Fixes high bias
– Complex models like NN and decision trees can model intricate patterns
• Care needed to train them and ensure they don’t overfit
– Unsupervised learning can provide powerful tools to understand data,
even when no labels are available
– Choosing a model for a given problem is difficult, but there may be some
guidance in the literature
• Keep in mind the bias-variance tradeoff when building an ML model
• http://scikit-learn.org/
• [Bishop] Pattern Recognition and Machine Learning, Bishop (2006)
• [ESL] Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009
• [Murray] Introduction to machine learning, Murray
– http://videolectures.net/bootcamp2010_murray_iml/
• [Ravikumar] What is Machine Learning, Ravikumar and Stone
– http://www.cs.utexas.edu/sites/default/files/legacy_files/research/documents/MLSS-
Intro.pdf
• [Parkes] CS181, Parkes and Rush, Harvard University
– http://cs181.fas.harvard.edu
• [Ng] CS229, Ng, Stanford University
– http://cs229.stanford.edu/
• [Rogozhnikov] Machine learning in high energy physics, Alex Rogozhnikov
– https://indico.cern.ch/event/497368/
10
1
Example 10
2
x2 x2
H1 H1
H0 H0
x1 x1
Linear Classifier Non-linear Classifier
Neural Networks and Local Minima 10
6
hXp://www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf
10D encoding
Nonlinear encoding layers
(using convolu0ons)
Inputs (8x24)
Jet-Images 11
3
1.2
5.2
1.6
discrimination
φ
φ
5
1.4
1.2 ΔR 4.8
4.6
1
• Jet mass
5.8
+ −
Figure 1: Left: Schematic of the fully hadronic decay sequences in (a) W W and (c) dij
5.6
events. Whereas a W jet is typically composed of two distinct lobes of energy, a QCD jet a
• N-prong structure: invariant mass through multiple splittings. Right: Typical event displays for (b) W jets
5.4
QCD jets with invariant mass near m . The jets are clustered with the anti-k jet algorit W T
using R = 0.6, with5.2 the dashed line giving the approximate boundary of the jet. The mar
o 1-prong (QCD)
φ
for each calorimeter5 cell is proportional to the logarithm of the particle energies in the ce
cells are colored according to how the exclusive kT algorithm divides the cells into two ca
o 2-prong (W,Z,H) QCD jet
subjets. The open4.8square indicates the total jet direction and the open circles indicate
subjet directions. 4.6The discriminating variable τ2 /τ1 measures the relative alignment of
Jet mass
o 3-prong (top) energy along the open circles compared to the open square.
−1.2 −1 −0.8 −0.6 −0.4 −0.2
η
discrimination
φ
φ
impose an invariant mass window of 65 GeV < 1.4
5
4.6
objects beyond the invariant mass cut.
1
• Jet mass
5.8
0.08 + −
Figure 1: Left: Schematic of the fully hadronic decay sequences in (a) W W W jets
and (c) dij
0.07 N-subje6ness
5.6
events. Whereas a W jet is typically composed of two distinct lobes of energy, a QCD jet a
QCD jets
• N-prong structure: invariant mass through multiple splittings. Right: Typical event displays for (b) W jets
QCD jets with invariant mass near
5.4
Relative occurence
using R = 0.6, with5.2 the dashed line giving the approximate boundary of the jet. The mar
o 1-prong (QCD)
φ
for each calorimeter5 cell is proportional
0.05 to the logarithm of the particle energies in the ce Van Tilburg
cells are colored according to how the exclusive kT algorithm divides the cells into two ca
o 2-prong (W,Z,H) 0.04 the total jet direction and the open circles indicate
subjets. The open4.8square indicates
subjet directions. 4.6The discriminating variable τ2 /τ1 measures the relative alignment of
o 3-prong (top) 0.03
energy along the open circles compared
−1.2 −1
to the open square.
−0.8 −0.6 −0.4 −0.2
η
Pythia 8, s = 13 TeV
Pre-processing steps 240 < pT/GeV < 260 GeV, 65 < mass/GeV < 95
Normalized to Unity
may not be Lorentz 0.3
m 2J = ∑ Ei E j (1− cos(θ ij ))
No pixelation
Invariant 0.25
i< j
Only pixelation
Pix+Translate (naive) (×0.75)
Pix+Translate
1/(Background Efficiency)
240 < p /GeV < 260 GeV, 0.19 < τ21 < 0.21, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.39 < τ21 < 0.41, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.59 < τ21 < 0.61, 79 < mass/GeV < 81
T T T
1 103 1 103 1 103
[Translated] Azimuthal Angle (φ)
Pixel p [GeV]
Pixel p [GeV]
0.8 102 0.8 102 0.8 102
T
T
10 10 10
0.6
0.4
1
0.6
0.4
1
0.6
0.4
1 mass
10-1 10-1 10-1
0.2
0
10-2
10-3
0.2
0
10-2
10-3
0.2
0
10-2
10-3
τ21
-0.2 10 -4
-0.2 10 -4
-0.2 10-4 15
-0.4
10-5
-0.4
10-5
-0.4
10-5 ∆R
10-6 10-6 10-6
-0.6 -0.6 -0.6
-0.8
10-7
10-8 -0.8
10-7
10-8 -0.8
10-7
10-8
MaxOut
-1 10-9 -1 10-9 -1 10-9
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
[Translated] Pseudorapidity (η) [Translated] Pseudorapidity (η) [Translated] Pseudorapidity (η) Convnet
10
QCD jets s = 13 TeV
Pythia 8, QCD dijets, Pythia 8, QCD dijets, s = 13 TeV Pythia 8, QCD dijets, s = 13 TeV
Random
InformaHon
240 < p /GeV < 260 GeV, 0.19 < τ21 < 0.21, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.39 < τ21 < 0.41, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.59 < τ21 < 0.61, 79 < mass/GeV < 81
T T T
1 103 1 103 1 103
[Translated] Azimuthal Angle (φ)
Pixel p [GeV]
Pixel p [GeV]
0.8 102 0.8 102 0.8 102
T
T
10 10 10
0.6 0.6 0.6
beyond m,τ21
1 1 1
0.4 0.4 0.4
10-1 10-1 10-1
0.2 10-2 0.2 10-2 0.2 10-2 5
0 10-3 0 10-3 0 10-3
-4 -4 -4
-0.2 10 -0.2 10 -0.2 10
10-5 10-5 10-5
-0.4 -0.4 -0.4
10-6 10-6 10-6
-0.6 -0.6 -0.6
10-7 10-7 10-7
-0.8 10-8 -0.8 10-8 -0.8 10-8
-9 -9
-1 -1 -1 10-9
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
10
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
10
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
0
[Translated] Pseudorapidity (η) [Translated] Pseudorapidity (η) [Translated] Pseudorapidity (η)
0.2 0.4 0.6 0.8
Restrict the phase space in very small mass and τ21 bins:
Improvement in discrimina0on from new, unique, informa0on learned by the
network
Deep correlation jet images 11
8
Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV
240 < p /GeV < 260 GeV, 0.19 < τ21 < 0.21, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.39 < τ21 < 0.41, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.59 < τ21 < 0.61, 79 < mass/GeV < 81
T T T
1 103 1 103 1 103
[Translated] Azimuthal Angle (φ)
Pixel p [GeV]
Pixel p [GeV]
0.8 102 0.8 102 0.8 102
T
10 10 10
0.6 0.6 0.6
1 1 1
0.4 0.4 0.4
10-1 10-1 10-1
0.2 10-2 0.2 10-2 0.2 10-2
0 10-3 0 10-3 0 10-3
Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV Pythia 8, W'→ WZ, s = 13 TeV
0.19 <τ21< 0.21
240 < p /GeV < 260 GeV, 0.19 < τ21 < 0.21, 79 < mass/GeV < 81
T
Pythia 8, QCD dijets, s = 13 TeV
0.39 <τ T
< 0.41
Pythia 8,21QCD dijets,
240 < p /GeV < 260 GeV, 0.39 < τ21 < 0.41, 79 < mass/GeV < 81
s = 13 TeV Pythia 8, QCD
T 21
dijets,
0.59 <τ < 0.61
240 < p /GeV < 260 GeV, 0.59 < τ21 < 0.61, 79 < mass/GeV < 81
s = 13 TeV
[Translated] Azimuthal Angle (φ)
AzimuthalCoefficient
Angle (φ)
Pixel p [GeV]
Pixel p [GeV]
Pixel p [GeV]
1 2
1 2
1
0.8 10 0.8 10 0.8 102
T
T
10 10 10
0.6 0.4 0.6 0.4 0.6 0.4
Correlation
1 1 1
0.5 0.4 0.5 0.4 0.4 0.5
10-1 10-1 10-1
0.2 0.2 0.2 0.2 0.2 0.2
10-2 10-2 10-2
[Translated]
-3 -3 -3
0 10 0 10 0 10
Pearson
0 0 10-4 0 10-4 0 0 10-4 0
-0.2 -0.2 -0.2
10-5 10-5 10-5
-0.4 -0.4 -0.4
-0.2 10-6 10-6 -0.2 10-6 -0.2
-0.6 -0.6 -0.6
-0.5 10 -7 -0.5 10 -7 -0.5 10-7
-0.8 10-8 -0.8 10-8 -0.8 10-8
-0.4 -0.4 -0.4
-1 10-9 -1 10-9 -1 10-9
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1 [Translated] Pseudorapidity (-0.6
-1 -1
η) [Translated] Pseudorapidity (η) -0.6 [Translated] Pseudorapidity (η) -0.6