Lecture 9

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

Combining Models

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 1
Motivation 1
• Many models can be trained on the same data
• Typically none is strictly better than others
• Recall “no free lunch theorem”
• Can we “combine” predictions from multiple models?

• Yes, typically with significant reduction of error!

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 2
Motivation 2
• Combined prediction using Adaptive Basis Functions
𝑀
𝑓 ( 𝑥 ) =∑ 𝑤𝑚 𝜙𝑚 (𝑥 ;𝑣 𝑚)
𝑖=1

• M basis functions with own parameters


• Weight / confidence of each basis function
• Parameters including M trained using data

• Another interpretation: automatically learning best


representation of data for the task at hand

• Difference with mixture models?


Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 3
Examples of Model Combinations
• Also called Ensemble Learning

• Decision Trees
• Bagging
• Boosting
• Committee / Mixture of Experts
• Feed forward neural nets / Multi-layer perceptrons
• …

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 4
Decision Trees
• Partition input space into cuboid regions
• Simple model for each region
• Classification: Single label; Regression: Constant real value
• Sequential process to choose model per instance
• Decision tree

Figure from Murphy

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 5
Learning Decision Trees
• Decision for each region
• Regression: Average of training data for the region
• Classification: Most likely label in the region

• Learning tree structure and splitting values


• Learning optimal tree intractable

• Greedy algorithm
• Find (node, dim., value) w/ largest reduction of “error”
• Regression error: residual sum of squares
• Classification: Misclassification error, entropy, …
• Stopping condition
• Preventing overfitting: Pruning using cross validation

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 6
Pros and Cons of Decision Trees
• Easily interpretable decision process
• Widely used in practice, e.g. medical diagnosis

• Not very good performance


• Restricted partition of space
• Restricted to choose one model per instance
• Unstable

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 7
Mixture of Supervised Models

𝑓 ( 𝑥 ) =∑ 𝜋 𝑘 𝜙 𝑘(𝑥 ,𝑤)
𝑖
Mixture of linear regression models Mixture of logistic regression models

Figure from Murphy

• Training using EM

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 8
Conditional Mixture of Supervised Models

• Mixture of experts
𝑓 ( 𝑥 ) =∑ 𝜋 𝑘 (𝑥)𝜙 𝑘 (𝑥,𝑤)
𝑖

Figure from Murphy

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya
9
Bootstrap Aggregation / Bagging
• Individual models (e.g. decision trees) may have high
variance along with low bias

• Construct M bootstrap datasets


• Train separate copy of predictive model on each
• Average prediction over copies
1
𝑓 ( 𝑥 ) = ∑ 𝑓 𝑚 (𝑥)
𝑀 𝑖
• If the errors are uncorrelated, then bagged error
reduces linearly with M

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 10
Random Forests
• Training same algorithm on bootstraps creates
correlated errors

• Randomly choose (a) subset of variables and (b)


subset of training data

• Good predictive accuracy


• Loss in interpretability

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 11
Boosting
• Combining weak learners, -better than random
• E.g. Decision stumps

• Sequence of weighted datasets


• Weight of data point in each iteration proportional
to no of misclassifications in earlier iterations

• Specific weighting scheme depends on loss function


• Theoretical bounds on error

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 12
Example loss functions and algorithms
• Squared error
• Absolute error

• Squared loss
• 0-1 loss
• Exponential loss
• Logloss
• Hinge loss

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 13
Example: AdaBoost
• Binary classification problem + Exponential loss

1. Initialize
2. Train classifier minimizing
3. Evaluate and
4. Update wts
5. Predict

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 14
Neural networks: Multilayer Perceptrons

• Multiple layers of logistic regression models


• Parameters of each optimized by training

• Motivated by models of the brain


• Powerful learning model regardless

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 15
LR and R remembered …
• Linear models with fixed basis functions

𝑦 ( 𝑥 , 𝑤)= 𝑓 ¿
• Fixed basis functions
• Non-linear transformation

• linear followed by non-linear transformation

^𝑦 ( 𝑥;𝑤 ,𝑣 ) =h( ∑ 𝑤𝑘𝑗 𝑔( ∑ 𝑣 𝑗𝑖 𝑥𝑖 ))


𝑗=1𝑡𝑜 𝑀 𝑖=1𝑡𝑜 𝐷

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 16
Feed-forward network functions
• M linear combinations of input variables
𝑎 𝑗= ∑ 𝑣 𝑗𝑖 𝑥𝑖
𝑖=1𝑡𝑜 𝐷
• Apply non-linear activation function
𝑧 𝑗=𝑔(𝑎 𝑗 )
• Linear combinations to get output activations
𝑏 𝑘= ∑ 𝑤𝑘𝑗 𝑧 𝑗
𝑗=1𝑡𝑜 𝑀
• Apply output activation function to get outputs

𝑦 𝑘=h (𝑏𝑘 )
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 17
Network Representation

𝑖𝑛𝑝𝑢𝑡𝑠 h𝑖𝑑𝑑𝑒𝑛 𝑜𝑢𝑡𝑝𝑢𝑡𝑠


𝑔
𝑣 𝑎𝑀 𝑧𝑀
𝑤

𝑥𝐷 h
𝑏1 𝑦1 Easy to generalize to
multiple layers
𝑣 𝑗𝑖 𝑤𝑘𝑗
𝑥𝑖 𝑎𝑗 𝑧𝑗 𝑏𝑘 𝑦𝑘
𝑏𝐾 𝑦𝐾
𝑥1
𝑎1 𝑧1

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 18
Power of feed-forward networks
• Universal approximators

A 2 layer network with linear outputs can


uniformly approximate any smooth continuous
function with arbitrary accuracy given
sufficient number of nodes in hidden layer

• Why are >2 layers needed?

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 19
Training
• Formulate error function in terms of weights

∑ | ^𝑦 ( 𝑥𝑛 ;𝑤 ,𝑣 ) − 𝑦 𝑛||
2
𝐸 ( 𝑤 ,𝑣 )=
𝑖=1𝑡𝑜 𝑁

• Optimize weights using gradient descent

( 𝑤 , 𝑣 )(𝑡 +1 )=( 𝑤 , 𝑣 )(𝑡 ) −𝜂 𝛻 𝐸( (𝑤 , 𝑣 )(𝑡 ))

• Deriving the gradient looks complicated because of


feed-forward …

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 20
Error Backpropagation
• Full gradient: sequence of local computations and
propagations over the network
𝜕 𝐸𝑛 𝜕 𝐸 𝑛 𝜕 𝑏𝑛𝑘
Output layer

𝑤
= =𝛿𝑛𝑘 𝑧 𝑛𝑗 𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
𝜕 𝑤𝑘𝑗 𝜕 𝑏𝑛𝑘 𝜕 𝑤𝑘𝑗
𝛿𝑤 ^ 𝑔 h
𝑛𝑘= 𝑦 𝑛𝑘 − 𝑦 𝑛𝑘 𝑣 𝑗𝑖 𝑤𝑘𝑗
𝑥𝑖 𝑎𝑗 𝑧𝑗 𝑏𝑘 ^
𝑦𝑘
𝛿 𝑣𝑛𝑗
𝑤
𝛿 𝑛𝑘 𝑦𝑘

𝜕 𝐸𝑛 𝜕 𝐸𝑛 𝜕 𝑎𝑛𝑗 𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟


Hidden layer

𝑣
= =𝛿𝑛𝑗 𝑥𝑛𝑖
𝜕 𝑣 𝑗𝑖 𝜕 𝑎𝑛𝑗 𝜕 𝑣 𝑗𝑖
𝜕 𝐸𝑛 𝜕 𝑏𝑛𝑘 𝜕𝐸 𝜕 𝐸𝑛
𝛿 =∑
𝑣
=∑ 𝛿𝑤𝑛𝑘 𝑤𝑘𝑗 𝑔 ′ (𝑎𝑛𝑗 ) =∑
𝑛𝑗
𝑘 𝜕 𝑏𝑛𝑘 𝜕𝑎 𝑛𝑗 𝑘 𝜕 𝑤 𝑛 𝜕𝑤

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 21
Backpropagation Algorithm
1. Apply input vector and compute derived variables
2. Compute at all output nodes
3. Back propagate to compute at all hidden nodes
4. Compute derivatives and
5. Batch: Sum derivatives over all input vectors

• Vanishing gradient problem

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 22
Neural Network Regularization
• Given such a large number of parameters,
preventing overfitting is vitally important

• Choosing the number of layers + no of hidden nodes


• Controlling the weights
• Weight decay
• Early stopping
• Weight sharing
• Structural regularization
• Convolutional neural networks for invariances in image data

Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 23
So… Which classifier is the best in practice?
• Extensive experimentation by Caruana et al 2006
• See more recent study here
• Low dimensions (9-200) • High dimensions (500-100K)
1. Boosted decision trees 1. HMC MLP
2. Random forests 2. Boosted MLP
3. Bagged decision trees 3. Bagged MLP
4. SVM 4. Boosted trees
5. Neural nets
5. Random forests
6. K nearest neighbors
7. Boosted stumps
8. Decision tree
9. Logistic regression
10. Naïve Bayes
• Remember the “No Free Lunch” theorem …
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 24

You might also like