Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Clustering Phenomena in Dropout

David Kewei Lin Jensen Jinhui Wang Phil Chen


linkewei@stanford.edu wangjh97@stanford.edu philhc@stanford.edu CS 229 (Spr 2019) Final Project
Mentor: Anand Avati

Introduction Clustering Bias-Variance Tradeoff


Idea 1 – Dropout ”clusters” neurons Findings Comparison of Model Variances
Dropout is a regularization technique introduced in (Srivastava et al., • Dropout decreases model variance
Result: High rand index between consecutive
2014). weight matrices • Sampling at test time further decreases model
variance
During training time, Why does sampling at test time decrease
nodes will be disabled at variance?
random with a fixed Electron cloud model
probability 1-p. Model family Electron cloud
(1 − p ≈ 0.2 − 0.5) Sampling Superposition of states
Idea 2 – Clustering can predict sampling variance
Inverting at test time Collapsing wavefunction
Result: almost exact power relation between size of hidden layer and sampling variance
On average, the electron cloud moves less
than any one ”component” of the cloud
During test time, the
node output is
multiplied by p so that
the expected outputs
match up. Legend

Framework / Setup Model Compression Summary/Future Work


• Interpretation as Bayesian Neural Idea: if dropout clusters neurons together, use ”centroids” of weight matrix to Summary
Net compress network • Dropout clusters weights of hidden layers
• Model drawn from Dropout • These clusters can predict trends for both sample variance and model
distribution Results variance
• Training equivalent to • Using the centroids of these clusters, we presented a model compression
variational inference for algorithm with strong results
Bayesian neural networks (Gal
and Gharamani, 2015) Future work
• Latent variables are the usual • Unified model to explain both sample and model variance through
weights Advantages clustering
• Retains original model features • Considering both sets of weights simultaneously for model compression
• Dropout test-time protocol is a • Easy computation • Theoretically-motivated hyperparameters for model compression
linearization assumption (scaling • Low variance
by Dropout factor)
• Alternative: Monte Carlo Disadvantages
References:
integration (sampling) • Slightly lower accuracy [1] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from
overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[2] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158,
2015.
• Testing setup [3] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. doi:
10.1080/01621459.1971.10482356
• MNIST dataset with basic 3- Both graphics used in the “Introduction to Dropout” section are from [1].
layer feedforward neural
network.

You might also like