Dropout regularization clusters neurons together during training such that there is a high similarity between consecutive weight matrices. This clustering can be used to predict trends in both the sample variance observed during test time sampling and the overall model variance. Specifically, the sizes of the hidden layers are found to have an almost exact power relation to the sampling variance. By using the centroids of the neuron clusters as compressed weights, the authors demonstrate strong model compression results while retaining the original model's predictive features.
Dropout regularization clusters neurons together during training such that there is a high similarity between consecutive weight matrices. This clustering can be used to predict trends in both the sample variance observed during test time sampling and the overall model variance. Specifically, the sizes of the hidden layers are found to have an almost exact power relation to the sampling variance. By using the centroids of the neuron clusters as compressed weights, the authors demonstrate strong model compression results while retaining the original model's predictive features.
Dropout regularization clusters neurons together during training such that there is a high similarity between consecutive weight matrices. This clustering can be used to predict trends in both the sample variance observed during test time sampling and the overall model variance. Specifically, the sizes of the hidden layers are found to have an almost exact power relation to the sampling variance. By using the centroids of the neuron clusters as compressed weights, the authors demonstrate strong model compression results while retaining the original model's predictive features.
linkewei@stanford.edu wangjh97@stanford.edu philhc@stanford.edu CS 229 (Spr 2019) Final Project Mentor: Anand Avati
Introduction Clustering Bias-Variance Tradeoff
Idea 1 – Dropout ”clusters” neurons Findings Comparison of Model Variances Dropout is a regularization technique introduced in (Srivastava et al., • Dropout decreases model variance Result: High rand index between consecutive 2014). weight matrices • Sampling at test time further decreases model variance During training time, Why does sampling at test time decrease nodes will be disabled at variance? random with a fixed Electron cloud model probability 1-p. Model family Electron cloud (1 − p ≈ 0.2 − 0.5) Sampling Superposition of states Idea 2 – Clustering can predict sampling variance Inverting at test time Collapsing wavefunction Result: almost exact power relation between size of hidden layer and sampling variance On average, the electron cloud moves less than any one ”component” of the cloud During test time, the node output is multiplied by p so that the expected outputs match up. Legend
Framework / Setup Model Compression Summary/Future Work
• Interpretation as Bayesian Neural Idea: if dropout clusters neurons together, use ”centroids” of weight matrix to Summary Net compress network • Dropout clusters weights of hidden layers • Model drawn from Dropout • These clusters can predict trends for both sample variance and model distribution Results variance • Training equivalent to • Using the centroids of these clusters, we presented a model compression variational inference for algorithm with strong results Bayesian neural networks (Gal and Gharamani, 2015) Future work • Latent variables are the usual • Unified model to explain both sample and model variance through weights Advantages clustering • Retains original model features • Considering both sets of weights simultaneously for model compression • Dropout test-time protocol is a • Easy computation • Theoretically-motivated hyperparameters for model compression linearization assumption (scaling • Low variance by Dropout factor) • Alternative: Monte Carlo Disadvantages References: integration (sampling) • Slightly lower accuracy [1] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014. [2] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158, 2015. • Testing setup [3] William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, 1971. doi: 10.1080/01621459.1971.10482356 • MNIST dataset with basic 3- Both graphics used in the “Introduction to Dropout” section are from [1]. layer feedforward neural network.