Professional Documents
Culture Documents
CS190.1x Week3a
CS190.1x Week3a
CS190.1x Week3a
Regression
Goal: Learn a mapping from observations (features) to
continuous labels given a training set (supervised learning)
Example: Height, Gender, Weight Shoe Size
Audio features Song year
Processes, memory Power consumption
Historical financials Future stock price
Many more
x = x1
x2
x3
w 0 + w 1 x1 + w 2 x2 + w 3 x 3
x = 1
x1
x2
x3
y =
w i xi = w x
i=0
1D Example
Goal: find the line of best fit
x coordinate: features
y coordinate: labels
y = w0 + w1 x
x
Intercept / Offset
Slope
Evaluating Predictions
Can measure closeness between label and prediction
Absolute loss:
|y
Squared loss: (y
y|
y)
(i)
y=w x
Linear assumption:
We use squared loss: (y
y)
min
w
(i)
(w x
(i)
i=1
y
(i) 2
y )
X
y
y
n d
min ||Xw
w
2
y||2
Equivalent min
w
(w x
i=1
(i)
||
w
x
df
1D: f(w)(=
i)
(i)
(w ) = 2
x (wx
dw
i=n1
df
(i)
(i)
(w ) = 2
x (wx
wx x x
dw
i=1
2
y||
(i)2
=
(wx
y )=0
(i)
i=1
(i)
y )=0
(i) 2
y )
wx x
x y=0
wx x
x y=0
w = (x x)
wx x x y
x y
w = (x x) x y
Least Squares Regression: Learn mapping ( w ) from
features to labels that minimizes residual sum of squares:
min ||Xw
w
2
y||2
1
X
y
y
n d
min ||Xw
w
2
y||2
Closed-form solution: w = (X X + Id )
Model Complexity
+
1
2
||w||2
X y
Millionsong
Regression Pipeline
full
dataset
training
set
full
dataset
test set
Obtain Raw Data
Split Data
training
set
full
dataset
test set
Obtain Raw Data
Split Data
Feature Extraction
training
set
model
full
dataset
test set
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
training
set
model
full
dataset
test set
accuracy
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
training
set
model
full
dataset
test set
accuracy
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
x = x1
z = z1
x2
z2
=
=
(x) =
2
x1
(z) =
2
z1
x1 x2
z1 z2
x2 x 1
z2 z 1
2
x2
2
z2
More succinctly:
(x) =
2
x1
2x1 x2
2
x2
(z) =
2
z1
2z1 z2
2
z2
(x) (z) =
2 2
x 1 z1
+ 2x1 x2 z1 z2 +
2 2
x 2 z2
= (x) (z)
training
set
model
full
dataset
new entity
test set
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
X
y
y
n d
min ||Xw
w
2
y||2
Model Complexity
2
||w||2
Closed-form solution: w = (X X + Id )
X y
min ||Xw
w
2
y||2
Model Complexity
2
||w||2
training
set
full
dataset
model
validation
set
test set
new entity
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
10-8
10-6
10-4
10-2
Regulariza*on-Parameter-(--)
Hyperparameter-1
Hyperparameter-2
Evaluating Predictions
How can we compare labels and predictions for n validation points?
2
(i)
(y
(i) 2
y )
i=1
MSE
training
set
full
dataset
model
validation
set
test set
new entity
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set
full
dataset
model
validation
set
test set
new entity
accuracy
prediction
Obtain Raw Data
Split Data
Feature Extraction
Supervised Learning
Evaluation
Predict
Distributed ML:
Computation and Storage
Challenge: Scalability
Classic ML techniques are not always suitable for modern datasets
60"
50"
Moore's"Law"
Overall"Data"
40"
Par8cle"Accel."
30"
DNA"Sequencers"
20"
10"
0"
2010"
2011"
2012"
2013"
2014"
2015"
Machine
Learning
Data
Data"Grows"faster"than"Moores"Law"
[IDC%report,%Kathy%Yelick,%LBNL]%
Distributed
Computing
min ||Xw
w
2
y||2
1
w = (X X) X y
Computation: O(nd2 + d3) operations
Storage Requirements
1
w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
Consider storing values as floats (8 bytes)
Storage bottlenecks:
2
X X and its inverse: O(d ) floats
X : O(nd) floats
w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
3
O(d )
Assume
computation and
single machine
2
O(d )
storage feasible on
9
4
3
1
5
2
9
1
3
2
1+3
2
28
5 =
11
3
3+5
2 = 28
18
9
9
4
3
1
5
2
1
3
2
2
28
5 =
11
3
18
9
9
4
3
1
5
2
1
3
2
2
28
5 =
11
3
18
9
5
2
1
3
2
9
4
3
1
9
4
18
9
+
8
3
2
28
5 =
11
3
15
10
+
5
4
18
9
15
6
5
2
1
3
2
9
4
3
1
9
4
18
9
+
8
3
2
28
5 =
11
3
15
10
+
5
4
18
9
15
6
5
2
1
3
2
9
4
3
1
9
4
18
9
+
8
3
2
28
5 =
11
3
15
10
+
5
4
18
9
15
6
5
2
1
3
2
9
4
3
1
9
4
18
9
+
8
3
2
28
5 =
11
3
15
10
+
5
4
18
9
15
6
x(1)
x(2)
x(i)
x(i)
X X=
x(n)
x(1)
x(2)
i =1
x(n)
Example: n = 6; 3 workers
x(3)
x(4)
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(1)
x(5)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage
x(1)
x(2)
x(i)
x(i)
X X=
x(n)
x(1)
x(2)
i =1
x(n)
Example: n = 6; 3 workers
x(3)
x(4)
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(1)
x(5)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage
Distributed ML:
Computation and Storage,
Part II
w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
3
O(d )
Assume
computation and
single machine
2
O(d )
storage feasible on
w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
3
O(d )
Assume
computation and
single machine
2
O(d )
storage feasible on
w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
As before, storing X and computing X X are bottlenecks
Now, storing and operating on X X is also a bottleneck
Cant easily distribute!
x(1)
x(2)
x(i)
x(i)
X X=
x(n)
x(1)
x(2)
i =1
x(n)
Example: n = 6; 3 workers
x(3)
x(4)
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(1)
x(5)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage
x(1)
x(2)
x(i)
x(i)
X X=
x(n)
x(1)
x(2)
i =1
x(n)
Example: n = 6; 3 workers
x(3)
x(4)
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(1)
x(5)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage
w = (X X) X y
Computation: O(nd2 + d3) operations
Storage: O(nd + d2) floats
As before, storing X and computing X X are bottlenecks
Now, storing and operating on X X is also a bottleneck
Cant easily distribute!
dense : 1. 0. 0. 0. 0. 0. 3.
8
size : 7
>
<
sparse : indices : 0 6
>
:
values : 1. 3.
d
r
Low-rank
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(3)
x(4)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
x(1)
x(5)
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage
x(i)
x(i)
x(i)
reduce:
x(i)
map:
x(3)
x(4)
x(2)
x(6)
x(i)
x(i)
x(i)
workers:
x(1)
x(5)
-1
x(i)
O(nd) Distributed
Storage
O(nd2)
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage
x(2)
x(6)
map:
reduce:
x(i)
workers:
x(1)
x(5)
x(i)
-1
O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
O(d3) Local O(d2) Local
Computation Storage
x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
reduce:
O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
O(d)
O(d)
O(d3) Local O(d2) Local
Computation Storage
x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:
reduce:
O(nd) Distributed
Storage
O(nd)
O(d)
2
O(nd )
O(d2) Local
Distributed
Storage
Computation
O(d)
O(d)
O(d3) Local O(d2) Local
Computation Storage