Univariate Smoothing

Univariate Smoothing Overview Problem Definition & Interpolation
• Problem definition • Smoothing Problem: Given a data set with a single input
• Interpolation variable x, find the best function ĝ(x) that minimizes the
prediction error on new inputs (probably not in the data set)
• Polynomial smoothing
• Cubic splines • Interpolation Problem: Same as the smoothing problem except
the model is subject to the constraint ĝ(xi ) = yi for every
• Basis splines input-output pair (xi , yi ) in the data set
• Smoothing splines – Linear Interpolation: Use a line between each pair of points
• Bayes’ rule – Nearest Neighbor Interpolation: Find the nearest input in
• Density estimation the data set and use the corresponding output as an
approximate fit
• Kernel smoothing
– Polynomial Interpolation:Fit a polynomial of order n − 1 to
• Local averaging n
input output data: ĝ(x) = i=1 wi xi−1
• Weighted least squares – Cubic Spline Interpolation: Fit a cubic polynomial with
• Local linear models continuous second derivatives in between each pair of points
• Prediction error estimates (more on this later)
J. McNames Portland State University ECE 4/557 Univariate Smoothing Ver. 1.25 1 J. McNames Portland State University ECE 4/557 Univariate Smoothing Ver. 1.25 2
Interpolation versus Extrapolation Example 1: Linear Interpolation

• Interpolation is technically defined only for inputs that are within Chirp Linear Interpolation
2
the range of the data set mini xi ≤ x ≤ maxi xi
• If an input is outside of this range, the model is said to be 1.5
extrapolating
1
• A good model should do reasonable things for both cases
0.5
• Extrapolation is much a harder problem
Output y
−0.5
−1
−1.5
−2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input x
Example 1: MATLAB Code % ================================================
% Nearest Neighbor Interpolation
% ================================================
figure ;
% function [] = Interpolation ();
FigureSet (1 , ’ LTX ’ );
yh = interp1 (x ,y , xt , ’ nearest ’ );
close all ;
h = plot ( xt , yh , ’b ’ ,x ,y , ’ r. ’ );
set (h , ’ MarkerSize ’ ,8);
N = 15;
set (h , ’ LineWidth ’ ,1 .2 );
rand ( ’ state ’ ,2);
xlabel ( ’ Input x ’ );
x = rand (N ,1);
ylabel ( ’ Output y ’ );
y = sin (2* pi *2* x. ^2) + 0 .2 * randn (N ,1);
title ( ’ Chirp Nearest Neighbor Interpolation ’ );
xt = (0:0 .0001 :1) ’; % Test inputs
set ( gca , ’ Box ’ , ’ Off ’ );
grid on ;
% ================================================
% Linear Interpolation axis ([0 1 -2 2]);
% ================================================ AxisSet (8);
figure ; print - depsc I n t e r p o l a t i o n N e a r e s t N e i g h b o r ;
yh = interp1 (x ,y , xt , ’ linear ’ ); % ================================================
h = plot ( xt , yh , ’b ’ ,x ,y , ’ r. ’ ); % Polynomial Interpolation
set (h , ’ MarkerSize ’ ,8); % ================================================
set (h , ’ LineWidth ’ ,1 .2 ); A = zeros (N , N );
xlabel ( ’ Input x ’ ); for cnt = 1: size (A ,2) ,
ylabel ( ’ Output y ’ ); A (: , cnt ) = x. ^( cnt -1);
title ( ’ Chirp Linear Interpolation ’ ); end ;
set ( gca , ’ Box ’ , ’ Off ’ ); w = pinv ( A )* y ;
grid on ; At = zeros ( length ( xt ) , N );
axis ([0 1 -2 2]); for cnt = 1: size (A ,2) ,
AxisSet (8); At (: , cnt ) = xt. ^( cnt -1);
print - depsc I n t e r p o l a t i o n L i n e a r ; end ;
yh = At * w ; grid on ;
axis ([0 1 -2 2]);
figure ; AxisSet (8);
FigureSet (1 , ’ LTX ’ ); print - depsc I n t e r p o l a t i o n C u b i c S p l i n e ;
h = plot ( xt , yh , ’b ’ ,x ,y , ’ r. ’ );
set (h , ’ MarkerSize ’ ,8); % ================================================
set (h , ’ LineWidth ’ ,1 .2 ); % Cubic Spline Interpolation
xlabel ( ’ Input x ’ ); % ================================================
ylabel ( ’ Output y ’ ); figure ;
title ( ’ Chirp Polynomial Interpolation ’ ); FigureSet (1 , ’ LTX ’ );
set ( gca , ’ Box ’ , ’ Off ’ ); yt = sin (2* pi *2* xt. ^2);
grid on ;
h = plot ( xt , yt , ’b ’ ,x ,y , ’ r. ’ );
axis ([0 1 -2 2]);
AxisSet (8);
print - depsc I n t e r p o l a t i o n P o l y n o m i a l ;
title ( ’ Chirp Optimal Model ’ );
set ( gca , ’ Box ’ , ’ Off ’ );
% ================================================
grid on ;
% Cubic Spline Interpolation
% ================================================ axis ([0 1 -2 2]);
figure ; AxisSet (8);
FigureSet (1 , ’ LTX ’ ); print - depsc I n t e r p o l a t i o n O p t i m a l M o d e l ;
yh = spline (x ,y , xt );
h = plot ( xt , yh , ’b ’ ,x ,y , ’ r. ’ );
title ( ’ Chirp Cubic Spline Interpolation ’ );
set ( gca , ’ Box ’ , ’ Off ’ );
Example 2: Nearest Neighbor Interpolation Example 2: MATLAB Code
Chirp Nearest Neighbor Interpolation

Same data set and test inputs as linear interpolation example.
2
1.5
0.5
Output y
−0.5
−1
−1.5
−2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input x
Example 3: Polynomial Interpolation Example 3: MATLAB Code
Chirp Polynomial Interpolation

2
1.5
0.5
Output y
−0.5
−1
−1.5
−2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input x
Example 4: Cubic Spline Interpolation Example 4: MATLAB Code
Chirp Cubic Spline Interpolation

2
1.5
0.5
Output y
−0.5
−1
−1.5
−2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input x
Interpolation Comments Smoothing

• There are an infinite number of functions that satisfy the • For the smoothing problem, even the constraint is relaxed
interpolation constraint:
ĝ(xi ) ≈ yi ∀i
ĝ(xi ) = yi ∀i
• The data set can be merely suggesting what the model output
• Of course, we would like to choose the model that minimizes the should be approximately at some specified points
prediction error • We need another constraint or assumption about the relationship
• Given only data, there is no way to do this exactly between x and y to have enough constraints to uniquely specify
• Our data set only specifies what ĝ(x) should be at specific points the model
• What should it be in between these points?

• In practice, the method of interpolation is usually chosen by the
user
Smoothing Assumptions and Statistical Model Example 5: Interpolation Optimal Model
y = g(x) + ε Chirp Optimal Model
2
• Generally we assume that the data was generated from the
statistical model above 1.5
• εi is a random variable with the following assumed properties 1

– Zero mean: E[ε] = 0
0.5
– εi and εj are independently distributed for i = j
Output y
– εi is identically distributed 0
• The two additional assumptions are usually made for the −0.5
smoothing problem
−1
– g(x) is continuous
– g(x) is smooth −1.5
−2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Input x
Smoothing Bias-Variance Tradeoff

yi = g(xi ) + εi Recall that

2
MSE(x) = E (g(x) − ĝ(x))

• When we add noise, we can drop the interpolation constraint: =
2
(g(x) − E[ĝ(x)]) + E (ĝ(x) − E[ĝ(x)])
2
ĝ(xi ) = yi ∀i
• But we still want ĝ(·) to be consistent with (i.e. close to) the = Bias2 + Variance
data: ĝ(xi ) ≈ yi • Fundamental smoother tradeoff:
• The methods we will discuss are biased in favor of models that are – Smoothness of the estimate ĝ(x)
smooth – Fit to the data
• This can also be framed as a bias-variance tradeoff
Bias-Variance Tradeoff Continued Example 6: Univariate Smoothing Data

2 2 Motorcycle Data Set
MSE(x) = (g(x) − E[ĝ(x)]) + E (ĝ(x) − E[ĝ(x)])
50
• Smooth models
– Less sensitive to the data
0
– Less variance
Output y
– Potentially high bias since they don’t fit the data well
−50
• Flexible models
– Sensitive to the data
– In the most extreme case, they interpolate the data −100
– High variance since they are sensitive to the data
– Low bias
5 10 15 20 25 30 35 40 45 50 55
Input x
Example 6: Smoothing Problem MATLAB Code return ;
% ================================================
% Linear
function [] = S m o o t h i n g P r o b l e m (); % ================================================
figure ;
A = load ( ’ MotorCycle.txt ’ ); FigureSet (1 ,4 .5 ,2 .8 );
x = A (: ,1); A = [ ones (N ,1) x ];
y = A (: ,2); w = pinv ( A )* y ;
yh = [ ones ( size ( xt )) xt ]* w ;
figure ; h = plot ( xt , yh , ’b ’ ,x ,y , ’ r. ’ );
FigureSet (1 , ’ LTX ’ ); set (h , ’ MarkerSize ’ ,8);
h = plot (x ,y , ’ r. ’ ); set (h , ’ LineWidth ’ ,1 .2 );
set (h , ’ MarkerSize ’ ,6); xlabel ( ’ Input x ’ );
xlabel ( ’ Input x ’ ); ylabel ( ’ Output y ’ );
ylabel ( ’ Output y ’ ); title ( ’ Chirp Linear Least Squares ’ );
title ( ’ Motorcycle Data Set ’ ); set ( gca , ’ Box ’ , ’ Off ’ );
set ( gca , ’ Box ’ , ’ Off ’ ); grid on ;
grid on ; ymin = min ( y );
ymin = min ( y ); ymax = max ( y );
ymax = max ( y ); yrng = ymax - ymin ;
yrng = ymax - ymin ; ymin = ymin - 0 .05 * yrng ;
ymin = ymin - 0 .05 * yrng ;
ymax = ymax + 0 .05 * yrng ;
ymax = ymax + 0 .05 * yrng ;
axis ([ min ( x ) max ( x ) ymin ymax ]);
axis ([ min ( x ) max ( x ) ymin ymax ]);
AxisSet (8);
AxisSet (8);
% print - depsc Test ;

print - depsc L i n e a r L e a s t S q u a r e s ;
print - depsc S m o o t h i n g P r o b l e m ;
Polynomial Smoothing Example 7: Polynomial Smoothing
p−1
• We can fit a polynomial ĝ(x) = i=0 wi xi to the data using the Motorcycle Linear Regression
linear modeling methods
• Note that linear models are linear in the parameters wi
50
• They need not be linear in the inputs
• Alternatively, you can think of this as a linear model with p
different inputs where the ith input is given by xi = xi 0
Output y
• This model is smooth in the sense that all derivatives of ĝ(x) are
continuous −50
• This is one measure of model smoothness
Linear
• In general, this is a terrible smoother Quadratic
−100
– Terrible at extrapolation Cubic
4th Order
– The matrix inverse is often poorly conditioned and 5th Order
regularization is necessary −150
−10 0 10 20 30 40 50 60 70
– The user has to pick the order of the polynomial p − 1 Input x
Example 7: MATLAB Code Cubic Splines

Matlab/PolynomialSmoothing.m • Cubic splines are modeled after the properties of flexible rods ship
designers used to use to draw smooth curves
• The rod would be rigidly constrained to go through specific points
(interpolation)
• The rod smoothly bent from one point to the next
• The rod naturally minimized its bending energy (i.e. curvature)
• This can be approximated by a piecewise cubic polynomial
Cubic Splines Functional Form Cubic Splines Smoothness Definition

3 • Out of all the functions that meet the above criteria, consider
ĝ(x) = wi (x)xi those that also minimize the approximate “curvature” of ĝ(x)
i=0 xmax 2 2
d ĝ(x)
C≡ dx
• Unlike polynomial regression, here the parameters wi (x) are also a xmin dx
function of x • These are piecewise cubics and are called cubic splines
• Consider a class of functions ĝ(x) that have the following • In the sense of satisfying the criteria listed above and minimizing
properties the curvature C, cubic splines are optimal
– Continuous • Even with all of these constraints, ĝ(x) is not uniquely specified
– Continuous 1st derivative • There are several cubic splines that meet the strict criteria and
– Continuous 2nd derivative have the same curvature
– Interpolates the data: ĝ(xi ) = yi • The most popular additional constraints are
ĝ(xmin ) = 0 ĝ(xmax ) = 0
• These are called natural cubic splines
Cubic Spline Constraints Cubic Spline Constraints Continued

y Cubic Spline y Cubic Spline
x x
Let pk (x) be the polynomial between the point xk and xk+1 . We need
• Cubic splines are piecewise cubic 4 × (n + 1) constraints to have the problem well defined.
3
• This means ĝ(x) = i=0 wi (x)xi has different weights between Property Expression Constraints
each pair of points Interpolation ĝ(xi ) = yi n
Continuous pk (xk+1 ) = pk+1 (xk+1 ) n
• For the entire region between each pair of points, the weights are
Continuous Derivative pk (xk+1 ) = pk+1 (xk+1 ) n
fixed
Continuous 2nd Derivative pk (xk+1 ) = pk+1 (xk+1 ) n
• Since each polynomial is defined by 4 parameters wi (x)
Natural splines have 4 additional constraints
• We have n + 1 regions where n is the number of points in the
data set p0 (x1 ) = 0 p0 (x1 ) = 0
• Thus, we need at least 4 × (n + 1) constraints for each region to pn (xn ) = 0 pn (xn ) = 0
uniquely specifies the weights
Basis Splines Basis Splines Continued
• You could solve for the 4(n + 1) model coefficients by solving a • The output of our model can then be written as
set of 4(n + 1) linear equations n−1

• This is cumbersome and very inefficient mathematically ĝ(x) = wi b3,i (x)
i=−2
• An easier way is to use basis functions
• Mathematically, each basis function is defined recursively • Numerically, this can be solved much more quickly (the order is
proportional to n)
x − kj ki+j+1 − x
bi,j (x) = bi−1,j (x) + bi−1,j+1 (x) • Since the basis functions have finite support (i.e. finite span) the
ki+j − kj ki+j+1 − kj+1
equivalent A matrix is banded
• Basis splines also have the nice property that they sum to unity
n−1

bi,j (x) = 1 ∀x ∈ [k1 , kn ]
j=1−i
Example 8: Basis Function 0 Example 8: Basis Function 1
Basis Function B0(x) Basis Function B1(x)
1 1
0.8 0.8
0.6 0.6
Output y
Output y
0.4 0.4
0.2 0.2
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Input x Input x
Example 8: Basis Function 2 Example 8: Basis Function 3
Basis Function B (x) Basis Function B (x)

2 3
1 1
0.8 0.8
0.6 0.6
Output y
Output y
0.4 0.4
0.2 0.2
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Input x Input x
Example 8: MATLAB Code Smoothing Splines

Matlab/BasisFunctions.m • For smoothing, we do not require ĝ(xi ) = yi
• But we would like it to be close: ĝ(xi ) ≈ yi
• How do we tradeoff smoothness (low variance) for a good fit to
the data (low bias)?
• One way is to find the ĝ(xi ) that minimizes the following
performance criterion:
n +∞
(ĝ(x) ) dx
2 2
Eλ = (yi − ĝ(xi )) + λ
i=1 −∞
• Contrast to cubic splines in which we required the first term to be

zero
• The second term is a roughness penalty
Smoothing Splines Continued Smoothing Splines Comments
n
+∞ n
+∞
(ĝ(x) ) dx (ĝ(x) ) dx
2 2 2 2
Eλ = (yi − ĝ(xi )) + λ Eλ = (yi − ĝ(xi )) + λ
i=1 −∞ i=1 −∞
• Smoothing splines are smooth in the same sense as cubic splines

• λ is a user-specified parameter that controls the tradeoff – If cubic, the second derivative is continuous
• It turns out the optimal solution (in the sense of minimizing Eλ ) – If quadratic, the first derivative is continuous
is a smoothing spline – If linear, the function is continuous
• A smoothing spline is identical to a cubic spline in form • If cubic smoothing spline, then
– There is an ith order polynomial between each pair of points – As λ → ∞, ĝ(x) approaches a linear least squares fit to the
– Same number of knots data (i.e. ĝ(x) → 0)
– Same number of different sets of polynomials – As λ → 0, ĝ(x) becomes an interpolating cubic spline
• Unlike the cubic spline, we now drop the constraint that ĝ(xi ) = yi • This is implemented in MATLAB as csaps
• Instead, ĝ(xi ) = ỹi for some set of ỹi • Instead of λ, it takes an equivalent parameter scaled between 0
(linear least squares fit) and 1 (cubic spline interpolation)
Example 9: Smoothing Spline Example 10: Smoothing Spline
Motorcycle Data Smoothing Spline Regression Motorcycle Data Smoothing Spline Regression α=0.0001
50 50
0 0
Output y
Output y
−50 −50
−100 α = 1.0 −100

α = 0.5
α = 0.0
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
Motorcycle Data Smoothing Spline Regression α=0.0010 Motorcycle Data Smoothing Spline Regression α=0.0100
50 50
0 0
Output y
Output y
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
50 50
0 0
Output y
Output y
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
50 50
0 0
Output y
Output y
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
Example 10: MATLAB Code xlabel ( ’ Input x ’ );

title ( ’ Motorcycle Data Smoothing Spline Regression ’ );
set ( gca , ’ Box ’ , ’ Off ’ );
function [] = S m o o t h i n g S p l i n e E x (); grid on ;
axis ([ -10 70 -150 90]);
close all ; AxisSet (8);
legend ( ’\ alpha = 1 .0 ’ , ’\ alpha = 0 .5 ’ , ’\ alpha = 0 .0 ’ ,4);
A = load ( ’ MotorCycle.txt ’ ); print - depsc S m o o t h i n g S p l i n e E x ;
xr = A (: ,1); % Raw values
yr = A (: ,2); % Raw values L = [0 .0001 0 .001 0 .01 0 .2 0 .5 0 .9 0 .99 ];
for cnt = 1: length ( L ) ,
x = unique ( xr ); alpha = L ( cnt );
y = zeros ( size ( x )); figure ;
for cnt = 1: length ( x ) , FigureSet (1 , ’ LTX ’ );
y ( cnt ) = mean ( yr ( xr == x ( cnt ))); yh = csaps (x ’ ,y ’ , alpha , xt ’) ’;
end ; h = plot ( xt , yh ,x ,y , ’ k. ’ );
N = size (A ,1); % No. data set points set (h , ’ LineWidth ’ ,1 .2 );
xt = ( -10:0 .2 :70) ’; xlabel ( ’ Input x ’ );
NT = length ( xt ); % No. test points ylabel ( ’ Output y ’ );
NS = 3; % No. of different splines st = sprintf ( ’ Motorcycle Data Smoothing Spline Regression \\ alpha =%6 .4f ’ , alpha );
title ( st );
yh = zeros ( NT , NS ); set ( gca , ’ Box ’ , ’ Off ’ );
yh (: ,3) = csaps (x ’ ,y ’ ,0 , xt ’) ’; grid on ;
yh (: ,2) = csaps (x ’ ,y ’ ,0 .5 , xt ’) ’; axis ([ -10 70 -150 90]);
yh (: ,1) = csaps (x ’ ,y ’ ,1 .0 , xt ’) ’; AxisSet (8);
FigureSet (1 , ’ LTX ’ ); st = sprintf ( ’ print - depsc S m o o t h i n g S p l i n e E x %04 d ; ’ , round ( alpha *10000));
h = plot ( xt , yh ,x ,y , ’ k. ’ ); eval ( st );
set (h , ’ MarkerSize ’ ,8); end ;
Review of Bayes’ Rule Continuous Bayes’ Rule
• Bayes’ rule says that two discrete-valued random variables A and f (x, y) f (x|Y = y)f (y)
f (y|X = x) = =
B have the following relationship f (x) f (x)
Pr {A, B} Pr {A|B} Pr {B} • E[Y |X = x] is given by
Pr {B|A} = =
Pr {A} Pr {A} +∞
E[Y |X = x] = yf (y|X = x) dy
• Recall that earlier we found the the ĝ(x) that minimizes the MSE −∞
is given by
Ŷ = g ∗ (x) = E[Y |X = x] • In order to estimate these equations we need a means of
estimating the densities f (x) and f (x, y)
• For smoothing, we can use the continuous analog of Bayes’ rule to
• A popular method of estimating density is to add a series of
estimate E[Y |X = x]
“bumps” together
f (x, y) f (x|Y = y)f (y) • The bumps are called kernels and should have the following
f (y|X = x) = =
f (x) f (x) property
+∞
bσ (u) du = 1
−∞
Density Estimation Example 11: Density Estimation

• Then a kernel density estimator is simply expressed as Motorcycle Data Density Estimation w= 0.1
0.1
n
1
fˆ(x) = bσ (x − xi ) 0.09
n i=1
0.08
• The width of the kernel is specified by σ. Typically 0.07
1 u
Density p(x)
0.06
bσ (u) = b
σ σ 0.05

where it is easy to show that bσ (u) du = 1 for any value of σ 0.04
• Bumps shaped like a Gaussian are popular 0.03
u 2
b(u) = √1 e− 2 0.02
2π
0.01
• Typically the bumps have even symmetry:
0
b(u) = b(−u) = b(|u|) −10 0 10 20 30 40 50 60 70
Input x
Example 11: Density Estimation Example 11: Density Estimation
Motorcycle Data Density Estimation w= 0.2 Motorcycle Data Density Estimation w= 0.5
0.1 0.1
0.09 0.09
0.08 0.08
0.07 0.07
Density p(x)
Density p(x)
0.06 0.06
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
Example 11: Density Estimation Example 11: Density Estimation
Motorcycle Data Density Estimation w= 1.0 Motorcycle Data Density Estimation w= 5.0
0.1 0.1
0.09 0.09
0.08 0.08
0.07 0.07
Density p(x)
Density p(x)
0.06 0.06
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
Example 11: MATLAB Code title ( st );
set ( gca , ’ Box ’ , ’ Off ’ );
grid on ;
axis ([ -10 70 0 0 .1 ]);
function [] = DensityEx (); AxisSet (8);
st = sprintf ( ’ print - depsc DensityEx %02 d ; ’ , round ( w *10));
close all ; eval ( st );
end ;
A = load ( ’ MotorCycle.txt ’ );
x = A (: ,1); % Raw values
y = A (: ,2); % Raw values
W = [0 .1 0 .2 0 .5 1 .0 5 .0 ];
xt = ( -10:0 .05 :70) ’;
for c1 = 1: length ( W ) ,
w = W ( c1 );
bs = zeros ( size ( xt )); % Bump sum
for c2 = 1: length ( x ) ,
bs = bs + exp ( -( xt - x ( c2 )) . ^2/(2* w. ^2))/ sqrt (2* pi * w ^2);
end ;
bs = bs / length ( x );
figure ;
h = plot (x , zeros ( size ( x )) , ’ k. ’ ,xt , bs );
ylabel ( ’ Density p ( x ) ’ );
st = sprintf ( ’ Motorcycle Data Density Estimation w =%5 .1f ’ ,w );
Density Estimation in Higher Dimensions Example 12: 2D Density Estimation

• Density estimation can be extended to higher dimensions in a the Motorcycle Data Input−Output Density Estimation w= 0.05
obvious way 1.8
n p
1 1.6
fˆ(x) = bσ (xj − xi,j ) 50
n i=1 j=1
1.4
where xj is the jth element of the input vector x and xi,j is the 1.2
jth element of the ith input vector in the data set 0
Output y
1
• Although you can use this for large values of p, it is not
recommended 0.8
−50
• The estimate becomes inaccurate very quickly as the number of 0.6
dimensions grows 0.4
−100
• For one or two dimensions this is a pretty good technique
0.2
0 10 20 30 40 50
Input x
Example 12: 2D Density Estimation Example 12: 2D Density Estimation
Motorcycle Data Input−Output Density Estimation w= 0.10 Motorcycle Data Input−Output Density Estimation w= 0.20
0.8
0.35
50 0.7 50
0.3
0.6
0.25
0 0.5 0
Output y
Output y
0.2
0.4
−50 −50 0.15
0.3
0.1
0.2
−100 −100
0.1 0.05
0 10 20 30 40 50 0 10 20 30 40 50
Input x Input x
Motorcycle Data Input−Output Density Estimation w= 0.50 Motorcycle Data Input−Output Density Estimation w= 1.00
0.14
0.07
0.12
50 50 0.06
0.1
0.05
0 0
Output y
Output y
0.08
0.04
−50 0.06 −50 0.03
0.04 0.02
−100 −100
0.02 0.01
0 10 20 30 40 50 0 10 20 30 40 50
Input x Input x
Example 12: MATLAB Code for c2 = 1: length ( x ) ,
bx = exp ( -( xmt - x ( c2 )) . ^2/(2* w. ^2))/ sqrt (2* pi * w ^2);
by = exp ( -( ymt - y ( c2 )) . ^2/(2* w. ^2))/ sqrt (2* pi * w ^2);
bs = bs + bx. * by ;
function [] = DensityEx2D (); end ;
bs = bs / length ( x );
close all ; figure ;
A = load ( ’ MotorCycle.txt ’ ); h = imagesc ( xt , yt , bs );
xr = A (: ,1); % Raw values hold on ;
yr = A (: ,2); % Raw values h = plot ( xr , yr , ’ k. ’ ,xr , yr , ’ w. ’ );
xm = mean ( xr ); set ( h (1) , ’ MarkerSize ’ ,4);
ym = mean ( yr ); set ( h (2) , ’ MarkerSize ’ ,2);
xs = std ( xr ); hold off ;
ys = std ( yr ); set ( gca , ’ YDir ’ , ’ Normal ’ );
x = ( xr - xm )/ xs ; ylabel ( ’ Output y ’ );
y = ( yr - ym )/ ys ; st = sprintf ( ’ Motorcycle Data Input - Output Density Estimation w =%5 .2f ’ ,w );
title ( st );
W = [0 .05 0 .1 0 .2 0 .5 1 .0 ]; set ( gca , ’ Box ’ , ’ Off ’ );
colorbar ;
xst = -2 .0 :0 .02 :2 .5 ; % X - test points AxisSet (8);
yst = -2 .5 :0 .02 :2 .5 ; % Y - test points
st = sprintf ( ’ print - depsc DensityEx2D %03 d ; ’ , round ( w *100));
[ xmt , ymt ] = meshgrid ( xst , yst ); % Grids of scaled test points
eval ( st );
end ;
xt = xst * xs + xm ; % Unscaled x - test values
yt = yst * ys + ym ; % Unscaled y - test values
for c1 = 1: length ( W ) ,
w = W ( c1 );
bs = zeros ( size ( xmt )); % Bump sum
Density Estimation and Scaling Example 13: 2D Density Estimation

• In higher dimensions it is important to scale each input to have x 10
−3
Motorcycle Data No Scaling Density Estimation w= 0.50
the same variance 9
• The following example shows the same data set without scaling 8
50
• Notice the oval-shaped bumps 7
6
0
Output y
−50 4
−100 2
−150
0 10 20 30 40 50 60
Input x
−3 −4
Motorcycle Data No Scaling Density Estimation w= 1.00 x 10 Motorcycle Data No Scaling Density Estimation w= 5.00 x 10
4.5 8
50 4 50
7
3.5
6
0 3 0
Output y
Output y
5
2.5
4
−50 2 −50
3
1.5
−100 1 −100 2
0.5 1
−150 −150
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Input x Input x

−4 −5
Motorcycle Data No Scaling Density Estimation w= 10.00 x 10 Motorcycle Data No Scaling Density Estimation w= 20.00 x 10
3 14
50 50 12
2.5
10
0 2 0
Output y
Output y
8
1.5
−50 −50 6
1
4
−100 −100
0.5
2
−150 −150
0 10 20 30 40 50 60 0 10 20 30 40 50 60
Input x Input x
Example 13: 2D Density Estimation Example 13: MATLAB Code
−5
Motorcycle Data No Scaling Density Estimation w= 50.00 x 10
function [] = DensityEx2D ();
% This is the same as DensityEx2D , except no scaling is used.
close all ;
4
50 A = load ( ’ MotorCycle.txt ’ );
3.5 y = A (: ,2); % Raw values
W = [0 .5 1 .0 2 .0 5 .0 10 .0 20 .0 50 .0 ];
0 3
Output y
xt = 0:0 .5 :60; % X - test points

yt = -150:90; % Y - test points
[ xmt , ymt ] = meshgrid ( xt , yt ); % Grids of scaled test points
2.5
−50 for c1 = 1: length ( W ) ,
w = W ( c1 );
2 bs = zeros ( size ( xmt )); % Bump sum
for c2 = 1: length ( x ) ,
bx = exp ( -( xmt - x ( c2 )) . ^2/(2* w. ^2))/ sqrt (2* pi * w ^2);
−100 1.5 by = exp ( -( ymt - y ( c2 )) . ^2/(2* w. ^2))/ sqrt (2* pi * w ^2);
bs = bs + bx. * by ;
end ;
1 bs = bs / length ( x );
figure ;
−150
0 10 20 30 40 50 60 h = imagesc ( xt , yt , bs );
Input x hold on ;
h = plot (x ,y , ’ k. ’ ,x ,y , ’ w. ’ );
set ( h (1) , ’ MarkerSize ’ ,4);
Kernel Smoothing Derivation
set ( h (2) , ’ MarkerSize ’ ,2);
hold off ; The following equations compose the Nadaraya-Watson estimator of
set ( gca , ’ YDir ’ , ’ Normal ’ );
E[y|x]
∞ ∞ ∞
st = sprintf ( ’ Motorcycle Data No Scaling Density Estimation w =%6 .2f ’ ,w );
f (x, y) y f (x, y) dy
title ( st );
E[y|x] = y f (y|x) dy = y dy = −∞
set ( gca , ’ Box ’ , ’ Off ’ );
colorbar ; −∞ −∞ f (x) f (x)
AxisSet (8);
st = sprintf ( ’ print - depsc DensityEx2Db %03 d ; ’ , round ( w *10)); The two densities can be estimated as follows
eval ( st );
n
1
end ;
fˆ(x, y) = bσ (|x − xi |) · bσ (|y − yi |)

n i=1
n
1
fˆ(x) = bσ (|x − xi |)
n i=1
Kernel Smoothing Derivation Continued (1) Kernel Smoothing Derivation Continued (2)
∞ n
y fˆ(x, y) dy
−∞ 1
E[y|x] ≈ fˆ(x) E[y|x] = bσ (|x − xi |) ×
fˆ(x) n i=1
∞ n
∞ ∞
ˆ 1 bσ (|y − yi |) dy + (y − yi ) bσ (|y − yi |) dy
f (x) E[y|x] ≈ y bσ (|x − xi |) · bσ (|y − yi |) dy yi
−∞ n i=1 −∞ −∞
n ∞
n ∞ 1
1 = bσ (|x − xi |) × yi + u bσ (|u|) dy
= bσ (|x − xi |) y bσ (|y − yi |) dy n i=1 −∞
n i=1 −∞
n
n ∞ 1
1 = yi bσ (|x − xi |)
= bσ (|x − xi |) (y − yi + yi ) bσ (|y − yi |) dy n i=1
n i=1 −∞ n
n
1
yi bσ (|x − xi |)
1 E[y|x] = n
i=1
n
= bσ (|x − xi |) × 1
bσ (|x − xi |)
n i=1 nn i=1
∞ yi bσ (|x − xi |)
∞ = i=1
n
yi bσ (|y − yi |) dy + (y − yi ) bσ (|y − yi |) dy i=1 bσ (|x − xi |)
−∞ −∞
Kernel Smoothing Derivation Continued Example 14: Kernels

Thus, by combining the equations on the previous slides we obtain Epanechnikov Biweight Triweight
n
yi bσ (|x − xi |) 1 1 1
E[y|x] ≈ ĝ(x) = i=1n
i=1 bσ (|x − xi |)
• Popular kernels include 0.5 0.5 0.5
Epanechnikov: b(u) = c (1 − u2 ) p(u)
Biweight: b(u) = c (1 − u2 )2 p(u) 0 0 0
Triweight: b(u) = c (1 − u2 )3 p(u)
Triangular: b(u) = c (1 − |u|) p(u) −2 0 2 −2 0 2 −2 0 2
2
Gaussian: b(u) = c e−u Triangular Gaussian Sinc
Sinc: b(u) = c sinc(u) 1 1 1
• Here
∞ c is a constant chosen to meet the constraint 0.5 0.5 0.5
−∞
b(u) du = 1
• p(u) is the unit pulse: 0 0 0
1 |u| ≤ 1
p(u) =
0 Otherwise −2 0 2 −2 0 2 −2 0 2
Example 14: MATLAB Code h = plot ([ -5 5] ,[0 0] , ’k : ’ ,x , K (: , cnt ));
set ( h (2) , ’ LineWidth ’ ,1 .5 );
title ( char ( L ( cnt )));
box off ;
function [] = Kernels (); axis ([ min ( x ) max ( x ) -0 .3 1 .2 ]);
end ;
ST = 0 .01 ;
x = ( -2 .2 : ST :2 .2 ) ’; AxisSet (8);
u = abs ( x ); print - depsc Kernels ;
I = ( u≤1);
kep = (1 - u. ^2) . * I ; % Epanechnikov

kbw = (1 - u. ^2) . ^2 . * I ; % Biweight
ktw = (1 - u. ^2) . ^3 . * I ; % Triweight
ktr = (1 - u ) . * I ; % Triangular
kga = exp ( - u. ^2); % Gaussian
ksn = sinc ( u ); % Sinc
kep = kep /( sum ( kep )* ST ); % Normalize

kbw = kbw /( sum ( kbw )* ST ); % Normalize
ktw = ktw /( sum ( ktw )* ST ); % Normalize
ktr = ktr /( sum ( ktr )* ST ); % Normalize
kga = kga /( sum ( kga )* ST ); % Normalize
ksn = ksn /( sum ( ksn )* ST ); % Normalize
K = [ kep kbw ktw ktr kga ksn ];

L = { ’ Epanechnikov ’ , ’ Biweight ’ , ’ Triweight ’ , ’ Triangular ’ , ’ Gaussian ’ , ’ Sinc ’ };
FigureSet (1 ,4 .5 ,2 .8 );
for cnt = 1:6 ,
subplot (2 ,3 , cnt );
Kernel Smoothing Comments Kernel Smoothing Effect of Support

n n
n
y b (|x − x |) b (|x − x |) yi bσ (|x − xi |)
E[y|x] ≈ ĝ(x) = i=1
i σ i
= yi n
σ i E[y|x] ≈ ĝ(x) = i=1
n
i=1 bσ (|x − xi |)
n
i=1 bσ (|x − xi |) i=1 j=1 bσ (|x − xj |)
As the width decreases (σ ↓) one of two things happens
• Kernel smoothing can be written as a weighted average • If b(u) has infinite support,
n
bσ (|x − xi |) – All of the equivalent weights become nearly equal to zero
ĝ(x) = yi wi (x) wi (x) = n – The weight from the nearest neighbor dominates
i=1 j=1 bσ (|x − xj |)
– Thus ĝ(x) does nearest neighbor interpolation as σ → 0
n
• Note that by definition i=1 wi (x) = 1 • If b(u) has finite support,
• If all the weights were equal, wi (x) = 1
n, then ĝ(x) = ȳ – At some values of x all of the weights may be 0
• This occurs as σ → ∞ – In this happens ĝ(x) at these points is not defined (depends on
implementation)
Kernel Smoothing Bias-Variance Tradeoff Example 15: Kernel Smoothing
n
1
yi bσ (|x − xi |)
E[y|x] ≈ ĝ(x) = 1 i=1
n
n
Motorcycle Data Kernel Smoothing Epanechnikov Kernel w=0.1000
n i=1 bσ (|x − xi |)
50
• Thus, as with smoothing splines there is a single parameter that
controls the tradeoff of smoothness (high bias) for the ability of
the model to fit the data (high variance) 0
Output y
• Kernel smoothers have bounded outputs
min yi ≤ min ĝ(x) ≤ ĝ(x) ≤ max ĝ(x) ≤ max yi −50
i x x i
• In this sense, they are more stable than smoothing splines

−100
• Recall smoothing splines diverge outside of the data range
• However, kernel smoothers are more likely to round off sharp
−150
edges and peaks and troughs −10 0 10 20 30 40 50 60 70
Input x
Example 15: Kernel Smoothing Example 15: Kernel Smoothing
Motorcycle Data Kernel Smoothing Epanechnikov Kernel w=1.0000 Motorcycle Data Kernel Smoothing Epanechnikov Kernel w=2.0000
50 50
0 0
Output y
Output y
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
Motorcycle Data Kernel Smoothing Epanechnikov Kernel w=5.0000 Motorcycle Data Kernel Smoothing Epanechnikov Kernel w=10.0000
50 50
0 0
Output y
Output y
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
Motorcycle Data Kernel Smoothing Gaussian Kernel w=0.1000 Motorcycle Data Kernel Smoothing Gaussian Kernel w=1.0000
50 50
0 0
Output y
Output y
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
Motorcycle Data Kernel Smoothing Gaussian Kernel w=2.0000 Motorcycle Data Kernel Smoothing Gaussian Kernel w=5.0000
50 50
0 0
Output y
Output y
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Input x Input x
Example 15: Kernel Smoothing Example 15: MATLAB Code
Motorcycle Data Kernel Smoothing Gaussian Kernel w=10.0000

function [] = K e r n e l S m o o t h i n g E x ();
close all ;
50 A = load ( ’ MotorCycle.txt ’ );
y = A (: ,2); % Raw values
xt = ( -10:0 .05 :70) ’;

0 W = [0 .1 1 .0 2 .0 3 .0 5 .0 10 .0 ];
Output y
% Epanechnikov Kernel
for cnt = 1: length ( W ) ,
w = W ( cnt );
−50 figure ;
yh = Kernel (x ,y , xt ,w ,2);
h = plot ( xt , yh , ’b ’ ,x ,y , ’ k. ’ );
−100 set (h , ’ LineWidth ’ ,1 .2 );
st = sprintf ( ’ Motorcycle Data Kernel Smoothing Epanechnikov Kernel w =%6 .4f ’ ,w );
title ( st );
−150
set ( gca , ’ Box ’ , ’ Off ’ );
−10 0 10 20 30 40 50 60 70 grid on ;
Input x axis ([ -10 70 -150 90]);
AxisSet (8);
st = sprintf ( ’ print - depsc E K e r n e l S m o o t h i n g E x %03 d ; ’ , round ( w *10));
Local Averaging
eval ( st );
n

end ;
ĝ(x) = wi (x) yi
% Gaussian Kernel
for cnt = 1: length ( W ) , i=1
w = W ( cnt );
figure ;
yh = Kernel (x ,y , xt ,w ,1);
• We saw that kernel smoothers can be viewed as a weighted
h = plot ( xt , yh , ’b ’ ,x ,y , ’ k. ’ ); average
• Instead, we could take a local average of the k-nearest neighbors
ylabel ( ’ Output y ’ ); of x
k
1
st = sprintf ( ’ Motorcycle Data Kernel Smoothing Gaussian Kernel w =%6 .4f ’ ,w );
title ( st );
set ( gca , ’ Box ’ , ’ Off ’ ); ĝ(x) = yc(i)
grid on ; k i=1
axis ([ -10 70 -150 90]);
AxisSet (8);
st = sprintf ( ’ print - depsc G K e r n e l S m o o t h i n g E x %03 d ; ’ , round ( w *10)); where c(i) is the data set index of the ith nearest point
eval ( st );
end ; • For this type of model, k controls the smoothness
Local Averaging Concept MATLAB Code
Motorcycle Data Set function [] = L o c a l A v e r a g e C o n c e p t ();
D = load ( ’ Motorcycle.txt ’ );
x = D (: ,1);
y = D (: ,2);
50
[x , is ] = sort ( x );
Head Acceleration (g)
y = y ( is );
0 q = 30; % Query point ( test input )

k = 10; % No. of neighbors
d = (x - q ) . ^2;
[ ds , is ] = sort ( d );
−50 xs = x ( is );
ys = y ( is );
xn = xs (1: k );
yn = ys (1: k );
−100
[ xsmin , imin ] = min ( xs (1: k ));
[ xsmax , imax ] = max ( xs (1: k ));
imin = is ( imin );
−150 imax = is ( imax );
5 10 15 20 25 30 35 40 45 50 55 xll = ( x ( imin ) + x ( imin -1))/2; % lower limit

Time (ms) xul = ( x ( imax ) + x ( imax +1))/2; % upper limit
rg = max ( y ) - min ( y ); h = plot ( xav , yav , ’r - ’ );
ymin = min ( y ) -0 .1 * rg ; set (h , ’ LineWidth ’ ,1 .5 );
ymax = max ( y )+0 .1 * rg ; % h = plot ( xll , yll , ’ b : ’);
h = plot ( q *[1 1] , [ ymin ymax ] , ’b - - ’ );
xbox = [ xll xul xul xll ]; set (h , ’ LineWidth ’ ,1 .5 );
ybox = [ ymin ymin ymax ymax ]; hold off ;
rg = max ( y ) - min ( y );
yav = mean ( yn )*[1 1]; axis ([ min ( x ) max ( x ) ymin ymax ]);
xav = [ xll xul ]; xlabel ( ’ Time ( ms ) ’ );
ylabel ( ’ Head Acceleration ( g ) ’ );
A = [ xn ones (k ,1)]; title ( ’ Motorcycle Data Set ’ );
b = yn ;
set ( gca , ’ Layer ’ , ’ top ’ );
v = pinv ( A )* b ;
set ( gca , ’ Box ’ , ’ off ’ );
xl1 = 0;
yl1 = [ xl1 1]* v ;
AxisSet (8);
xl2 = 1 .5 ;
print - depsc L o c a l A v e r a g e C o n c e p t ;
yl2 = [ xl2 1]* v ;
xll = [ xl1 xl2 ];
yll = [ yl1 yl2 ];
figure ;
h = patch ( xbox , ybox , ’g ’ );
set (h , ’ FaceColor ’ , .8 *[1 1 1]);

set (h , ’ EdgeColor ’ , .8 *[1 1 1]);
hold on ;
h = plot (x ,y , ’ k. ’ );
Local Averaging Discussion Example 16: Local Averaging

k
1 Motorcycle Data Set k=2
ĝ(x) = yc(i)
k i=1
50
• Local averaging has a number of disadvantages
– The data set must be stored in memory. (This is essentially Head Acceleration (g)
0
true for kernel smoothers and smoothing splines also)
– The output ĝ(x) is discontinuous
– Finding the k nearest neighbors can be computationally −50
expensive
−100
−150
−10 0 10 20 30 40 50 60 70
Time (ms)
Example 16: Local Averaging Example 16: Local Averaging
Motorcycle Data Set k=5 Motorcycle Data Set k=10
50 50

0 0
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Time (ms) Time (ms)
Example 16: Local Averaging Example 16: Local Averaging
50 50

0 0
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Time (ms) Time (ms)
Example 16: MATLAB Code yh ( cnt ) = mean ( yn );
end ;
figure ;
function [] = LocalAverageFit (); FigureSet (1 , ’ LTX ’ );
h = plot (x ,y , ’ k. ’ );
close all ; set (h , ’ MarkerSize ’ ,8);
hold on ;
D = load ( ’ Motorcycle.txt ’ ); h = stairs ( xt , yh , ’b ’ );
x = D (: ,1); set (h , ’ LineWidth ’ ,1 .2 );
y = D (: ,2); hold off ;
axis ([ -10 70 -150 90]);
[x , is ] = sort ( x ); xlabel ( ’ Time ( ms ) ’ );
y = y ( is ); ylabel ( ’ Head Acceleration ( g ) ’ );
st = sprintf ( ’ Motorcycle Data Set k =% d ’ ,k );
xt = ( -10:0 .1 :70) ’; title ( st );
yh = zeros ( size ( xt )); set ( gca , ’ Layer ’ , ’ top ’ );
set ( gca , ’ Box ’ , ’ off ’ );
K = [2 5 10 20 50]; AxisSet (8);
for c = 1: length ( K ) , st = sprintf ( ’ print - depsc LocalAverageEx %02 d ; ’ ,k );
k = K ( c ); eval ( st );
end ;
for cnt = 1: length ( xt ) ,
d = (x - xt ( cnt )) . ^2;
[ ds , is ] = sort ( d );
xs = x ( is );
ys = y ( is );
xn = xs (1: k );
yn = ys (1: k );
Weighted Local Averaging Example 17: Local Averaging Weighting Functions

k k
bk (|x − xc(i) |) yc(i) i=1 bi yc(i) Weighted Averaging Weighting Functions
ĝ(x) = i=1
k = k
i=1 bk (|x − xc(i) ) i=1 bi 1 Epanechnikov
Biweight
Triweight
• Local averaging can be tweaked to produce a continuous ĝ(x) 0.8 Triangular
• We simply take a weighted average where b(u) is a smoothly Weighting Function
decreasing function of the distance 0.6
• We can use our familiar (non-negative) kernels to achieve this

0.4
• My favorite is the biweight function
2
d2i 0.2
bi = 1 − 2
dk+1
0
where di = |x − xc(i) | is the distance between the input and the
ith nearest neighbor 0 0.2 0.4 0.6 0.8 1 1.2
Distance (u)
Example 17: Weighted Averaging Example 17: Weighted Averaging
50 50

0 0
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Time (ms) Time (ms)
Example 17: Weighted Averaging Example 17: Weighted Averaging
50 50

0 0
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Time (ms) Time (ms)
Example 17: Weighted Averaging Example 17: MATLAB Code
Motorcycle Data Set k=50

function [ yt ] = WeightedAverage (x ,y , xt , k );
xarg = x ;
yarg = y ;
50 x = unique ( xarg );
y = zeros ( size ( x ));
for cnt = 1: length ( x ) ,
y ( cnt ) = mean ( yarg ( xarg == x ( cnt )));

end ;
0
yt = zeros ( length ( xt ) ,1);
[ Np , Ni ] = size ( x );
for cnt = 1: length ( xt ) ,

−50 d = zeros ( Np ,1);
for cnt2 = 1: Ni ,
d = d + ( x (: , cnt2 ) - xt ( cnt , cnt2 )) . ^2;
end ;
[ ds , is ] = sort ( d );
−100 xs = x ( is );
ys = y ( is );
dn = ds (1: k );
dmax = ds ( k +1);
−150
xn = xs (1: k );
−10 0 10 20 30 40 50 60 70 yn = ys (1: k );
Time (ms)
w = (1 -( dn / dmax )) . ^2;
yt ( cnt ) = sum ( w. * yn )/ sum ( w );
Weighted Local Averaging Comments
end ; k
i=1 bk (|x − xc(i) |) yc(i)
ĝ(x) = k
i=1 bk (|x − xc(i) |)
• Like kernel smoothers, weighted local averaging models are stable

(bounded)
min yc(i) ≤ ĝ(x) ≤ max yc(i)
c(i)∈1,k c(i)∈1,k
• The key difference here is that the kernel width is determined by

the distance to the (k + 1)th nearest neighbor
• This is advantageous
– In regions of dense data, the equivalent kernel width shrinks
– In regions of sparse data, the equivalent kernel width expands
Local Model Optimality Local Model Optimality Continued
It can be shown that for fixed weighting functions, w(x), both kernel
n
smoothers and weighted local averaging models minimize the weighted 1 2
ASE ≡ (yi − ĝ(xi )) b(|x − xi |)
average squared error n i=1
n n
1 2 yi b(|x − xi |)
ASE ≡ (yi − ĝ(xi )) b(|x − xi |) ĝ ∗ (x) = i=1
n
i=1 b(|x − xi |)
n i=1
n

dASE
∝ (yi − ĝ(x)) b(|x − xi |) = 0
dĝ(xi ) • Thus, we have an alternative derivation of kernel smoothers and
i=1
n n weighted local averaging models

0 = yi b(|x − xi |) − ĝ(x)b(|x − xi |) • They are the models that minimize the weighted ASE
i=1 i=1
n n • The only difference between kernel smoothers, local averaging

= yi b(|x − xi |) − ĝ(x) b(|x − xi |) models, and weighted local averaging models are the weighting
i=1 i=1 functions, b(·)
n
yi b(|x − xi |)
ĝ(x) = i=1
n
i=1 b(|x − xi |)
Local Model Consistency Bias-Variance Tradeoff

• Under general assumptions, it can be shown that smoothing 2 2
PE ≡ E (y − ĝ(x)) = (g(x) − E [ĝ(x)]) + E (ĝ(x) − E [ĝ(x)])
2
splines, kernel smoothers, and local models are consistent

2 Bias Variance
• Consistency means that as n → ∞, if the following conditions are
satisfied
• For each, we discussed a bias-variance tradeoff
– |bσ (u)| du < ∞
– Less smooth ⇒ More variance and less bias
– limu→∞ u bσ (u) = 0
– More smooth ⇒ Less variance and more bias
– E[ε2i ] < ∞
• Recall that the prediction error can be written as shown above
– As n → ∞, σ → 0 and nσ → ∞
then at every point that is continuous for g(x) and f (x) with • The expectation is taken over the distribution of data sets used to
f (x) > 0, construct ĝ(x)
ĝ(x) → g(x) • Conceptually, this can be plotted
with probability 1.
Bias-Variance Tradeoff Continued Model Selection
• How do we estimate the prediction error with only one data set?
Prediction Error
Bias • The ASE won’t work: monotonically decreases as the smoothness
Variance decreases
Smoothness • All of our smoothers can be written as
ĝ(x) = ŷ = H(x)y
• Our goal is to minimize the prediction error
• How do we choose the best smoothing parameter? for a given input vector x.
• All of the methods we discussed had a single parameter that • This is very similar to the hat matrix of linear models, except now
controlled smoothness the H matrix is a function of x
– Smoothing splines had a smoothness penalty parameter λ • The equivalent degrees of freedom can be estimated by
– Kernel methods had the bump width σ T

– Local averaging models had the number of neighbors k p ≈ trace HH

• How do we pick the best smoothness?
• Would like an accurate estimate of the prediction error
Model Selection Continued Resampling Techniques

• The prediction error can then be estimated by • It is also possible to use resampling techniques
p
– N-Fold Cross-Validation: Divide the data set into N
PE ≈ r × ASE
n different sets. Pick the first set as the test set and build the
model using the remaining N − 1 sets of points. Calculate the
where the r(·) is a function that adjusts the ASE to be a more
ASE on the test set. Repeat for all of the sets and average all
accurate estimate of the PE
N estimates of the ASE.
• A number of different functions r(·) have been proposed – Leave-one-out Cross-Validation: Same as above for N = n.
– Final Prediction Error: r(u) = 1−u
1+u
– Bootstrap: Select n points from the data set with
loge n u
– Schwartz’ Criterion: r(u) = 1 + 2 1−u
replacement and calculate the ASE.
– Generalized CVE: r(u) = 1
(1−u)2
• I personally prefer to use Leave-one-out CVE
– Shabata’s Model Selector: r(u) = 1 + 2u • A study I conducted last spring with weighted averaging models
• In each case u = p indicated that CVE and Generalized CVE were the best (for that
n
type of model)
Example: Weighted Averaging CVE Example: Weighted Averaging CVE
Local Averaging Cross−Validation Error Motorcycle Data Local Averaging kopt=11
900
50
800

0
700
CVE
600 −50
500 −100
400
−150
0 5 10 15 20 25 30 −10 0 10 20 30 40 50 60 70
Number of Neighbors (k) Time (ms)
Weighted Least Squares Weighted Least Squares Continued

T T
n
1 2 ASEb = (y − Aw) B B(y − Aw)
ASEb = b (yi − ŷi )2
n i=1 i • This can be framed as a typical (unweighted) least squares
T T problem if we add the following definitions
= (y − Aw) B B(y − Aw)
Ab ≡ BA yb ≡ By
where ⎡ ⎤
b1 0 ... 0 • Then the ASEb can be written as
⎢0 b2 ... 0⎥ T
⎢ ⎥ ASEb = (yb − Ab w) (yb − Ab w)
B=⎢. .. .. .. ⎥
⎣ .. . . .⎦ which has the known optimal least squares solution
0 0 . . . bn T T
w = (Ab Ab )−1 Ab yb
• When we discussed linear models we found that the minimized the • Now we can easily generalize kernel methods and local averaging
average squared error models to create localized linear models
• We can generalize this easily to find the best linear model that • We merely specify the weights so that points near the input have
minimizes the weighted ASE the most influence on the model output
Example 18: Local Linear Model Example 18: Local Linear Model
50 50

0 0
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Time (ms) Time (ms)
50 50

0 0
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Time (ms) Time (ms)
50 50

0 0
−50 −50
−100 −100
−150 −150
−10 0 10 20 30 40 50 60 70 −10 0 10 20 30 40 50 60 70
Time (ms) Time (ms)
Example 18: Local Linear Model Example 18: MATLAB Code
Motorcycle Data Set k=93

function [] = LocalLinearEx ();
close all ;
50 D = load ( ’ Motorcycle.txt ’ );
x = D (: ,1);
y = D (: ,2);
[x , is ] = sort ( x );
0 y = y ( is );
xt = ( -10:0 .05 :70) ’;
k = [2 5 10 20 30 50 93];
−50 for cnt = 1: length ( k ) ,
yh = LocalLinear (x ,y , xt , k ( cnt ));
figure ;
h = plot (x ,y , ’ k. ’ ,xt , yh , ’b ’ );
−100 set ( h (1) , ’ MarkerSize ’ ,8);
set ( h (2) , ’ LineWidth ’ ,1 .2 );
axis ([ -10 70 -150 90]);
xlabel ( ’ Time ( ms ) ’ );
ylabel ( ’ Head Acceleration ( g ) ’ );
−150
st = sprintf ( ’ Motorcycle Data Set k =% d ’ ,k ( cnt ));
−10 0 10 20 30 40 50 60 70 title ( st );
Time (ms) set ( gca , ’ Layer ’ , ’ top ’ );
set ( gca , ’ Box ’ , ’ off ’ );
AxisSet (8);
Univariate Smoothing Summary
st = sprintf ( ’ print - depsc LocalLinearEx %02 d ; ’ ,k ( cnt ));
eval ( st ); • We discussed four methods of interpolation
– Linear Interpolation
end ;
– Nearest Neighbor Interpolation

– Polynomial Interpolation
– Cubic Spline Interpolation
• We discussed six methods of univariate smoothing
– Polynomial regression (generalization of linear models)
– Smoothing splines
– Kernel smoothing
– Local averaging
– Weighted local averaging
– Local linear models (weighted)
• Discussed one method of density estimation based on kernels
Univariate Smoothing Summary Continued

• All of the smoothing methods had a single parameter that controls
the smoothness of the model
• For each, we discussed a bias-variance tradeoff
– Less smooth ⇒ More variance and less bias
– More smooth ⇒ Less variance and more bias
• We discussed a several methods of estimating the “true”
prediction error of the model
– Some of the methods were simple modifications of the ASE
– Other methods were based on resampling (cross-validation &
the bootstrap)
• Most can be generalized to the multivariate case
J. McNames Portland State University ECE 4/557 Univariate Smoothing Ver. 1.25 147

Univariate Smoothing

Uploaded by

Copyright:

Available Formats

You might also like

Univariate Smoothing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Univariate Smoothing

Uploaded by

Copyright:

Available Formats

Univariate Smoothing Overview Problem Deﬁnition & Interpolation

Interpolation versus Extrapolation Example 1: Linear Interpolation

Chirp Nearest Neighbor Interpolation

Example 3: Polynomial Interpolation Example 3: MATLAB Code

Chirp Polynomial Interpolation

Chirp Cubic Spline Interpolation

Interpolation Comments Smoothing

• What should it be in between these points?

• εi is a random variable with the following assumed properties 1

Smoothing Bias-Variance Tradeoﬀ

Example 6: Smoothing Problem MATLAB Code return ;

% print - depsc Test ;

Example 7: MATLAB Code Cubic Splines

Cubic Spline Constraints Cubic Spline Constraints Continued

Example 8: Basis Function 0 Example 8: Basis Function 1

Basis Function B0(x) Basis Function B1(x)

Basis Function B (x) Basis Function B (x)

Example 8: MATLAB Code Smoothing Splines

• Contrast to cubic splines in which we required the ﬁrst term to be

• Smoothing splines are smooth in the same sense as cubic splines

Example 9: Smoothing Spline Example 10: Smoothing Spline

−100 α = 1.0 −100

Example 10: Smoothing Spline Example 10: Smoothing Spline

Example 10: MATLAB Code xlabel ( ’ Input x ’ );

Density Estimation Example 11: Density Estimation

Example 11: Density Estimation Example 11: Density Estimation

xt = ( -10:0 .05 :70) ’;

Density Estimation in Higher Dimensions Example 12: 2D Density Estimation

Example 12: 2D Density Estimation Example 12: 2D Density Estimation

−50 0.06 −50 0.03

Density Estimation and Scaling Example 13: 2D Density Estimation

Example 13: 2D Density Estimation Example 13: 2D Density Estimation

xt = 0:0 .5 :60; % X - test points

fˆ(x, y) = bσ (|x − xi |) · bσ (|y − yi |)

Kernel Smoothing Derivation Continued Example 14: Kernels

kep = (1 - u. ^2) . * I ; % Epanechnikov

kep = kep /( sum ( kep )* ST ); % Normalize

K = [ kep kbw ktw ktr kga ksn ];

Kernel Smoothing Comments Kernel Smoothing Eﬀect of Support

• In this sense, they are more stable than smoothing splines

Example 15: Kernel Smoothing Example 15: Kernel Smoothing

Example 15: Kernel Smoothing Example 15: Kernel Smoothing

Example 15: Kernel Smoothing Example 15: MATLAB Code

Motorcycle Data Kernel Smoothing Gaussian Kernel w=10.0000

xt = ( -10:0 .05 :70) ’;

Local Averaging Concept MATLAB Code

Motorcycle Data Set function [] = L o c a l A v e r a g e C o n c e p t ();

0 q = 30; % Query point ( test input )

5 10 15 20 25 30 35 40 45 50 55 xll = ( x ( imin ) + x ( imin -1))/2; % lower limit

h = patch ( xbox , ybox , ’g ’ );

set (h , ’ FaceColor ’ , .8 *[1 1 1]);

Local Averaging Discussion Example 16: Local Averaging

Motorcycle Data Set k=5 Motorcycle Data Set k=10

Head Acceleration (g)

Example 16: Local Averaging Example 16: Local Averaging

Motorcycle Data Set k=20 Motorcycle Data Set k=50

Head Acceleration (g)

Weighted Local Averaging Example 17: Local Averaging Weighting Functions

splines, kernel smoothers, and local models are consistent