Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Advanced Regression

with
JMP PRO
German JMP User Meeting
Holzminden – June 22, 2017
Silvio Miccio
Overview
• Introduction
• Some Details on Parameter Estimation and Model Selection
• Generalized Linear Models
• Penalized Regression Models in JMP PRO
• Example:
• Analysis of Time to Event Data (Parametric Survival Models)
• Classification Model with Missing Informative Data
• Linear Mixed Models in JMP PRO
• Example:
• Nested Intercept
• Repeated Measure (Consumer Research)
Introduction
• Multiple Linear Regression (MLR) is one of the most
commonly used methods in Empirical Modeling
• MLR is high efficient as long as all assumptions are met
• Especially observational data often do not meet the
assumptions, resulting in problems with estimation of
coefficients and model selection and with this in model
validity
• Hence, Advanced Regression Methods, like available in JMP
PRO, have to applied to benefit from the ease of
interpretation of regression methods
Linear Regression
 = +   + ⋯   +
 
 = +    +    +      +   +
 
  +

 = +  +    +


• yi = Response; i = 1, 2….n (n = number of observations)


• xji = Factor, Predictor; j = 1, 2….p (p = number of factors)
• β0 = Intercept
• βj = Coefficients
• εi = Error
Assumptions
1. Errors are normally distributed  ~ (0,   )
2. Errors are independent
E.g. no pattern in residuals over time or groups
3. Homoscedasticity
Variance is constant in the entire model space
4. Factors are not or only slightly correlated
Rule of thumb VIF < 3 or 5
5. Predictors are fixed factors, measured with almost “no” error.
Error is assumed to be completely on the side of the residuals
6. Response is a linear combination of coefficients and factors
Some Details on Linear
Regression
Parameter Estimation
Linear Regression
For generalization of the parametric model of a linear regression
 = +   +⋯   +
it makes sense to change to matrix notation:
y = Xβ + ε
1   … 
with
 1.  … 
 . . . . n × p matrix of the
=  n × 1 vector of responses "= . . . . .
⋮ factors/variables
! . . . . .
1! ! … !

&
p × 1 vector of 
=  &= 
⋮ unknown constants ⋮ n × 1 vector of random errors N (0,σ2)
!

Standard Least Square Estimate
!

: = = >
?

L = &’&
L = (y - Xβ)‘(y - Xβ) (AB)‘ = B‘A‘
L = y’y - β‘X‘y - y‘X β + β‘X‘X β β‘X‘y = y‘X β
L = y’y - 2β‘X‘y + β‘X‘X β Quadratic Function

9: 9:
= −2X < y + 2X < Xβ =0
9 9
X’X β=X‘y
β = (X‘X)-1X‘y
X Matrix (coded) – 23 FF Design 3 Center Points
Int X1 X2 X3 X1X2 X1X3 X2X3
• X matrix of a full factorial
1 -1 -1 -1 1 1 1 23 design with three
1 1 -1 -1 -1 -1 1 center points
1 -1 1 -1 -1 1 -1
• 1st column intercept
1 1 1 -1 1 -1 -1
• 2nd – 4th column main
1 -1 -1 1 1 -1 -1
X= effects
1 1 -1 1 -1 1 -1

1 -1 1 1 -1 -1 1
• 5th – 7th column
1 1 1 1 1 1 1
interactions
1 0 0 0 0 0 0

1 0 0 0 0 0 0

1 0 0 0 0 0 0
X’X (Covariance Matrix)
(X’X)-1 Inverted Covariance Matrix
1/11 0 0 0 0 0 0

0 1/8 0 0 0 0 0

0 0 1/8 0 0 0 0

0 0 0 1/8 0 0 0

0 0 0 0 1/8 0 0

0 0 0 0 0 1/8 0

0 0 0 0 0 0 1/8

• The “degrees of freedom“ for estimating the model coefficients show up at the
diagonal
• This is only true if all off diagonal elements are 0 (all factors are independent of
each other)
• When off diagonal elements are not zero then the factors are correlated
Orthogonal

Multicollinearity
Effects of Multi-Co-linearity
• Singular matrix no solution
• High variance in the coefficients
• High variance in the predictions
• Often high R-square, but (all) factors are insignificant
• Small changes in the data may have a big effect on the
coefficients (not robust)
• Best subset selection i.e. via Stepwise Regression may
become almost impossible
Some Details on Linear
Regression
Model Selection
Model Selection
• Overall goal in Empirical Modeling is to identify the model with
the lowest expected prediction error
Expected Prediction Error =
Irreducible error (inherent noise of the system) +
Squared Bias (depends on model selection) +
Variance (depends on model selection)
• This requires to find the model with optimum complexity (e.g.
number of factors, number of sub-models, functional form of
model terms, modeling method)
• Model Selection: “estimating the performance of different
models in order to choose the (approximate) best one”
Bias-Variance Trade Off
• If model complexity is too low
the model is biased (important
features of the system not
captured by the model)
• If model complexity is too high
the model is fit too hard to the
data, which results in a poor
generalization of the prediction
(high prediction variance)
• Training error: variation in the data • The challenge is to identify the
not explained by the model model with the optimum trade-
• Test error: expected prediction error off between bias and variance
based on independent test data
Methods for Model Selection
• When it is not possible to split the data into a training,
validation and test set (this is the case for designed
experiments or for small data sets) model selection
can be done by via measures that try to approximating
the validation/test error, like AIC, AICc and BIC
• Here the estimated value usually is not of direct
interest, it is the relative size that matters
• Alternative methods based on re-sampling (e.g. cross
validation) provide direct estimate of the expected
prediction error (can be used for model assessment)
Generalized Linear Models
Modeling discrete responses and non-normal distributed
errors
Generalized Linear Model (GLM)
• A GLM is a generalization of a linear model for non-normal
responses where errors being a function of the mean
o Binomial - dichotomous data (yes/no, pass/fail)
o Poisson - count data
o Log Normal - data restricted to non-negative values (transformed data
normally distributed)
o and much more…
• Components of GLM
1. Random Component
2. Systematic Component
3. Link Function
Random Component
Identifies the distribution and variance of the response.
Usually derives from the exponential family of distributions,
but not restricted to it.

The parameter θi and φ are location and scale parameter


Systematic Component & Link Function
Systematic Component
Linear function of the factors, the so called Linear Predictor
where the predictors can be transformed (squared, log…)
A = +   + ⋯   +

Link Function
Specifies the link between random and systematic
components. It is an invertible function that defines the
relation between the response and the linear predictor
A =B 
 = BC A
Common Variance and Link Functions
Comparison Standard Least Squares vs. GLM
Standard Least Square Regression Iteratively Re-Weighted Least Squares
 =" +&  = BC A
= " < " C " <  A =" +&
= " < E" C " < Ez
y is an n × 1 vector of responses
X is an n × p matrix of the factors variables
β is a p × 1 vector of unknown constants
ε is an n × 1 vector of random errors N (0,σ2)
X′ is the transpose of X
X X′ is a p × p matrix of correlations between the factors
η is the linear predictor
g-1 is the inverse link function
W is a diagonal matrix of weights wi
z is a response vector with entries zi
Generalized Linear Regression
Penalized Regression
Generalized Linear Regression (GLR)
GLR can be seen as extension of GLM, in addition being able
to deal with:
• Multicollienearity and to perform
• Model Selection (p > n/2 as well as for p > n)
This is achieved by penalized regression methods, which
attempt to fit better models by shrinking the parameter
estimates
Although shrunken estimates are biased, the resulting
models are usually better i.e. having a lower prediction error
Ridge Regression
• Ridge Regression was developed as remedy for multicollinearity
• It attempts to minimize the penalized residual sum of squares

!  
G HIJ 
= KLBMN =  − − = O O +P= O
? O? O?

• λ ≥ 0 is the regularization parameter controlling the amount of


shrinkage

• P=  is called the L2 penalty, due to the squared term
O
O?
Ridge Regression
• Parameters are estimated according to:
G HIJ = S < S + TU C S < V

where I is a p x p diagonal identity matrix (diagonal


elements are 1 and all off diagonal elements are 0
• When there is a multicollinearity problem the off diagonal
elements (covariances) of (X’X) are large compared to the
values of the diagonal (variances)
• By adding λI to the covariance matrix, the diagonal elements
increase
What is Ridge Regression Doing?
• Since β = (X’X+λI)
X’X+λI)-1X’y,
X’y the diagonal elements of the inverted
matrix are getting smaller, which means the parameter
estimates for β are shrunken
• As λ gets larger the inverse is getting smaller meaning the
variance in β decreases (what is desired), but only to a certain
point
• When λ is getting too big the residual sum of squares increase,
because the coefficients are shrunken so much that they do not
explain the response anymore (bias)
• Hence there is an optimum for λ
LASSO Regression
• Since Ridge Regression is shrinking large coefficients, which are
potentially important, more than small coefficients the LASSO was
developed
• LASSO shrinks all coefficients by the same amount, but in addition
shrinks “unimportant” factors to exactly zero, which means they are
removed from the model (→ Model SelecQon)
• Like Ridge Regression, LASSO minimizes the error sum of squares, but
using a different penalty

!  
WXYYZ = KLBMN =  − − = O +P=
O O
? O? O?
LASSO Regression

 is called the L1 penalty, due to the “first power” in
P=

O
the penalty term
O?
• Parameter estimation for the LASSO is algorithmic, because
there is no closed from solution (penalty is an absolute value,
cannot be differentiated)
• The Lasso is designed for model selection, but is not doing that
good with collinearity
• Ridge is designed for multicollinearity, but is doing no model
selection
• Hence, a combination of both
• methods would be desired
Elastic Nets

!   
[\ 
= KLBMN =  − − = O O + P = O + P = O
? O? O? O?

• Elastic Nets combine the L1 and the L2 penalty


• The L1 penalty controls the model selection
• The L2 penalty
o Enables p > n
o Keeps groups of highly correlated variables in the model (LASSO just picks one
variable from the group)
o Improves, smoothes parameter estimation
Adaptive Methods
• Adaptive methods penalize “important” factors less than
“unimportant” by a weighted penalty
  
O O
Adaptive L Penalty = P = ~ Adaptive L Penalty = P = ~
O? O O? O

~
• O is the maximum likelihood estimate if existing, for normal
distributed data the least square estimate or for non normal
distributed data the ridge solution
• Adaptive models attempt to ensure Oracle Properties
Identification of true active factors
Correct estimation of parameter estimates
Tuning Parameters (from JMP Help)
• LASSO and Ridge are determined by one tuning parameter (L1 or L2)
• Elastic Nets are determined by two tuning parameters (L1 and L2),
where the Elastic Net Alpha is the weight between the penalties
• The higher the tuning parameter, the higher the penalty (adding a
zero provides the Maximum Likelihood solution (MLE); no penalty)
o When tuning parameter is too small the model is likely to overfit
o When tuning parameter is too big there is bias in the model
• To obtain a solution the tuning parameter is increased over a fine grid
• Optimum solution is where best fit over the entire tuning parameter
grid is achieved
Tuning Parameters – Elastic Net Alpha
from JMP Help)
• Determines the mix
between the L1 and L2
penalty
• Default value is 0.9
meaning (coefficient on L1
penalty is set to 0.9,
coefficient on L2 penalty is
to 0.1
• If Elastic Net Alpha is not
set, the algorithm
computes the Lasso, Elastic
Net, and Ridge fits, in that
order and keeps the “best”
solution
Model Tuning
• Try different Estimation Methods,
settings for Advanced Controls and
Validation Methods to find best model
• All models are displayed in the model
report and can be individually saved as
script and prediction formula
• Note: k-fold or random holdback
validation is not recommended for
DOE data
Data Set 1 - Parametric Survival Analysis
• 4 Factors (E = Equipment Set-Up, P = Process Setting, F1 & F2 Product
Formulation) have been investigated in a designed experiment
• Column Censor includes the censoring variable (0 = no censoring, 1 =
censoring)
• Response is the time the sample resists a force applied to it
• For feasibility reasons the measurement is stopped after a pre-defined
maximum test time. This leads to so called “right censoring”, because not
all samples fail within the maximum test time.
• Objective is to create a model for predicting the survival time of the
sample
• The data file “GLR Survival” contains the scripts for the parametric
survival model for JMP (does not allow for automated model selection)
and JMP PRO
Data 2 - Credit Risk Scoring
• The data set is called Equity.jmp and taken from the JMP Sample Data Library located in the JMP Help
menu
• It is based on historical data gathered to determine whether a customer is a good or bad credit risk for a
home equity loan (watch out: missing data, they are set to Informative Missing in JMP PRO, because
they contain important information)
• Predictors:
• LOAN = how much was the loan
• MORTDUE = how much they need to pay on their mortgage
• VALUE = assessed valuation
• REASON = reason for loan
• JOB = broad job category
• YOJ = years on the job
• DEROG = number of derogatory reports
• DELINQ = number of delinquent trade lines
• CLAGE = age of oldest trade line
• NINQ = number of recent credit enquiries
• CLNO = number to trade lines
• DEBTINC = dept to income ratio
• Response is Credit Risk, predict good and bad credit risks
• Data file “Credit Risk” contains scripts for JMP PRO (GLR for model selection, informative missing,
validation column) and JMP (logistic regression, it is possible to do stepwise and manual informative
missing coding – see JMP home page for details)
Linear Mixed Models
G-Side and R-Side Random Effects
Fixed Factors
• Usually the factors e.g. in a design of experiment
are varied within fixed factor levels
• With fixed factors we can make statistical inferences
within the investigated model space, based on the
factor effects
• When the factor levels are randomly chosen from a
larger population of factor levels, the factor is said
to be a random factor
Random Factors
• Random factors allow to draw conclusions about the entire
population of possible factor levels
• The population of possible factor levels is considered to be
infinite
• Random effects models are of special interest for identifying
sources of variation, because they allow to identify variance
components
• Random Factors
• Random Effects: Machines, Operators, Panelists
• Random Effects also have to be considered for split plot designs,
correlated responses, spatial data, repeated measurements
• y is an vector of responses
Random Effects Model • X is the regression matrix of the fixed effects
• β is a vector of unknown fixed effect
parameters
• Z is the regression matrix of the of the
 = " + bc + &
random effects
• γ is a vector of unknown random effects
parameters
c ~  0, d , & ~  0, e • ε is a vector of random errors (not required
to be independent or homogenous)
• G is variance-covariance matrix for random
f  = " ; h  = bdb < + e effects
• R is variance-covariance matrix for model
errors
• G-side effects are specified by the Z matrix
(random effects)
• R-side effects, are specified by the
covariance structure (repeated structure)
Repeated Covariance Structure Requirements
(taken from JMP Help)

For details regarding


different covariance
structures, please see
the JMP Help
Strategies for Selecting Covariance Structure
• It is “always” possible to fit an unstructured covariance
structure, but this also means fitting the most complex
model (potential risk of overfitting)
• Best option is to use covariance structure that are expected
to make sense for the given context (see JMP for details
regarding the structures)
• To find out the best covariance structure from competing
models AICc and/or BIC can be used
• Check the structure of the residuals by plotting them or
using the variogram
Repeated Measures (from JMP Help)
• Repeated measures designs, also known as within-subject designs,
model changes in a response over time or space while allowing errors to
be correlated.
• Spatial data are measurements made in two or more dimensions,
typically latitude and longitude. Spatial measurements are often
correlated as a function of their spatial proximity.
• Correlated response data result from making several measurements on
the same experimental unit. For example, height, weight, and blood
pressure readings taken on individuals in a medical study, or hardness,
strength, and elasticity measured on a manufactured item, are likely to
be correlated. Although these measurements can be studied individually,
treating them as correlated responses can lead to useful insights.
Correlated Response (Example from JMP Help)
In this example, the effect of two layouts dealing with wafer production
is studied for a characteristic of interest. Each of 50 wafers is
partitioned into four quadrants and the characteristic is measured on
each of these quadrants. Data of this type are usually presented in a
format where each row contains all of the repeated measurements for
one of the units of interest. Data of this type are often analyzed using
separate models for each response. However, when repeated
measurements are taken on a single unit, it is likely that there is within-
unit correlation. Failure to account for this correlation can result in
poor decisions and predictions. You can use the Mixed Model
personality to account for and model the possible correlation.
P&G example cannot be shared. Use JMP data file “Wafer Stacked”
from the JMP sample data library (not attached). Can be analyzed in
JMP PRO only; R-Side Random Effect with spatial structure.

You might also like