DATT - Class 05 - Assignment - GR 9

ASSIGNMENT NO: 1
Submitted To: Dr. RAMPRRASADH GOARTHY
PGDM 2019-2021
Name SAP ID
Saurabh Pratap Singh 80203190169
Aditya Yadav 80203190184
Yogeshkumar Shankariya 80203190158
Question-1
AIC:
The Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and

thereby relative quality of statistical models for a given set of data. Given a collection of models for the
data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a
means for model selection.
AIC is a single number score that can be used to determine which of multiple models is most likely to be
the best model for a given dataset. It estimates models relatively, meaning that AIC scores are only
useful in comparison with other AIC scores for the same dataset. A lower AIC score is better.
AIC is most frequently used in situations where one is not able to easily test the model’s performance on
a test set in standard machine learning practice (small data, or time series). AIC is particularly valuable
for time series, because time series analysis’ most valuable data is often the most recent, which is stuck
in the validation and test sets. As a result, training on all the data and using AIC can result in improved
model selection over traditional train/validation/test model selection methods.
AIC works by evaluating the model’s fit on the training data, and adding a penalty term for the
complexity of the model (similar fundamentals to regularization). The desired result is to find the lowest
possible AIC, which indicates the best balance of model fit with generalizability. This serves the eventual
goal of maximizing fit on out-of-sample data.
AIC = 2k- 2ln (L cap)
AIC equation, where L = likelihood and k = # of parameters

AIC is typically used when It do not have access to out-of-sample data and want to decide between
multiple different model types, or for time convenience.
AIC’s assumptions:
1. Are using the same data between models
2. Are measuring the same outcome variable between models
3. Have a sample of infinite size
AIC Interpretation:
Pick the model with the lowest score as the best.
Pitfalls of AIC:
AIC only measures the relative quality of models. This means that all models tested could still fit poorly.
As a result, other measures are necessary to show that model’s results are of an acceptable absolute
standard
BIC:
Bayesian information criterion (BIC) is a criterion for model selection among a finite set of models. It is
based, in part, on the likelihood function, and it is closely related to Akaike information criterion (AIC)
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may
result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of
parameters in the model. The penalty term is larger in BIC than in AIC.
BIC has been widely used for model identification in time series and linear regression. It can, however,
be applied quite widely to any set of maximum likelihood-based models.
Mathematical Expression:
Mathematically BIC can be defined as-
Bayesian Information Criterion formula
Application & Interpretation:
The models can be tested using corresponding BIC values. Lower BIC value indicates lower penalty terms
hence a better model.
Though these two measures are derived from a different perspective, they are closely related.
Apparently, the only difference is BIC considers the number of observations in the formula, which AIC
does not.
Though BIC is always higher than AIC, lower the value of these two measures, better the model.
Quantile:
In statistics and probability, quantiles are cut points dividing the range of a probability distribution into
continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
There is one fewer quantile than the number of groups created. Thus quartiles are the three cut points
that will divide a dataset into four equal-sized groups. Common quantiles have special names: for
instance, quartile, decile (creating 10 groups: see below for more). The groups created are termed
halves, thirds, quarters, etc., though sometimes the terms for the quantile are used for the groups
created, rather than for the cut points.
q-quantiles are values that partition a finite set of values into q subsets of (nearly) equal sizes. There
is q − 1 of the q-quantiles, one for each integer k satisfying 0 < k < q. In some cases the value of a
quantile may not be uniquely determined, as can be the case for the median (2-quantile) of a uniform
probability distribution on a set of even size. Quantiles can also be applied to continuous distributions,
providing a way to generalize rank statistics to continuous variables (see percentile rank). When
the cumulative distribution function of a random variable is known, the q-quantiles are the application
of the quantile function (the inverse function of the cumulative distribution function) to the values {1/q,
2/q, …, (q − 1)/q}.
A population regression function is a linear function which hypothesizes a theoretical relationship
between a dependent variable and a set of independent or explanatory variables at a population level.
PRF and SRF
A stochastic error terms is present in the regression model as well. It is written in the following form
Yi = Bo + B1x1 + Ui
Where Y = Dependent Variable, X= Independent Variable, U = Stochastic Error term, B0 = Intercept form
B1= Slope coefficient
It states that how population mean value of dependent variable is related to one or more explanatory
variable.
Sample Regression Function
It is the sample counterpart of the population regression function. Different samples will generate
different estimates because SRF is obtained for a given sample.
It is written as follows:
Yi = Bo + B1x1 + ei
These are the fitted values of the population estimators. For each value of x(hat) there is
a fitted value of y(hat). The residual error term is ei
Population regression function (PRF) is the locus of the conditional mean of variable Y
(dependent variable) for the fixed variable X (independent variable). Sample regression
function (SRF) shows the estimated relation between explanatory or independent variable X and

dependent variable Y.
Question-2
Refer DATT - Class 05 - Assignment - Gr 9.xlsx (Q-Q PLOT MANUALLY)
Question-3
Overcome heteroskedasticity (Comments and Results are in yellow highlighter)
setwd("C:/Users/Home/Desktop/Trim 4/Advanced Multivariate/Workspace")
### 1. Importing Sales.csv into R Studio
SalesData <- read.csv("Sales.csv")
attach(SalesData)
SalesData[1] <- NULL #Remove Column1 which is not required for the analysis
### 2. HOMOSKEDASTICITY Check.
RegModel2 <- lm(data = SalesData, RD ~ SALES)
summary(RegModel2)
plot(RegModel2)
plot(RegModel2$residuals)
#Breusch-Pagan test.
# H0: There is constant variance or homoskedasticity in the residuals
# H1: There is no constant variance in the residuals (i.e. there is heteroskedasticity)

# If p-value is greater than 0.05 (95% confidence), then fail to reject H0
lmtest::bptest(RD ~ SALES, studentize = FALSE, data = SalesData)
#data: RD ~ SALES
#BP = 8.91, df = 1, p-value = 0.002836
# So, here the p-value is very small and less than 0.05. Hence, we reject the H0
# That means, we don't have homoskedasticity in the data, and hence we have heteroskedasticity
# Both the residual plot (informal method) and BP test (formal method) confirm that
# there is no homoskedasticity in the data.
###Removing heteroskedasticity
### logTransformation of Y
LogNew_RD <- log(RD)
SalesData$LogNew_RD<-LogNew_RD #Appending rows to LogNew_RD to SalesData frame
View(SalesData)
### NewModel
Sales.Log <-lm(LogNew_RD ~ SALES,data =SalesData)
summary (Sales.Log)
lmtest:: bptest(Sales.Log,studentize = FALSE)

# Breusch-Pagan test
# data: Sales.Log
# BP = 0.094878, df = 1, p-value = 0.7581
#Thus heterskedasticity is removed in our new model
#RESULTS OF NEW MODEL
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 5.790e+00 4.029e-01 14.372 1.45e-10 *
# SALES 1.470e-05 3.386e-06 4.342 0.000504 *
# ---
# Signif. codes: 0 ‘*’ 0.001 ‘*’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 1.122 on 16 degrees of freedom
# Multiple R-squared: 0.541, Adjusted R-squared: 0.5123
# F-statistic: 18.86 on 1 and 16 DF, p-value: 0.000504
#Final Model #
#LogNewRD) = 5.790+1.47*10^-05 SALES)

#RD = e^(5.790+1.47*10^-05 SALES) Taking Antilog on both sides to form final model
Question-4
OUTLIERS IN REGRESSION:
Outliers are data points that are far from other data points. In other words, they're unusual values in a
dataset. Outliers are problematic for many statistical analyses because they can cause tests to either
miss significant findings or distort real results. Outliers in regression are observations that fall far from
the cloud of points. These points are sometimes important because they can have a strong influence on
the least squares line. Many times, they don’t play a key role and hence it is advisable to remove them
from your model. Outliers are one of those statistical issues that everyone knows about, but most
people aren’t sure how to deal with. Most parametric statistics, like means, standard deviations, and
correlations, and every statistic based on these, are highly sensitive to outliers.
If there are outliers in the data, they should not be removed or ignored without a good reason. Proper
analysis of Model needs to be done before taking any decision of retaining or removing outliers.
HOW TO DETECT/DEAL OUTLIERS IN REGRESSION MODEL:
Informal Method:
 Stem and leaf Plot of Residuals: A stem-and-leaf display or stem-and-leaf plot is a device for
presenting quantitative data in a graphical format, similar to a histogram, to assist in visualizing
the shape of a distribution. Stem-and-leaf displays are useful for displaying the relative density
and shape of the data, giving the reader a quick overview of the distribution. By drawing this
plot, we can easily highlight the outliers in our model.
 Box-plot of Residuals: This is a simple box-plot of all residuals where we have all quartiles range
and from it we can clearly infer the outliers in the model.
Formal Method:
 Leverage: Points that fall horizontally away from the centre of the cloud tend to pull harder on
the line, so we call them points with high leverage. Points that fall horizontally far from the line
are points of high leverage; these points can strongly influence the slope of the least squares
line. Usually we can say a point is influential if, had we fitted the line without it, the influential
point would have been unusually far from the least squares line.
 Cook’s Distance: Cook’s distance, Di, is used in Regression Analysis to find influential outliers in a
set of predictor variables. In other words, it’s a way to identify points that negatively affect your
regression model. There are certain rules which helps us in detecting outliers in Regression
Model. A good rule is to investigate any point over 4/n, where n is the number of observations.
Another good rule of thumb is that cook’s distance greater than 1 is a possible outlier.
OVERCOMING OUTLIERS IN MODEL:
Good Judgement and Proper Analysis:
 If it is obvious that the outlier is due to incorrectly entered or measured data, you should drop
the outlier.
 If the outlier does not change the results but does affect assumptions, you may drop the outlier.
 More commonly, the outlier affects both results and assumptions. In this situation, it
is not legitimate to simply drop the outlier. You may run the analysis both with and without it,
but you should state in at least a footnote the dropping of any such data points and how the
results changed.
 If the outlier creates a significant association, you should drop the outlier and should not report
any significance from your analysis.
Transformation:
 Try some transformation on x (independent variable) and rerun the regression model.
 Try some transformation on y (dependent variable) and rerun the regression model and rerun
the regression model and check if there is any improvement. It has been observed that log and
square root transformations might reduce the number of outliers.
 Another option is to transform some variables and try to build a new model. There is a good
chance that this model might be better than the previous one.

DATT - Class 05 - Assignment - GR 9

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DATT - Class 05 - Assignment - GR 9

Uploaded by

Copyright:

Available Formats

ASSIGNMENT NO: 1

Submitted To: Dr. RAMPRRASADH GOARTHY

The Akaike information criterion (AIC) is an estimator of out-of-sample prediction error and

AIC = 2k- 2ln (L cap)

AIC equation, where L = likelihood and k = # of parameters

1. Are using the same data between models

2. Are measuring the same outcome variable between models

3. Have a sample of infinite size

Pick the model with the lowest score as the best.

Mathematically BIC can be defined as-

Bayesian Information Criterion formula

Application & Interpretation:

PRF and SRF

Sample Regression Function

a fitted value of y(hat). The residual error term is ei

Population regression function (PRF) is the locus of the conditional mean of variable Y

function (SRF) shows the estimated relation between explanatory or independent variable X and

Refer DATT - Class 05 - Assignment - Gr 9.xlsx (Q-Q PLOT MANUALLY)

Overcome heteroskedasticity (Comments and Results are in yellow highlighter)

setwd("C:/Users/Home/Desktop/Trim 4/Advanced Multivariate/Workspace")

### 1. Importing Sales.csv into R Studio

SalesData <- read.csv("Sales.csv")

### 2. HOMOSKEDASTICITY Check.

RegModel2 <- lm(data = SalesData, RD ~ SALES)

# H0: There is constant variance or homoskedasticity in the residuals

# H1: There is no constant variance in the residuals (i.e. there is heteroskedasticity)

lmtest::bptest(RD ~ SALES, studentize = FALSE, data = SalesData)

#BP = 8.91, df = 1, p-value = 0.002836

# there is no homoskedasticity in the data.

LogNew_RD <- log(RD)

SalesData$LogNew_RD<-LogNew_RD #Appending rows to LogNew_RD to SalesData frame

Sales.Log <-lm(LogNew_RD ~ SALES,data =SalesData)

lmtest:: bptest(Sales.Log,studentize = FALSE)

# BP = 0.094878, df = 1, p-value = 0.7581

#Thus heterskedasticity is removed in our new model

#RESULTS OF NEW MODEL

# Estimate Std. Error t value Pr(>|t|)

# (Intercept) 5.790e+00 4.029e-01 14.372 1.45e-10 *

# SALES 1.470e-05 3.386e-06 4.342 0.000504 *

# Signif. codes: 0 ‘*’ 0.001 ‘*’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 1.122 on 16 degrees of freedom

# Multiple R-squared: 0.541, Adjusted R-squared: 0.5123

# F-statistic: 18.86 on 1 and 16 DF, p-value: 0.000504

#LogNewRD) = 5.790+1.47*10^-05 SALES)

HOW TO DETECT/DEAL OUTLIERS IN REGRESSION MODEL:

OVERCOMING OUTLIERS IN MODEL:

Good Judgement and Proper Analysis:

You might also like

# Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1