Professional Documents
Culture Documents
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
2 ABA 04/10/2023
Some concepts in R for Module II - (2)
Calculating Probability for Discrete distributions
Discrete Distributions Rname Parameter
Binomial binorm n,p
Geometric geom p
Hypergeometric hype m,n,k
Negative Binomial nbinorm size,prob/µn
Poisson pois lamda=mean
3 ABA 04/10/2023
General Statistics (1)
Null Hypothesis (Ho)
Nothing Happened, The mean was unchanged, The treatment has no effect, The model
did not improve
Alternate Hypothesis (Ha)
Something Happened, The mean rose, The treatment improved the patient’s health, The
model fit better
Assume Ho is TRUE
T-statistics
P-Value
Small (P<α ) – Strong evidence against Ho, i.e. Reject Ho
not small (P >= α ) – Retain H0 (failing to reject Ho )
Example : P< 0.05 –Reject Ho
P < 0.05, (100-95)/100 = 5/100=0.05 = 95%
High Risk Applications
P < 0.01, (100-99)/100 = 1/100=0.01 = 99%
P < 0.001, (100-99.9)/100 = 0.1/100=0.001 = 99.9%
4 ABA 04/10/2023
General Statistics (2)
Testing mean of the sample – (t-test – small sample, n<30)
You have sample from a population, given this sample you want to
know if the mean of the population could reasonably be “m”
t.test is making inferences about a population mean from the
sample.
5 ABA 04/10/2023
General Statistics (3)
Testing for Normality
You want a statistical test to determine whether your data sample is
from a normally distributed population.
shapiro.test(x)
plot(x)
Shapiro-Wilk test:
Null hypothesis: the data are normally distributed
Alternative hypothesis: the data are not normally distributed
P<α Reject NULL, indicates that the population is likely not
normally distributed.
P> α large p-value suggests the underlying population could be
normally distributed.
6 ABA 04/10/2023
General Statistics (4)
Confidence Interval for a Median - WILCOX
The procedure for calculating the Confidence Interval for mean is well-defined and
widely known. The same is not true for the median. Wilcox on signed rank test is
pretty standard procedures for this.
Comparing the locations of two samples nonparametrically
You want to know: Is one population shifted to the left/right compared with the
other?
wilcox.test : tells us whether the central locations of the two populations are
significantly different or equivalently whether their relative frequency are different
Example : We randomly select a group of employees and ask each one to complete
the same task under 2 different circumstances (favourable and unfavourable
conditions). We measure their completion times and check they are significantly
different or not.
7 ABA 04/10/2023
Example – Wilcox Test
# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
my_data <- data.frame(group = rep(c("Woman", "Man"), each = 9), weight =
c(women_weight, men_weight))
#Question : Is there any significant difference between women and men weights?
# Compute two-samples Wilcoxon test
res <- wilcox.test(weight ~ group, data = my_data, exact = FALSE)
print(res)
# INFERENCE The p-value of the test is 0.02712, which is less than the significance level (0.05).
# We can conclude that men’s median weight is significantly different from women’s median weight
# if you want to test whether the median men’s weight is less than the median women’s weight,
wilcox.test(weight ~ group, data = my_data, exact = FALSE, alternative = "less")
#Or, if you want to test whether the median men’s weight is greater than the median women’s weight,
wilcox.test(weight ~ group, data = my_data, exact = FALSE, alternative =
"greater")
boxplot(men_weight,women_weight, xlab = "Gender", ylab="Weight", names=c("Men","Women"))
8 ABA 04/10/2023
Example – Chi-Square Test - Car Data
Cars93 data in the "MASS" library which represents
the sales of different models of car in the year 1993.
library("MASS")
print(str(Cars93))
9 ABA 04/10/2023
10 ABA 04/10/2023
The above result shows the dataset has many Factor variables which
can be considered as categorical variables.
For our model we will consider the variables "AirBags" and "Type".
11 ABA 04/10/2023
Chi-Square Test
# Create a data frame from the main data set.
car.data <- data.frame(Cars93$AirBags, Cars93$Type)
12 ABA 04/10/2023
Unit III Sessions
10.00
SAMPLE SKETCHES
8.00 45.00
6.00 40.00
4.00 35.00
30.00 b<0
2.00
b=0
25.00 0<b<1
- b=1
- 0.50 1.00 1.50 2.00 2.50 3.00
20.00 b>1
15.00
10.00
5.00
-
14 ABA - 0.50 1.00 1.50 2.00 2.50 04/10/2023
3.00
Linear Modeling – Example 1
df1<-read.csv(file.choose())
#Visualization
boxplot(df1[3:9])
boxplot(df1[11:13])
boxplot(df1$POC1)
#Simple Linear Regression, DV: Closing Price, IV: Opening Price
reg1=lm(df1$close~df1$OPEN)
summary(reg1)
reg2=lm(df1$close~df1$LOW)
summary(reg2)
reg3=lm(df1$close~df1$HIGH)
summary(reg3)
reg4=lm(df1$close~df1$ltp)
summary(reg4)
reg5=lm(df1$close~df1$vwap)
summary(reg5)
15 ABA 04/10/2023
Statistical Inferences
Statistical Inferences
Intercept
Beta
P-value
*** - Statistical significant at 1 percent level
** - Statistical significant at 5 percent level
* - Statistical significant at 10 percent level
R2 Value
Higher the R2 better the model is fit (0.70 and above)
acceptable if R2 is (0.50 to 0.70). (Application Specific)
Residual Error
16 ABA 04/10/2023
Learning Objective
To Demonstrate the concept of Regression using
R.
FINANCE DATA + REGRESSION + R
Software
17 ABA 04/10/2023
17 G. Dhananjhay & V. Senthil 4/10/23
Basic idea of Regression
Francis Galton
18 ABA 04/10/2023
Finance Background
We study CAPM (Capital Asset Pricing Model) for this exercise
Regression Model
19 ABA 04/10/2023
Data
Recent FY daily equity data
Source : - www.nseindia.com
20 ABA 04/10/2023
Regression using R : one more example
Step by Step Approach for Practice
Step 1. Download the tcsnifty50POC_1year
data=read.csv(file.choose())
reg1=lm(data1$tcsPOC~data1$NIFTYPOC)
summary(reg1)
21 ABA 04/10/2023
Linear Modeling output
22 ABA 04/10/2023
Confidence Interval & PLOT
confint(reg1,level=0.99)
confint(reg1,level=0.90)
23 ABA 04/10/2023
VISUALIZATION 1: par(mfrow=c(2,2))
plot(reg1)
24 ABA 04/10/2023
VISUALIZATION 2
par(mfrow=c(1,2))
attach(data)
hist(TCSPOC)
hist(NIFTY50POC)
26 ABA 04/10/2023
Session 10
install.packages("datarium")
library(datarium)
28 ABA 04/10/2023
MLR Model 1 output
29 ABA 04/10/2023
Model 1 - Interpretation
1. In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly
significant. This means that, at least, one of the predictor variables is significantly related to
the outcome variable.
2. It can be seen that, changing in youtube and facebook advertising budget are
significantly associated to changes in sales while changes in newspaper budget is not
significantly associated with sales.
3. For example, for a fixed amount of youtube and newspaper advertising budget, spending an
additional 1 000 dollars on facebook advertising leads to an increase in sales by
approximately 0.1885*1000 = 189 sale units, on average.
4. The youtube coefficient suggests that for every 1 000 dollars increase in youtube advertising
budget, holding all other predictors constant, we can expect an increase of 0.045*1000 = 45
sales units, on average.
5. We found that newspaper is not significant in the multiple regression model. This means
that, for a fixed amount of youtube and facebook advertising budget, changes in the
newspaper advertising budget will not significantly affect sales units.
30 ABA 04/10/2023
MLR Model 2 - Output
31 ABA 04/10/2023
Model 2 - Interpretation
Finally, our model equation can be written as follow:
sales = 3.5 + 0.045*youtube + 0.187*facebook
32 ABA 04/10/2023
Model accuracy assessment
The overall quality of the model can be assessed by examining the R-squared (R2)
and Residual Standard Error (RSE).
R-squared:
In multiple linear regression, the R2 represents the correlation coefficient between the
observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y.
For this reason, the value of R will always be positive and will range from zero to one.
R2 represents the proportion of variance, in the outcome variable y, that may be
predicted by knowing the value of the x variables. An R2 value close to 1 indicates
that the model explains a large portion of the variance in the outcome variable.
A problem with the R2 is that, it will always increase when more variables are added to
the model, even if those variables are only weakly associated with the response. A
solution is to adjust the R2 by taking into account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a
correction for the number of x variables included in the prediction model.
33 ABA 04/10/2023
Residual Standard Error (RSE), or sigma:
The RSE estimate gives a measure of error of prediction. The
lower the RSE, the more accurate the model .
The error rate can be estimated by dividing the RSE by the mean
outcome variable:
sigma(model1)/mean(sales)
## [1] 0.120
In our multiple regression example, the RSE is 2.023
corresponding to 12% error rate.
sigma(model2)/mean(sales)
## [1] 0.119
In our multiple regression example, the RSE is 2.023
corresponding to 12% error rate.
34 ABA 04/10/2023