Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

Session 6-15 – Unit II & III

Probability and Distribution, Classical tests


Use R to perform one-sample tests: t-test, Wilcoxon signed-rank test
Use R to perform two-sample tests: t-test and Wilcoxon test, paired t-test,
Use R to perform one-way analysis of variance and Kruskal-Wallis test
Some concepts in R for Module II – (1)
Calculating Probability for Continuous distributions
Continuous Distributions Rname Parameter
Beta beta shape 1, 2
Cauchy cauchy location,scale
Chi-Squared chisq df
Expoential exp rate or scale
F f df1,df2
Gamma gamma rate
Log-Normal enorm meanlog, sdlog
Logistic logis location, scale
Normal norm mean, sd
Student’s t test t df
Uniform unif min, max
Weibull Weibull Shape, Scale
Wilcox Wilcox m,n

2 ABA 04/10/2023
Some concepts in R for Module II - (2)
Calculating Probability for Discrete distributions
Discrete Distributions Rname Parameter
Binomial binorm n,p
Geometric geom p
Hypergeometric hype m,n,k
Negative Binomial nbinorm size,prob/µn
Poisson pois lamda=mean

3 ABA 04/10/2023
General Statistics (1)
Null Hypothesis (Ho)
 Nothing Happened, The mean was unchanged, The treatment has no effect, The model
did not improve
Alternate Hypothesis (Ha)
 Something Happened, The mean rose, The treatment improved the patient’s health, The
model fit better
Assume Ho is TRUE
T-statistics
P-Value
 Small (P<α ) – Strong evidence against Ho, i.e. Reject Ho
 not small (P >= α ) – Retain H0 (failing to reject Ho )
 Example : P< 0.05 –Reject Ho
 P < 0.05, (100-95)/100 = 5/100=0.05 = 95%
 High Risk Applications
P < 0.01, (100-99)/100 = 1/100=0.01 = 99%
P < 0.001, (100-99.9)/100 = 0.1/100=0.001 = 99.9%
4 ABA 04/10/2023
General Statistics (2)
Testing mean of the sample – (t-test – small sample, n<30)
You have sample from a population, given this sample you want to
know if the mean of the population could reasonably be “m”
t.test is making inferences about a population mean from the
sample.

t test - ask if the population means could be 95


x<-rnorm(50,mean=100,sd=15)
t.test(x,mu=95) #p value is X.XXXXXX < 0.05
plot(x)
P<α (small) and so it is unlikely (based on the sample data) that
95 could be the mean of the population i.e. Reject Ho
Run the above 3 lines multiple times & Check the answer

5 ABA 04/10/2023
General Statistics (3)
Testing for Normality
You want a statistical test to determine whether your data sample is
from a normally distributed population.
shapiro.test(x)
plot(x)
Shapiro-Wilk test:
Null hypothesis: the data are normally distributed
Alternative hypothesis: the data are not normally distributed
P<α Reject NULL, indicates that the population is likely not
normally distributed.
P> α large p-value suggests the underlying population could be
normally distributed.

6 ABA 04/10/2023
General Statistics (4)
Confidence Interval for a Median - WILCOX
 The procedure for calculating the Confidence Interval for mean is well-defined and
widely known. The same is not true for the median. Wilcox on signed rank test is
pretty standard procedures for this.
Comparing the locations of two samples nonparametrically
You want to know: Is one population shifted to the left/right compared with the
other?

wilcox.test : tells us whether the central locations of the two populations are
significantly different or equivalently whether their relative frequency are different

Example : We randomly select a group of employees and ask each one to complete
the same task under 2 different circumstances (favourable and unfavourable
conditions). We measure their completion times and check they are significantly
different or not.

7 ABA 04/10/2023
Example – Wilcox Test
# Data in two numeric vectors
women_weight <- c(38.9, 61.2, 73.3, 21.8, 63.4, 64.6, 48.4, 48.8, 48.5)
men_weight <- c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
# Create a data frame
my_data <- data.frame(group = rep(c("Woman", "Man"), each = 9), weight =
c(women_weight, men_weight))
#Question : Is there any significant difference between women and men weights?
# Compute two-samples Wilcoxon test
res <- wilcox.test(weight ~ group, data = my_data, exact = FALSE)
print(res)

# INFERENCE The p-value of the test is 0.02712, which is less than the significance level (0.05).
# We can conclude that men’s median weight is significantly different from women’s median weight
# if you want to test whether the median men’s weight is less than the median women’s weight,
wilcox.test(weight ~ group, data = my_data, exact = FALSE, alternative = "less")
#Or, if you want to test whether the median men’s weight is greater than the median women’s weight,
wilcox.test(weight ~ group, data = my_data, exact = FALSE, alternative =
"greater")
boxplot(men_weight,women_weight, xlab = "Gender", ylab="Weight", names=c("Men","Women"))

8 ABA 04/10/2023
Example – Chi-Square Test - Car Data
 Cars93 data in the "MASS" library which represents
the sales of different models of car in the year 1993.

library("MASS")
print(str(Cars93))

9 ABA 04/10/2023
10 ABA 04/10/2023
 The above result shows the dataset has many Factor variables which
can be considered as categorical variables.

 For our model we will consider the variables "AirBags" and "Type".

 We aim to find out any significant correlation between the types of


car sold and the type of Air bags it has.

 If correlation is observed we can estimate which types of cars can sell


better with what types of air bags.

11 ABA 04/10/2023
Chi-Square Test
# Create a data frame from the main data set.
car.data <- data.frame(Cars93$AirBags, Cars93$Type)

# Create a table with the needed variables.


car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)

# Perform the Chi-Square test.


print(chisq.test(car.data))

The result shows the p-


value of less than 0.05
which indicates a
strong correlation.

12 ABA 04/10/2023
Unit III Sessions

Regression Analysis - Basics


Examples - Marketing Models
Linear Modeling - Examples
Regression -Base Model 1 & 2
SAMPLE SKETCHES
b<0
16.00
b=0
0<b<1
14.00
b=1
b>1
12.00

10.00
SAMPLE SKETCHES
8.00 45.00

6.00 40.00

4.00 35.00

30.00 b<0
2.00
b=0
25.00 0<b<1
- b=1
- 0.50 1.00 1.50 2.00 2.50 3.00
20.00 b>1

15.00

10.00

5.00

-
14 ABA - 0.50 1.00 1.50 2.00 2.50 04/10/2023
3.00
Linear Modeling – Example 1
df1<-read.csv(file.choose())
#Visualization
boxplot(df1[3:9])
boxplot(df1[11:13])
boxplot(df1$POC1)
#Simple Linear Regression, DV: Closing Price, IV: Opening Price
reg1=lm(df1$close~df1$OPEN)
summary(reg1)
reg2=lm(df1$close~df1$LOW)
summary(reg2)
reg3=lm(df1$close~df1$HIGH)
summary(reg3)
reg4=lm(df1$close~df1$ltp)
summary(reg4)
reg5=lm(df1$close~df1$vwap)
summary(reg5)

15 ABA 04/10/2023
Statistical Inferences
Statistical Inferences
Intercept
Beta
P-value
*** - Statistical significant at 1 percent level
** - Statistical significant at 5 percent level
* - Statistical significant at 10 percent level
R2 Value
Higher the R2 better the model is fit (0.70 and above)
acceptable if R2 is (0.50 to 0.70). (Application Specific)
Residual Error

16 ABA 04/10/2023
Learning Objective
To Demonstrate the concept of Regression using
R.
FINANCE DATA + REGRESSION + R
Software

17 ABA 04/10/2023
17 G. Dhananjhay & V. Senthil 4/10/23
Basic idea of Regression

Basic idea of Regression


Use data to identify relationships among
variables and use these relationships to make
predictions.

Francis Galton

18 ABA 04/10/2023
Finance Background
We study CAPM (Capital Asset Pricing Model) for this exercise

Market Represented by NIFTY-50 is independent variable (IV)

Stock represented by TCS is Dependent Variable (DV)

Regression Model

Shall we swap the variables ?


- Apply Common sense for variable selection

19 ABA 04/10/2023
Data
Recent FY daily equity data

Daily Closing Prices of TCS, NIFTY50

Source : - www.nseindia.com

Returns are calculated using R * 100

20 ABA 04/10/2023
Regression using R : one more example
Step by Step Approach for Practice
Step 1. Download the tcsnifty50POC_1year

Step 2. File Menu  New Script

data=read.csv(file.choose())

reg1=lm(data1$tcsPOC~data1$NIFTYPOC)

summary(reg1)

21 ABA 04/10/2023
Linear Modeling output

22 ABA 04/10/2023
Confidence Interval & PLOT

confint(reg1,level=0.99)
confint(reg1,level=0.90)

23 ABA 04/10/2023
VISUALIZATION 1: par(mfrow=c(2,2))
plot(reg1)

24 ABA 04/10/2023
VISUALIZATION 2
par(mfrow=c(1,2))
attach(data)
hist(TCSPOC)
hist(NIFTY50POC)

25 ABA What is your inference ? Who is doing good ? Why ? 04/10/2023


Try it Exercise – Assignment 1
Download Your Company & Sector
Regression using POC (Minimum 3 SLR required)
Plot all POCs
Assignment content
1. Data set (% of change, cleaned)
2. LM Summary output
3. Linear Regression Equation(s)
4. Plot & Boxplot output
5. Inferences

26 ABA 04/10/2023
Session 10

MLR - Marketing Example


Sales and Advertisement
Sales and Advertisement
 We’ll use the marketing data set [datarium package], which
contains the impact of the amount of money spent on three
advertising medias (youtube, facebook and newspaper) on sales.

 sales = b0 + b1*youtube + b2*facebook + b3*newspaper

 install.packages("datarium")
 library(datarium)

 data("marketing", package = "datarium")


 head(marketing, 4)

28 ABA 04/10/2023
MLR Model 1 output

29 ABA 04/10/2023
Model 1 - Interpretation
1. In our example, it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly
significant. This means that, at least, one of the predictor variables is significantly related to
the outcome variable.

2. It can be seen that, changing in youtube and facebook advertising budget are
significantly associated to changes in sales while changes in newspaper budget is not
significantly associated with sales.

3. For example, for a fixed amount of youtube and newspaper advertising budget, spending an
additional 1 000 dollars on facebook advertising leads to an increase in sales by
approximately 0.1885*1000 = 189 sale units, on average.

4. The youtube coefficient suggests that for every 1 000 dollars increase in youtube advertising
budget, holding all other predictors constant, we can expect an increase of 0.045*1000 = 45
sales units, on average.

5. We found that newspaper is not significant in the multiple regression model. This means
that, for a fixed amount of youtube and facebook advertising budget, changes in the
newspaper advertising budget will not significantly affect sales units.
30 ABA 04/10/2023
MLR Model 2 - Output

31 ABA 04/10/2023
Model 2 - Interpretation
Finally, our model equation can be written as follow:
 sales = 3.5 + 0.045*youtube + 0.187*facebook

32 ABA 04/10/2023
Model accuracy assessment
The overall quality of the model can be assessed by examining the R-squared (R2)
and Residual Standard Error (RSE).

R-squared:
In multiple linear regression, the R2 represents the correlation coefficient between the
observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y.
For this reason, the value of R will always be positive and will range from zero to one.
R2 represents the proportion of variance, in the outcome variable y, that may be
predicted by knowing the value of the x variables. An R2 value close to 1 indicates
that the model explains a large portion of the variance in the outcome variable.
A problem with the R2 is that, it will always increase when more variables are added to
the model, even if those variables are only weakly associated with the response. A
solution is to adjust the R2 by taking into account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a
correction for the number of x variables included in the prediction model.

33 ABA 04/10/2023
Residual Standard Error (RSE), or sigma:
 The RSE estimate gives a measure of error of prediction. The
lower the RSE, the more accurate the model .
 The error rate can be estimated by dividing the RSE by the mean
outcome variable:
 sigma(model1)/mean(sales)
 ## [1] 0.120
 In our multiple regression example, the RSE is 2.023
corresponding to 12% error rate.
 sigma(model2)/mean(sales)
 ## [1] 0.119
 In our multiple regression example, the RSE is 2.023
corresponding to 12% error rate.

34 ABA 04/10/2023

You might also like