Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

Introduction to

Data Science
Module 2
Week 2
Review of Descriptive and Inferential Statistics
Data Processing and Visualization with R
Module Objectives
At the end of this module, students must be able to:
1. Differentiate the two areas of statistics: descriptive and
inferential;
2. Perform simple linear regression in Excel and in R along with
pertinent visual output;
3. Perform multiple linear regression in Excel and in R along with
pertinent visual output;
Statistics Refresher

 Collection
DESCRIPTIVE  Organization
 Presentation

STATISTICS

 Draw conclusions for a larger group/data


INFERENTIAL  Determine relationships
 Make predictions
Statistics Refresher

DESCRIPTIVE

Point
STATISTICS Probability
Estimation

INFERENTIAL Interval

Hypothesis
Testing
The Process of Statistics

Sampling Theory
POPULATION SAMPLE
Descriptive Statistics

Inferential Statistics
PARAMETER STATISTIC
Stat Refresher: Regression Analysis

Regression Analysis:
 Statistical technique used most frequently to analyze the
relationship between two or more variables.
 At least two variables need to be continuous
 Deals with the way one variable tends to change as one or
more other variables change
Example

• Input the data


• Create a scatter plot
• Add trend line
When to use regression?
Regression analysis is used to describe the relationship between:
 A single response variable Y; and
 One or more predictor variables: 𝑋1,𝑋2,…,𝑋𝑝
p = 1 : Simple regression
p > 1 : Multivariate regression
Examples:
 how sales (Y) vary with advertising expenditures (X)
 how quantity demanded (Y) varies with prices (X)
 relationship between corporate profit (Y) and R&D spending (X)
The Variables
Response Variables
- The response variable Y must be a continuous variable.

Predictor Variables
- The predictors 𝑋1,𝑋2,…,𝑋𝑝 can be continuous, discrete or
categorical variables
Initial EDA
Prior to any regression modelling, the data should always be
inspected for:
 Data-entry errors
 Missing values
 Outliers
 Unusual (e.g., asymmetric)distributions
 Changes in Variability
 Clustering
 Non-linear bivariate relationships
 Unexpected pattern
Simple Linear Regression

The Variables
X : explanatory variable (horizontal axis)
Y : response variable (vertical axis)
After data collection, we have pairs of observations:
(𝑋1,𝑌1),…,(𝑋𝑛,𝑌𝑛)
Sample Data 1
Variables: X (Height), Y (Weight)

 We want to be able to describe the weight as a linear function


of height
Sample Data 1
Weight ≈ + Height 
The regression of variable Y on variable X is given by:
= 𝛽0 + 𝛽1𝑋1 + 𝜖1 ; 𝑖 = 1,…,𝑛
where:
 Random Error : 𝜖𝑖 ~ 𝑁(0,𝜎2), independent and identical
 Linear Function : 𝛽0 + 𝛽1𝑋1 = 𝐸(𝑌|𝑋 = 𝑋𝑖)
Unknown Parameters:
 𝛽0 (intercept) : point in which the line intercepts the y-axis;
 𝛽1 (slope) : increase in Y per unit change in X.
Estimation of Unknown Parameters I
We want to find the equation of the
  line that “best” fits the data. It
means finding 𝑏0 and 𝑏1 such that the fitted values of 𝑦𝑖, given by:
= 𝑏0 + 𝑏1𝑥1
are as “close” as possible to the observed values 𝑦𝑖.
Residuals
The difference between the observed value 𝑦𝑖 and the fitted value
is called residual and is given by:
𝑒𝑖= 𝑦𝑖 −
Residuals
Estimation of Unknown Parameters II
Least Squares Method  
A usual way of calculating 𝑏0 and 𝑏1 is based on the minimization
of the sum of the squared residuals, or residual sum of squares
(RSS):
Sample Data 1
Multivariate Regression
From SLR to MLR
 It is not often the case that dependent variable is explained by
exactly one variable.
 We use multiple regression to attempt to predict the dependent
variable using more than one independent variable.
 Multiple regressions can be linear and nonlinear. We use
Multiple Linear Regression for explanation, prediction, and
inference.
Sample Data 1
Example: Advertising
1. Perform SLR on each predictor variable.
2. Interpret the results.
3. Perform MLR.
4. Interpret the results.
Predicting Values in MVR
1. What is the predicted sales when TV = 115, Radio = 45,
Newspaper = 41?

2. What is the predicted sales when TV = 195, Radio = 62,


Newspaper = 155?

You might also like