Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

What Is Statistical Modeling?

Statistical modeling is like a formal depiction of a theory. It


is typically described as the mathematical relationship
between random and non-random variables.
Statistical modeling helps you differentiate between
reasonable and dubious conclusions based on quantitative
evidence.
Analyses and predictions made by statisticians are highly
trustworthy.
A statistician can help investigators avoid various analytical
traps along the way.

Statistical modeling techniques


Data gathering is the foundation of statistical modeling. The data may
come from the cloud, spreadsheets, databases, or other sources. There
are two categories of statistical modeling methods used in data
analysis. These are:

1. Supervised learning
In the supervised learning model, the algorithm uses a labeled data set
for learning, with an answer key the algorithm uses to determine
accuracy as it trains on the data. Supervised learning techniques in
statistical modeling include:

Regression model: A predictive model designed to analyze the


relationship between independent and dependent variables. The most
common regression models are logistical, polynomial, and linear. These
models determine the relationship between variables, forecasting, and
modeling.
Classification model: An algorithm analyzes and classifies a large and
complex set of data points. Common models include decision trees,
Naive Bayes, the nearest neighbor, random forests, and neural
networking models.

2. Unsupervised learning
In the unsupervised learning model, the algorithm is given unlabeled
data and attempts to extract features and determine patterns
independently. Clustering algorithms and association rules are examples
of unsupervised learning. Here are two examples:

K-means clustering: The algorithm combines a specified number of data


points into specific groupings based on similarities.

Reinforcement learning: This technique involves training the algorithm


to iterate over many attempts using deep learning, rewarding moves
that result in favorable outcomes, and penalizing activities that
produce undesired effects.

Machine learning vs. statistical modeling


Statistics and machine learning (ML) differ primarily in their purposes.
You can build ML models for predicting the future by making accurate
predictions without explicit programming, while statistical models can
explain the relationship between variables.

However, some statistical models are inaccurate because of their


inability to capture complex relationships between data, even if they can
predict. ML predictions are more accurate, but they are also more
challenging to understand and explain.

In statistical models, probabilistic models for the data and variables are
interpreted and identified, such as the effects of predictor variables. A
statistical model establishes the magnitude and significance of
relationships between variables and their scale. Models based on
machine learning are more empirical.

Reasons for learning statistical modeling


a) Even though data scientists are usually responsible for developing
algorithms and models, analysts may also use statistical models in
their work from time to time. As a result, analysts seeking to excel
should gain a solid grasp of the factors that contribute to the
success of these models.
b) Companies and organizations are leveraging statistical modeling to
make predictions based on data to keep pace with the explosive
growth of machine learning and artificial intelligence. The following
are some benefits of understanding statistical modeling.

Choosing models that meet your needs


A data analyst needs a comprehensive understanding of all the
statistical models available. You should identify which model is most
appropriate for your data and which model best addresses the question
at hand.

Improved data preparation for analysis


Raw data is rarely ready for analysis. Data must be clean before
conducting accurate and viable research. The cleanup process usually
involves organizing the collected information and removing "bad or
incomplete data" from the sample.

To build a good statistical model, you need to explore and understand


the data. If the data is not good enough, you can't draw any meaningful
inferences. Knowing how different statistical models work and how they
leverage data will enable you to determine what data is most relevant to
the questions you are trying to answer.
Enhanced communication skills
Most organizations require data analysts to present their findings to
two different audiences. First, the business team is not interested in
the details of your analysis but wants to know the main conclusions.
There is a second group of people often interested in the granular
details. These people often require a summary of your broad findings
and an explanation of how you reached them.
An understanding of statistical modeling can help you communicate
effectively with both audiences. You will generate better data
visualizations and share complex ideas with non-analysts. You will create
and explain those more granular details when necessary with a deeper
understanding of how these models work on the backend.

Job opportunities
You'll find that statistical data analysis skills demand data science
positions that will involve machine learning. They may ask you to solve
some typical statistics problems during an interview.
With a proper background in statistics and math, it is possible to
optimize linear regression models and understand how decision trees
calculate impurity at each node. These are some of the top reasons
machine learning needs statistics. Taking online courses on statistics can
get you started.

Linear Regression Models and the Least Squares


Line of Best Fit
Imagine you have some points, and want to have a line that best fits
them like this:

Temp Sales
12 200
14 200
16 300
18 400
20 400
22 500
23 550
25 600

We can place the line "by eye": try to have the line as close as possible
to all points, and a similar number of points above and below the line.

But for better accuracy let's see how to calculate the line using Least
Squares Regression.

The Line
Our aim is to calculate the values m (slope) and b (y-intercept) in
the equation of a line :
y = mx + b
Where:
 y = how far up
 x = how far along
 m = Slope or Gradient (how steep the line is)
 b = the Y Intercept (where the line crosses the Y axis)
Steps
To find the line of best fit for N points:
Step 1: For each (x,y) point calculate x2 and xy
Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ
means "sum up")
Step 3: Calculate Slope m:
m = (N Σ(xy) − Σx Σy) / (N Σ(x2) − (Σx)2)
(N is the number of points.)
Step 4: Calculate Intercept b:
b = (Σy − m Σx) / N
Step 5: Assemble the equation of a line
y = mx + b
Done!
Example
Let's have an example to see how to do it!

Example: Sam found how many hours of sunshine vs how many ice
creams were sold at the shop from Monday to Friday:
"y"
"x"
Ice
Hours of
Creams
Sunshine
Sold
2 4
3 5
5 7
7 10
9 15
Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b

Step 1: For each (x,y) calculate x2 and xy:


x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):
x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263
Also N (number of data values) = 5
Step 3: Calculate Slope m:
m = N Σ(xy) − Σx Σy/ N Σ(x2) − (Σx)2
= 5 x 263 − 26 x 41 / 5 x 168 − 262
= 1315 – 1066 / 840 − 676
= 249 /164 = 1.5183...
Step 4: Calculate Intercept b:
b = Σy − m Σx / N
= 41 − 1.5183 x 26 / 5
= 0.3049...
Step 5: Assemble the equation of a line:
y = mx + b
y = 1.518x + 0.305
Let's see how it works out:
x y y = 1.518x + 0.305 error
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03
Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:

Nice fit!

Sam hears the weather forecast which says "we expect 8 hours of sun
tomorrow", so he uses the above equation to estimate that he will sell
y = 1.518 x 8 + 0.305 = 12.45 Ice Creams
Sam makes fresh waffle cone mixture for 14 ice creams just in case.
Yum.
How does it work?
It works by making the total of the square of the errors as small as
possible (that is why it is called "least squares"):

The straight line minimizes the sum of squared errors


So, when we square each of those errors and add them all up, the total
is as small as possible.
You can imagine (but not accurately) each data point connected to a
straight bar by springs:
Regression models
They are used to describe relationships between variables by fitting a
line to the observed data. Regression allows you to estimate how a
dependent variable changes as the independent variable(s) change.
Multiple linear regression is used to estimate the relationship
between two or more independent variables and one dependent
variable unlike simple regression model
You can use multiple linear regression when you want to know:
1. How strong the relationship is between two or more independent
variables and one dependent variable (e.g. how rainfall,
temperature, and amount of fertilizer added affect crop growth).
2. The value of the dependent variable at a certain value of the
independent variables (e.g. the expected yield of a crop at certain
levels of rainfall, temperature, and fertilizer addition).

Multiple linear regression example


You are a public health researcher interested in social factors that
influence heart disease. You survey 500 towns and gather data on the
percentage of people in each town who smoke, the percentage of people
in each town who bike to work, and the percentage of people in each
town who have heart disease.
Because you have two independent variables and one dependent
variable, and all your variables are quantitative, you can use multiple
linear regression to analyze the relationship between them.

Assumptions of multiple linear regression


Multiple linear regression makes all of the same assumptions as simple
linear regression:
1. Homogeneity of variance (homoscedasticity): the size of the
error in our prediction doesn’t change significantly across the
values of the independent variable.
2. Independence of observations: the observations in the dataset
were collected using statistically valid sampling methods, and
there are no hidden relationships among variables.
In multiple linear regression, it is possible that some of the
independent variables are actually correlated with one another, so it is
important to check these before developing the regression model. If
two independent variables are too highly correlated (r2 > ~0.6), then
only one of them should be used in the regression model.
3. Normality: The data follows a normal distribution.
4. Linearity: the line of best fit through the data points is a
straight line, rather than a curve or some sort of grouping factor.
How to perform a multiple linear regression

Multiple regression models


Most introductions to regression discuss the simple case of two
variables measured on continuous scales, where the aim is to
investigate the influence of one variable on another. It is useful to
begin with the familiar simple regression before discussing multiple
regression.
Regression analysis helps us to answer questions like:
 Does the amount Healthtex spends per month on training its
sales force affect its monthly sales?
 Is the number of square feet in a home related to the cost of
renting the home?
 In a study of fuel efficiency, is there a relationship between
miles per gallon and the weight of a car?
 Does the number of hours that students study for an exam
influence the exam score?
In regression analysis we use the independent variable (X) to estimate
the dependent variable (Y). The relationship between the variables is
linear and both variables must be at least of interval scale. The least
squares criterion is used to determine the equation. LEAST
SQUARES PRINCIPLE determining a regression equation is derived
by minimizing the sum of the squares of the vertical distances
between the actual Y values and the predicted values of Y. The
general simple regression equation is of the form:
𝑌̂ = 𝛽0 + 𝛽1 𝑋
Where 𝑌̂ (read 𝑌 hat) is the estimated value of the 𝑌 variable for a
selected 𝑋 value.
𝛽0 is the 𝑌 intercept. It is the estimated value of 𝑌 when 𝑋 = 0.
Another way to put it is: 𝛽0 is the estimated value of 𝑌 where the
regression line crosses the 𝑌 𝑎𝑥𝑖𝑠 when 𝑋 is zero.
𝛽1 is the slope of the line, or the average change in 𝑌̂ for each change
of one unit (either increase or decrease) in the independent variable
𝑋. 𝑋 is any value of the independent variable that is selected.

Assumptions Underlying Linear Regression


 For each value of X, there is a group of Y values
 Y values are normally distributed. The means of these normal
distributions of Y values all lie on the straight line of
regression.
 The standard deviations of these normal distributions are
equal.
 The Y values are statistically independent. This means that in
the selection of a sample, the Y values chosen for a particular
X value do not depend on the Y values for any other X values.

Figure 1: Regression Equation


Illustration
Suppose we are interested in describing the decline with age of
forced expiratory volume in one second (FEV1) in non-smokers and
that data on both variables has been gathered from a cross-
sectional sample of a population. A statistical analysis might begin
with a scatter plot of the data (see fig 2).

Figure 2: Relationship between FEV1 and age in 160 male non-


smokers
Then a model of the relationship in the population would be proposed,
where the model is specified in form of an equation. The choice of the
model form should ideally be dictated by subject matter knowledge,
biological plausibility, and the data. Suppose a linear relationship is
proposed; then the model would have the general form:

𝐹𝐸𝑉1 = 𝛽0 + 𝛽1 . 𝐴𝑔𝑒 + 𝜀 𝑴𝒐𝒅𝒆𝒍 𝟏

The three unknown quantities in this model; 𝛽0 𝛽1 and 𝜀 would then be


estimated or quantified in the analysis. The model ignoring 𝜀 (by
setting it equal to zero) is a description of the relationship between
age and the mean FEV1 among people of a given age. The term 𝜀 is a
random component assumed to vary from person to person. Inclusion
of this term in the model allows for the fact that people of the same
age are not all the same: their individual FEV1 values will vary about
the mean for that age. Random variation is unpredictable but, overall,
it can be described by a statistical distribution. With continuous
variables such as FEV1, the random component is often assumed to
have a Normal distribution with a mean of zero. Table 7 shows a
typical software output from fitting model 1 to the data. It includes
95% confidence intervals (CI) for 𝛽0 and 𝛽1 and 𝑝 values from
significance tests. In each test, the null hypothesis is that the true
value of the coefficient is zero. If 𝛽1 was zero, then age would have
no effect on FEV1. Here, the test and the 95% (CI) strongly suggest
that 𝛽1 is negative.

Estimation of model 1 coefficients from data in fig 2: typical software output

FEV1 Coefficient Std error t statistic Probability 95% CI


Mean Square Error, that is, SD (𝜀 ) = 0.464 litres.
Age −0.0301 0.0032 −9.52 <0.001 −0.0363 to −0.0238
Constant 5.5803 0.1440 38.75 <0.001 5.2960 to 5.8647

The data in fig 2 gives estimates of 5.58 litres for 𝛽0 , −0.03


litres/y for 𝛽1 and 0.46 litres for SD (𝜀). Therefore the “fitted”
model is: FEV1 = 5.58 − 0.03.age + 𝜀 .
Model 1 is an example of a linear model: it assumes that mean FEV1
declines by a fixed amount (estimated as 30 ml) for every year of age.
It is important to realise that linearity was assumed, not proven: the
statistical analysis merely estimates the coefficients of an assumed
model. We could have proposed a more complicated model equation,
for example, quadratic or exponential, and then estimated its
coefficients. The process of estimation does not tell us which model
form, if any, is right. However, there are a range of post-estimation,
regression “diagnostic methods” to help with this task, for example,
“analysis of residuals” and “leverage” statistics, which highlight
discrepancies between the data and the assumed model form.

Activity

Suppose we want to assess the association between BMI and systolic blood
pressure using data collected where a total of n=3,539 participants attended the exam, and
their mean systolic blood pressure was 127.3 with a standard deviation of 19.0. The mean
BMI in the sample was 28.2 with a standard deviation of 5.3. A simple linear regression
analysis reveals the following:
Independent Regression t- P-
Variable Coefficient statistic value
Intercept 108.28 62.61 0.0001
BMI 0.67 11.06 0.0001

Fit a simple linear regression and interpret the coefficients?

Solution

The simple linear regression model is:

𝑌̂ = 108.28 + 0.67(𝐵𝑀𝐼)

Where 𝑌̂ is the predicted of expected systolic blood pressure. The regression coefficient
associated with BMI is 0.67 suggesting that each one unit increase in BMI is associated with
a 0.67 unit increase in systolic blood pressure. The association between BMI and systolic
blood pressure is also statistically significant (p=0.0001).

Multiple Regression Analysis


If we add the number of explanatory variables we have the general
multiple regression with k independent variables given by:

𝑌̂ = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + ⋯ + 𝛽𝑘 𝑋𝑘

Figure 3: Regression Plane for a 2-Independent Variable Linear


Regression Equation
From our previous example, other factors besides age are known to
affect FEV1, for example, height and number of cigarettes smoked per
day. Regression models can be easily extended to include these and any
other determinants of lung function. Model 2 includes height and
cigarettes. It assumes that each has a linear relationship with FEV1 and
assumes that the joint effect of the three factors together is the sum
of their separate effects:
𝐹𝐸𝑉1
= 𝛽0 + 𝛽1 𝑎𝑔𝑒 + 𝛽2 ℎ𝑒𝑖𝑔ℎ𝑡 + 𝛽3 𝑐𝑖𝑔𝑎𝑟𝑒𝑡𝑡𝑒𝑠 + 𝜀 𝑴𝒐𝒅𝒆𝒍 𝟐
A standard statistical analysis based on this model and data would
produce estimates of 𝛽0 , 𝛽1 , 𝛽2 , 𝛽3 and SD( 𝜀 ), as well as 95% CIs and
“null hypothesis” tests for each coefficient.
Activity

Suppose we now want to assess whether age (a continuous


variable, measured in years), male gender (yes/no), and treatment for
hypertension (yes/no) are potential factors, and if so, appropriately
account for these using multiple linear regression analysis. For analytic
purposes, treatment for hypertension is coded as 1=yes and 0=no.
Gender is coded as 1=male and 0=female. A multiple regression analysis
reveals the following:
Independent Regression t- P-
Variable Coefficient statistic value

Intercept 68.15 26.33 0.0001

BMI 0.58 10.30 0.0001

Age 0.65 20.22 0.0001

Male gender 0.94 1.58 0.1133

Treatment for 6.44 9.74 0.0001


hypertension

Discuss these results?


The multiple regression model in this case will be 𝑌̂= 68.15 + 0.58
(BMI) + 0.65 (Age) + 0.94 (Male gender) + 6.44 (Treatment for
hypertension).
Notice that the association between BMI and systolic blood pressure
is smaller (0.58 versus 0.67 see in the previous example) after
adjustment for age, gender and treatment for hypertension. BMI
remains statistically significantly associated with systolic blood
pressure (p=0.0001), but the magnitude of the association is lower
after adjustment. The regression coefficient decreases by 13%.
Multiple linear regression formula
The formula for a multiple linear regression is:

 = the predicted value of the dependent variable


 = the y-intercept (value of y when all other parameters are set
to 0)
 = the regression coefficient ( ) of the first independent
variable ( ) (a.k.a. the effect that increasing the value of the
independent variable has on the predicted y value)
 … = do the same for however many independent variables you are
testing
 = the regression coefficient of the last independent
variable
 = model error (a.k.a. how much variation there is in our estimate
of )
To find the best-fit line for each independent variable, multiple linear
regression calculates three things:
 The regression coefficients that lead to the smallest overall
model error.
 The t statistic of the overall model.
 The associated p value (how likely it is that the t statistic would
have occurred by chance if the null hypothesis of no relationship
between the independent and dependent variables was true).
It then calculates the t statistic and p value for each regression
coefficient in the model.

Multiple linear regression in R


While it is possible to do multiple linear regression by hand, it is much
more commonly done via statistical software. We are going to use R for
our examples because it is free, powerful, and widely available.
Download the sample dataset to try it yourself.
Load the heart.data dataset into your R environment and run the
following code:
R code for multiple linear regression
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data =
heart.data)
This code takes the data set heart.data and calculates the effect that
the independent variables biking and smoking have on the dependent
variable heart disease using the equation for the linear model: lm().
Learn more by following the full step-by-step guide to linear regression
in R.
Interpreting the results
To view the results of the model, you can use the summary() function:
summary(heart.disease.lm)
This function takes the most important parameters from the linear
model and puts them into a table that looks like this:
The summary first prints out the formula (‘Call’), then the model
residuals (‘Residuals’). If the residuals are roughly centered around
zero and with similar spread on either side, as these do (median 0.03,
and min and max around -2 and 2) then the model probably fits the
assumption of heteroscedasticity.
Next are the regression coefficients of the model (‘Coefficients’). Row
1 of the coefficients table is labeled (Intercept) – this is the y-
intercept of the regression equation. It’s helpful to know the
estimated intercept in order to plug it into the regression equation and
predict values of the dependent variable:
heart disease = 15 + (-0.2*biking) + (0.178*smoking) ± e
The most important things to note in this output table are the next
two tables – the estimates for the independent variables.
The Estimate column is the estimated effect, also called
the regression coefficient or r2 value. The estimates in the table tell
us that for every one percent increase in biking to work there is an
associated 0.2 percent decrease in heart disease, and that for every
one percent increase in smoking there is an associated .17 percent
increase in heart disease.
The Std.error column displays the standard error of the estimate.
This number shows how much variation there is around the estimates
of the regression coefficient.
The t value column displays the test statistic. Unless otherwise
specified, the test statistic used in linear regression is the t value
from a two-sided t test. The larger the test statistic, the less likely it
is that the results occurred by chance.
The Pr( > | t | ) column shows the p value. This shows how likely the
calculated t value would have occurred by chance if the null hypothesis
of no effect of the parameter were true.
Because these values are so low (p < 0.001 in both cases), we can reject
the null hypothesis and conclude that both biking to work and smoking
both likely influence rates of heart disease.
Presenting the results
When reporting your results, include the estimated effect (i.e. the
regression coefficient), the standard error of the estimate, and
the p value. You should also interpret your numbers to make it clear to
your readers what the regression coefficient means.
In our survey of 500 towns, we found significant relationships between
the frequency of biking to work and the frequency of heart disease
and the frequency of smoking and frequency of heart disease (p < 0.001
for each). Specifically we found a 0.2% decrease (± 0.0014) in the
frequency of heart disease for every 1% increase in biking, and a
0.178% increase (± 0.0035) in the frequency of heart disease for every
1% increase in smoking.

Visualizing the results in a graph


It can also be helpful to include a graph with your results. Multiple
linear regression is somewhat more complicated than simple linear
regression, because there are more parameters than will fit on a two-
dimensional plot.
However, there are ways to display your results that include the
effects of multiple independent variables on the dependent variable,
even though only one independent variable can actually be plotted on
the x-axis.
Here, we have calculated the predicted values of the dependent
variable (heart disease) across the full range of observed values for
the percentage of people biking to work.
To include the effect of smoking on the independent variable, we
calculated these predicted values while holding smoking constant at the
minimum, mean, and maximum observed rates of smoking.

You might also like