Regression Analysis (1722021)

Regression Analysis
What is Regression Analysis?

Regression analysis is a set of statistical methods used for the estimation of
relationships between a dependent variable and one or more independent
variables.
Independent Variable: An independent variable is an input, assumption, or

driver that is changed in order to assess its impact on a dependent variable (the
outcome). The independent variable = the input and the dependent variable =
output. In financial modeling and analysis, an analyst typically performs
sensitivity analysis. It helps to assess the strength of the relationship between
variables and for modeling/predicting the future relationship between them.
Regression analysis includes several variations, such as linear, multiple linear,

and nonlinear. The most common ones are the simple linear and multiple linear
models. Nonlinear regression analysis is commonly used to treat complicated
data sets in which the dependent and independent variables show a nonlinear
relationship.
Regression analysis offers numerous applications in various disciplines,

including finance, engineering, management, sports (VAR) etc.
Regression Analysis – Linear model assumptions
Linear regression analysis is based on six fundamental assumptions:
1. The dependent and independent variables show a linear relationship

between the slope and the intercept.
2. The independent variable is not random.
3. The value of the residual (error) is zero.
4. The value of the residual (error) is constant across all observations.
5. The value of the residual (error) is not correlated across all observations.
6. The residual (error) values follow a normal distribution.
Regression Analysis – Simple linear regression
Simple linear regression is a model that assesses the relationship between a

dependent variable and one independent variable. The simple linear model is
expressed using the following equation:
Y = a + bX + ϵ
Where:
Y – dependent variable
X – independent (explanatory) variable
a – intercept
b – slope
ϵ – residual (error)
Regression Analysis – Multiple linear regression
Multiple linear regression analysis is essentially similar to the simple linear

model, with the exception that multiple independent variables are used in the
model. The mathematical representation of multiple linear regression is:
Y = a + bX1 + cX2 + dX3 + ϵ
Where:
Y – dependent variable
X1, X2, X3 – independent (explanatory) variables
a – intercept
b, c, d – slopes
ϵ – residual (error)
Multiple linear regression follows the same conditions as the simple linear
model. However, since there are several independent variables in multiple linear
analysis, there is another mandatory condition for the model:
 Non-collinearity: Independent variables should show a minimum of

correlation with each other. If the independent variables are highly
correlated with each other, it will be difficult to assess the true relationships
between the dependent and independent variables.
Regression analysis in finance
Regression analysis has several applications in finance. For example, the

statistical method is fundamental to the Capital Asset Pricing Model (CAPM). The
Capital Asset Pricing Model (CAPM) is a model that describes the relationship
between expected returns and risks of a security. CAPM formula shows the
return of a security is equal to the risk-free return plus a risk premium (Rs = Rfr +
Rp), based on the beta of that security. Essentially, the CAPM equation is a model
that determines the relationship between the expected return of an asset and
the market risk premium.
The analysis is also used to forecast the returns of securities based on different
factors, or forecast the performance of a business. Learn more: CFI’s Budgeting
and Forecasting Course!
1. Beta and CAPM
In finance, regression analysis is used to calculate the Beta. The beta (β) of an
investment security (i.e. a stock) is a measurement of its volatility of returns
relative to the entire market. It is used as a measure of risk and is an integral
part of the Capital Asset Pricing Model (CAPM). A company with a higher beta
has greater risk and also greater expected returns. (volatility of returns relative
to the market) for a stock. It can be done in Excel using the SLOPE Function. The
SLOPE Function is categorized under Excel Statistical functions. It will return the
slope of the linear regression line through the data points in known ys and xs. In
financial analysis, SLOPE can be useful in calculating beta for a stock. Formula =
SLOPE (known_y's, known_x's) The function uses the.

Download CFI’s free beta calculator. This beta calculator allows you to measure
the volatility of returns of an individual stock relative to the entire market. The
beta (β) of an investment security (i.e. a stock) is a measurement of its volatility
of returns relative to the entire market. It is used as a measure of risk and is an
integral part of Capital Asset Pricing (Cap)!

2. Forecasting Revenues and Expenses
When forecasting financial statements, financial forecasting is the

processing/estimating/ predicting how a business will perform in the future. The
most common type of financial forecast is an income statement, however, in a
complete financial model all three statements are forecast. In this guide on how
to build a financial forecast, it may be useful to do a multiple regression analysis
to determine how changes in certain assumptions or drivers of the business will
impact revenue or expenses in the future. For example, there may be a very high
correlation between the number of sales people employed by a company, the
number of stores they operate, and the revenue the business generates.

The above example shows how to use the Forecast functionFORECAST

FunctionThe FORECAST Function is categorized under Excel Statistical functions.
It will calculate or predict for us a future value using existing values. In financial
modeling, the forecast function can be useful in calculating the statistical value
of a forecast made. For example, if we know the past earnings and in Excel to
calculate a company’s revenue based on the number of ads it runs.
Learn more forecasting methods in CFI’s Budgeting and Forecasting Course!
Additional resources
We hope you’ve enjoyed reading CFI’s explanation of regression analysis. CFI

offers the Financial Modeling & Valuation Analyst (FMVA)™FMVA®
CertificationThe Financial Modeling & Valuation Analyst (FMVA)® accreditation is
a global standard for financial analysts that covers finance, accounting, financial
modeling, valuation, budgeting, forecasting, presentations, and strategy.
certification program for those looking to take their careers to the next level. To
learn more about related topics, check out the following resources:
 Cost Behavior AnalysisCost Behavior AnalysisCost behavior analysis refers

to management’s attempt to understand how operating costs change in
relation to a change in an organization’s level of activity. These costs may
include direct materials, direct labor, and overhead costs that are incurred
from developing a product.
 Financial Modeling SkillsFinancial Modeling SkillsLearn what the 10 most
important financial modeling skills are and what's required to be good at
financial modeling in Excel. he most important skills are 1 accounting, 2
Excel, 3 linking the financial statements, 4 forecasting, 5 problem-solving, 6
attention to detail, 7 simplicity, 8 esthetics, 9 presentations, 10
 Forecasting MethodsForecasting MethodsTop Forecasting Methods. There
is a wide range of frequently used quantitative budget forecasting tools. In
this article, we will explain four types of revenue forecasting methods that
financial analysts use to predict future revenues. Four Types of revenue
forecasting include straight-line, moving average, regression
 High-Low MethodHigh-Low MethodIn cost accounting, the high-low
method is a technique used to split mixed costs into variable and fixed costs.
Although the high-low method is easy to apply, it is seldom used, as it can
distort costs due to its reliance on two extreme values from a given data set.
Formula for the High-Low Method The formula for
Financial Analyst Training
Get world-class financial training with CFI’s online certified financial analyst
training programFMVA® CertificationThe Financial Modeling & Valuation Analyst
(FMVA)® accreditation is a global standard for financial analysts that covers
finance, accounting, financial modeling, valuation, budgeting, forecasting,
presentations, and strategy.
!
Gain the confidence you need to move up the ladder in a high powered
corporate finance career path.
Learn financial modeling and valuation in Excel the easy way, with step-by-step
training.

 Company
o About
o Team
o Careers
o Reviews
o For Enterprise
o For Universities
 Programs
o Financial Modeling & Valuation Analyst (FMVA)®
o CPE Credits
 Courses
o Accounting
o Excel
o Finance
o Financial Modeling
o All Courses
 Support
o Financial Aid
o 1% Pledge
o Help | FAQ
o Contact Us
o Scholarships
 Resources
o Knowledge Library
o Templates
o Jobs & Careers
o Excel Skills
o eBooks
Visit Our Partners...

© 2015 to 2019 CFI Education Inc.
 Privacy
 Terms of Use
 Terms of Service
 Copyright & Trademarks
 Legal
<img height="1" width="1" style="display:none;" alt=""

src="https://dc.ads.linkedin.com/collect/?pid=490258&fmt=gif" />
Contact Us
Got a question? We'd love to hear from you. Send us an email and we'll respond
as soon as possible. You can also reach us by phone at +1-800-817-7539.
 How may we direct your inquiry?*
o I want to ask a question
o Send me information about the FMVA® Program

 Name*
First Last
 Email*
 Phone
Message
Submit form_id=234&tit 1 234 WyJbXSIsImJkMD
0 1
This iframe contains the logic required to handle Ajax powered Gravity Forms.
<img height="1" width="1" style="display:none"
src="https://www.facebook.com/tr?
id=257587335174579&ev=PageView&noscript=1">
Search
Responsive Menu
Search
Responsive Menu
Statistics How To
Statistics for the rest of us!
<img src="https://www.statisticshowto.com/wp-content/uploads/2013/10/cropped-banner-21.jpg"
class="header-image" width="896" height="200" alt="Statistics How To">
 Home
 Tables
o Binomial Distribution Table
o F Table
o PPMC Critical Values
o T-Distribution Table (One Tail and Two-Tails)
o Chi Squared Table (Right Tail)
o Z-table (Right of Curve or Left)
 Probability and Statistics
o Binomials
o Chi-Square Statistic
o Expected Value
o Hypothesis Testing
o Non Normal Distribution
o Normal Distributions
o Probability
o Regression Analysis
o Statistics Basics
o T-Distribution
o Multivariate Analysis
o Sampling
 Calculators
o Variance and Standard Deviation Calculator
o Tdist Calculator
o Permutation Calculator / Combination Calculator
o Interquartile Range Calculator
o Linear Regression Calculator
o Expected Value Calculator
o Binomial Distribution Calculator
 Statistics Blog
 Matrices
 Experimental Design
 Practically Cheating Statistics Handbook
Regression Analysis: Step by Step Articles, Videos, Simple

Definitions
Probability and Statistics > Regression analysis
<img aria-describedby="caption-attachment-13461"
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/11/regression-2-
300x180.jpg" alt="regression analysis" width="300" height="180" class="size-medium wp-image-13461"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/11/regression-2-
300x180.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2014/11/regression-2-474x285.jpg 474w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/11/regression-2.jpg 475w"
sizes="(max-width: 300px) 100vw, 300px" />
A simple linear regression plot for amount of rainfall.
Regression analysis is used in stats to find trends in data. For example, you might guess that there’s a
connection between how much you eat and how much you weigh; regression analysis can help you quantify
that. Regression analysis will provide you with an equation for a graph so that you can make predictions about
your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll
weigh in ten years time if you continue to put on weight at the same rate. It will also give you a slew of statistics
(including a p-value and a correlation coefficient) to tell you how accurate your model is. Most elementary stats
courses cover very basic techniques, like making scatter plots and performing linear regression. However, you
may come across more advanced techniques like multiple regression.
Contents:
1. Introduction to Regression Analysis
2. Multiple Regression Analysis
3. Overfitting and how to avoid it
4. Related articles
Technology:
1. Regression in Minitab
Regression Analysis: An Introduction

In statistics, it’s hard to stare at a set of random numbers in a table and try to make any sense of it. For example,
global warming may be reducing average snowfall in your town and you are asked to predict how much snow
you think will fall this year. Looking at the following table you might guess somewhere around 10-20 inches.
That’s a good guess, but you could make a better guess, by using regression.
<img src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2014/11/regression-13.jpg" alt="regression 1" width="134" height="210" class="alignnone
size-full wp-image-13460" />
Essentially, regression is the “best guess” at using a set of data to make some kind of prediction. It’s fitting a set
of points to a graph. There’s a whole host of tools that can run regression for you, including Excel, which I used
here to help make sense of that snowfall data:
<img
300x180.jpg" alt="regression 2" width="300" height="180" class="alignnone size-medium wp-image-13461"
Just by looking at the regression line running down through the data, you can fine tune your best guess a bit.
You can see that the original guess (20 inches or so) was way off. For 2015, it looks like the line will be
somewhere between 5 and 10 inches! That might be “good enough”, but regression also gives you a useful
equation, which for this chart is:
y = -2.2923x + 4624.4.
What that means is you can plug in an x value (the year) and get a pretty good estimate of snowfall for any year.
For example, 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is pretty close to the actual figure of 30 inches for that
year.
Best of all, you can use the equation to make predictions. For example, how much snow will fall in 2017?
y = 2.2923(2017) + 4624.4 = 0.8 inches.
Regression also gives you an R squared value, which for this graph is 0.702. This number tells you how good
your model is. The values range from 0 to 1, with 0 being a terrible model and 1 being a perfect model. As you
can probably see, 0.7 is a fairly decent model so you can be fairly confident in your weather prediction!
Back to Top
Multiple Regression Analysis

Multiple regression analysis is used to see if there is a statistically significant relationship between sets of
variables. It’s used to find trends in those sets of data.
Multiple regression analysis is almost the same as simple linear regression. The only difference between simple
linear regression and multiple regression is in the number of predictors (“x” variables) used in the regression.
 Simple regression analysis uses a single x variable for each dependent “y” variable. For example: (x 1,
Y1).
 Multiple regression uses multiple “x” variables for each independent variable: (x1)1, (x2)1, (x3)1, Y1).
In one-variable linear regression, you would input one dependent variable (i.e. “sales”) against an independent
variable (i.e. “profit”). But you might be interested in how different types of sales effect the regression. You
could set your X1 as one type of sales, your X2 as another type of sales and so on.
When to Use Multiple Regression Analysis.

Ordinary linear regression usually isn’t enough to take into account all of the real-life factors that have an effect
on an outcome. For example, the following graph plots a single variable (number of doctors) against another
variable (life-expectancy of women).
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/multiple-linear-
regression-300x198.gif" alt="multiple regression analysis" width="300" height="198" class="size-medium wp-
image-12090" />
Image: Columbia University
From this graph it might appear there is a relationship between life-expectancy of women and the number of
doctors in the population. In fact, that’s probably true and you could say it’s a simple fix: put more doctors into
the population to increase life expectancy. But the reality is you would have to look at other factors like the
possibility that doctors in rural areas might have less education or experience. Or perhaps they have a lack of
access to medical facilities like trauma centers.
The addition of those extra factors would cause you to add additional dependent variables to your regression
analysis and create a multiple regression analysis model.
Multiple Regression Analysis Output.

Regression analysis is always performed in software, like Excel or SPSS. The output differs according to how
many variables you have but it’s essentially the same type of output you would find in a simple linear
regression. There’s just more of it:
 Simple regression: Y = b0 + b1 x.
 Multiple regression: Y = b0 + b1 x1 + b0 + b1 x2…b0…b1 xn.
The output would include a summary, similar to a summary for simple linear regression, that includes:
 R (the multiple correlation coefficient),

 R squared (the coefficient of determination),
 adjusted R-squared,
 The standard error of the estimate.
These statistics help you figure out how well a regression model fits the data. The ANOVA table in the output
would give you the p-value and f-statistic.
Minimum Sample size

“The answer to the sample size question appears to depend in part on the objectives
of the researcher, the research questions that are being addressed, and the type of
model being utilized. Although there are several research articles and textbooks giving
recommendations for minimum sample sizes for multiple regression, few agree
on how large is large enough and not many address the prediction side of MLR.” ~ Gregory T. Knofczynski
If you’re concerned with finding accurate values for squared multiple correlation coefficient, minimizing the
shrinkage of the squared multiple correlation coefficient or have another specific goal, Gregory Knofczynski’s
paper is a worthwhile read and comes with lots of references for further study. That said, many people just want
to run MLS to get a general idea of trends and they don’t need very specific estimates. If that’s the case, you can
use a rule of thumb. It’s widely stated in the literature that you should have more than 100 items in your
sample. While this is sometimes adequate, you’ll be on the safer side if you have at least 200 observations or
better yet—more than 400.
Back to Top
Overfitting in Regression
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/11/least-squares-regression-
line-300x176.jpg" alt="overfitting" width="300" height="176" class="size-medium wp-image-13335"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/11/least-squares-
regression-line-300x176.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2014/11/least-squares-regression-line.jpg 628w" sizes="(max-width: 300px) 100vw, 300px" />
Overfitting can lead to a poor model for your data.
Overfitting is where your model is too complex for your data — it happens when your sample size is too
small. If you put enough predictor variables in your regression model, you will nearly always get a model that
looks significant.
While an overfitted model may fit the idiosyncrasies of your data extremely well, it won’t fit additional test
samples or the overall population. The model’s
p-values, R-Squared and regression coefficients can all be misleading. Basically, you’re asking too much from a
small set of data.
How to Avoid Overfitting

In linear modeling (including multiple regression), you should have at least 10-15 observations for each term
you are trying to estimate. Any less than that, and you run the risk of overfitting your model.
“Terms” include:
 Interaction Effects,
 Polynomial expressions (for modeling curved lines),
 Predictor variables.
While this rule of thumb is generally accepted, Green (1991) takes this a step further and suggests that the
minimum sample size for any regression should be 50, with an additional 8 observations per term. For example,
if you have one interacting variable and three predictor variables, you’ll need around 45-60 items in your sample
to avoid overfitting, or 50 + 3(8) = 74 items according to Green (Green’s rule).
Exceptions
There are exceptions to the “10-15” rule of thumb. They include:
1. When there is multicollinearity in your data, or if the effect size is small. If that’s the case, you’ll need
to include more terms (although there is, unfortunately, no rule of thumb for how many terms to add!).
2. You may be able to get away with as few as 10 or less no. of observations per predictor if you are using
logistic regression or survival models, as long as you don’t have extreme event probabilities, small effect
sizes, or predictor variables with truncated ranges (Peduzzi et al.)
How to Detect and Avoid Overfitting

The easiest way to avoid overfitting is to increase your sample size by collecting more data. If you can’t do
that, the second option is to reduce the number of predictors in your model — either by combining or
eliminating them. Factor Analysis is one method you can use to identify related predictors that might be
candidates for combining (effects of interacting variables and their weak/strong impacts in models).
1. Cross-Validation
Use cross validation to detect overfitting: this partitions your data, generalizes your model, and chooses the
model which works best. One form of cross-validation is predicted R-squared. Most good statistical software
will include this statistic, which is calculated by:
 Removing one observation at a time from your data,

 Estimating the regression equation for each iteration,
 Using the regression equation to predict the removed observation.
Cross validation isn’t a magic cure for small data sets though, and sometimes a clear model isn’t identified even
with an adequate sample size.
2. Shrinkage & Resampling

Shrinkage and resampling tehcniques (like this R-module) can help you to find out how well your model might
fit a new sample.
3. Automated Methods
Automated stepwise regression shouldn’t be used as an overfitting solution for small data sets. According to
Babyak (2004),
“The problems with automated selection conducted in this very typical manner are so
numerous that it would be hard to catalogue all of them [in a journal article].”
Babyak also recommends avoiding univariate pretesting or screening (a “variation of automated selection in
disguise”), dichotomizing continuous variables — which can dramatically increase Type I errors, or multiple
testing of confounding variables (although this may be ok if used judiciously).
References
1. Babyak, M.A.,(2004). “What you see may not be what you get: a brief, nontechnical introduction to
overfitting in regression-type models.” Psychosomatic Medicine. 2004 May-Jun;66(3):411-21.
2. Green S.B., (1991) “How many subjects does it take to do a regression analysis?” Multivariate
Behavior Research 26:499–510.
3. Peduzzi P.N., et. al (1995). “The importance of events per independent variable in multivariable
analysis, II: accuracy and precision of regression estimates.” Journal of Clinical Epidemiology 48:1503–
10.
4. Peduzzi P.N., et. al (1996). “A simulation study of the number of events per variable in logistic
regression analysis.” Journal of Clinical Epidemiology 49:1373–9.
Back to Top
Check out our YouTube channel for hundreds of videos on elementary statistics, including regression analysis
using a variety of tools like Excel and the TI-83.
More articles
1. How to Construct a Scatter Plot.
2. How to Calculate Pearson’s Correlation Coefficients.
3. How to Compute a Linear Regression Test Value.
4. Chow Test for Split Data Sets
5. Forward Selection
6. What is Kriging?
7. How to Find a Regression Slope Intercept.
8. How to Find a Linear Regression Slope.
9. How to Find the Standard Error of Regression Slope.
10. Validity Coefficient: What it is and how to find it.
11. Quadratic Regression.
12. Stepwise Regression
13. Unstandardized Coefficient
14. Weak Instruments.”>Weak Instruments
Back to Top
Definitions
1. ANCOVA.
2. Assumptions and Conditions for Regression.
3. Betas / Standardized Coefficients.
4. What is a Beta Weight?
5. Bilinear Regression
6. The Breusch-Pagan-Godfrey Test
7. Cook’s Distance.
8. What is a Covariate?
9. Cox Regression.
10. Detrend Data.
11. Exogeneity.
12. Gauss-Newton Algorithm.
13. What is the General Linear Model?
14. What is the Generalized Linear Model?
15. What is the Hausman Test?
16. What is Homoscedasticity?
17. Influential Data.
18. What is an Instrumental Variable?
19. Lack of Fit
20. Lasso Regression.
21. Levenberg–Marquardt Algorithm
22. What is the Line of best fit?
23. What is Logistic Regression?
24. What is the Mahalanobis distance?
25. Model Misspecification.
26. Multinomial Logistic Regression.
27. What is Nonlinear Regression?
28. Ordered Logit / Ordered Logistic Regression
29. What is Ordinary Least Squares Regression?
30. Overfitting.
31. Parsimonious Models.
32. What is Pearson’s Correlation Coefficient?
33. Poisson Regression.
34. Probit Model.
35. What is a Prediction Interval?
36. What is Regularization?
37. Regularized Least Squares.
38. Regularized Regression
39. What are Relative Weights?
40. What are Residual Plots?
41. Reverse Causality.
42. Ridge Regression
43. Root Mean Square Error.
44. Semiparametric models
45. Simultaneity Bias.
46. Simultaneous Equations Model.
47. What is Spurious Correlation?
48. Structural Equations Model
49. What are Tolerance Intervals?
50. Trend Analysis
51. Tuning Parameter
52. What is Weighted Least Squares Regression?
53. Y Hat explained.
Back to Top
Regression in Minitab
Watch the video or read the steps below:
<img
src="https://i.ytimg.com/vi/ID/hqdefault.jpg" alt="" width="480" height="360">
<iframe width="420" height="315" src="//www.youtube.com/embed/nHuh_2bGaBk?rel=0" frameborder="0"

allowfullscreen></iframe>
Regression is fitting data to a line (Minitab can also perform other types of regression, like quadratic
regression). When you find regression in Minitab, you’ll get a scatter plot of your data along with the line of
best fit, plus Minitab will provide you with:
1. Standard Error (how much the data points deviate from the mean).
2. R squared: a value between 0 and 1 which tells you how well your data points fit the model.
3. Adjusted R2 (adjusts R2 to account for data points that do not fit the model).
Regression in Minitab takes just a couple of clicks from the toolbar and is accessed through the Stat menu.
Example question: Find regression in Minitab for the following set of data points that compare calories
consumed per day to weight:
Calories consumed daily (Weight in lb): 2800 (140), 2810 (143), 2805 (144), 2705 (145), 3000 (155), 2500
(130), 2400 (121), 2100 (100), 2000 (99), 2350 (120), 2400 (121), 3000 (155).
Step 1: Type your data into two columns in Minitab.
Step 2: Click “Stat,” then click “Regression” and then click “Fitted Line Plot.”
<img
aria-describedby="caption-attachment-6098" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/regression-in-minitab-1.jpg" alt="Minitab regression" width="549" height="384"
class="size-full wp-image-6098" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/regression-in-minitab-1.jpg 549w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/regression-in-minitab-1-
300x209.jpg 300w" sizes="(max-width: 549px) 100vw, 549px" />
Regression in Minitab selection.
Step 3: Click a variable name for the dependent value in the left-hand window. For this sample question, we
want to know if calories consumed affects weight, so calories is the independent variable (Y) and weight is the
dependent variable (X). Click “Calories” and then click “Select.”
Step 4: Repeat Step 3 for the dependent X variable, weight.

<img aria-
describedby="caption-attachment-6099" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/regression-in-minitab-2.jpg" alt="regression in Minitab " width="460" height="251"
class="size-full wp-image-6099" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/regression-in-minitab-2.jpg 460w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/regression-in-minitab-2-
Selecting variables for Minitab regression.
Step 5: Click “OK.” Minitab will create a regression line graph in a separate window.
Step 4: Read the results. As well as creating a regression graph, Minitab will give you values for S, R-sq and
R-sq(adj) in the top right corner of the fitted line plot window.
s = standard error.
R-Sq = Coefficient of Determination
R-Sq(adj) = Adjusted Coefficient of Determination (Adjusted R Squared).
That’s it!
------------------------------------------------------------------------------
Need help with a homework or test question? With Chegg Study, you can get step-by-step solutions to your
questions from an expert in the field. Your first 30 minutes with a Chegg tutor is free!
Statistical concepts explained visually - Includes many concepts such as sample size, hypothesis tests, or logistic
regression, explained by Stephanie Glen, founder of StatisticsHowTo.
Comments? Need to post a correction? Please post a comment on our Facebook page.
Check out our updated Privacy policy and Cookie Policy
Find an article
Search
content/uploads/2015/08/Ebookreg.png" alt="" />
Feel like "cheating" at Statistics? Check out the grade-increasing book that's recommended reading at
top universities!
content/uploads/2017/10/app.png" alt="" />
Need help NOW with a homework problem? Click here!
Copyright ©2019 Statistics How To | Theme by: Theme Horse | Powered by: WordPress
We encourage you to view our updated policy on cookies and affiliates. Find out more.Okay, thanks
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Scatter Plot / Scatter Chart: Definition, Examples,
Excel/TI-83/TI-89/SPSS
Contents:
 What is a Scatter Plot?

 Scatter Graphs and Correlation
 What is a 3D Scatter Plot?
 What is a Bubble Chart?
 How to Make a Scatter Plot:
1. By hand
2. Excel
3. Matlab
4. Minitab
5. SPSS
6. TI-89
7. TI-83
What is a Scatter Plot?

Scatter plots (also called scatter graphs) are similar to line graphs. A line graph uses a line on an X-Y
axis to plot a continuous function, while a scatter plot uses dots to represent individual pieces of data.
In statistics, these plots are useful to see if two variables are related to each other. For example, a
scatter chart can suggest a linear relationship (i.e. a straight line).
class="size-medium wp-image-5029" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/scatter-plot-300x162.png" alt="Scatter plot suggesting a linear relationship."
width="300" height="162" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/scatter-plot-300x162.png 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/scatter-plot.png
800w" sizes="(max-width: 300px) 100vw, 300px" />
Scatter plot suggesting a linear relationship.
Scatter plots are also called scatter graphs, scatter charts, scatter diagrams and scattergrams.
Correlation in Scatter Plots
The relationship between variables is called correlation. Correlation is just another word for
“relationship.” For example, how much you weigh is related (correlated) to how much you eat. There
are two types of correlations: positive correlation and negative correlation. If data points make a line
from the origin from low x and y values to high x and y values the data points are positively
correlated, like in the above graph. If the graph starts off with high y-values and continues to low y-
values then the graph is negatively correlated.
You can think of positive correlation as something that produces a positive result. For example, the
more you exercise, the better your cardiovascular health. “Positive” doesn’t necessarily mean “good”!
More smoking leads to more chance of cancer and the more you drive, the more likely you are to be
in a car accident.
Back to Top
3D Scatter Plot
A 3D scatter plot is a scatter plot with three axes. For example, the following 3D scatter plot shows
student scores in three subjects: Reading (y-axis), Writing (x-axis) and Math (z-axis).
<img class="alignleft size-full wp-image-30528"

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/3d-scatter-
plot.png" alt="3d scatter plot" width="322" height="297"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/3d-scatter-
plot.png 322w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/3d-
scatter-plot-300x277.png 300w" sizes="(max-width: 322px) 100vw, 322px" />
Student A scored 100 in Writing and Math and 90 in reading, and student B scored 50 in writing, 30 in
reading and 15 in math. 3D plots are fairly easy to make for a few points, but once you start to get into
larger sets of data, you’ll want to use technology. Unfortunately, Excel doesn’t have an option to
create this chart. Statistical programs commonly available through colleges and universities (like SAS)
can create them. There are quite a few free options available, but I recommend:
 Plotly is an easy way to create a 3D chart online.

 Gnuplot: downloadable program. Easy to use compared to other programs.
 R: Also a download. Has a fairly steep learning curve, but handles most statistical
computations. If you want a general stst package (As opposed to one that will just create charts),
this is the best option.
Back to Top
What is a Bubble Chart?
What is a Bubble Chart?
<img aria-describedby="caption-attachment-31165" class="size-full wp-image-31165"

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/bubble-
plot.png" alt="Bubble plot showing Medicare amounts per service/specialty. Image: CMS.gov."
content/uploads/2016/09/bubble-plot.png 640w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/bubble-plot-
300x181.png 300w" sizes="(max-width: 640px) 100vw, 640px" />
Bubble plot showing Medicare amounts per service/specialty. Image: CMS.gov.
A bubble chart is a way to show how variables relate to each other. It is similar to a scatter chart, only
instead of dots there are different sized bubbles.
Bubble charts are a good choice if your data has 3 series/characteristics with an associated value; in
other words, you need:
 a category with values for your x-axis,

 a category with values for your y-axis, and
 a category with values for sizing your bubbles.
They are often used for financial purposes and for use with quadrants.
Types of Bubble Chart

In its most basic form, larger bubbles indicate larger values. The placement of the bubble on the x-
axis and y-axis give you information about what the bubble represents. This chart shows length of
investment (x-axis), price at time of purchase (y-axis) and the relative size of the investment today.

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/bubble-plot-
2.png" alt="bubble plot 2" width="246" height="246"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/bubble-plot-
2.png 246w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/bubble-
plot-2-150x150.png 150w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2016/09/bubble-plot-2-230x230.png 230w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/bubble-plot-2-
Color coded bubble plots use color to sort the bubbles into categories. For example, I might want to
sort my investment chart into stocks, bonds, and mutual funds:
<img class="alignleft size-full wp-image-
31163" src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/bubble-
chart-3.png" alt="bubble chart 3" width="357" height="242"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/bubble-
chart-3.png 357w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2016/09/bubble-chart-3-300x203.png 300w" sizes="(max-width: 357px) 100vw,
357px" />
A cartogram is a bubble plot of a map, where the x-axis and y-axis are longitude and latitude. The
size of the bubble could indicate population, number of oil rigs, natural weather events, or some other
type of geographical data.

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/09/cartogram.png"
alt="cartogram" width="250" height="155" />
The charts are sometimes referred to by dimensions:
 Two-dimensional charts have x-values and y-values only. They are equivalent to a scatter
plot.
 Three-dimensional charts have the x-y axes and bubble size.
 Four-dimensional charts have x-y axes, bubble size and color.
Back to Top
How to Make a Scatter plot: Overview

<img aria-describedby="caption-attachment-
15233" class="size-full wp-image-15233"
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/11/construct-a-
scatter-plot-2.png" alt="scatter plot" width="313" height="262"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/11/construct-a-
scatter-plot-2.png 313w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/construct-a-scatter-plot-2-300x251.png 300w" sizes="(max-width: 313px)
100vw, 313px" />
A simple scatter plot .
A scatter plot gives you a visual idea of what is happening with your data. Scatter plots are similar to
line graphs. The only difference is a line graph has a continuous line while a scatter plot has a series
of dots. Scatter plots in statistics create the foundation for simple linear regression, where we take
scatter plots and try to create a usable model using functions. In fact, all regression is doing is trying
to draw a line through all of those dots.
Back to Top
Make a Scatter plot by Hand

Watch the video or read on below:
<img
<iframe src="//www.youtube.com/embed/BuuMB91J1OE" width="420" height="315" frameborder="0"

allowfullscreen="allowfullscreen"></iframe>
There are just three steps to creating a scatter plot by hand.
Make a Scatter plot: Steps

Example question: create a scatter plot for the following data:
x y
3 25
4.1 25
5 30
6 29
6.1 42
6.3 46
Step 1: Draw a graph. Label the x- and y- axis. Choose a range that includes the maximums and
minimums from the given data. For example, our x-values go from 3 to 6.3, so a range from 3 to 7
would be appropriate.
<img class="alignnone size-full wp-image-882"
title="scatterplot1" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/scatterplot1.bmp" alt="scatter plot 1" />
Step 2: Draw the first point on the graph. Our first point is (3,25).
<img class="alignnone size-medium wp-image-

15231" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/construct-a-scatter-plot-1-300x251.png" alt="construct a scatter plot 1"
content/uploads/2009/11/construct-a-scatter-plot-1-300x251.png 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/11/construct-a-scatter-
plot-1.png 313w" sizes="(max-width: 300px) 100vw, 300px" />
Step 3:. Draw the remaining points on the graph.

15233" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/construct-a-scatter-plot-2-300x251.png" alt="construct a scatter plot 2"
content/uploads/2009/11/construct-a-scatter-plot-2-300x251.png 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/11/construct-a-scatter-
plot-2.png 313w" sizes="(max-width: 300px) 100vw, 300px" />
That’s it!
Back to Top
How to Construct a Scatter Plot in Excel

In this section, I’ll cover how to make a scatter plot in Excel plus some advanced options like
formatting your chart, adding labels, and adding a trendline (the linear regression equation). Watch
the video or read the steps below:
<img
<iframe src="//www.youtube.com/embed/S32ivqyqaPA?rel=0" width="420" height="315"

frameborder="0" allowfullscreen="allowfullscreen"></iframe>
Steps
Step 1: Type your data into two columns (scroll down to the second example for some screenshots).
Step 2: Click “Insert,” then click “Scatter.”
Step 3: Choose a type of plot. For example, click the first icon (scatter with only markers).
Formatting
Delete the Legend.
Step 1: Right click on the legend, then press “delete.”
Clean up the White Space

Sometimes your markers will be clustered at the top or bottom right of the graph. Here’s how to get rid
of that white space by formatting the horizontal and vertical axes.
Step 1: Click the “Layout” tab, then click “Axes.”
Step 2: Click “Primary Horizontal,” then click “More Primary Horizontal Options.”
Step 3: Click the “Fixed Value” radio button and then type in a value for where you want your
horizontal axis to start. Click “Close.”
Step 4: Repeat Steps 1 to 3, choosing “Vertical” instead of horizontal.
Adding Chart Labels
Excel usually adds labels you don’t want, or leaves out axis labels you do want. To delete unwanted
labels, you can click and delete. Here’s how to add a label:
Step 1: Click the “Layout” tab.
Step 2: Click “Axis” titles and then click “Primary Horizontal Axis Title.”
Step 3: Choose a position. for example, you may want the title below the axis.
Step 4: Click the text and type in your new label.
Step 5: Repeat Steps 1 to 4, choosing “vertical” for the vertical axis.
Tip: If you don’t like the vertical arrangement of the axis title, right click, then choose “format axis title.”
Click “alignment” and then pick a text direction (i.e. horizontal).
Adding a Trendline
Step 1: Click the “Layout” tab.
Step 2: Click “Trendline” and then click “More trendline Options.”
Step 3: Click the “Show equation on chart box” and then click “Close.”
Example 2: Create a scatter plot in Microsoft Excel plotting the following data from a study
investigating the relationship between height and weight of pre-diabetic patients:
Height (inches): 72, 71,70,67,65,64,64,63,62,60
Weight (lb): 180, 178,190,150,145,132,170,120,143,98
Step 1: Type your data into a spreadsheet. For the scatter plot to work correctly, your data must be
entered into two columns. The example below shows data entered for height (column A) and weight
(column B).
<img class="alignnone size-medium wp-image-4859"

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/scatter1-
232x300.jpg" alt="scatter plot in excel" width="232" height="300"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/scatter1-
content/uploads/2013/08/scatter1.jpg 331w" sizes="(max-width: 232px) 100vw, 232px" />
Step 2: Highlight your data. To highlight your data, left click at the top left of your data and then drag
the mouse to the bottom right.

4860" src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/scatter2-
300x251.jpg" alt="scatter2" width="300" height="251"
Step 3: Click the “Insert” button on the ribbon, then click “Scatter,” then click “Scatter with only
markers.” Microsoft Excel will create a scatter plot from your data and display the graph next to your
data in the spreadsheet.

4861" src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/scatter3-
300x244.jpg" alt="scatter3" width="300" height="244"
Tip: If you want to change the data (and therefore your graph), there’s no need to redo the whole
procedure. When you type new data into either column, Microsoft Excel will automatically calculate
the change and instantly display the new graph.
Back to Top
MATLAB Instructions
Use the SCATTER (X,Y,S,C) command.
 Vectors X and Y must be the same size.

 S is the area of each bubble (in squared points). S can be a vector or a scalar. If scalar, all
markers will be the same size.
 C is the maker color.
Scatter plot in Minitab

Watch the video for how to create a scatter plot in Minitab or read the steps below.
<img
<iframe src="//www.youtube.com/embed/knniupT5GFw?rel=0" width="420" height="315"

<img aria-
describedby="caption-attachment-8614" class="size-full wp-image-8614"
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/scatterplot-in-
minitab.gif" alt="scatter plot in minitab" width="450" height="300" />
Image: Penn State
Step 1: Enter your data into two columns. One column should be the x-variable (the independent
variable) and the second column should be the y-variable (the dependent variable). Make sure you
put a header for your data in the first row in each column — it will make the creation of the scatter plot
easier in Step 4 and Step 5.
Step 2: Click “Graph” on the toolbar and then click “Scatter plot.”
Step 3: Click “Simple” Scatter plot. In most cases, this is the option you’ll use for scatter plots in
elementary statistics. You can choose one of the others (such as the scatter plot with lines), but you’ll
rarely need to use them.
Step 4: Click your y-variable name in the left window, then click “Select” to move that y-variable into
the y-variable box.
Step 5: Click your x-variable name in the left window, then click “Select” to move that x-variable into
the x-variable box.
Step 6: Click “OK” to create the scatter plot in Minitab. The graph will appear in a separate window.
Tip: If you want to change the ticks (the spacing for the x-axis or y-axis), double-click one of the
numbers to open the Edit Scale box, where you can change a variety of options for your scatter plot,
including ticks.
Back to Top
How to Make an SPSS Scatter Plot

IBM SPSS Statistics has several different options for scatter plots: Simple Scatter, Matrix Scatter,
Simple Dot, Overlay Scatter and 3D Scatter. Which type of scatter plot you choose depends mostly
upon how many variables you want to plot:
 A Simple Scatter Plot plots one variable against another.

 A Matrix Scatter Plot plots all possible combinations of two or more numeric variables against
each other
 A Simple Dot Plot plots one categorical variable or one continuous variable.
 An Overlay Scatterplot plots two or more pairs of variables.
 3D Scatterplots are 3-Dimensional plots of three numeric variables.
Watch the video to learn How to Make an SPSS Scatter Plot with the Chart Builder, or read below for
instructions on how to create one with the Legacy Dialog menu:
<img
<iframe src="https://www.youtube.com/embed/pvdCO73S1B0" width="560" height="315"

How to Make an SPSS Scatter Plot with the

Legacy Dialog menu
Step 1: Click “Graphs,” then mouse over “Legacy Dialogs” then click “Scatter/Dot”.

5024" src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-
scatter-plot-1-300x173.jpg" alt="How to Make an SPSS Scatter Plot 2" width="300" height="173"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-
scatter-plot-1-300x173.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/spss-scatter-plot-1.jpg 647w" sizes="(max-width: 300px) 100vw, 300px" />
Step 2: Choose a type of Scatter Plot. For this example, click “Simple Scatter.”
Step 3: Click the “Define” button to open the “Simple Scatterplot” window.
Step 4: Click on the variable you want to display on the Y-axis and then click the arrow to the left
of the “Y-Axis” selection box. <img

class="alignnone size-medium wp-image-5025"
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-scatter-
plot-2-300x142.jpg" alt="spss scatter plot 2" width="300" height="142"
Step 4: Click on the variable you want to display on the X-axis and then click the arrow to the left
of the “X-Axis” selection box. Click “OK” to produce the scatterplot.

5027" src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-
scatter-plot-3-300x248.jpg" alt="spss scatter plot 3" width="300" height="248"
That’s it!
Tip: You don’t have to select value labels by, but if you do, the value labels are used as point labels
for the scatter plot. If you don’t select a variable to label cases by, outliers and extremes can be
labeled with case numbers.
Back to Top
Scatter Plot on the TI-89: Overview

Making a scatter plot on the TI-89 involves three phases: Accessing the data matrix editor, inputting
your X and Y values and then graphing the data.
<img aria-describedby="caption-attachment-2406" class="size-thumbnail

wp-image-2406" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/04/TI-89-150x150.jpg" alt="scatter plot on the ti-89" width="150"
height="150" />
TI-89
Scatter Plot on the TI-89: Steps:

<img
<iframe src="https://www.youtube.com/embed/bUODwjpazF8" width="560" height="315"

Example problem: make a scatter plot for the following data: (1,6), (2,8), (3,9), (4,11), and (5,14).
Accessing the Data Matrix Editor

Step 1: Press APPS, then scroll to the “Data/Matrix” editor, press ENTER and then select “new.”
Step 2: Scroll down to “Variable” and type in desired name. For example, type “scatterone”. Note: you
don’t have to press the ALPHA key to access the alpha keypad. Just type!
Step 3: Press ENTER ENTER.
Inputting X and Y Values

Step 1: Enter your X values under the “c1” column. Press ENTER after each entry.
For our list, you would need to press:
1 ENTER
2 ENTER
3 ENTER
4 ENTER
5 ENTER
Step 2: Enter your Y values under the “c2” column (use the arrow keys to scroll to the top of the
column). Press ENTER after each entry.
For our list, you would need to press:
6 ENTER
8 ENTER
9 ENTER
11 ENTER
14 ENTER
Graphing the Data

Step 1: Press F2 for Plot Setup.
Step 2: Press F1.
Step 3: Select “scatter” next to “plot type”
Step 4: Select “box” next to “mark type”
Step 5: Scroll to the “x” box and then press ALPHA ) 1 to enter “c1”.
Step 6: Scroll to the “y” box and then press ALPHA ) 2 to enter “c2”.
Step 7: Press ENTER ENTER.
Step 8: Press the diamond key F3 to view your scatter plot.
Step 9: Press F2 and then press 9 so that the scatter plot will be drawn in the correct window for the
data.
That’s it!
Check out our YouTube channel for more tips and help!
Back to Top
TI 83 Scatter Plot
<img
<iframe src="//www.youtube.com/embed/Cz9XJdce0Sw?rel=0" width="420" height="315"

TI 83 Scatter Plot: Overview

Making a scatter plot on a TI-83 graphing calculator is a breeze with the easy to use LIST menu. In
order to graph a TI 83 scatter plot, you’ll need a set of bivariate data. Bivariate data is data that you
can plot on an XY axis: you’ll need a list of “x” values (for example, weight) and a list of “y” values (for
example, height). The XY values can be in two separate lists, or they can be written as XY
coordinates (x,y). Once you have those, it’s as easy as typing the lists into the calculator, and
choosing your graph.
TI 83 Scatter Plot: Steps

Sample problem: Create a TI 83 scatter plot for the following coordinates (2, 3), (4, 4), (6, 9), (8, 11),
and (10, 12).
Step 1: Press STAT, then press ENTER to enter the lists screen. If you already have data in L1 or L2,
clear the data: move to cursor onto L1, press CLEAR and then ENTER. Repeat for L2.
Step 2: Enter your x-variables, one at a time. Follow each number by pressing the ENTER key. For
our list, you would enter:
2 ENTER
4 ENTER
6 ENTER
8 ENTER
10 ENTER
Step 3: Use the arrow keys to scroll across to the next column, L2.
Step 4: Enter your y-variables, one at a time. Follow each number by pressing the enter key. For our
list, you would enter:
3 ENTER
4 ENTER
9 ENTER
11 ENTER
12 ENTER
Step 5: Press 2nd, then press STATPLOT (the Y= key).
Step 6: Press ENTER to enter StatPlots for Plot1.
Step 7: Press ENTER to turn Plot1 “ON.”
Step 8: Arrow down to the next line (“Type”) and highlight the scatter plot (the first image). Press
ENTER.
Step 9: Arrow down to “Xlist.” If “L1” isn’t showing, press 2nd and 1. Arrow down to “Ylist.” If “L2” isn’t
showing, press 2nd and 2.
Step 10: Press ZOOM then 9. This should bring up a scatter plot on your screen.
Tip: Hit TRACE and press the right and left arrow buttons to move from point to point, displaying the
XY values for those points.
That’s how to make a TI 83 Scatter Plot!
Lost your guidebook? Download a new one here from the TI website.
Check out our Youtube channel for more stats help and tips!
------------------------------------------------------------------------------
Need help with a homework or test question? With Chegg Study, you can get step-by-step
solutions to your questions from an expert in the field. Your first 30 minutes with a Chegg tutor is free!
Statistical concepts explained visually - Includes many concepts such as sample size, hypothesis
tests, or logistic regression, explained by Stephanie Glen, founder of StatisticsHowTo.

Find an article
Search
Feel like "cheating" at Statistics? Check out the grade-increasing book that's recommended
reading at top universities!
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2017/10/app.png"
alt="" />
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Correlation Coefficient: Simple Definition, Formula, Easy

Steps
Correlation coefficients are used in statistics to measure how strong a relationship is between two variables.
There are several types of correlation coefficient: Pearson’s correlation (also called Pearson’s R) is a
correlation coefficient commonly used in linear regression. If you’re starting out in statistics, you’ll probably
learn about Pearson’s R first. In fact, when anyone refers to the correlation coefficient, they are usually talking
about Pearson’s.
Contents (Click to skip to the section):
1. What is a correlation coefficient?

2. What is Pearson Correlation? How to Calculate:
 By hand
 TI 83
 Excel
 SPSS
 Minitab
 What do the results mean?
3. Cramer’s V Correlation
4. Where did the Correlation Coefficient Come From?
5. Correlation Coefficient Hypothesis Test.
6. More Articles / Correlation Coefficients
Correlation Coefficient Formula: Definition

Watch the video or read the article below:
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/bCpfd2PxBVA?rel=0" frameborder="0"

Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return
a value between -1 and 1, where:
 1 indicates a strong positive relationship.

 -1 indicates a strong negative relationship.
 A result of zero indicates no relationship at all.
<img
aria-describedby="caption-attachment-2975" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/10/pearson-2-small.png" alt="correlation coefficient formula" title="pearson-correlation
coeffiicent-small" width="573" height="210" class="size-full wp-image-2975"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson-2-small.png
573w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson-2-small-
Graphs showing a correlation of -1, 0 and +1
Meaning
 A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive
increase of a fixed proportion in the other. For example, shoe sizes go up in (almost) perfect correlation
with foot length.
 A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative
decrease of a fixed proportion in the other. For example, the amount of gas in a tank decreases in (almost)
perfect correlation with speed.
 Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t
related.
The absolute value of the correlation coefficient gives us the relationship strength. The larger the number, the
stronger the relationship. For example, |-.75| = .75, which has a stronger relationship than .65.
Like the explanation? Check out the Practically Cheating Statistics Handbook, which has hundreds of step-by-
step, worked out problems!
Types of correlation coefficient formulas.

There are several types of correlation coefficient formulas.
One of the most commonly used formulas in stats is Pearson’s correlation coefficient formula. If you’re taking a
basic stats class, this is the one you’ll probably use:
<img aria-describedby="caption-
attachment-2880" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/10/pearson.gif" alt="pearson correlation" title="pearson correlation coefficient"
width="410" height="215" class="size-full wp-image-2880"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson.gif 827w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson-300x156.gif 300w"
Pearson correlation coefficient
Two other formulaE are commonly used: the sample correlation coefficient and the population correlation
coefficient.
Sample correlation coefficient

content/uploads/2012/12/sample-correlation-coefficient.png" alt="" width="66" height="29" class="alignleft
Sx and sy are the sample standard deviations, and sxy is the sample covariance.
Population correlation coefficient

content/uploads/2012/12/population-correlation-coefficient.png" alt="" width="70" height="29" class="alignleft
The population correlation coefficient uses σx and σy as the population standard deviations, and σ xy as the
population covariance.
Check out my Youtube channel for more tips and help with statistics!
Back to Top
What is Pearson Correlation?

Correlation between sets of data is a measure of how well they are related. The most common measure of
correlation in stats is the Pearson Correlation. The full name is the Pearson Product Moment Correlation
(PPMC). It shows the linear relationship between two sets of data. In simple terms, it answers the question, Can
I draw a line graph to represent the data? Two letters are used to represent the Pearson correlation: Greek letter
rho (ρ) for a population and the letter “r” for a sample.
Potential problems with Pearson correlation.

The PPMC is not able to tell the difference between dependent variables and independent variables. For
example, if you are trying to find the correlation between a high calorie diet and diabetes, you might find a high
correlation of .8. However, you could also get the same result with the variables switched around. In other
words, you could say that diabetes causes a high calorie diet. That obviously makes no sense. Therefore, as a
researcher you have to be aware of the data you are plugging in. In addition, the PPMC will not give you any
information about the slope of the line; it only tells you whether there is a relationship.
Real Life Example
Pearson correlation is used in thousands of real life situations. For example, scientists in China wanted to know
if there was a relationship between how weedy rice populations are different genetically. The goal was to find
out the evolutionary potential of the rice. Pearson’s correlation between the two groups was analyzed. It showed
a positive Pearson Product Moment correlation of between 0.783 and 0.895 for weedy rice populations. This
figure is quite high, which suggested a fairly strong relationship.
If you’re interested in seeing more examples of PPMC, you can find several studies on the National Institute of
Health’s Open website, which shows result on studies as varied as breast cyst imaging to the role that
carbohydrates play in weight loss.
Back to Top
How to Find Pearson’s Correlation Coefficients

By Hand
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/SC1kvvoH10Y" frameborder="0"

Sample question: Find the value of the correlation coefficient from the following table:
Subject Age x Glucose Level y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1:Make a chart. Use the given data, and add three more columns: xy, x2, and y2.
Subject Age x Glucose Level y xy x2 y2
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 = 4,257.
1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x column, and put the result in the x2 column.
1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481
Step 4: Take the square of the numbers in the y column, and put the result in the y2 column.
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569

6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns and put the result at the bottom of the column. The Greek
letter sigma (Σ) is a short way of saying “sum of.”
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Step 6: Use the following correlation coefficient formula.

title="pearsons correlation coefficient" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/pearsons-300x156.gif" alt="pearsons correlation coefficient" width="300"
height="156" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/pearsons-300x156.gif 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/pearsons.gif 827w" sizes="(max-width: 300px) 100vw, 300px" />
The answer is: 2868 / 5413.27 = 0.529809
Click here if you want easy, step-by-step instructions for solving this formula.
From our table:
 Σx = 247
 Σy = 486
 Σxy = 20,485
 Σx2 = 11,409
 Σy2 = 40,022
 n is the sample size, in our case = 6
The correlation coefficient =
 6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]
= 0.5298
The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which means the
variables have a moderate positive correlation.
Back to Top.
Like the explanation? Check out the Practically Cheating Statistics Handbook, which has hundreds more step-
by-step explanations, just like this one!
Correlation Formula: TI 83
If you’re taking AP Statistics, you won’t actually have to work the correlation formula by hand. You’ll use your
graphing calculator. Here’s how to find r on a TI83.
Step 1: Type your data into a list and make a scatter plot to ensure your variables are roughly correlated. In other
words, look for a straight line. Not sure how to do this? See: TI 83 Scatter plot.
Step 2: Press the STAT button.
Step 3: Scroll right to the CALC menu.
Step 4: Scroll down to 4:LinReg(ax+b), then press ENTER. The output will show “r” at the very bottom of the
list.
Tip: If you don’t see r, turn Diagnostic ON, then perform the steps again.
How to Compute the Pearson Correlation Coefficient

Excel 2007
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/ewK9ozyRh1g" frameborder="0"

Step 1: Type your data into two columns in Excel. For example, type your “x” data into column A and your
“y” data into column B.
Step 2: Select any empty cell.
Step 3: Click the function button on the ribbon.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/02/pearsons-correlation-
excel-11-295x300.jpg" alt="" width="295" height="300" class="alignleft size-medium wp-image-3413"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/02/pearsons-correlation-
excel-11-295x300.jpg 295w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/02/pearsons-correlation-excel-11.jpg 363w" sizes="(max-width: 295px) 100vw,
295px" />
Step 4: Type “correlation” into the ‘Search for a function’ box.
Step 5: Click “Go.” CORREL will be highlighted.
<img
content/uploads/2013/02/pearsons-correlation-excel-2.jpg 602w" sizes="(max-width: 300px) 100vw, 300px" />
Step 6: Click “OK.”
Step 7: Type the location of your data into the “Array 1” and “Array 2” boxes. For this example, type
“A2:A10” into the Array 1 box and then type “B2:B10” into the Array 2 box.
<img
content/uploads/2013/02/pearsons-correlation-excel-3.jpg 770w" sizes="(max-width: 300px) 100vw, 300px" />
Step 8: Click “OK.” The result will appear in the cell you selected in Step 2. For this particular data set, the
correlation coefficient(r) is -0.1316.
Caution: The results for this test can be misleading unless you have made a scatter plot first to ensure your data
roughly fits a straight line. The correlation coefficient in Excel 2007 will always return a value, even if your
data is something other than linear (i.e. exponential).
That’s it!
Subscribe to our Youtube Channel for more Excel tips and stats help.
Back to top.
Correlation Coefficient SPSS: Overview.

<img
<iframe width="420" height="315" src="https://www.youtube.com/embed/VqISD8DlXeE" frameborder="0"

Step 1: Click “Analyze,” then click “Correlate,” then click “Bivariate.” The Bivariate Correlations window
will appear.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-pearson-correlation-
coefficient-1-300x180.jpg" alt="correlation coefficient spss" width="300" height="180" class="alignnone size-
medium wp-image-4993" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/spss-pearson-correlation-coefficient-1-300x180.jpg 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-pearson-correlation-
coefficient-1.jpg 477w" sizes="(max-width: 300px) 100vw, 300px" />
Step 2: Click one of the variables in the left-hand window of the Bivariate Correlations pop-up window. Then
click the center arrow to move the variable to the “Variables:” window. Repeat this for a second variable.
<img
coefficient-2-300x276.jpg" alt="spss pearson correlation coefficient 2" width="300" height="276"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-pearson-
correlation-coefficient-2-300x276.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/spss-pearson-correlation-coefficient-2.jpg 593w" sizes="(max-width: 300px) 100vw,
300px" />
Step 3: Click the “Pearson” check box if it isn’t already checked. Then click either a “one-tailed” or “two-
tailed” test radio button. If you aren’t sure if your test is one-tailed or two-tailed, see: Is it a a one-tailed test or
two-tailed test?
Step 4: Click “OK” and read the results. Each box in the output gives you a correlation between two
variables. For example, the PPMC for Number of older siblings and GPA is -.098, which means practically no
correlation. You can find this information in two places in the output. Why? This cross-referencing columns and
rows is very useful when you are comparing PPMCs for dozens of variables.
<img
coefficient-3-300x189.jpg" alt="spss pearson correlation coefficient 3" width="300" height="189"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-pearson-
correlation-coefficient-3-300x189.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/spss-pearson-correlation-coefficient-3.jpg 573w" sizes="(max-width: 300px) 100vw,
300px" />
Tip #1: It’s always a good idea to make an SPSS scatter plot of your data set before you perform this test.
That’s because SPSS will always give you some kind of answer and will assume that the data is linearly related.
If you have data that might be better suited to another correlation (for example, exponentially related data) then
SPSS will still run Pearson’s for you and you might get misleading results.
Tip #2: Click on the “Options” button in the Bivariate Correlations window if you want to include descriptive
statistics like the mean and standard deviation.
Back to top.
Minitab
Watch this video on how to calculate the correlation coefficient in Minitab, or read the steps in the article
below:
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/Peu-rUsffTU?rel=0" frameborder="0"

The Minitab correlation coefficient will return a value for r from -1 to 1.
Sample question: Find the Minitab correlation coefficient based on age vs. glucose level from the following
table from a pre-diabetic study of 6 participants:
Subject Age x Glucose Level y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1: Type your data into a Minitab worksheet. I entered this sample data into three columns.
<img aria-describedby="caption-attachment-
6026" src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/pearson-in-
minitab-1.jpg" alt="minitab correlation coefficient" width="330" height="220" class="size-full wp-image-6026"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/pearson-in-minitab-
1.jpg 330w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/pearson-in-
minitab-1-300x200.jpg 300w" sizes="(max-width: 330px) 100vw, 330px" />
Data entered into three columns in a Minitab worksheet.
Step 2: Click “Stat”, then click “Basic Statistics” and then click “Correlation.”
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/pearson-in-minitab-
2.jpg" alt="minitab correlation" width="596" height="422" class="size-full wp-image-6027"
“Correlation” is selected from the “Stats > Basic Statistics” menu.
Step 3: Click a variable name in the left window and then click the “Select” button to move the variable
name to the Variable box. For this sample question, click “Age,” then click “Select,” then click “Glucose Level”
then click “Select” to transfer both variables to the Variable window.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/pearson-in-minitab-
3.jpg" alt="pearson in minitab 3" width="333" height="326" class="alignnone size-full wp-image-6029"
Step 4: (Optional) Check the “P-Value” box if you want to display a P-Value for r.
Step 5: Click “OK”. The Minitab correlation coefficient will be displayed in the Session Window. If you don’t
see the results, click “Window” and then click “Tile.” The Session window should appear.
<img aria-
content/uploads/2013/09/pearson-in-minitab-4.jpg" alt="Results from the Minitab correlation." width="518"
height="218" class="size-full wp-image-6030" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/pearson-in-minitab-4.jpg 518w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/pearson-in-minitab-4-
Results from the Minitab correlation.
For this dataset:

Value of r: 0.530
P-Value: 0.280
That’s it!
Tip: Give your columns meaningful names (in the first row of the column, right under C1, C2 etc.). That way,
when it comes to choosing variable names in Step 3, you’ll easily see what it is you are trying to choose. This
becomes especially important when you have dozens of columns of variables in a data sheet!
Meaning of the Linear Correlation Coefficient.

Pearson’s Correlation Coefficient is a linear correlation coefficient that returns a value of between -1 and +1. A
-1 means there is a strong negative correlation and +1 means that there is a strong positive correlation. A 0
means that there is no correlation (this is also called zero correlation).
This can initially be a little hard to wrap your head around (who likes to deal with negative numbers?). The
Political Science Department at Quinnipiac University posted this useful list of the meaning of Pearson’s
Correlation coefficients. They note that these are “crude estimates” for interpreting strengths of correlations
using Pearson’s Correlation:
Table x: Criteria for passing judgemen on estimated PPMCs
r value =
+.70 or higher Very strong positive relationship

+.40 to +.69 Strong positive relationship
+.30 to +.39 Moderate positive relationship
+.20 to +.29 weak positive relationship
+.01 to +.19 No or negligible relationship
0 No relationship [zero correlation]
-.01 to -.19 No or negligible relationship
-.20 to -.29 weak negative relationship
-.30 to -.39 Moderate negative relationship
-.40 to -.69 Strong negative relationship
-.70 or higher Very strong negative relationship
It may be helpful to see graphically what these correlations look like:
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson-2-small-
300x109.png" alt="Graphs showing a correlation of -1, 0 and +1" width="300" height="109" class="size-
content/uploads/2012/10/pearson-2-small-300x109.png 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/10/pearson-2-small.png 573w"
Graphs showing a correlation of -1 (a negative correlation), 0 and +1 (a positive correlation)
The images show that a strong negative correlation means that the graph has a downward slope from left to
right: as the x-values increase, the y-values get smaller. A strong positive correlation means that the graph has
an upward slope from left to right: as the x-values increase, the y-values get larger.
Back to top.
Cramer’s V Correlation
Cramer’s V Correlation is similar to the Pearson Correlation coefficient. While the Pearson correlation is used to
test the strength of linear relationships, Cramer’s V is used to calculate correlation in tables with more than 2 x 2
columns and rows. Cramer’s V correlation varies between 0 and 1. A value close to 0 means that there is very
little association between the variables. A Cramer’s V of close to 1 indicates a very strong association.
Cramer’s V
.25 or higher Very strong relationship
.15 to .25 Strong relationship
.11 to .15 Moderate relationship
.06 to .10 weak relationship
.01 to .05 No or negligible relationship
Back to Top.
Where did the Correlation Coefficient Come From?

A correlation coefficient gives you an idea of how well data fits a line or curve. Pearson wasn’t the original
inventor of the term correlation but his use of it became one of the most popular ways to measure correlation.
Brief History
Francis Galton (who was also involved with the development of the interquartile range) was the first person to
measure correlation, originally termed “co-relation,” which actually makes sense considering you’re studying
the relationship between a couple of different variables. In Co-Relations and Their Measurement, he said “The
statures of kinsmen are co-related variables; thus, the stature of the father is correlated to that of the adult son
and so on; but the index of co-relation … is different in the different cases.” It’s worth noting though that Galton
mentioned in his paper that he had borrowed the term from biology, where “Co-relation and correlation of
structure” was being used but until the time of his paper it hadn’t been properly defined.
In 1892, British statistician Francis Ysidro Edgeworth published a paper called “Correlated Averages,”
Philosophical Magazine, 5th Series, 34, 190-204 where he used the term “Coefficient of Correlation.” It wasn’t
until 1896 that British mathematician Karl Pearson used “Coefficient of Correlation” in two papers:
Contributions to the Mathematical Theory of Evolution and Mathematical Contributions to the Theory of
Evolution. III. Regression, Heredity and Panmixia. It was the second paper that introduced the Pearson product-
moment correlation formula for estimating correlation.
<img aria-
content/uploads/2011/11/pearsonequation.jpg" alt="The Pearson Product-Moment Correlation equation."
width="528" height="112" class="size-full wp-image-2014"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2011/11/pearsonequation.jpg
528w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2011/11/pearsonequation-
The Pearson Product-Moment Correlation equation.
Back to Top.
Correlation Coefficient Hypothesis Test
If you can read a table — you can test for correlation coefficient. Note that correlations should only be
calculated for an entire range of data. If you restrict the range, r will be weakened.
Sample problem: test the significance of the correlation coefficient r = 0.565 using the critical values for PPMC
table. Test at α = 0.01 for a sample size of 9.
Step 1: Subtract two from the sample size to get df, degrees of freedom.
9–2=7
Step 2: Look the values up in the PPMC Table. With df = 7 and α = 0.01, the table value is = 0.798
Step 3: Draw a graph, so you can more easily see the relationship.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/11/ppm.png" alt="ppm"
width="400" height="90" class="alignnone size-full wp-image-12986"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/11/ppm.png 400w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/11/ppm-300x67.png 300w"
r = 0.565 does not fall into the “reject” region (above 0.798), so there isn’t enough evidence to state a strong
linear relationship exists in the data.
Related Articles / More Correlation Coefficients

Other similar formulas you might come across that involve correlation (click for article):
 Concordance Correlation coefficient.

 Intraclass Correlation.
 Kendall’s Tau.
 Moran’s I.
 Partial Correlation.
 Phi Coefficient.
 Point Biserial Correlation.
 Polychoric Correlation.
 Spearman Rank Correlation.
 Tetrachoric Correlation.
 Zero-Order Correlation.
------------------------------------------------------------------------------
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Linear Regression: Simple Steps, Video. Find Equation,
Coefficient, Slope
Contents:
What is Simple Linear Regression?
How to Find a Linear Regression Equation:
1. How to Find a Linear Regression Equation by Hand.

2. Find a Linear Regression Equation in Excel.
3. TI83 Linear Regression.
4. TI 89 Linear Regression
Finding related items:
1. How to Find the Regression Coefficient.

2. Find the Linear Regression Slope.
3. Find a Linear Regression Test Value.
Leverage:
1. Leverage in Linear Regression.
Back to top

If you’re just beginning to learn about regression analysis, a simple linear is the first type of regression you’ll
come across in a stats class.
Linear regression is the most widely used statistical technique; it is a way to model a relationship between two
sets of variables. The result is a linear regression equation that can be used to make predictions about data.
Most software packages and calculators can calculate linear regression. For example:
 TI-83.
 Excel.
You can also Find a linear regression by hand.
Before you try your calculations, you should always make a scatter plot to see if your data roughly fits a line.
Why? Because regression will always give you an equation, and it may not make any sense if your data is
scattered exponentially.
Etymology
“Linear” means line. The word Regression came from a 19th-Century Scientist, Sir Francis Galton, who coined
the term “regression toward mediocrity” (in modern language, that’s regression toward the mean). He used the
term to describe the phenomenon of how nature tends to dampen excess physical traits from generation to
generation (like extreme height).
Why use Linear Relationships?

Linear relationships, i.e. lines, are easier to work with and most phenomenon are naturally linearly related. If
variables aren’t linearly related, then some math can transform that relationship into a linear one, so that it’s
easier for the researcher (i.e. you) to understand.

You’re probably familiar with plotting line graphs with one X axis and one Y axis. The X variable is sometimes
called the independent variable and the Y variable is called the dependent variable. Simple linear regression
plots one independent variable X against one dependent variable Y. Technically, in regression analysis, the
independent variable is usually called the predictor variable and the dependent variable is called the criterion
variable. However, many people just call them the independent and dependent variables. More advanced
regression techniques (like multiple regression) use multiple independent variables.
Regression analysis can result in linear or nonlinear graphs. A linear regression is where the relationships
between your variables can be described with a straight line. Non-linear regressions produce curved lines.(**)
300x180.jpg" alt="simple linear regression" width="300" height="180" class="size-medium wp-image-13461"
Simple linear regression for the amount of rainfall per year.

Regression analysis is almost always performed by a computer program, as the equations are extremely time-
consuming to perform by hand.
**As this is an introductory article, I kept it simple. But there’s actually an important technical difference
between linear and nonlinear, that will become more important if you continue studying regression. For details,
see the article on nonlinear regression.
Back to top

Overview
Regression analysis is used to find equations that fit data. Once we have the regression equation, we can use the
model to make predictions. One type of regression analysis is linear analysis. When a correlation coefficient
shows that data is likely to be able to predict future outcomes and a scatter plot of the data appears to form a
straight line, you can use simple linear regression to find a predictive function. If you recall from elementary
algebra, the equation for a line is y = mx + b. This article shows you how to take data, calculate linear
regression, and find the equation y’ = a + bx. Note: If you’re taking AP statistics, you may see the equation
written as b0 + b1x, which is the same thing (you’re just using the variables b0 + b1 instead of a + b.
Watch the video or read the steps below to find a linear regression equation by hand. Scroll to the bottom of the
page if you would prefer to use Excel:
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/CfrWexuiZyU?rel=0" frameborder="0"

The Linear Regression Equation
Linear regression is a way to model the relationship between two variables. You might also recognize the
equation as the slope formula. The equation has the form Y= a + bX, where Y is the dependent variable (that’s
the variable that goes on the Y axis), X is the independent variable (i.e. it is plotted on the X axis), b is the slope
of the line and a is the y-intercept.
content/uploads/2009/11/linearregressionequations.bmp" alt="the linear regression equation" class="alignnone
The first step in finding a linear regression equation is to determine if there is a relationship between the two
variables. This is often a judgment call for the researcher. You’ll also need a list of your data in x-y format (i.e.
two columns of data—independent and dependent variables).
Warnings:
1. Just because two variables are related, it does not mean that one causes the other. For example,
although there is a relationship between high GRE scores and better performance in grad school, it doesn’t
mean that high GRE scores cause good grad school performance.
2. If you attempt to try and find a linear regression equation for a set of data (especially through an
automated program like Excel or a TI-83), you will find one, but it does not necessarily mean the equation
is a good fit for your data. One technique is to make a scatter plot first, to see if the data roughly fits a line
before you try to find a linear regression equation.
How to Find a Linear Regression Equation: Steps

Step 1: Make a chart of your data, filling in the columns in the same way as you would fill in the chart if you
were finding the Pearson’s Correlation Coefficient.
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561

Σ 247 486 20485 11409 40022
From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample size (6, in
our case).
Step 2: Use the following equations to find a and b.
<img class="alignnone size-full wp-image-960" title="linear

regression equations" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/linearregressionequations.bmp" alt="find a linear regression equation" />
a = 65.1416
b = .385225
Find a:
 ((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 2472)

 484979 / 7445
 =65.14
Find b:
 (6(20,485) – (247 × 486)) / (6 (11409) – 2472)

 (122,910 – 120,042) / 68,454 – 2472
 2,868 / 7,445
 = .385225
Step 3: Insert the values into the equation.

y’ = a + bx
y’ = 65.14 + .385225x
That’s how to find a linear regression equation by hand!
by-step solutions, just like this one!
* Note that this example has a low correlation coefficient, and therefore wouldn’t be too good at predicting
anything.
Back to top
Find a Linear Regression Equation in Excel

<img
<iframe width="560" height="315" src="//www.youtube.com/embed/OlxiOJ26r_k?rel=0" frameborder="0"

Linear Regression Equation Microsoft Excel: Steps

Step 1: Install the Data Analysis Toolpak, if it isn’t already installed. For instructions on how to load the Data
Analysis Toolpak, click here.
“y” data into column b. Do not leave any blank cells between your entries.
Step 3: Click the “Data Analysis” tab on the Excel toolbar.
Step 4: Click “regression” in the pop up window and then click “OK.”
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/data-analysis-
300x205.jpg" alt="linear regression equation microsoft excel." width="300" height="205" class="size-medium
wp-image-4878" srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/data-
analysis-300x205.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/data-analysis.jpg 398w" sizes="(max-width: 300px) 100vw, 300px" />
The Data Analysis pop up window has many options, including linear regression.
Step 5: Select your input Y range. You can do this two ways: either select the data in the worksheet or type the
location of your data into the “Input Y Range box.” For example, if your Y data is in A2 through A10 then type
“A2:A10” into the Input Y Range box.
Step 6: Select your input X range by selecting the data in the worksheet or typing the location of your data into
the “Input X Range box.”
Step 7: Select the location where you want your output range to go by selecting a blank area in the worksheet
or typing the location of where you want your data to go in the “Output Range” box.
Step 8: Click “OK”. Excel will calculate the linear regression and populate your worksheet with the results.
Tip: The linear regression equation information is given in the last output set (the coefficients column). The first
entry in the “Intercept” row is “a” (the y-intercept) and the first entry in the “X” column is “b” (the slope).
Back to top
TI83 Linear Regression

<img
<iframe width="420" height="315" src="//www.youtube.com/embed/PhLjUx_q_U4?rel=0" frameborder="0"

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/linear-regression-
leverage-1-150x150.jpg" alt="TI 83 Linear Regression" width="150" height="150" class="size-thumbnail wp-
image-6079" />
Two linear regression lines.
TI 83 Linear Regression: Overview

Linear regression is tedious and prone to errors when done by hand, but you can perform linear regression in
the time it takes you to input a few variables into a list. Linear regression will only give you a reasonable result
if your data looks like a line on a scatter plot, so before you find the equation for a linear regression line you
may want to view the data on a scatter plot first. See this article for how to make a scatter plot on the TI 83.
TI 83 Linear Regression: Steps

Sample problem: Find a linear regression equation (of the form y = ax + b) for x-values of 1, 2, 3, 4, 5 and y-
values of 3, 9, 27, 64, and 102.
Step 1: Press STAT, then press ENTER to enter the lists screen. If you already have data in L1 or L2, clear the
data: move the cursor onto L1, press CLEAR and then ENTER. Repeat for L2.
Step 2: Enter your x-variables, one at a time. Follow each number by pressing the ENTER key. For our list, you
would enter:
1 ENTER
2 ENTER
3 ENTER
4 ENTER
5 ENTER
Step 4: Enter your y-variables, one at a time. Follow each number by pressing the enter key. For our list, you
would enter:
3 ENTER
9 ENTER
27 ENTER
64 ENTER
102 ENTER
Step 5: Press the STAT button, then use the scroll key to highlight “CALC.”
Step 6: Press 4 to choose “LinReg(ax+b)”. Press ENTER and then ENTER again. The TI 83 will return the
variables needed for the equation. Just insert the given variables (a, b) into the equation for linear regression
(y=ax+b). For the above data, this is y = 25.3x – 34.9.
That’s how to perform TI 83 Linear Regression!
Back to top
How to Find a Linear Regression Slope: Overview

Remember from algebra, that the slope is the “m” in the formula y = mx + b.
In the linear regression formula, the slope is the a in the equation y’ = b + ax.
They are basically the same thing. So if you’re asked to find linear regression slope, all you need to do is find b
in the same way that you would find m.
Calculating linear regression by hand is tricky, to say the least. There’s a lot of summation (that’s the Σ symbol,
which means to add up). The basic steps are below, or you can watch the video at the beginning of this
article. The video goes into a lot more detail about how to do summation. Finding the equation will also give
you the slope. If you don’t want to find the slope by hand (or if you want to check your work), you can also use
Excel.
How to Find Linear Regression Slope: Steps
Step 1: Find the following data from the information given: Σx, Σy, Σxy, Σx 2, Σy2. If you don’t remember how
to get those variables from data, see this article on how to find a Pearson’s correlation coefficient. Follow the
steps there to create a table and find Σx, Σy, Σxy, Σx2, and Σy2.
Step 2: Insert the data into the b formula (there is no need to find a).
<imgclass="alignnone size-full wp-image-960" title="linear

content/uploads/2009/11/linearregressionequations.bmp" alt="how to find linear regression slope" />
If formulas scare you, you can find more comprehensive instructions on how to work the formula here: How to
Find a Linear Regression Equation: Overview.
How to Find Regression Slope in Excel 2013
<img

Subscribe to our Youtube channel for lots more stats tips and tricks.
Back to top
How to Find the Regression Coefficient
A regression coefficient is the same thing as the slope of the line of the regression equation. The equation for
the regression coefficient that you’ll find on the AP Statistics test is: B1 = b1 = Σ [ (xi – x)(yi – y) ] / Σ [ (xi – x)2].
“y” in this equation is the mean of y and “x” is the mean of x.
content/uploads/2012/04/TI-83.png" alt="regression coefficient" width="100" height="200" class="alignleft
size-full wp-image-2399" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/04/TI-83.png 286w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/04/TI-83-149x300.png 149w" sizes="(max-width: 100px) 100vw, 100px" />You could
find the regression coefficient by hand (as outlined in the section at the top of this page).
However, you won’t have to calculate the regression coefficient by hand in the AP test — you’ll use your TI-83
calculator. Why? Calculating linear regression by hand is very time consuming (allow yourself about 30 minutes
to do the calculations and check them) and because of the huge number of calculations you have to make you’re
very likely to make mathematical errors. When you find a linear regression equation on the TI83, you get the
regression coefficient as part of the answer.
Sample problem: Find the regression coefficient for the following set of data:
x: 1, 2, 3, 4, 5.
y: 3, 9, 27, 64, 102.
Step 1: Press STAT, then press ENTER to enter LISTS. You may need to clear data if you already have
numbers in L1 or L2. To clear the data: move the cursor onto L1, press CLEAR and then ENTER. Repeat for L2
if you need to.
Step 2: Enter your x-data into a list. Press the ENTER key after each entry.
1 ENTER
2 ENTER
3 ENTER
4 ENTER
5 ENTER
Step 3: Scroll across to the next column, L2 using the arrow keys at the top right of the keypad.
Step 4: Enter the y-data:

3 ENTER
9 ENTER
27 ENTER
64 ENTER
102 ENTER
Step 5: Press the STAT button, then scroll to highlight “CALC.” Press ENTER
Step 6: Press 4 to choose “LinReg(ax+b)”. Press ENTER. The TI 83 will return the variables needed for the
linear regression equation. The value you’re looking for >the regression coefficient > is b, which is 25.3 for this
set of data.
That’s it!
Back to top
Linear Regression Test Value
leverage-1-150x150.jpg" alt="Two linear regression lines." width="150" height="150" class="size-thumbnail
wp-image-6079" />
Linear regression test values are used in simple linear regression exactly the same way as test values (like the z-
score or T statistic) are used in hypothesis testing. Instead of working with the z-table you’ll be working with a
t-distribution table. The linear regression test value is compared to the test statistic to help you support or reject
a null hypothesis.
Linear Regression Test Value: Steps

Sample question: Given a set of data with sample size 8 and r = 0.454, find the linear regression test value.
Note: r is the correlation coefficient.
Step 1: Find r, the correlation coefficient, unless it has already been given to you in the question. In this case, r
is given (r = .0454). Not sure how to find r? See: Correlation Coefficient for steps on how to find r.
Step 2: Use the following formula to compute the test value (n is the sample size):
<img class="alignnone size-full wp-image-948" title="linear regression test value"

src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/11/linearregressiontestvalue.bmp" alt="linear regression test value" />
How to solve the formula:

1. Replace the variables with your numbers:
T = .454√((8 – 2)/(1-[.454] ))
2
 Subtract 2 from n:
8–2=6
 Square r:
.454 × .454 = .206116
 Subtract step (3) from 1:
1 – .206116 = .793884
 Divide step (2) by step (4):
6 / .793884 = 7.557779
 Take the square root of step (5):
√7.557779 = 2.74914154
 Multiply r by step (6):
.454 × 2.74914154 = 1.24811026
The Linear Regression Test value, T = 1.24811026
That’s it!
Finding the test statistic

The linear regression test value isn’t much use unless you have something to compare it to. Compare your value
to the test statistic. The test statistic is also a t-score (t) defined by the following equation:
t = slope of the sample regression line / standard error of the slope.
See: How to find a linear regression slope / How to find the standard error of the slope (TI-83).
You can find a worked example of calculating the linear regression test value (with an alpha level) here:
Correlation Coefficients.
Back to top
Leverage in Linear Regression

Data points that have leverage have the potential to move a linear regression line. They tend to be outliers. An
outlier is a point that is either an extremely high or extremely low value.
Influential Points
If the parameter estimates (sample standard deviation, variance etc.) change significantly when an outlier is
removed, that data point is called an influential observation.
The more a data point differs from the mean of the other x-values, the more leverage it has. The more leverage a
point is, the higher the probability that point will be influential (i.e. it could change the parameter estimates).
Leverage in Linear Regression: How it Affects

Graphs
In linear regression, the influential point (outlier) will try to pull the linear regression line toward itself. The
graph below shows what happens to a linear regression line when outlier A is included:
leverage-1.jpg" alt="Leverage Linear Regression" width="600" height="400" class="size-full wp-image-
6079" />
Two linear regression lines. The influential point A is included in the upper line but not in the lower line.
Outliers with extreme X values (values that aren’t within the range of the other data points) have more leverage
in linear regression than points with less extreme x values. In other words, extreme x-value outliers will move
the line more than less extreme values.
The following graph shows a data point outside of the range of the other values. The values range from 0 to
about 70,000. This one point has an x-value of about 80,000 which is outside the range. It affects the regression
line a lot more than the point in the first image above, which was inside the range of the other values.
leverage-3.jpg" alt="A high-leverage outlier. The point has moved the graph more because it is outside the
range of the" width="600" height="400" class="size-full wp-image-6082" />
A high-leverage outlier. The point has moved the graph more because it is outside the range of the other values.
In general outliers that have values close to the mean of x will have less leverage that outliers towards the edges
of the range. Outliers with values of x outside of the range will have more leverage. Values that are extreme on
the y-axis (compared to the other values) will have more influence than values closer to the other y-values.
Like the videos? Subscribe to our Youtube Channel.
------------------------------------------------------------------------------
Find an article
Search
top universities!

Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Chow Test: Definition & Examples

Regression Analysis > Chow Test
What is a Chow Test

The Chow test tells you if the regression coefficients are different for split data sets. Basically, it tests whether
one regression line or two separate regression lines best fit a split set of data.
Split Data Sets and the Chow Test

Sometimes your data will have a break point or structural point (a period of significant or violent change),
splitting a data set into two parts. For example:
 Donations given to an organization before and after a natural disaster.

 Stock market prices before and after Black Friday.
 House prices before and after a significant interest change.
 Asset prices before and after civil war.
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/10/chow-test.png"
alt="chow test" width="601" height="183" class="size-full wp-image-31840"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/10/chow-test.png 601w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/10/chow-test-300x91.png 300w"
The dataset on the left has a single regression line . The set on the right has a break point in the middle and two
regression lines.
If the two parts can be represented by one single regression line, we say that the regression can be “pooled.”
Let’s say your linear regression analysis of two parts of a data set (shown on the right) resulted in the following
two linear regression equations:
 First part of the data: yt = X1*b1 + μ1

 Second part of the data: yt = X2*b2 + μ2
The Chow test would tell you if the coefficients b 1 = b2 and μ1 = μ2. If they are equal, the data set can be
represented with a single regression line.
Running the Test

The null hypothesis for the test is that there is no break point (i.e. that the data set can be represented with a
single regression line).
1. Run a regression for the entire data set (the “pooled regression”). Collect the error Sum of Squares
data.
2. Run separate regressions on each half of the data set. Collect the Error Sum of Squares data for the two
regressions.
3. Calculate the Chow F statistic using the SSE from each subsample. The formula is:
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/10/chow-test-
formula.png" alt="chow-test-formula" width="400" height="80" class="alignnone size-full wp-image-
31858" />
where:
 RSSp = pooled (combined) regression line.
 RSS1 = regression line before break.
 RSS2 = regression line after break.
4. Find the F-critical value from the F-table.
5. Reject the null hypothesis if your calculated F-value falls into the rejection region (i.e. if the calculated
F-value is greater than the F-critical value).
Reference:
Chow, G.C. (1960), “Tests of Equality between Sets of Coefficients in Two Linear Regressions,” Econometrica,
28, 591-605.
------------------------------------------------------------------------------
By Stephanie | October 11, 2016 | Statistics How To |
 ← Dimensionality & High Dimensional Data: Definition, Examples, Curse of

 Interval Estimate: Definition, Examples →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Forward Selection: Definition

Regression Analysis >
Forward selection is a type of stepwise regression which begins with an empty model and adds in variables one
by one. In each forward step, you add the one variable that gives the single best improvement to your model.
It is one of two commonly used methods of stepwise regression; the other is backward elimination, and is
almost opposite. In that, you start with a model that includes every possible variable and eliminate the
extraneous variables one by one.
General Method Behind Forward Selection

Forward selection typically begins with only an intercept. One tests the various variables that may be relevant,
and the ‘best’ variable—where “best” is determined by some pre-determined criteria—is added to the model.
As the model continues to improve (per that same criteria) we continue the process, adding in one variable at a
time and testing at each step. Once the model no longer improves with adding more variables, the process stops.
The criterion used to determine which variable goes in when are varied. You could be attempting to find the
lowest score under cross validation, the lowest p-value, or any of a number of other tests or measures of
accuracy.
Since stepwise regression tends toward over-fitting, which happens when we put in more variables than is
actually good for the model; it typically shows a very close, neat fit of the data used in regression, but the model
will be far off from additional data points and not good for interpolation. Therefore, it is usually good to have
strict criteria for adding in any variables.
References
Brant, Rollin. Forward Selection. MDSC 643.02 Lecture Materials. Retrieved from
https://www.stat.ubc.ca/~rollin/teach/643w04/lec/node41.html on July 7, 2018
Cook, Perry. Stepwise Selection. Human-Computer Interface Technology (CS436) Class Notes. Retrieved from
https://www.cs.princeton.edu/courses/archive/fall08/cos436/Duda/FS/stepwise.htm on July 8, 2018.
Shalizi, Cosma. Lecture 26: Variable Selection. Modern Regression for Undergraduates Class Notes.
http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/26/lecture-26.pdf
SAS Support. Forward Selection. The GLMSELECT Procedure. Retrieved from
http://support.sas.com/documentation/cdl/en/statug/66859/HTML/default/viewer.htm#statug_glmselect_details0
3.htm on July 8, 2018.
------------------------------------------------------------------------------
By Stephanie | September 19, 2017 | Statistics How To |
 ← Ancillary Statistic: Simple Definition and Example

 Implicitization: Simple Definition and Examples →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Kriging: Definition, Limitations
Kriging is a type of regression that gives a least squares estimate of data (Remy et. al, 2011). It uses z-scores to
generate an estimated surface model from the spatial description of a scattered set of data points. It originated in
mining geology, and is now an important part of the geostatistics toolbox. It also has applications in computer
engineering, remote sensing, and environmental science.
One strong point of this type of interpolation is that it not only generates an interpolated spatial model, it also
generates an estimate of the uncertainty of each point in that model.
Unlike linear regression or inverse distance weighted interpolation, kriging interpolation is based primarily on
empirical observations, the observed sample data points, rather than on a pre-assumed model.
The interpolation gives more weight to sample points nearby a location than those further, and, in order to
reduce sampling bias, weighs clusters less heavily than single points. The value of each point is calculated in
such a way as to minimize the expected error for that particular point.
Example
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2018/06/kriging.png" alt=""
width="450" height="291" class="alignleft size-full wp-image-53901"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2018/06/kriging.png 450w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2018/06/kriging-300x194.png 300w"
The above graph is an example of one-dimensional data interpolation by kriging.
 The gray areas are the normally distributed confidence intervals.

 The red line represents the kriging interpolation, which runs along the means of the normally
distributed confidence intervals.
 Squares show the original data points.
 The dashed curve shows a smooth spline. However, this departs significantly from the expected
intermediate values given by the means.
Limitations of Kriging Interpolation

Kriging assumes that the space being studied is stationary; that is to say, that the joint probability distribution
doesn’t change throughout the study space.
It also assumes a property called isotropy; that there is uniformity in every direction.
If these conditions are difficult to fulfill, the method becomes problematic. However, in universal kriging the
stationary requirement is relaxed.
The accuracy of your model will be limited if the data aren’t spatially correlated, if their limited in spread, or if
the number of data points are small.
References
1. ArcMap Documentation. Retrieved from http://desktop.arcgis.com/en/arcmap/10.3/tools/3d-analyst-
toolbox/how-kriging-works.htm on June 24, 2018
2. Clark, I. What is Kri-ging Anyway? Retrieved from http://www.kriging.com/whatiskri-ging.html on
June 24, 2018
3. GIS Geography. The Prediction Is Strong in this One. (How to Interpolate With Geostatistics)
Retrieved from https://gisgeography.com/kriging-interpolation-prediction/ on June 24, 2018
4. Population Health Method. Retrieved from
https://www.mailman.columbia.edu/research/population-health-methods/kri-ging on June 24, 2018.
5. Remy et. al (2011) Applied Geostatistics with SGeMS: A User’s Guide. Cambridge University Press.
------------------------------------------------------------------------------
By Stephanie | July 30, 2018 | Statistics How To |
 ← You need an expert. We have one for you

 Primary Data & Secondary Data: Definition & Example →
Find an article
Search
top universities!
Discrete statistics
What is an Interquartile Range?

The interquartile range is a measure of where the “middle fifty” is in a data set. Where a range is a measure of
where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values
lie. That’s why it’s preferred over many other measures of spread (i.e. the average or median) when reporting
things like school performance or SAT scores.
The interquartile range formula is the first quartile subtracted from the third quartile:
IQR = Q3 – Q1.
Contents (click to skip to the page section):

Solving by hand:
1. Solve the formula by hand (odd set of numbers).

2. What if I have an even set of numbers?
3. Find an interquartile range for an odd set of numbers: Second Method
4. Box Plot interquartile range: How to find it
Using Technology:
1. Interquartile Range in Minitab

2. Interquartile Range in Excel
3. Interquartile Range in SPSS
4. Interquartile Range on the TI83
5. Q1, Q3 and the IQR on the TI89
General info:
1. What is an Interquartile range?

2. What is the Interquartile Range Formula?
3. IQR as a Test for Normal Distribution
4. What is an Interquartile Range used for?
5. History of the Interquartile Range.
Solve the formula by hand.

<img
<iframe width="420" height="315" src="//www.youtube.com/embed/R6VDj7pEG30?rel=0" frameborder="0"

Steps:
 Step 1: Put/arrange the numbers in ascending order of magnitude.
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
 Step 2: Find the median of the set of nos.
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
 Step 3: Place parentheses around the numbers above and below the median.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).
 Step 4: Find Q1 and Q3
Think of Q1 as a median in the lower half of the data and think of Q3 as a median for the upper half of
data.
(1, 2, 5, 6, 7), 9, ( 12, 15, 18, 19, 27). Q1 = 5 and Q3 = 18.
 Step 5: Subtract the estimated Q1 from Q3 to find the interquartile range (IQR).
18 – 5 = 13.
Example 2:
Sample question: Find the IQR for the following data set: 3, 5, 7, 8, 9, 11, 15, 16, 20, 21.
 Step 1: Put the numbers in order.
3, 5, 7, 8, 9, 11, 15, 16, 20, 21.
 Step 2: Make a mark in the center of the data:
3, 5, 7, 8, 9, | 11, 15, 16, 20, 21.
 Step 3: Place parentheses around the numbers above and below the mark you made in Step 2–it
makes Q1 and Q3 easier to spot.
(3, 5, 7, 8, 9), | (11, 15, 16, 20, 21).
Q1 is the median (the middle) of the lower half of the data, and Q3 is the median (the middle) of the upper
half of the data.
(3, 5, 7, 8, 9), | (11, 15, 16, 20, 21). Q1 = 7 and Q3 = 16.
 Step 5: Subtract Q1 from Q3.
16 – 7 = 9.
This is your IQR.
Back to Top
Find an interquartile range for an odd set of

numbers: Alternate Method
As you may already know, nothing is “set in stone” in statistics: when some statisticians find an interquartile
range for a set of odd numbers, they include the median in both both quartiles. For example, in the following set
of numbers: 1,2,5,6,7,9,12,15,18,19,27 some statisticians would break it into two halves, including the median
(9) in both halves:
(1,2,5,6,7,9),(9,12,15,18,19,27)
This leads to two halves with an even set of numbers, so you can follow the steps above to find the IQR.
Back to Top
Box Plot interquartile range: How to find it

<img
<iframe width="420" height="315" src="//www.youtube.com/embed/oI0qDG5ZqZg" frameborder="0"

<img class="alignnone size-full wp-

image-32" title="boxplot1" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/08/boxplot1.bmp" alt="box plot interquartile range" />
Sample question: Find the interquartile range for the above box plot.
 Step 1: Find Q1.Q1 is represented by the left hand edge of the “box” (at the point where the whisker
stops).
<img class="alignnone size-
full wp-image-34" title="finding q1 on the boxplot graph"
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/boxplot2.bmp"
alt="finding q1 on the boxplot graph" />
In the above graph, Q1 is approximately at 2.6. (A complete explanation of Q1 is here: The five number
summary.)
 Step 2: Find Q3.

Q3 is represented on a boxplot by the right hand edge of the “box”.
<img class="alignnone size-full

wp-image-35" title="finding q3 on the boxplot"
alt="finding q3 on the boxplot" width="394" height="151" />
Q3 is approximately 12 in this graph.
 Step 3: Subtract the number you found in step 1 from the number you found in step 3.
This will give you the interquartile range. 12 – 2.6 = 9.4.
That’s it!
Back to Top
Interquartile Range in Minitab
Read on for step-by-step directions, or view the video version below.

<img
<iframe width="420" height="315" src="//www.youtube.com/embed/FM62SrU_udA?rel=0" frameborder="0"

Interquartile Range in Minitab: Steps

Sample question: Find an interquartile range in Minitab for the Grade Point Average (GPA) in the following
data set:
Grade Point Average (GPA): 1(3.2), 1(3.1), 2(3.5), 2(2.0), 3(1.9), 3(4.0), 3(3.9), 4(3.8), 4(2.9), 5(3.9), 5(3.2),
5(3.3), 6(3.4), 6(2.6), 6(2.5), 7(2.0), 7(1.5), 8(4.0), 8(2.0).
Step 1: Type your data into a Minitab worksheet. Enter your data into one or two columns.
content/uploads/2013/09/minitab-interquartile-range-a-122x300.jpg" alt="minitab interquartile range a"
width="122" height="300" class="alignnone size-medium wp-image-5664"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/minitab-interquartile-
range-a-122x300.jpg 122w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/minitab-interquartile-range-a.jpg 195w" sizes="(max-width: 122px) 100vw, 122px" />
Step 2: Click “Stat,” then click “Basic Statistics,” then click “Display Descriptive Statistics” to open the
Descriptive Statistics menu.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/minitab-interquartile-
range-b-300x165.jpg" alt="minitab interquartile range b" width="300" height="165" class="alignnone size-
content/uploads/2013/09/minitab-interquartile-range-b-300x165.jpg 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/minitab-interquartile-range-
b.jpg 613w" sizes="(max-width: 300px) 100vw, 300px" />
Step 3: Click a variable name in the left window and then click the “Select” button to transfer the variable
name to the right-hand window.
Step 4: Click the “Statistics” button.
<img
range-c-295x300.jpg" alt="minitab interquartile range c" width="295" height="300" class="alignnone size-
content/uploads/2013/09/minitab-interquartile-range-c-295x300.jpg 295w,
c.jpg 354w" sizes="(max-width: 295px) 100vw, 295px" />
Step 5: Check “Interquartile Range.”
<img
range-d-300x229.jpg" alt="minitab interquartile range d" width="300" height="229" class="alignnone size-
content/uploads/2013/09/minitab-interquartile-range-d-300x229.jpg 300w,
d.jpg 429w" sizes="(max-width: 300px) 100vw, 300px" />.
Step 6: Click the “OK” button (a new window will open with the result). The IQR for the GPA in this
particular data set is 1.8.
<img
range-e-300x179.jpg" alt="minitab interquartile range e" width="300" height="179" class="alignnone size-
content/uploads/2013/09/minitab-interquartile-range-e-300x179.jpg 300w,
e.jpg 657w" sizes="(max-width: 300px) 100vw, 300px" />
That’s it!
Tip: If you don’t see descriptive statistics show in a window, click “Window” on the toolbar, then click “Tile.”
Click the Session window (this is where descriptive statistics appear) and then scroll up to see your results.
Back to Top
Interquartile Range in Excel 2007
How to Find an Interquartile Range Excel 2007
Watch the video or read the steps below to find an interquartile range in Excel 2007:
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/wda-jwGHNSg?rel=0" frameborder="0"

<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/04/iqr.jpg" alt=""
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/04/iqr.jpg 399w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/04/iqr-300x203.jpg 300w"
Steps:
Step 1: Enter your data into a single Excel column on a worksheet. For example, type your data in cells A2 to
A10. Don’t leave any gaps in your data.
Step 2: Click a blank cell (for example, click cell B2) and then type =QUARTILE(A2:A10,1). You’ll need to
replace A2:A10 with the actual values from your data set. For example, if you typed your data into B2 to B50,
the equation is =QUARTILE(B2:B50,1). The “1” in this Excel formula(A2:A10,1) represents the first quartile
(i.e the point lying at 25% of the data set).
Step 3: Click a second blank cell (for example, click cell B3) and then type =QUARTILE(A2:A10,3). Replace
A2:A10 with the actual values from your data set. The “3” in this Excel formula (A2:A10,3) represents the third
quartile (i.e. the point lying at 75% of the data set).
Step 4: Click a third blank cell (for example, click cell B4) and then type =B3-B2. If your quartile functions
from Step 2 and 3 are in different locations, change the cell references.
Step 5: Press the “Enter” key. Excel will return the IQR in the cell you clicked in Step 4
That’s it!
Back to Top
How to Find an Interquartile Range in SPSS

Like most technology, SPSS has several ways that you can calculate the IQR. However, if you click on the
most intuitive way you would expect to find it (“Descriptive Statistics > Frequencies”), the surprise is that it
won’t list the IQR (although it will list the first, second and third quartiles). You could take this route and then
subtract the third quartile from the first to get the IQR. However, the easiest way to find the interquartile range
in SPSS by using the “Explore” command. If you have already typed data into your worksheet, skip to Step 3.
<img
<iframe width="560" height="315" src="https://www.youtube.com/embed/Bwr0mJwyl5Y" frameborder="0"

Steps
Step 1: Open a new data file in SPSS. Click “File,” mouse over “New” and then click “Data.”
Step 2: Type your data into columns in the worksheet. You can use as many columns as you need, but don’t
leave blank rows or spaces between your data. See: How to Enter Data into SPSS.
Step 3: Click “Analyze,” then mouse over “Descriptive Statistics.” Click “Explore” to open the “Explore”
dialog box.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/interquartile-range-
SPSS-1-300x130.jpg" alt="how to find the interquartile range in SPSS 1" width="300" height="130"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/interquartile-range-
SPSS-1-300x130.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/interquartile-range-SPSS-1.jpg 648w" sizes="(max-width: 300px) 100vw, 300px" />
Step 4: Click the variable name (that’s just a fancy name for the column heading), then click the top arrow to
move the variable into the “Dependent list” box.
SPSS-2-300x220.jpg" alt="The "Explore" variables dialog box." width="300" height="220"
class="size-medium wp-image-4929" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/interquartile-range-SPSS-2-300x220.jpg 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/interquartile-range-SPSS-2.jpg
The “Explore” variables dialog box.
Step 5: Click “OK.” The interquartile range is listed in the Descriptives box.
<img
SPSS-3-300x188.jpg" alt="interquartile range SPSS 3" width="300" height="188" class="alignnone size-
Tip: This example has only one list typed into the data sheet, but you may have several to choose from
depending on how you entered your data. Make sure you select the right variable (column names) before
proceeding. If you want more memorable variable names, change the column title by clicking the “variable
view” button at the very bottom left of the worksheet. Type in your new variable name and then return to data
view by clicking the “data view” button.
Back to Top
Imagine all the data in a set as points on a number line. For example, if you have 3, 7 and 28 in your set of data,
imagine them as points on a number line that is centered on 0 but stretches both infinitely below zero and
infinitely above zero. Once plotted on that number line, the smallest data point and the biggest data point in the
set of data create the boundaries of an interval of space on the number line that contains all data points in the set.
The interquartile range (IQR) is the length of the middle 50% of that interval of space.
<img aria-
content/uploads/2012/03/iqr.jpg" alt="what is an interquartile range" width="443" height="165" class="size-full
wp-image-2334" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/03/iqr.jpg 443w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/03/iqr-300x111.jpg 300w" sizes="(max-width: 443px) 100vw, 443px" />
The interquartile range is the middle 50% of a data set. Box and whiskers image by Jhguch at en.wikipedia
If you want to know that the IQR is in formal terms, the IQR is calculated as: The difference between the third
or upper quartile and the first or lower quartile. Quartile is a term used to describe how to divide the set of
data into four equal portions (think quarter).
IQR Example
If you have a set containing the data points 1, 3, 5, 7, 8, 10, 11 and 13, the first quartile is 4, the second quartile
is 7.5 and the third quartile is 10.5. Draw these points on a number line and you’ll see that those three numbers
divide the number line in quarters from 1 to 13. As such, the IQR of that data set is 6.5, calculated as 10.5
minus 4. The first and third quartiles are also sometimes called the 25th and 75th percentiles because those are
the equivalent figures when the data set is divided into percents rather than quarters.
Back to Top
Interquartile Range using the TI83

<img
<iframe width="560" height="315" src="//www.youtube.com/embed/H_jwBds5JMA" frameborder="0"

While you can use the nifty online interquartile range calculator on this website, that might not be an option in a
quiz or test. Most instructors allow the use of a TI-83 on tests, and it’s even one of the few calculators allowed
in the AP Statistics exam. Finding the TI 83 interquartile range involves nothing more than entering your data
list and pushing a couple of buttons.
Sample problem: Find the TI 83 interquartile range for the heights of the top 10 buildings in the world (as of
2009). The heights, (in feet) are: 2717, 2063, 2001, 1815, 1516, 1503, 1482, 1377, 1312, 1272.
Steps
Step 1: Enter the above data into a list on the TI 83 calculator. Press the STAT button and then press ENTER.
Enter the first number (2717), and then press ENTER. Continue entering numbers, pressing ENTER after each
entry.
Step 3: Press the right arrow button (the arrow keys are located at the top right of the keypad) to select “Calc.”
Step 4: Press ENTER to highlight “1-Var Stats.”
Step 5: Press ENTER again to bring up a list of stats.

Step 6:Scroll down the list with the arrow keys to find Q1 and Q3. Write those numbers down. You could copy
and paste the numbers but unfortunately, Texas Instruments doesn’t make this easy:
1. Use the arrow keys to place the cursor at the beginning of the
text that you want to highlight.
2. Using the TI Keyboard, press and hold down the Shift key, and then use the arrow keys to highlight the
text.
3. Release the Shift key and arrow key.
The copy and paste menu should appear, enabling you to copy and paste the data. You would have to do this
twice (returning to the HOME screen each time), so it’s much faster just to write the numbers down.
Step 7:Subtract Q1 from Q3 to find the IQR (strong>624 feet for this set of numbers).
That’s it!
Back to Top
How to Find Q1, Q3 and the Interquartile Range TI

89
<img
<iframe width="420" height="315" src="https://www.youtube.com/embed/Ftmt7X1OvgE" frameborder="0"

Sample problem: Find Q1, Q3, and the IQR for the following list of numbers: 1, 9, 2, 3, 7, 8, 9, 2.
Step 1: Press APPS. Scroll to Stats/List Editor (use the arrow keys on the keypad to scroll). Press ENTER. If
you don’t have the stats/list editor you can download it here.
Step 2: Clear the list editor of data: press F1 8.
Step 3: Press ALPHA 9 ALPHA 1 ENTER. This names your list “IQ.”
Step 4: Enter your numbers, one at a time. Follow each entry by pressing the ENTER key. For our group of
numbers, enter
1,9,2,3,7,8,9,2
Step 5: Press F4, then ENTER (for the 1-var stats screen).
Step 6: Tell the calculator you want stats for the list called “IQ” by entering ALPHA 9 ALPHA 1 into the
“List:” box. The calculator should automatically put the cursor there for you. Press ENTER twice.
Step 7:Read the results. Q1 is listed as Q1X (in our example, Q1X=2). Q 3 is listed as Q3X (Q3X=8.5). To find
the IQR, subtract Q1 from Q3 on the Home screen. The IQR is 8.5-2=6.5.
That’s it!
Back to Top
What is The Interquartile Range Formula?

The IQR formula is:
IQR = Q3 – Q1
Where Q3 is the upper quartile and Q1 is the lower quartile.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/boxplot4.bmp" alt="the
interquartile range formula" class="alignnone size-full wp-image-42" />
IQR as a test for normal distribution
Use the interquartile range formula with the mean and standard deviation to test whether or not a population has
a normal distribution. The formula to determine whether or not a population is normally distributed are:
Q1 – (σ z 1) + X
Q3 – (σ z 3) + X
Where Q1 is the first quartile, Q3 is the third quartile, σ is the standard deviation, z is the standard score (“z-
score“) and X is the mean. In order to tell whether a population is normally distributed, solve both equations and
then compare the results. If there is a significant difference between the results and the first or third quartiles,
then the population is not normally distributed.
Back to Top
What is an Interquartile Range Used For?
The IQR is used to measure how spread out the data points in a set are from the mean of the data set. The higher
the IQR, the more spread out the data points; in contrast, the smaller the IQR, the more bunched up the data
points are around the mean. The IQR range is one of many measurements used to measure how spread out the
data points in a data set are. It is best used with other measurements such as the median and total range to build
a complete picture of a data set’s tendency to cluster around its mean.
Back to Top
Where Does the term Interquartile Range Come From?

Who invented the term “Interquartile Range?” In order to find that out, we have to go back to the 19th century.
History
British physician Sir Donald MacAlister used the terms lower quartile and higher quartile in the 1879
publication, the Law of the Geometric Mean. Proc. R. Soc. XXIX, p. 374: ” “As these two measures, with the
mean, divide the curve of facility into four equal parts, I propose to call them the ‘higher quartile’ and the ‘lower
quartile’ respectively.”
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/origin-of-the-word-
quartile.jpg" alt="origin of the word quartile" width="381" height="254" class="alignnone size-full wp-image-
7002" srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/origin-of-the-
word-quartile.jpg 381w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/origin-of-the-word-quartile-300x200.jpg 300w" sizes="(max-width: 381px) 100vw,
381px" />
Although a physician by trade, he was gifted with mathematics and achieved the highest score in the final
mathematics exams at Cambridge University in 1877. He spoke nineteen languages including English, Czech
and Swedish.
Macalister’s paper, the Law of the Geometric Mean was actually in response to a question put forward by
Francis Galton (inventor of the Galton board). However, it wasn’t until 1882 that Galton (“Report of the
Anthropometric Committee”) used the upper quartile and lower quartile values and the term “interquartile
range” — defined as twice the probable error. Galton wasn’t just a statistician — he was also an anthropologist,
geographer, proto-genetecist and psychometrician who produced more than 340 books. He also coined the
statistical terms “correlation” and “regression toward the mean.”
------------------------------------------------------------------------------
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Interquartile Range (IQR): What it is and How to Find it

The interquartile range is a measure of where the “middle fifty” is in a data set. Where a range is a measure of
where the beginning and end are in a set, an interquartile range is a measure of where the bulk of the values
lie. That’s why it’s preferred over many other measures of spread (i.e. the average or median) when reporting
things like school performance or SAT scores.
The interquartile range formula is the first quartile subtracted from the third quartile:
IQR = Q3 – Q1.
Contents (click to skip to the page section):

Solving by hand:
1. Solve the formula by hand (odd set of numbers).

2. What if I have an even set of numbers?
3. Find an interquartile range for an odd set of numbers: Second Method
4. Box Plot interquartile range: How to find it
Using Technology:
1. Interquartile Range in Minitab

2. Interquartile Range in Excel
3. Interquartile Range in SPSS
4. Interquartile Range on the TI83
5. Q1, Q3 and the IQR on the TI89
General info:
1. What is an Interquartile range?

2. What is the Interquartile Range Formula?
3. IQR as a Test for Normal Distribution
4. What is an Interquartile Range used for?
5. History of the Interquartile Range.
Solve the formula by hand.

<img
<iframe width="420" height="315" src="//www.youtube.com/embed/R6VDj7pEG30?rel=0" frameborder="0"

Steps:
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
 Step 2: Find the median.
1, 2, 5, 6, 7, 9, 12, 15, 18, 19, 27.
Not necessary statistically, but it makes Q1 and Q3 easier to spot.
(1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27).
Think of Q1 as a median in the lower half of the data and think of Q3 as a median for the upper half of
data.
(1, 2, 5, 6, 7), 9, ( 12, 15, 18, 19, 27). Q1 = 5 and Q3 = 18.
 Step 5: Subtract Q1 from Q3 to find the interquartile range.
18 – 5 = 13.
What if I Have an Even Set of Numbers?

Sample question: Find the IQR for the following data set: 3, 5, 7, 8, 9, 11, 15, 16, 20, 21.
3, 5, 7, 8, 9, 11, 15, 16, 20, 21.
 Step 2: Make a mark in the center of the data:
3, 5, 7, 8, 9, | 11, 15, 16, 20, 21.
 Step 3: Place parentheses around the numbers above and below the mark you made in Step 2–it
makes Q1 and Q3 easier to spot.
(3, 5, 7, 8, 9), | (11, 15, 16, 20, 21).
Q1 is the median (the middle) of the lower half of the data, and Q3 is the median (the middle) of the upper
half of the data.
(3, 5, 7, 8, 9), | (11, 15, 16, 20, 21). Q1 = 7 and Q3 = 16.
 Step 5: Subtract Q1 from Q3.
16 – 7 = 9.
This is your IQR.
Back to Top
Find an interquartile range for an odd set of

numbers: Alternate Method
As you may already know, nothing is “set in stone” in statistics: when some statisticians find an interquartile
range for a set of odd numbers, they include the median in both both quartiles. For example, in the following set
of numbers: 1,2,5,6,7,9,12,15,18,19,27 some statisticians would break it into two halves, including the median
(9) in both halves:
(1,2,5,6,7,9),(9,12,15,18,19,27)
This leads to two halves with an even set of numbers, so you can follow the steps above to find the IQR.
Back to Top

<img
<iframe width="420" height="315" src="//www.youtube.com/embed/oI0qDG5ZqZg" frameborder="0"


image-32" title="boxplot1" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/08/boxplot1.bmp" alt="box plot interquartile range" />
Sample question: Find the interquartile range for the above box plot.
 Step 1: Find Q1.Q1 is represented by the left hand edge of the “box” (at the point where the whisker
stops).
<img class="alignnone size-
full wp-image-34" title="finding q1 on the boxplot graph"
alt="finding q1 on the boxplot graph" />
In the above graph, Q1 is approximately at 2.6. (A complete explanation of Q1 is here: The five number
summary.)
 Step 2: Find Q3.

Q3 is represented on a boxplot by the right hand edge of the “box”.
<img class="alignnone size-full

wp-image-35" title="finding q3 on the boxplot"
alt="finding q3 on the boxplot" width="394" height="151" />
Q3 is approximately 12 in this graph.
 Step 3: Subtract the number you found in step 1 from the number you found in step 3.
This will give you the interquartile range. 12 – 2.6 = 9.4.
That’s it!
Back to Top
Interquartile Range in Minitab
Read on for step-by-step directions, or view the video version below.

<img
<iframe width="420" height="315" src="//www.youtube.com/embed/FM62SrU_udA?rel=0" frameborder="0"

Interquartile Range in Minitab: Steps

Sample question: Find an interquartile range in Minitab for the Grade Point Average (GPA) in the following
data set:
Grade Point Average (GPA): 1(3.2), 1(3.1), 2(3.5), 2(2.0), 3(1.9), 3(4.0), 3(3.9), 4(3.8), 4(2.9), 5(3.9), 5(3.2),
5(3.3), 6(3.4), 6(2.6), 6(2.5), 7(2.0), 7(1.5), 8(4.0), 8(2.0).
Step 1: Type your data into a Minitab worksheet. Enter your data into one or two columns.
content/uploads/2013/09/minitab-interquartile-range-a-122x300.jpg" alt="minitab interquartile range a"
width="122" height="300" class="alignnone size-medium wp-image-5664"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/minitab-interquartile-
range-a-122x300.jpg 122w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/minitab-interquartile-range-a.jpg 195w" sizes="(max-width: 122px) 100vw, 122px" />
Step 2: Click “Stat,” then click “Basic Statistics,” then click “Display Descriptive Statistics” to open the
Descriptive Statistics menu.
<img
range-b-300x165.jpg" alt="minitab interquartile range b" width="300" height="165" class="alignnone size-
content/uploads/2013/09/minitab-interquartile-range-b-300x165.jpg 300w,
b.jpg 613w" sizes="(max-width: 300px) 100vw, 300px" />
Step 3: Click a variable name in the left window and then click the “Select” button to transfer the variable
name to the right-hand window.
<img
range-c-295x300.jpg" alt="minitab interquartile range c" width="295" height="300" class="alignnone size-
content/uploads/2013/09/minitab-interquartile-range-c-295x300.jpg 295w,
c.jpg 354w" sizes="(max-width: 295px) 100vw, 295px" />
Step 5: Check “Interquartile Range.”
<img
range-d-300x229.jpg" alt="minitab interquartile range d" width="300" height="229" class="alignnone size-
content/uploads/2013/09/minitab-interquartile-range-d-300x229.jpg 300w,
d.jpg 429w" sizes="(max-width: 300px) 100vw, 300px" />.
Step 6: Click the “OK” button (a new window will open with the result). The IQR for the GPA in this
particular data set is 1.8.
<img
range-e-300x179.jpg" alt="minitab interquartile range e" width="300" height="179" class="alignnone size-
content/uploads/2013/09/minitab-interquartile-range-e-300x179.jpg 300w,
e.jpg 657w" sizes="(max-width: 300px) 100vw, 300px" />
That’s it!
Tip: If you don’t see descriptive statistics show in a window, click “Window” on the toolbar, then click “Tile.”
Click the Session window (this is where descriptive statistics appear) and then scroll up to see your results.
Back to Top
Interquartile Range in Excel 2007
How to Find an Interquartile Range Excel 2007
Watch the video or read the steps below to find an interquartile range in Excel 2007:
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/wda-jwGHNSg?rel=0" frameborder="0"

<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/04/iqr.jpg" alt=""
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/04/iqr.jpg 399w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2012/04/iqr-300x203.jpg 300w"
Steps:
Step 1: Enter your data into a single Excel column on a worksheet. For example, type your data in cells A2 to
A10. Don’t leave any gaps in your data.
Step 2: Click a blank cell (for example, click cell B2) and then type =QUARTILE(A2:A10,1). You’ll need to
replace A2:A10 with the actual values from your data set. For example, if you typed your data into B2 to B50,
the equation is =QUARTILE(B2:B50,1). The “1” in this Excel formula(A2:A10,1) represents the first quartile
(i.e the point lying at 25% of the data set).
Step 3: Click a second blank cell (for example, click cell B3) and then type =QUARTILE(A2:A10,3). Replace
A2:A10 with the actual values from your data set. The “3” in this Excel formula (A2:A10,3) represents the third
quartile (i.e. the point lying at 75% of the data set).
Step 4: Click a third blank cell (for example, click cell B4) and then type =B3-B2. If your quartile functions
from Step 2 and 3 are in different locations, change the cell references.
Step 5: Press the “Enter” key. Excel will return the IQR in the cell you clicked in Step 4
That’s it!
Back to Top
How to Find an Interquartile Range in SPSS

Like most technology, SPSS has several ways that you can calculate the IQR. However, if you click on the
most intuitive way you would expect to find it (“Descriptive Statistics > Frequencies”), the surprise is that it
won’t list the IQR (although it will list the first, second and third quartiles). You could take this route and then
subtract the third quartile from the first to get the IQR. However, the easiest way to find the interquartile range
in SPSS by using the “Explore” command. If you have already typed data into your worksheet, skip to Step 3.
<img
<iframe width="560" height="315" src="https://www.youtube.com/embed/Bwr0mJwyl5Y" frameborder="0"

Steps
Step 1: Open a new data file in SPSS. Click “File,” mouse over “New” and then click “Data.”
Step 2: Type your data into columns in the worksheet. You can use as many columns as you need, but don’t
leave blank rows or spaces between your data. See: How to Enter Data into SPSS.
Step 3: Click “Analyze,” then mouse over “Descriptive Statistics.” Click “Explore” to open the “Explore”
dialog box.
<img
SPSS-1-300x130.jpg" alt="how to find the interquartile range in SPSS 1" width="300" height="130"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/interquartile-range-
SPSS-1-300x130.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/interquartile-range-SPSS-1.jpg 648w" sizes="(max-width: 300px) 100vw, 300px" />
Step 4: Click the variable name (that’s just a fancy name for the column heading), then click the top arrow to
move the variable into the “Dependent list” box.
SPSS-2-300x220.jpg" alt="The "Explore" variables dialog box." width="300" height="220"
class="size-medium wp-image-4929" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
The “Explore” variables dialog box.
Step 5: Click “OK.” The interquartile range is listed in the Descriptives box.
<img
SPSS-3-300x188.jpg" alt="interquartile range SPSS 3" width="300" height="188" class="alignnone size-
Tip: This example has only one list typed into the data sheet, but you may have several to choose from
depending on how you entered your data. Make sure you select the right variable (column names) before
proceeding. If you want more memorable variable names, change the column title by clicking the “variable
view” button at the very bottom left of the worksheet. Type in your new variable name and then return to data
view by clicking the “data view” button.
Back to Top
Imagine all the data in a set as points on a number line. For example, if you have 3, 7 and 28 in your set of data,
imagine them as points on a number line that is centered on 0 but stretches both infinitely below zero and
infinitely above zero. Once plotted on that number line, the smallest data point and the biggest data point in the
set of data create the boundaries of an interval of space on the number line that contains all data points in the set.
The interquartile range (IQR) is the length of the middle 50% of that interval of space.
<img aria-
content/uploads/2012/03/iqr.jpg" alt="what is an interquartile range" width="443" height="165" class="size-full
content/uploads/2012/03/iqr.jpg 443w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/03/iqr-300x111.jpg 300w" sizes="(max-width: 443px) 100vw, 443px" />
The interquartile range is the middle 50% of a data set. Box and whiskers image by Jhguch at en.wikipedia
If you want to know that the IQR is in formal terms, the IQR is calculated as: The difference between the third
or upper quartile and the first or lower quartile. Quartile is a term used to describe how to divide the set of
data into four equal portions (think quarter).
IQR Example
If you have a set containing the data points 1, 3, 5, 7, 8, 10, 11 and 13, the first quartile is 4, the second quartile
is 7.5 and the third quartile is 10.5. Draw these points on a number line and you’ll see that those three numbers
divide the number line in quarters from 1 to 13. As such, the IQR of that data set is 6.5, calculated as 10.5
minus 4. The first and third quartiles are also sometimes called the 25th and 75th percentiles because those are
the equivalent figures when the data set is divided into percents rather than quarters.
Back to Top
Interquartile Range using the TI83

<img
<iframe width="560" height="315" src="//www.youtube.com/embed/H_jwBds5JMA" frameborder="0"

While you can use the nifty online interquartile range calculator on this website, that might not be an option in a
quiz or test. Most instructors allow the use of a TI-83 on tests, and it’s even one of the few calculators allowed
in the AP Statistics exam. Finding the TI 83 interquartile range involves nothing more than entering your data
list and pushing a couple of buttons.
Sample problem: Find the TI 83 interquartile range for the heights of the top 10 buildings in the world (as of
2009). The heights, (in feet) are: 2717, 2063, 2001, 1815, 1516, 1503, 1482, 1377, 1312, 1272.
Steps
Step 1: Enter the above data into a list on the TI 83 calculator. Press the STAT button and then press ENTER.
Enter the first number (2717), and then press ENTER. Continue entering numbers, pressing ENTER after each
entry.
Step 3: Press the right arrow button (the arrow keys are located at the top right of the keypad) to select “Calc.”
Step 4: Press ENTER to highlight “1-Var Stats.”
Step 5: Press ENTER again to bring up a list of stats.

Step 6:Scroll down the list with the arrow keys to find Q1 and Q3. Write those numbers down. You could copy
and paste the numbers but unfortunately, Texas Instruments doesn’t make this easy:
1. Use the arrow keys to place the cursor at the beginning of the
text that you want to highlight.
2. Using the TI Keyboard, press and hold down the Shift key, and then use the arrow keys to highlight the
text.
3. Release the Shift key and arrow key.
The copy and paste menu should appear, enabling you to copy and paste the data. You would have to do this
twice (returning to the HOME screen each time), so it’s much faster just to write the numbers down.
Step 7:Subtract Q1 from Q3 to find the IQR (strong>624 feet for this set of numbers).
That’s it!
Back to Top
How to Find Q1, Q3 and the Interquartile Range TI

89
<img
<iframe width="420" height="315" src="https://www.youtube.com/embed/Ftmt7X1OvgE" frameborder="0"

Sample problem: Find Q1, Q3, and the IQR for the following list of numbers: 1, 9, 2, 3, 7, 8, 9, 2.
Step 1: Press APPS. Scroll to Stats/List Editor (use the arrow keys on the keypad to scroll). Press ENTER. If
you don’t have the stats/list editor you can download it here.
Step 2: Clear the list editor of data: press F1 8.
Step 3: Press ALPHA 9 ALPHA 1 ENTER. This names your list “IQ.”
Step 4: Enter your numbers, one at a time. Follow each entry by pressing the ENTER key. For our group of
numbers, enter
1,9,2,3,7,8,9,2
Step 5: Press F4, then ENTER (for the 1-var stats screen).
Step 6: Tell the calculator you want stats for the list called “IQ” by entering ALPHA 9 ALPHA 1 into the
“List:” box. The calculator should automatically put the cursor there for you. Press ENTER twice.
Step 7:Read the results. Q1 is listed as Q1X (in our example, Q1X=2). Q 3 is listed as Q3X (Q3X=8.5). To find
the IQR, subtract Q1 from Q3 on the Home screen. The IQR is 8.5-2=6.5.
That’s it!
Back to Top
What is The Interquartile Range Formula?

The IQR formula is:
IQR = Q3 – Q1
Where Q3 is the upper quartile and Q1 is the lower quartile.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/boxplot4.bmp" alt="the
interquartile range formula" class="alignnone size-full wp-image-42" />
IQR as a test for normal distribution
Use the interquartile range formula with the mean and standard deviation to test whether or not a population has
a normal distribution. The formula to determine whether or not a population is normally distributed are:
Q1 – (σ z 1) + X
Q3 – (σ z 3) + X
Where Q1 is the first quartile, Q3 is the third quartile, σ is the standard deviation, z is the standard score (“z-
score“) and X is the mean. In order to tell whether a population is normally distributed, solve both equations and
then compare the results. If there is a significant difference between the results and the first or third quartiles,
then the population is not normally distributed.
Back to Top
What is an Interquartile Range Used For?
The IQR is used to measure how spread out the data points in a set are from the mean of the data set. The higher
the IQR, the more spread out the data points; in contrast, the smaller the IQR, the more bunched up the data
points are around the mean. The IQR range is one of many measurements used to measure how spread out the
data points in a data set are. It is best used with other measurements such as the median and total range to build
a complete picture of a data set’s tendency to cluster around its mean.
Back to Top
Where Does the term Interquartile Range Come From?

Who invented the term “Interquartile Range?” In order to find that out, we have to go back to the 19th century.
History
British physician Sir Donald MacAlister used the terms lower quartile and higher quartile in the 1879
publication, the Law of the Geometric Mean. Proc. R. Soc. XXIX, p. 374: ” “As these two measures, with the
mean, divide the curve of facility into four equal parts, I propose to call them the ‘higher quartile’ and the ‘lower
quartile’ respectively.”
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/origin-of-the-word-
quartile.jpg" alt="origin of the word quartile" width="381" height="254" class="alignnone size-full wp-image-
7002" srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/origin-of-the-
word-quartile.jpg 381w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/origin-of-the-word-quartile-300x200.jpg 300w" sizes="(max-width: 381px) 100vw,
381px" />
Although a physician by trade, he was gifted with mathematics and achieved the highest score in the final
mathematics exams at Cambridge University in 1877. He spoke nineteen languages including English, Czech
and Swedish.
Macalister’s paper, the Law of the Geometric Mean was actually in response to a question put forward by
Francis Galton (inventor of the Galton board). However, it wasn’t until 1882 that Galton (“Report of the
Anthropometric Committee”) used the upper quartile and lower quartile values and the term “interquartile
range” — defined as twice the probable error. Galton wasn’t just a statistician — he was also an anthropologist,
geographer, proto-genetecist and psychometrician who produced more than 340 books. He also coined the
statistical terms “correlation” and “regression toward the mean.”
------------------------------------------------------------------------------
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Discrete vs Continuous variables: How to Tell the Difference
Probability and Statistics > Basic Statistics > Discrete vs continuous variables
Watch the video, or read the article below:
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/bRr8zDOkh9o?rel=0" frameborder="0"

In an introductory stats class, one of the first things you’ll learn is the difference between discrete vs continuous
variables. In a nutshell, discrete variables are points plotted on a chart and a continuous variable can be plotted
as a line.
Discrete vs Continuous variables: Definitions.

What is a Discrete Variable?
Discrete variables are countable in a finite amount of time. For example, you can count the change in your
pocket. You can count the money in your bank account. You could also count the amount of money in
everyone’s bank accounts. It might take you a long time to count that last item, but the point is—it’s still
countable.
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/scatter-plot-2-
300x179.jpg" alt="Discrete variables on a scatter plot." width="300" height="179" class="size-medium wp-
image-5924" srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/scatter-
plot-2-300x179.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/scatter-plot-2.jpg 512w" sizes="(max-width: 300px) 100vw, 300px" />
Discrete variables on a scatter plot.
What is a Continuous Variable?

Continuous Variables would (literally) take forever to count. In fact, you would get to “forever” and never
finish counting them. For example, take age. You can’t count “age”. Why not? Because it would literally take
forever. For example, you could be:
25 years, 10 months, 2 days, 5 hours, 4 seconds, 4 milliseconds, 8 nanoseconds, 99 picosends…and so on.
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/clock-150x150.jpg"
alt="discrete vs continuous variables" width="150" height="150" class="size-thumbnail wp-image-10486"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/clock-150x150.jpg
150w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/clock-300x300.jpg
300w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/clock-230x230.jpg
230w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/clock-80x80.jpg 80w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/clock.jpg 500w" sizes="(max-
width: 150px) 100vw, 150px" />
Time is a continuous variable.

You could turn age into a discrete variable and then you could count it. For example:
 A person’s age in years.

 A baby’s age in months.
Take a look at this article on orders of magnitude of time and you’ll see why time or age just isn’t countable.
Try counting your age in Planctoseconds (good luck…see you at the end of time!).
Discrete vs Continuous variables: Steps

Step 1: Figure out how long it would take you to sit down and count out the possible values of your variable.
For example, if your variable is “Temperature in Arizona,” how long would it take you to write every possible
temperature? It would take you literally forever:
50°, 50.1°, 50.11°, 50.111°, 50.1111°, …
If you start counting now and never, ever, ever finish (i.e. the numbers go on and on until infinity), you have
what’s called a continuous variable.
If your variable is “Number of Planets around a star,” then you can count all of the numbers out (there can’t be
an infinite number of planets). That is a discrete variable.
Step 2: Think about “hidden” numbers that you haven’t considered. For example: is time a discrete or
continuous variable? You might think it’s continuous (after all, time goes on forever, right?) but if we’re
thinking about numbers on a wristwatch (or a stop watch), those numbers are limited by the numbers or number
of decimal places that a manufacturer has decided to put into the watch. It’s unlikely that you’ll be given an
ambiguous question like this in your elementary stats class but it’s worth thinking about!
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/01/graph-of-4-5x+3-
295x300.png" alt="graph" width="295" height="300" class="size-medium wp-image-11356"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/01/graph-of-4-5x+3-
295x300.png 295w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/01/graph-of-
4-5x+3-1007x1024.png 1007w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2014/01/graph-of-4-5x+3-80x80.png 80w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/01/graph-of-4-5x+3.png 1268w"
This graph of -4/5x+3 has continuous variables — it could go on forever…
Check out our Youtube channel for more stats tips and help!
------------------------------------------------------------------------------
By Stephanie | March 22, 2018 | Statistics How To |
 ← Tarone-Ware Test
 Find a Five-Number Summary in Statistics: Easy Steps →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Range of a Set of Data in Math and Statistics
Contents (click to skip to that section):
1. Definition
2. How to Find a Range
3. When it Might be Misleading
4. Rule of Thumb
5. Range in Excel
6. Origins / History
Definition of a Range (Statistics)

In statistics, the range is a measure of spread: it’s the difference between the highest value and the lowest value
in a data set.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2018/08/range-statistics.png"
alt="range math statistics" width="372" height="231" class="alignleft size-full wp-image-53952"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2018/08/range-statistics.png
372w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2018/08/range-statistics-
Note: In some areas of math, the range can also mean the entire range of numbers — for example, the range of
cell phone prices might be $40 to $550. In calculus, the range is defined differently. It is all of the output values
of a function. See: How to Find the Domain and Range of a Function.
How to Find a Range in Statistics

<img
<iframe src="//www.youtube.com/embed/La43tpiK0Z4?rel=0" width="560" height="315" frameborder="0"

The same two steps are used whether you are dealing with positive numbers, negative numbers, or time (e.g.
seconds or minutes).
How to Find a Range

Example question 1: What is the range for the following set of numbers? 10, 99, 87, 45, 67, 43, 45, 33, 21, 7,
65, 98?
Step 1: Sort the numbers in order, from smallest to largest:

7, 10, 21, 33, 43, 45, 45, 65, 67, 87, 98, 99
Step 2: Subtract the smallest number in the set from the largest number in the set :
99 – 7 = 92
The range is 92
That’s it!
Example question 2: What is the range of these integers?

14, -12, 7, 0, -5, -8, 17, -11, 19
-12, -11, -8, -5, 0, 7, 14, 17, 19
19 – -12 = 19 + 12 = 31
The range is 31.
That’s it!
Example question 3: What is the range of the following times?

2.7 hrs, 8.3 hrs, 3.5 hrs, 5.1 hrs, 4.9 hrs
2.7, 3.5, 4.9, 5.1, 8.3
8.3 hr – 2.7 hr = 5.6 hr
The range is 5.6 hr.
That’s how to find a range!
Another Example.
Problem: You take 7 statistics tests over the course of a semester. You score 94, 88, 73, 84, 91, 87, and 79.
What is the range of your scores?
Solution:
Step 1: Order your scores from smallest to largest:
73, 79, 84, 87, 88, 91, 94.
Step 2: Subtract the smallest number from the highest = 94 – 73 = 21.
Answer: 21.
Back to Top
When it Might be Misleading

The range only uses the smallest and the largest number in a set; The rest of the values are ignored. That could
lead to a misleading result. Take the above test scores. Let’s say you had the flu one test day and scored a 10.
Assuming your highest score on another test was 94, then:
94 – 10 = 84!
That’s not a good reflection of your overall test performance at all.
The score of 10 in the example above is what we call an outlier. It’s an extremely high or low value that can
throw off stats. That’s why other measures of spread are sometimes preferred, like the mean.
Back to Top
Rule of Thumb
The rule of thumb says that the range is about four times the standard deviation (Range = 4*SD). The standard
deviation is another measure of spread in statistics. It tells you how your data is clustered around the mean.
What the rule of thumb tells you in most cases is that the bulk of the data can be found pretty close to the mean
(within a couple of standard deviations); The result is that those erroneous “outliers” should have very little
effect on your final statistic.
Procedure for finding a standard deviation using the rule of thumb:

Step 1: Find the range.
Step 2: Divide Step 1 by four.
The rule of thumb doesn’t work that well for small data sets. And it doesn’t work at all if you don’t have data
that fits a normal distribution. That’s why you’ll rarely see it used in statistics. See: Range rule of thumb.
Back to Top
Range in Excel 2013-2016

<img
<iframe src="//www.youtube.com/embed/CeWwXk1XGuA" width="420" height="315" frameborder="0"

To find a range in Excel, you have two options: you can use the MAX and MIN functions to find the largest and
smallest numbers in a data set and then you can subtract the two. For example, if you had a data set in cells A1
to A10, you’d need three formulas in three blank cells. Lastly the format (assuming you put these formulas into
cells B1:B3) would be:
B1 = MAX(A1:A10)
B2 = MIN(A1:A10)
B3 =(B1-B2)
A much easier way is to use Data Analysis, where in just a couple of clicks (with no entering formulas) you can
display a variety of summary statistics, including the range (How to load the Data Analysis Toolpak).
Range in Excel: Data Analysis Steps

1st: Click the “Data” tab and then click “Data Analysis.”
2nd: Click “Descriptive Statistics” and then click “OK.”
3rd: Click the Input Range box and then type the location for your data. For example, if you typed your data into
cells A1 to A10, type “A1:A10” into that box
4th: Click the radio button for Rows or Columns, depending on how your data is laid out.
5th: Click the “Labels in first row” box if your data has column headers.
6th: Click the “Descriptive Statistics” check box.
7th: Select a location for your output. For example, click the “New Worksheet” radio button.
8th: Click “OK.”

Back to Top
Origins
The origin of the word “Range” in mathematics is unknown, but a few early uses of the word as it’s used in
statistics can be found as far back as 1848, in H. Lloyd, “On Certain Questions Connected with the Reduction of
Magnetical and Meteorological Observations,” Proceedings of the Royal Irish Academy, 4, 180-183 (David,
1995). Following this, the word was later used in a book on Calculus in 1865: The Differential Calculus by John
Spare mentions: “…in respect to the range of values which the function and its variable may sustain, and to their
mutual dependence” [University of Michigan Digital Library]. Although technically not statistics, the range in
calculus has practically the same meaning (the spread from the smallest value to the largest).
Check out our YouTube channel for more stats help and tips!
------------------------------------------------------------------------------

Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Fake Statistics: Real or Not? (With Examples)

Sometimes you think you can trust results from a survey, but it isn’t always easy to spot fake statistics. Do you
believe an egg company when it tells you 50% of people in a taste test preferred a certain brand of eggs? How
about if a survey of U.S. Marines showed support for massive military pay increases? Sometimes it isn’t enough
to just accept the given data. Dig a little deeper and you might uncover the truth.
“There are lies, damned lies, and statistics.” ~ Mark Twain
Why Do Fake Statistics Exist?
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/01/Misleading-Statistics-
Examples.jpg" alt="fake statistics" width="275" height="183" class="size-full wp-image-11537" />
This misleading billboard has a fake statistic (Image: Image: Manchester Evening News).
There are three main reasons why fake stats exist:
1. Deliberately misleading: To bolster a dubious claim, people might:

 Link to made up research.
 Show a fancy looking graph created with made up numbers.
 Link to an article published in a professional journal. But when the link is clicked, it actually
downloads a non-published pdf from a public access site.
2. Poor understanding: As anyone who has ever taken a stats class will tell you— stats is hard. Trying
to decipher even the simplest stat is fraught with pitfalls. The billboard above leads you to believe that
80% of dentists recommend Colgate over any other brand; But the dentists in that particular poll actually
recommended several brands (see: misleading graphs for a few more examples).
3. Ignorance: Some people share stories without fact checking them. For example, Donald Trump
famously shared a bogus graphic showing fake crime stats. The claims were widely debunked by
FactCheck.org, The Washington Post, and others; the Post article linked to FBI data which shows just the
opposite.
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2010/02/donald-trump-
crime-retweet.png" alt="" width="617" height="620" class="size-full wp-image-51923"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2010/02/donald-trump-
crime-retweet.png 617w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2010/02/donald-trump-crime-retweet-150x150.png 150w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2010/02/donald-trump-crime-
retweet-300x300.png 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2010/02/donald-trump-crime-retweet-230x230.png 230w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2010/02/donald-trump-crime-
retweet-80x80.png 80w" sizes="(max-width: 617px) 100vw, 617px" />
The source of this particular set of fake statistics—the “Crime Statistics Bureau”— doesn’t exist.
Questions to Ask
Who Paid for the Survey?
Take a close look at who paid for the survey. If you read that 90% of people lost 20 pounds in a month on a
certain “miracle” diet, look at who paid. If it was the company who owns that “miracle” product, then it’s likely
you have what’s called a self-selection study. In this type of study, someone stands to make money from the
results of a trial or survey. You may have seen those soda ads where “90% of people prefer the taste of product
X.” But if the manufacturer paid for that survey, you probably can’t trust the results.
Are the Opinions Biased?

Take a look to see if the statistics came from a voluntary survey, where people can choose to be included or not.
For example, your professor might send you an email with an invitation to comment on what you think of a new
book. These types of samples are biased toward people who have strong opinions (often negative ones). In other
words, students are more likely to respond to the above survey if they hate the book. The students who like it
will be less likely to respond.
Is Causation Proved?
Look for the faulty conclusion that one variable causes another in the survey. For example, you might read that
unemployment causes an increase in corn production; corn products (like high fructose corn syrup) are cheap
and therefore people are more likely to buy cheap foods when unemployed. But there may be many other factors
causing an increase in production—including an increase in government subsidies. Just because one factor is
seemingly connected to another (correlation), that doesn’t necessarily imply that one caused the other. More
info: see Correlation vs. Causation.
<img aria-
content/uploads/2016/11/publication-bias.png" alt="location bias" width="500" height="322" class="size-full
content/uploads/2016/11/publication-bias.png 500w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2016/11/publication-bias-300x193.png 300w" sizes="(max-width: 500px) 100vw, 500px" />
*Studies with positive results are more likely to make it into journals, like those listed on PubMed.
Is the Publication Biased?

If you read a Tweet from Donald Trump, it likely has a leftist slant. On the other hand, Barack Obama’s website
has a strong pro-Democrat bias. That may be common knowledge, but many biases aren’t as clear. How do you
spot a biased news story? Look no further than Allsides.com, which keeps a tab of news sources and their
biases. For example, CNS News tends to the far right, while Buzzfeed is somewhat leftist.
In particular, watch out for misleading percentages. Unemployment may have “slowed by 50%,” but if the
unemployment rate was previously 100,000 new unemployment claims per month, that still means 50,000
people are joining the unemployed ranks every month.
*In academic and professional writing, beware of publication bias. Specifically, journals are more likely to
report positive results (for example, a drug trial that had a positive outcome) rather than a drug trial that failed.
Just because a source publishes a positive result doesn’t mean that there aren’t others out there that reported the
opposite.
Is the Sample Representative?

Make sure the sample size isn’t too limited in scope. It’s unlikely you can generalize about student achievement
in the U.S. by studying a single inner-city school in Brooklyn. And it’s unlikely you can make generalizations
about American polling behavior by standing outside a polling booth in Mar-a-Lago Florida. Just as inner city
schools don’t behave like every other school, a rich neighborhood can’t be used to generalize about the voting
population. Also, make sure the sample size is large enough. If your voting precinct contains 1 million voters,
it’s unlikely you’ll get any good results from surveying 20 people.
Are the Numbers Too Good to be True?

Beware of precise numbers. If a survey reports that 3,150,023 households in the U.S. are dog owners, you
might believe that figure. However, it’s practically impossible that anyone would have seriously surveyed all of
the households in the U.S. It’s much more likely they surveyed a sample and that 3,150,023 is an estimate. It
should have been reported as “an estimated 3 million” to avoid being misled.
Question Everything!
There are many other examples of fake statistics. Newspapers sometimes print erroneous figures, drug
companies print fake test results, governments present fake statistics in their favor. The golden rule is: question
every statistic that you read!
References
Manchester Evening News. Retrieved June 4, 2016 from:
http://www.manchestereveningnews.co.uk/news/greater-manchester-news/kick-in-the-teeth-over-toothpaste-
ads-979028.
------------------------------------------------------------------------------
*Comments? Need to post a correction? Please post a comment on our Facebook page.
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Bayes’ Theorem Problems, Definition and Examples

Probability and Statistics > Probability > Bayes’ Theorem Problems
What is Bayes’ Theorem?

Bayes’ theorem is a way to figure out conditional probability. Conditional probability is the probability of an
event happening, given that it has some relationship to one or more other events. For example, your probability
of getting a parking space is connected to the time of day you park, where you park, and what conventions are
going on at any time. Bayes’ theorem is slightly more nuanced; in a nutshell, it gives you the actual probability
of an event given information about tests.
 “Events” Are different from “tests.” For example, there is a test (assess/check/troubleshoot) for liver
disease, but that’s separate from the event (occurrence/effect/confirmation) of actually having liver
disease.
 Tests are flawed: just because you have a positive test does not mean you actually have the disease.
Many tests have a high false positive rate. Rare events tend to have higher false positive rates than
more common events. We’re not just talking about medical tests here. For example, spam filtering can
have high false positive rates. Bayes’ theorem takes the test results and calculates your real probability
that the test has identified the event (back-calculation).
The Formula
Watch the video for a quick example of working a Bayes’ Theorem problem, or read the examples below:
<img
<iframe width="420" height="315" src="https://www.youtube.com/embed/9miB7xbr59Y" frameborder="0"

Bayes’ Theorem (also known as Bayes’ rule) is a deceptively simple formula used to calculate conditional
probability. The Theorem was named after English mathematician Thomas Bayes (1701-1761). The formal
definition for the rule is:
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/bayes-theorem.jpg"
alt="bayes' theorem" width="254" height="79" class="alignleft size-full wp-image-26035" />
A = Event; B = Test
In most cases, you can’t just plug numbers into an equation; You have to figure out what your “tests” and
“events” are first. For two events, A and B, Bayes’ theorem allows you to figure out p(A|B) (the probability that
event A happened, given that test B was positive) from p(B|A) (the probability that test B happened, given that
event A happened). It can be a little tricky to wrap your head around as technically you’re working backwards;
you may have to switch your tests and events around, which can get confusing. An example should clarify what
I mean by “switch the tests and events around.”
Bayes’ Theorem Example #1

You might be interested in finding out a patient’s probability of having liver disease (event/outcome) if they are
an alcoholic (test). “Being an alcoholic” is the test (kind of like a litmus test) for liver disease.
 A (event): could mean the event “Patient has liver disease.” Past data tells you that 10% of patients
entering your clinic have liver disease. P(A) = 0.10.
 B (test): could mean the litmus test that “Patient is an alcoholic.” Five percent of the clinic’s patients
are alcoholics. P(B) = 0.05.
 You might also know that among those patients diagnosed with liver disease, 7% are alcoholics. This is
your B|A: the probability that a patient is alcoholic, given that they have liver disease, is 7%.
Bayes’ theorem tells you:

P(A|B) = (0.07 * 0.1)/0.05 = 0.14
In other words, if the patient is an alcoholic, their chances of having liver disease is 0.14 (14%). This is a large
increase from the 10% suggested by past data. But it’s still unlikely that any particular patient has liver disease.
More Bayes’ Theorem Examples

Bayes’ Theorem Problems Example #2
Another way to look at the theorem is to say that one event follows another. Above I said “tests” and “events”,
but it’s also legitimate to think of it as the “first event” that leads to the “second event.” There’s no one right
way to do this: use the terminology that makes most sense to you.
In a particular pain clinic, 10% of patients are prescribed narcotic pain killers. Overall, five percent of the
clinic’s patients are addicted to narcotics (including pain killers and illegal substances). Out of all the people
prescribed pain pills, 8% are addicts. If a patient is an addict, what is the probability that they will be prescribed
pain pills?
Step 1: Figure out what your event “A” is from the question. That information is in the italicized part of this
particular question. The event that happens first (A) is being prescribed pain pills. That’s given as 10%.
Step 2: Figure out what your event “B” is from the question. That information is also in the italicized part of
this particular question. Event B is being an addict. That’s given as 5%.
Step 3: Figure out what the probability of event B (Step 2) given event A (Step 1). In other words, find what
(B|A) is. We want to know “Given that people are prescribed pain pills, what’s the probability they are an
addict?” That is given in the question as 8%, or .8.
Step 4: Insert your answers from Steps 1, 2 and 3 into the formula and solve.
P(A|B) = P(B|A) * P(A) / P(B) = (0.08 * 0.1)/0.05 = 0.16
The probability of an addict being prescribed pain pills is 0.16 (16%).
Example #3: the Medical Test

A slightly more complicated example involves a medical test (in this case, a genetic test):
There are several forms of Bayes’ Theorem out there, and they are all equivalent (they are just written in
slightly different ways). In this next equation, “X” is used in place of “B.” In addition, you’ll see some changes
in the denominator. The proof of why we can rearrange the equation like this is beyond the scope of this article
(otherwise it would be 5,000 words instead of 2,000!). However, if you come across a question involving
medical tests, you’ll likely be using this alternative formula to find the answer:
<img src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/bayes-theorem-
problems.png" alt="bayes' theorem problems" width="600" height="68" class="alignnone size-full wp-image-
11982" srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/bayes-
theorem-problems.png 781w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2014/02/bayes-theorem-problems-300x33.png 300w" sizes="(max-width: 600px) 100vw,
600px" />
Watch the video for a quick solution or read two solved Bayes’ Theorem examples below:
<img
<iframe width="560" height="315" src="//www.youtube.com/embed/Jht31ML2HxI" frameborder="0"

1% of people have a certain genetic defect [prA]; [pr~A = 0.99].

90% of tests for the gene detect the defect (true positives) [prB/A].
9.6% of the tests are false positives (patient lacks gene but test turns out positive i.e. did not fail the test despite
not having the gene….somewhat deceptive right?).
If a person gets a positive test result, what are the odds they actually have the genetic defect? [prA/B]?
The first step into solving Bayes’ theorem problems is to assign letters to events:
 A = chance of having the faulty gene. That was given in the question as 1%. That also means the
probability of not having the gene (~A) is 99%.
 X = A positive test result.
So:
1. P(A|X) = Probability of having the gene given a positive test result.

2. P(X|A) = Chance of a positive test result given that the person actually has the gene. That was given in
the question as 90%.
3. p(X|~A) = Chance of a positive test if the person doesn’t have the gene. That was given in the question
as 9.6%
Now we have all of the information we need to put into the equation:
P(A|X) = (.9 * .01) / (.9 * .01 + .096 * .99) = 0.0865 (8.65%).
The probability of having the faulty gene on the test is 8.65%.
Bayes’ Theorem Problems #4: A Test for Cancer

I wrote about how challenging physicians find probability and statistics in my post on reading mammogram
results wrong. Note: It’s not surprising that physicians are way off with their interpretation of results, given that
some tricky probabilities are at play. Here’s a second example of how Bayes’ Theorem works. I’ve used similar
numbers, but the question is worded differently to give you another opportunity to wrap your mind around how
you decide which is event A and which is event X.
Q. Given the following statistics, what is the probability that a woman has cancer if she has a positive
mammogram result?
 One percent of women over 50 yrs of age have breast cancer.

 Ninety percent of women who have breast cancer test positive on mammograms.
 Eight percent of women will have false positives.
Step 1: Assign events to A or X. You want to know what a woman’s probability of having cancer is, given a
positive mammogram. For this problem, actually having cancer is A and a positive test result is X.
Step 2: List out the parts of the equation (this makes it easier to work the actual equation):
P(A)=0.01
P(~A)=0.99
P(X|A)=0.9
P(X|~A)=0.08
Step 3: Insert the parts into the equation and solve. Note that as this is a medical test, we’re using the form of the
equation from example #2:
(0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.
The probability of a woman having cancer, given a positive test result, is 10%.
Remember when (up there ^^) I said that there are many equivalent ways to write Bayes Theorem? Here
is another equation, that you can use to figure out the above problem. You’ll get exactly the same result:
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/bayes-4a.png" alt="bayes
4a" width="426" height="63" class="alignleft size-full wp-image-26043"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/bayes-4a.png 426w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/bayes-4a-300x44.png 300w"
The main difference with this form of the equation is that it uses the probability terms intersection(∩) and
compliment (c). Think of it as shorthand: it’s the same equation, written in a different way.
In order to find the probabilities on the right side of this equation, use the multiplication rule:
P(B ∩ A) = P(B) * P(A|B)
The two sides of the equation are equivalent, and P(B) * P(A|B) is what we were using when we solved the
numerator in the problem above.
P(B) * P(A|B) = 0.01 * 0.9 = 0.009
For the denominator, we have P(Bc ∩ A) as part of the equation. This can be (equivalently) rewritten as
P(Bc*P(A|Bc). This gives us:
P(B *P(A|B ) = 0.99 * 0.08 = 0.0792.
c c
Inserting those two solutions into the formula, we get:

0.009 / (0.009 + 0.0792) = 10%.
Bayes’ Theorem Problems: Another Way to Look at

It.
Bayes’ theorem problems can be figured out without using the equation (although using the equation is probably
simpler). But if you can’t wrap your head around why the equation works (or what it’s doing), here’s the non-
equation solution for the same problem in #1 (the genetic test problem) above.
Step 1: Find the probability of a true positive on the test. That equals people who actually have the defect (1%) *
true positive results (90%) = .009.
Step 2: Find the probability of a false positive on the test. That equals people who don’t have the defect (99%) *
false positive results (9.6%) = .09504.
Step 3: Figure out the probability of getting a positive result on the test. That equals the chance of a true positive
(Step 1) plus a false positive (Step 2) = .009 + .09504 = .0.10404.
Step 4: Find the probability of actually having the gene, given a positive result. Divide the chance of having a
real, positive result (Step 1) by the chance of getting any kind of positive result (Step 3) = .009/.10404 = 0.0865
(8.65%).
Other forms of Bayes’ Theorem

Bayes’ Theorem has several forms. You probably won’t encounter any of these other forms in an elementary
stats class. The different forms can be used for different purposes. For example, one version uses what Rudolf
Carnap called the “probability ratio“. The probability ratio rule states that any event (like a patient having liver
disease) must be multiplied by this factor PR(H,E)=P E(H)/P(H). That gives the event’s probability conditional
on E. The Odds Ratio Rule is very similar to the probability ratio, but the likelihood ratio divides a test’s true
positive rate divided by its false positive rate. The formal definition of the Odds Ratio rule is OR(H,E)=P H,
(E)/P~H(E).
Bayesian Spam Filtering

Although Bayes’ Theorem is used extensively in the medical sciences, there are other applications. For example,
it’s used to filter spam. The event in this case is that the message is spam. The test for spam is that the message
contains some flagged words (like “viagra” or “you have won”). Here’s the equation set up (from Wikipedia),
read as “The probability a message is spam given that it contains certain flagged words”:
<img src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/spam-filtering.png"
alt="spam filtering" width="715" height="87" class="alignleft size-full wp-image-26018"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/spam-filtering.png
715w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/spam-filtering-
The actual equations used for spam filtering are a little more complex; they contain more flags than just content.
For example, the timing of the message, or how often the filter has seen the same content before, are two other
spam tests.
Next: Inverse Probability Distribution
------------------------------------------------------------------------------
By Stephanie | February 5, 2014 | Statistics How To |
 ← Empirical Research: Definition

 Normal Probability Practice Problems and Answers →
Find an article
Search
top universities!

Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Find a Five-Number Summary in Statistics: Easy Steps

Probability and Statistics > Basic Statistics > How to find a five-number summary in statistics
Contents:
1. Find a Five-Number Summary by Hand

2. TI 89 Instructions
3. SPSS Instructions
4. 5 number summary in Excel (new window)
How to find a five-number summary in statistics:

Overview

image-42" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/08/boxplot4.bmp" alt="find a five-number summary." />
The five number summary includes 5 items:
 The minimum.
 Q1 (the first quartile, or the 25% mark).
 The median.
 Q3 (the third quartile, or the 75% mark).
 The maximum.
The five number summary gives you a rough idea about what your data set looks like. for example, you’ll have
your lowest value (the minimum) and the highest value (the maximum). Although it’s useful in itself, the main
reason you’ll want to find a five-number summary is to find more useful statistics, like the interquartile range,
sometimes called the middle fifty.
This how to article will guide you through how to find a five-number summary. Watch the video or read the
steps below:
<img
<iframe src="//www.youtube.com/embed/omOSu7_Z22o?rel=0" width="420" height="315" frameborder="0"

How to Find a Five-Number Summary: Steps

 Step 1: Put your numbers in ascending order (from smallest to largest). For this particular data set, the
order is:
Example: 1,2,5,6,7,9,12,15,18,19,27.
 Step 2: Find the minimum and maximum for your data set. Now that your numbers are in order, this
should be easy to spot.
In the example in step 1, the minimum (the smallest number) is 1 and the maximum (the largest number) is
27.
 Step 3: Find the median. The median is the middle number. If you aren’t sure how to find the median,
see: How to find the mean mode and median.
(This is not technically necessary, but it makes Q1 and Q3 easier to find).
(1,2,5,6,7),9,(12,15,18,19,27).
 Step 5: Find Q1 and Q3. Q1 can be thought of as a median in the lower half of the data, and Q3 can be
thought of as a median for the upper half of data.
(1,2,5,6,7), 9, ( 12,15,18,19,27).
 Step 6: Write down your summary found in the above steps.
minimum=1, Q1 =5, median=9, Q3=18, and maximum=27.
That’s it!
When the Summary doesn’t exist

Sometimes, it’s impossible to find a five-number summary. In order for the five numbers to exist, your data set
must meet these two requirements:
 Your data must be univariate. In other words, the data must be a single variable. For example, this list
of weights is one variable: 120, 100, 130, 145. If you have a list of ages and you want to compare the ages
to weights, it becomes bivariate data (two variables). For example: age 1 (25 pounds), 5 (60 pounds), 15
(129 pounds). The matching pairs makes it impossible to find a five number summary.
 Your data must be ordinal, interval, or ratio.
Back to Top
Box and whisker chart

A box and whiskers chart is a visual representation of the summary.

image-42" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/08/boxplot4.bmp" alt="variability" />
Box Plot / Find a Five-Number Summary on the TI

89
<img aria-describedby="caption-attachment-18817" class="size-full wp-
image-18817" src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2015/07/left-
skewed-boxplot.png" alt="box plot ti 89" width="200" height="93" />
A left skewed boxplot, showing a long left whisker. Image: SHU.EDU
When you create a box and whiskers chart on the TI-89, the TI-89 will automatically calculate the five number
summary for you.
Sample problem: Create a box and whiskers chart and find the five number summary for the following data:
200, 350, 300, 350, and 400.
Step 1: Create a new folder called “Box.” From the HOME screen, press F4 and scroll down to NewFold
(option B). Press ENTER.
Step 2: Press 2nd Alpha ( – x to spell B O X and press ENTER.
Step 3: Press APPS, then scroll down to Stats/List Editor. Press ENTER twice.
Step 4: Press the down arrow key to get to the first line of the list. Enter your data into list1. Follow each entry
with a comma: 200, 350, 300, 350, 400.
Step 5: Press F2 then 1 to enter Plot Setup.
Step 6: Press F1, right arrow, and 5 to select mod box plot.
Step 7: Arrow down to Mark and select box.
Step 8: Arrow down and enter B O X (using the alphanumeric keypad) in the x. Press ENTER.
Step 9: Read the boxplot. Press F3 and use the left and right cursors to find Min(200), Q1(250), Med(325),
Q3(400), and Max(500).
That’s it!
Tip: if you want to change the folder back to MAIN, press MODES, scroll down to Current Folder. Press right
key, then press 1 ENTER.
TipIf you get the error message undefined variable, it can be a frustrating process to try and solve the problem.
Clearing the memory *may* help, but an easier way to get the box plot to graph is to enter the data into “list 1”
in the List Editor and then type “list 1” as your “x” when defining the box plot.
Lost your guidebook? You can download a new one from the TI website here.
Back to Top
Find a Five Number Summary in SPSS

Calculating the five number summary is pretty straight forward if you have a small data set, but for larger data
sets — which you will typically work with in SPSS — the task can be impossibly tedious. That’s where
software like SPSS comes in handy — tasks that would sometimes take hours by hand can be calculated in a
fraction of a second. The SPSS five number summary is calculated with the “Frequencies” tool.
<img
<iframe src="https://www.youtube.com/embed/KP6AVhEh8TE" width="560" height="315" frameborder="0"

Step 1: Open a new data sheet and type your data into a column (or several columns). To open a new data
sheet, click “File” in the toolbar, then click “new” and then click “Data.” Make sure you type your data without
spaces (in other words, don’t leave empty rows).
Step 2: Click “Analyze,” then click “Descriptive Statistics” and then click “Frequencies” to open the
Frequencies dialog box.
class="size-medium wp-image-4938" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/five-number-summary-spss-3-300x207.jpg" alt="How to Find the SPSS Five Number
Summary" width="300" height="207" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/five-number-summary-spss-3-300x207.jpg 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/five-number-summary-spss-
3.jpg 580w" sizes="(max-width: 300px) 100vw, 300px" />
The five number summary in SPSS is calculated through the Frequency menu.
Step 3: Click a variable name (or several if you have entered your data into multiple columns) and then click
the central arrow to move them to the Variable(s) list box. Note that SPSS uses the term “Variables,” but all it
really means is the column header name. You can change this name by clicking the “Variables” view button at
the bottom of the sheet.
Step 4: Click “Statistics” to open the Statistics dialog box.

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/five-number-summary-
spss-4-300x147.jpg" alt="five number summary spss 4" width="300" height="147"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/five-number-
summary-spss-4-300x147.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/five-number-summary-spss-4.jpg 746w" sizes="(max-width: 300px) 100vw, 300px" />
Step 5: Check “Quartiles,” “median,” “mimimum” and “maximum” and then click “Continue.”
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/five-number-summary-
spss-7-300x177.jpg" alt="five number summary spss 7" width="300" height="177"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/five-number-
summary-spss-7-300x177.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/five-number-summary-spss-7.jpg 780w" sizes="(max-width: 300px) 100vw, 300px" />
Step 6: Click “OK”. The SPSS five number summary is calculated and the results are returned in a new
window.
Note: SPSS lists the first quartile (Q1) as the 25th percentile in the results window, and the third quartile (Q3) is
listed as the 75th percentile.
Back to Top
Check out our YouTube channel for more stats help and tips.
------------------------------------------------------------------------------
 ← Discrete vs Continuous variables: How to Tell the Difference

 Membership Bias: Definition, Examples →
Find an article
Search
top universities!
Probability and Statistics >
Contents (click to go to that section):
1. About
2. List of Types
3. Different Sampling Methods: How to Tell the Difference
4. What is Sampling Error?
5. More Articles
About Samples
<img class="alignleft size-medium wp-image-5899"

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/10-percent-condition-
300x262.jpg" alt="sampling" width="300" height="262"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/10-percent-condition-
300x262.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/10-
percent-condition.jpg 316w" sizes="(max-width: 300px) 100vw, 300px" />Samples are parts of a population.
For example, you might have a list of information on 100 people (your “sample”) out of 10,000 people (the
“population”). You can use that list to make some assumptions about the entire population’s behavior.
However, it’s not that simple. When you do stats, your sample size has to be ideal—not too large or too small.
Then once you’ve decided on a sample size, you must use a sound technique to collect the sample from the
population:
 Probability Sampling uses randomization to select sample members. You know the probability of
each potential member’s inclusion in the sample. For example, 1/100. However, it isn’t necessary for the
odds to be equal. Some members might have a 1/100 chance of being chosen, others might have 1/50.
 Non-probability sampling uses non-random techniques (i.e. the judgment of the researcher). You
can’t calculate the odds of any particular item, person or thing being included in your sample.
Back to Top
Types:
Common Types
The most common techniques you’ll likely meet in elementary statistics or AP statistics include taking a sample
with and without replacement. Specific techniques include:
 Bernoulli samples have independent Bernoulli trials on population elements. The trials decide whether
the element becomes part of the sample. All population elements have an equal chance of being included
in each choice of a single sample. The sample sizes in Bernoulli samples follow a binomial distribution.
Poisson samples (less common): An independent Bernoulli trial decides if each population element makes
it to the sample.
 Cluster sampes divide the population into groups (clusters). Then a random sample is chosen from the
clusters. It’s used when researchers don’t know the individuals in a population but do know the population
subsets or groups.
 In systematic sampling, you select sample elements from an ordered frame. A sampling frame is just a
list of participants that you want to get a sample from. For example, in the equal-probability method,
choose an element from a list and then choose every kth element using the equation k = N\n. Small “n”
denotes the sample size and capital “N” equals the size of the population.
 SRS : Select items completely randomly, so that each element has the same probability of being chosen
as any other element. Each subset of elements has the same probability of being chosen as any other subset
of k elements.
 In stratified sampling, sample each subpopulation independently. First, divide the population into
homogeneous (very similar) subgroups before getting the sample. Each population member only belongs
to one group. Then apply simple random or a systematic method within each group to choose the sample.
Stratified Randomization: a sub-type of stratified used in clinical trials. First, divide patients into strata,
then randomize with permuted block randomization.
Less Common Types

You’ll rarely (if ever) come across these techniques in a basic stats class. However, you’ll come across them in
the “real world”:
 Acceptance-Rejection Sampling: A way to sample from an unknown distribution using a similar,

more convenient distribution.
 Accidental sampling (also known as grab, convenience or opportunity sampling): Draw a sample from
a convenient, readily available population. It doesn’t give a representative sample for the population but
can be useful for pilot testing.
 Adaptive sampling (also called response-adaptive designs): adapt your selection criteria as the
experiment progresses, based on preliminary results as they come in.
 Bootstrap Sample: Select a smaller sample from a larger sample with Bootstrapping. Bootstrapping is
a type of resampling where you draw large numbers of smaller samples of the same size, with
replacement, from a single original sample.
 The Demon algorithm (physics) samples members of a microcanonical ensemble (used to represent
the possible states of a mechanical system which has an exactly specified total energy) with a given
energy. The “demon” represents a degree of freedom in the system which stores and provides energy.
 Critical Case Samples: With this method, you carefully choose cases to maximize the information you
can get from a handful of samples.
 Discrepant case sampling: you choose cases that appear to contradict your findings.
 Distance sample : a widely used technique that estimates the density or abundance of animal
populations.
 The experience sampling method samples experiences (rather than individuals or members). In this
method, study participants stop at certain times and make notes of their experiences as they experience
them.
 Haphazard Sampling: where a researcher chooses items haphazardly, trying to simulate randomness.
However, the result may not be random at all — tainted by selection bias.
Additional Uncommon Types

You’ll probably not come across these in a basic stats class.
 Inverse Sample: based on negative binomial sampling. Take samples until a specified number of
successes have happened.
 Importance Sampling: a method to model rare events.
 The Kish grid: a way to select members of a household for interviews and uses a random number
tables for the selections.
 Latin hypercube: used to construct computer experiments. It generates samples of plausible
collections of values for parameters in a multidimensional distribution.
 In line-intercept sampling, a method where you include an element in a sample from a particular
region if a certain line segment intersects the element.
 Use Maximum Variation Samples when you want to include extremes (like rich/poor or young/old).
A related technique: extreme case sampling.
 Multistage sampling; one of a variety of cluster sampling techniques where you choose random
elements from a cluster (instead of every member in the cluster).
 Quota sampling: a way to select survey participants. It’s similar to statified sampling but researchers
choose members of a group based on judgment. For example, people closest to the researcher might be
chosen for ease of access.
 Respondent Driven Sampling. A chain-referral sampling method where participants recommend other
people they know.
 A sequential sample doesn’t have a set size; take items one (or a few) at a time until you have enough
for your research. It’s commonly used in ecology.
 Snowball samples: where existing study participants recruit future study participants from people they
know.
 Square root biased samplea way to choose people for additional screenings at airports. A combination
of SRS and profiling.
Back to Top
Different Sampling Methods: How to Tell the

Difference
content/uploads/2013/09/10-percent-condition-150x150.jpg" alt="different sampling methods" width="150"
height="150" class="alignnone size-thumbnail wp-image-5899" />
You’ll come across many terms in statistics that define different sampling methods: simple random sampling,
systematic sampling, stratified random sampling and cluster sampling. How to tell the difference between the
different sampling methods can be a challenge.
<img
<iframe width="420" height="315" src="https://www.youtube.com/embed/A7fcdRhSp8k" frameborder="0"

Different Sampling Methods: How to Tell the

Difference: Steps
Step 1: Find out if the study sampled from individuals (for example, picked from a pool of people). You’ll find
simple random sampling in a school lottery, where individual names are picked out of a hat. But a more
“systematic” way of choosing people can be found in “systematic sampling,” where every nth individual is
chosen from a population. For example, every 100th customer at a certain store might receive a “ doorbuster”
gift.
Step 2: Find out if the study picked groups of participants. For large numbers of people (like the number of
potential draftees in the Vietnam war), it’s much simpler to pick people by groups (simple random sampling).
In the case of the draft, draftees were chosen by birth date, “simplifying” the procedure.
Step 3: Determine if your study contained data from more than one carefully defined group (“strata” or
“cluster”). Some examples of strata could be: Democrats and Republics, Renters and Homeowners, Country
Folk vs. City Dwellers, Jacksonville Jaguars fans and San Francisco 49ers fans. If there are two or more very
distinct, clear groups, you have a stratified sample or a “cluster sample.”
 If you have data about the individuals in the groups, that’s a stratified sample. In order to perform
stratified sampling on this sample, you could perform random sampling of each strata independently.
 If you only have data about the groups themselves (you may only know the location of the individuals),
then that’s a cluster sample.
Step 4: Find out if the sample was easy to get. Convenience samples are like convenience stores: why go out of
your way to get samples, when you can nip out to the corner store? A classic example of convenience sampling
is standing at a shopping mall, asking passers by for their opinion.
Back to Top
What is Sampling Error?

Errors happen when you take a sample from the population rather than using the entire population. In other
words, it’s the difference between the statistic you measure and the parameter you would find if you took a
census of the entire population.
If you were to survey the entire population (like the US Census), there would be no error. It’s nearly impossible
to calculate the error margin. However, when you take samples at random, you estimate the error and call it the
margin of error.
<img aria-describedby="caption-attachment-5774" class="size-full wp-

image-5774" src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/census.jpg"
alt="sampling error." width="184" height="275" />
A well planned survey can reduce error.
For example, if you wanted to figure out how many people out of a thousand were under 18, and you came up
with the figure 19.357%. If the actual percentage equals 19.300%, the difference (19.357 – 19.300) of 0.57 or
3% = the margin of error. If you continued to take samples of 1,000 people, you’d probably get slightly different
statistics, 19.1%, 18.9%, 19.5% etc, but they would all be around the same figure. This is one of the reasons that
you’ll often see sample sizes of 1,000 or 1,500 in surveys: they produce a very acceptable margin of error of
about 3%.
Formula: the formula for the margin of error is 1/√n, where n is the size of the sample. For example, a random
sample of 1,000 has about a 1/√n; = 3.2% error.
Sample error can only be reduced, this is because it is considered to be an acceptable tradeoff to avoid
measuring the entire population. In general, the larger the sample, the smaller the margin of error. There is a
notable exception: if you use cluster sampling, this may increase the error because of the similarities between
cluster members. A carefully designed experiment or survey can also reduce error.
Another Type of Error

The non-sampling error could be one reason as to why there’s a difference between the sample and the
population. This is due to poor data collection methods (like faulty instruments or inaccurate data recording,
selection bias, non response bias (where individuals don’t want to or can’t respond to a survey), or other
mistakes in collecting the data. Increasing the sample size will not reduce these errors. They key is to avoid
making the errors in the first place with a well-planned design for the survey or experiment.
Back to Top
More Articles
1. Area sampling and area frames.
2. What is the Large Enough Sample Condition?
3. What is a Sample?
4. How to Find a Sample Size in Statistics.
5. What is the 10% Condition?
6. What is Direct Sampling?
7. Double sampling.
8. What is Efficiency?
9. Latin Hypercube Sampling.
10. What is an Effective Sample Size?
11. Finite Population Correction Factor.
12. What is Markov Chain Monte Carlo?
13. Resampling techniques.
14. What is a Typical Case?
15. How to Use Slovin’s Formula.
16. Sample Distributions.
17. What is the Samp. Distribution of the Sample Proportion?
18. Sampling Design
19. Sampling Unit
20. What is Sampling variability?
21. Total Population Sampling
Check out our YouTube channel for more stats tips and help!
------------------------------------------------------------------------------
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Standard Deviation: Simple Definition, Step by Step Video
Contents: Standard Deviation (click to skip to section):
Basics:
1. Standard Deviation Definition

2. How to Find the Sample Standard Deviation by Hand
More advanced topics:
1. Standard Deviation for a Binomial

2. Discrete Random Variable Standard Deviation
3. Standard Deviation for a Frequency Distribution
Using Technology:
1. Find the Standard Deviation in Minitab
2. Find the Standard Deviation in SPSS
Related articles:
1. Absolute standard deviation
Definition
<img
<iframe width="560" height="315" src="//www.youtube.com/embed/heN3uvJ99Vo" frameborder="0"

Standard deviation is a measure of dispersement in statistics. “Dispersement” tells you how much your data is
spread out. Specifically, it shows you how much your data is spread out around the mean or average. For
example, are all your scores close to the average? Or are lots of scores way above (or way below) the average
score?
What Does it Look Like on a Graph?

The bell curve (what statisticians call a “normal distribution“) is commonly seen in statistics as a tool to
understand standard deviation.
The following graph of a normal distribution represents a great deal of data in real life. The mean, or average, is
represented by the Greek letter μ, in the center. Each segment (colored in dark blue to light blue) represents one
standard deviation away from the mean. For example, 2σ means two standard deviations from the mean.
<img class="alignleft size-full wp-image-
3136" title="Standard_deviation_diagram.svg" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/11/Standard_deviation_diagram.svg_.png" alt="standard deviation" width="350"
height="175" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/11/Standard_deviation_diagram.svg_.png 350w,
https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/11/Standard_deviation_diagram.svg_-300x150.png 300w" sizes="(max-width: 350px)
100vw, 350px" />
Real Life Example

A normal distribution curve can represent hundreds of situations in real life. Have you ever noticed in class that
most students get Cs while a few get As or Fs? That can be modeled with a bell curve. People’s weights,
heights, nutrition habits and exercise regimens can also be modeled with graphs similar to this one. That
knowledge enables companies, schools and governments to make predictions about future behavior. For
behaviors that fit this type of bell curve (like performance on the SAT), you’ll be able to predict that 34.1 + 34.1
= 68.2% of students will score very close to the average score, or one standard deviation away from the mean.
How to Find the Sample Standard Deviation by Hand

<img
<iframe width="420" height="315" src="//www.youtube.com/embed/arzaMpDxYSQ?rel=0" frameborder="0"

Standard Deviation for a Binomial

(Click to Skip to Section)
Standard Deviation For a Binomial: TI-83
Standard Deviation For a Binomial: by hand
TI 83 Standard Deviation For a Binomial

<img
<iframe width="560" height="315" src="//www.youtube.com/embed/Z9NpdvYw_bI" frameborder="0"

The TI 83 doesn’t have a built in function to find the standard deviation for a binomial. You have to enter the
equation in manually.
content/uploads/2009/08/standard-deviation-binomial-distribution.gif" alt="standard deviation for a binomial"
width="109" height="26" class="alignnone size-full wp-image-13386" />
Example problem: Find standard deviation for a binomial distribution with n = 5 and p = 0.12.
Step 1: Subtract p from 1 to find q.

1 – .12 ENTER
=.88
Step 2: Multiply n times p times q.

5 * .12 * .88 ENTER
=.528
Step 3: Find the square root of the answer from Step 2.

√.528 = =.727 (rounded to 3 decimal places).
Standard Deviation For a Binomial: By Hand

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/09/coin-150x150.jpg"
alt="standard deviation binomial distribution" width="150" height="150" class="size-thumbnail wp-image-
9280" srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/09/coin-
150x150.jpg 150w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/09/coin-
80x80.jpg 80w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/09/coin.jpg
A coin toss can be a binomial experiment.
A binomial distribution is one of the simplest types of distributions in statistics. It’s a type of distribution where
there is either success, or failure. For example, winning the lottery: or not winning the lottery. You can find the
standard deviation for a binomial distribution in two ways:
1. With a formula
2. With a probability distribution table (scroll down for the steps)
The formula to find the standard deviation for a binomial distribution is:
content/uploads/2009/08/standard-deviation-binomial-distribution.gif" alt="standard deviation binomial
distribution" width="109" height="26" class="alignnone size-full wp-image-13386" />

<img
<iframe width="560" height="315" src="//www.youtube.com/embed/dsOdS4EJbZY" frameborder="0"

Example question:
Find the standard deviation for the following binomial distribution: flip a coin 1000 times to see how many
heads you get.
Step 1: Identify n and p from the question. N is the number of trials (given as 1000) and p is the probability,
which is .5 (you have a 50% chance of getting a heads in any coin flip).
At this point you can insert those numbers into the formula and solve. If formulas aren’t your forte, follow these
additional steps:
Step 2: Multiply n by p:
1000 * .5 = 500.
Step 3: Subtract “p” from 1:

1 – .5 = .5.
Step 4: Multiply Step 2 by Step 3: 500 * .5 = 250.
Step 5: Take the square root of Step 4:

√ 250 = 15.81.
That’s it!
Standard Deviation of Discrete Random Variables
With discrete random variables, sometimes you’re given a probability distribution table instead of “p” and “n”.
As long as you have a table you can calculate the standard deviation of discrete random variables with this
formula:
content/uploads/2009/08/standard-deviation-discrete-random-variable.png" alt="standard deviation discrete
random variable" width="158" height="35" class="alignnone size-full wp-image-13388" />
Example question: Find the standard deviation of the discrete random variables shown in the following table,
which represents flipping three coins:
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/std-dev-discrete-random-
300x32.jpg" alt="standard deviation discrete random variable" width="300" height="32" class="alignnone size-
content/uploads/2009/08/std-dev-discrete-random-300x32.jpg 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2009/08/std-dev-discrete-random.jpg
Step 1: Find the mean (this is also called the expected value) by multiplying the probabilities by x in each
column and adding them all up:
μ = (0 * 0.125) + (1 * 0.375) + (2 * 0.375) + (3 * 0.125) = 1.5
Step 2: work the inner part of the above equation, without the square root:
 ((0 – 1.5)2 * 0.125 ) +

 ((1 – 1.5)2 * 0.375 ) +
 ((2 – 1.5)2 * 0.375 ) +
 ((3 – 1.5)2 * 0.125 ) +
 = 0.75
Step 3: Take the square root of Step 2:

σ = √ 0.75 = 0.8660254.
That’s it!
Back to Top
Standard Deviation for a Frequency Distribution

Back to Top
The formula to find the standard deviation for a frequency distribution is:
content/uploads/2012/11/standard-deviation-frequency-distribution.png" alt="" width="201" height="99"
class="alignleft size-full wp-image-34426" />
Where:
 μ is the mean for the frequency distribution,

 f is the individual frequency counts,
 x is the value associated with the frequencies.
If formulas aren’t your forte, watch this short video, which shows you how to work the formula:
<img
<iframe width="560" height="315" src="https://www.youtube.com/embed/swzThKDKFoM" frameborder="0"

How to find the Standard Deviation in Minitab

Watch the video or follow the steps below:
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/8CoPQe5TTdY" frameborder="0"

Example question: Find the standard deviation in Minitab for the following data: 102, 104, 105, 110, 112, 116,
124, 124, 125, 240, 245, 254, 258, 259, 265, 265, 278, 289, 298, 311, 321, 321, 324, 354
Step 1: Type your data into a single column in a Minitab worksheet.
Step 2: Click “Stat”, then click “Basic Statistics,” then click “Descriptive Statistics.”
Step 3: Select the variables you want to find the standard deviation for and then click “Select” to move the
variable names to the right window.
Step 5: Check the “Standard deviation” box and then click “OK” twice. The standard deviation will be
displayed in a new window.
That’s it!
Back to Top
How to find the Standard Deviation in SPSS

The tool to calculate standard deviation in SPSS is found in the “Analytics > Descriptive Statistics” section of
the toolbar. You can also use the “Frequencies” option in the same menu. The video below shows both options,
or read below for the steps with the first option only.
<img
<iframe width="420" height="315" src="https://www.youtube.com/embed/byaKIjALjPY" frameborder="0"

If you have already typed in your data into a worksheet, skip to Step 3.
Step 1: Open a new worksheet to type in data. Once SPSS opens, select the “type in data” radio button to the
right of the “What would you like to do” dialog box.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-mean-1-
300x214.jpg" alt="how to find the std dev in spss" width="400" height="300" class="alignnone size-medium
wp-image-4890" />
Step 2: Type your data into the worksheet.You can use as many columns as you like to enter data, but don’t
leave any blank rows between your data.
Step 3: Click “Analyze” on the toolbar and then mouse over “Descriptive Statistics.” Click “Descriptives”
to open the variables dialog box.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-mean-2-
300x145.jpg" alt="spss mean 2" width="300" height="145" class="alignnone size-medium wp-image-4891"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-mean-2-
300x145.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-
mean-2.jpg 613w" sizes="(max-width: 300px) 100vw, 300px" />
Step 4: Select the variables you want to find descriptive statistics for. SPSS needs to know where the data is
that you want to calculate the standard deviation for. The system will populate the left box with possibilities
(columns of data that you entered) but you will need to select which variables you want to include and transfer
those lists to the right box. To transfer the lists, click the center arrow to move those variables from the left box
to the right box.
Step 5: Check the “Standard Deviation” box, then click “OK”. The answer will show to the right of the window,
in the last column headed “std deviation.”
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-standard-deviation-
300x124.jpg" alt="spss standard deviation" width="300" height="124" class="alignnone size-medium wp-
image-4903" srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/spss-
standard-deviation-300x124.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/spss-standard-deviation.jpg 528w" sizes="(max-width: 300px) 100vw, 300px" />
Back to Top
------------------------------------------------------------------------------
Find an article
Search
top universities!

Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
How to Find a Coefficient of Variation

How to Find a Coefficient of Variation: Contents:
1. What is the Coefficient of Variation?

2. How to Find the Coefficient of Variation
What is the Coefficient of Variation?
content/uploads/2013/09/coefficient-of-variation-150x150.gif" alt="coefficient of variation" width="150"
height="150" class="alignnone size-thumbnail wp-image-10115"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/coefficient-of-
variation-150x150.gif 150w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/09/coefficient-of-variation-230x230.gif 230w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/coefficient-of-variation-
80x80.gif 80w" sizes="(max-width: 150px) 100vw, 150px" />
The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard deviation to
the mean (average). For example, the expression “The standard deviation is 15% of the mean” is a CV.
The CV is particularly useful when you want to compare results from two different surveys or tests that have
different measures or values. For example, if you are comparing the results from two tests that have different
scoring mechanisms. If sample A has a CV of 12% and sample B has a CV of 25%, you would say that sample
B has more variation, relative to its mean.
Formula
The formula for the coefficient of variation is:
Coefficient of Variation = (Standard Deviation / Mean) * 100.

In symbols: CV = (SD/ <img src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2009/09/xbar.bmp" alt="xbar" class="alignnone size-full wp-image-595" />) * 100.
Multiplying the coefficient by 100 is an optional step to get a percentage, as opposed to a decimal.
Coefficient of Variation Example

A researcher is comparing two multiple-choice tests with different conditions. In the first test, a typical multiple-
choice test is administered. In the second test, alternative choices (i.e. incorrect answers) are randomly assigned
to test takers. The results from the two tests are:
Regular Test Randomized Answers
Mean 59.9 44.8
SD 10.2 12.7
Trying to compare the two test results is challenging. Comparing standard deviations doesn’t really work,
because the means are also different. Calculation using the formula CV=(SD/Mean)*100 helps to make sense of
the data:
Mean 59.9 44.8
SD 10.2 12.7
CV 17.03 28.35
Looking at the standard deviations of 10.2 and 12.7, you might think that the tests have similar results.
However, when you adjust for the difference in the means, the results have more significance:
Regular test: CV = 17.03
Randomized answers: CV = 28.35
The coefficient of variation can also be used to compare variability between different measures. For example,
you can compare IQ scores to scores on the Woodcock-Johnson III Tests of Cognitive Abilities.
Note: The Coefficient of Variation should only be used to compare positive data on a ratio scale. The CV has
little or no meaning for measurements on an interval scale. Examples of interval scales include temperatures in
Celsius or Fahrenheit, while the Kelvin scale is a ratio scale that starts at zero and cannot, by definition, take on
a negative value (0 degrees Kelvin is the absence of heat).
How to Find a Coefficient of Variation: Overview.
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/ZNYCW1LJcKw?rel=0" frameborder="0"

Use the following formula to calculate the CV by hand for a population or a sample.
content/uploads/2010/03/cv.jpg" alt="how to find a coefficient of variation" title="coefficient of variation
formulas" width="173" height="192" class="alignnone size-full wp-image-1650" />
σ is the standard deviation for a population, which is the same as “s” for the sample.
μ is the mean for the population, which is the same as XBar in the sample.
In other words, to find the coefficient of variation, divide the standard deviation by the mean and
multiply by 100.
How to find a coefficient of variation in Excel.
You can calculate the coefficient of variation in Excel using the formulas for standard deviation and mean. For a
given column of data (i.e. A1:A10), you could enter: “=stdev(A1:A10)/average(A1:A10)) then multiply by 100.
How to Find a Coefficient of Variation by hand:

Steps.
Sample question: Two versions of a test are given to students. One test has pre-set answers and a second test
has randomized answers. Find the coefficient of variation.
Mean 50.1 45.8
SD 11.2 12.9
Step 1: Divide the standard deviation by the mean for the first sample:
11.2 / 50.1 = 0.22355
Step 2: Multiply Step 1 by 100:

0.22355 * 100 = 22.355%
Step 3: Divide the standard deviation by the mean for the second sample:
12.9 / 45.8 = 0.28166
Step 4: Multiply Step 3 by 100:

0.28166 * 100 = 28.266%
That’s it! Now you can compare the two results directly.
Check out our YouTube channel for more stats help and tips.
------------------------------------------------------------------------------

Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Outliers: Finding Them in Data, Formula, Examples. Easy

Steps and Video
Probability and Statistics > Basic Statistics > How to find outliers
Outliers are stragglers — extremely high or extremely low values — in a data set that can throw off your stats.
For example, if you were measuring children’s nose length, your average value might be thrown off if Pinocchio
was in the class.
Contents (Click to skip to the section):
1. What is an Outlier?
2. How to Find Outliers with the Interquartile Range.
3. How to Find Outliers with the Tukey Method and more advanced methods.
What is an outlier?

An outlier is a piece of data that is an abnormal distance from other points. In other words, it’s data that lies
outside the other values in the set. If you had Pinocchio in a class of children, the length of his nose compared
to the other children would be an outlier.
In this set of random numbers, 1 and 201 are outliers:
1, 99, 100, 101, 103, 109, 110, 201
“1” is an extremely low value and “201” is an extremely high value.
Outliers aren’t always that obvious. Let’s say you received the following paychecks last month:
$225, $250, $25, $235.
Your average paycheck is $135. But that small paycheck ($25) might be because you went on vacation, so a
weekly paycheck average of $135 isn’t a true reflection of how much you earned. Yoru average is actually
closer to $237 if you take the outlier ($25) out of the set.
Of course, trying to find outliers isn’t always that simple. Your data set may look like this:
61, 10, 32, 19, 22, 29, 36, 14, 49, 3.
You could take a guess that 3 might be an outlier and perhaps 61. But you’d be wrong: 61 is the only outlier in
this data set.
A box and whiskers chart (boxplot) often shows outliers:
<img aria-describedby="caption-
attachment-5983" class="size-full wp-image-5983"
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/how-to-find-outliers.jpg"
alt="The outlier on this boxplot is outside of the box and whiskers." width="381" height="284"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/how-to-find-
outliers.jpg 381w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/how-to-
find-outliers-300x223.jpg 300w" sizes="(max-width: 381px) 100vw, 381px" />
The outlier on this boxplot is outside of the box and whiskers.
However, you may not have access to a box and whiskers chart. And even if you do, some boxplots may not
show outliers. For example, this chart has whiskers that reach out to include outliers:
<img class="size-full wp-image-5984"

src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/how-to-find-outliers-
2.jpg" alt="Box and whiskers chart that includes outliers in the whiskers." width="381" height="284"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/how-to-find-outliers-
2.jpg 381w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/how-to-find-
outliers-2-300x223.jpg 300w" sizes="(max-width: 381px) 100vw, 381px" />
Therefore, don’t rely on finding outliers from a box and whiskers chart. That said, box and whiskers charts can
be a useful tool to display them after you have calculated what your outliers actually are. The most effective
way to find all of your outliers is by using the interquartile range (IQR). The IQR contains the middle bulk of
your data, so outliers can be easily found once you know the IQR.
Back to Top
How to Find Outliers Using the Interquartile

Range(IQR)
An outlier is defined as being any point of data that lies over 1.5 IQRs below the first quartile (Q 1) or above the
third quartile (Q3)in a data set.
High = (Q3) + 1.5 IQR
Low = (Q1) – 1.5 IQR
Watch this video on How To Find Outliers, or read the steps below:
<img
<iframe src="//www.youtube.com/embed/HKm7hCC_qvk?rel=0" width="420" height="315" frameborder="0"

Sample Question: Find the outliers for the following data set: 3, 10, 14, 22, 19, 29, 70, 49, 36, 32.
Step 1: Find the IQR, Q1(25th percentile) and Q3(75th percentile). Use our online interquartile range
calculator to find the IQR or if you want to calculate it by hand, follow the steps in this article: Interquartile
Range in Statistics: How to find it.
IQR = 22
Q1 = 14
Q3 = 36
<
img aria-describedby="caption-attachment-5986" class="size-full wp-image-5986"
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/interquartile-range1.jpg"
alt="how to find outliers in data" width="593" height="422"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/interquartile-
range1.jpg 593w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/interquartile-
range1-300x213.jpg 300w" sizes="(max-width: 593px) 100vw, 593px" />
IQR, Q1 and Q3 found using the online calculator (see link in this step).
Step 2: Multiply the IQR you found in Step 1 by 1.5:

IQR * 1.5 = 22 * 1.5 = 33.
Step 3: Add the amount you found in Step 2 to Q3 from Step 1:

33 + 36 = 69.
This is your upper limit. Set this number aside for a moment.
Step 3: Subtract the amount you found in Step 2 from Q1 from Step 1:
14 – 33 = -19.
This is your lower limit. Set this number aside for a moment.
Step 5: Put the numbers from your data set in order:

3, 10, 14, 19, 22, 29, 32, 36, 49, 70
Step 6: Insert your low and high values into your data set, in order:
-19, 3, 10, 14, 19, 22, 29, 32, 36, 49, 69, 70
Step 6: Highlight any number below or above the numbers you inserted in Step 6:
-19, 3, 10, 14, 19, 22, 29, 32, 36, 49, 69, 70
That’s it!
Back to Top
How to Find Outliers with the The Tukey Method
class="size-full wp-image-5302" src="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/boxplot-with-outliers.gif" alt="Frequency chart with boxplot at the top. The outliers
are shown as dots outside the range of the whiskers." width="300" height="300" />
Frequency chart with boxplot at the top. The outliers are shown as dots outside the range of the whiskers.
The Tukey method for finding outliers uses the interquartile range to filter out very large or very small numbers.
It’s practically the same as the procedure above, but you might see the formulas written slightly differently and
the terminology is a little different as well. For example, the Tukey method uses the concept of “fences”.
The formulas are:
Low outliers = Q1 – 1.5(Q3 – Q1) = Q1 – 1.5(IQR)
High outliers = Q3 + 1.5(Q3 – Q1) = Q3 + 1.5(IQR)
Where:
Q1 = first quartile
Q3 = third quartile
IQR = Interquartile range
These equations give you two values, or “fences“. You can think of them as a fence that cordons off the outliers
from all of the values that are contained in the bulk of the data.
Sample question: Use Tukey’s method to find outliers for the following set of data: 1,2,5,6,7,9,12,15,18,19,38.
Step 1: Find the Interquartile range:
1. Find the median: 1,2,5,6,7,9,12,15,18,19,38.

2. Place parentheses around the numbers above and below the median — it makes Q1 and Q3 easier to
find.
(1,2,5,6,7),9,(12,15,18,19,38)
3. Find Q1 and Q3. Q1 can be thought of as a median in the lower half of the data. Q3 can be thought of
as a median for the upper half of data.
(1,2,5,6,7), 9, ( 12,15,18,19,38). Q1=5 and Q3=18.
4. Subtract Q1 from Q3. 18-5=13.
Step 2: Calculate 1.5 * IQR:

1.5 * IQR = 1.5 * 13 = 19.5
Step 3: Subtract from Q1 to get your lower fence:

5 – 19.5 = -14.5
Step 4: Add to Q3 to get your upper fence:

18 + 19.5 = 37.5.
Step 5:Add your fences to your data to identify outliers:

(-14.5) 1,2,5,6,7,9,12,15,18,19,(37.5),38.
Anything outside of the fences is an outlier. For this data set, 38 is the only outlier.
That’s how to find outliers with the Tukey method!
Back to Top
How to Find Outliers with Advanced Methods

1. Generalized ESD
2. Grubbs’ test.
3. Dixon’s Q Test.
4. Modified Thompson Tau Test
5. Pierce’s Criterion
Next: Modify Extreme Values with Winsorizations
Reference: John Tukey, Exploratory Data Analysis, Addison-Wesley, 1977, pp. 43-44.
Check out our YouTube channel for more stats tips and help!
------------------------------------------------------------------------------
 ← Adjusted R2 / Adjusted R-Squared: What is it used for?

 Unimodal Distribution in Statistics →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Quantitative Variables (Numeric Variables) in Statistics:
Definition, Examples
Types of Variable > Quantitative Variables

Contents:
1. Definition of a quantitative variable

2. What is the Quantitative Data Condition?
1. Definition
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/quantitative-and-
categorical-variables-300x300.png" alt="quantitative variables" width="300" height="300" class="size-medium
content/uploads/2013/08/quantitative-and-categorical-variables-300x300.png 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/quantitative-and-categorical-
variables-150x150.png 150w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/quantitative-and-categorical-variables.png 600w" sizes="(max-width: 300px) 100vw,
300px" />
Graph of categorical variables on the y-axis and quantitative/numerical data on the x-axis. Credit: Thupper|
Wikimedia Commons
Two types of variables are used in statistics: Quantitative (also called measurement or numerical variables) and
categorical (also called qualitative):
 Quantitative variables are numerical variables: counts, percents, or numbers.

 Categorical variables are descriptions of groups or things, like “breeds of dog” or “voting
preference”.
Examples of Quantitative Variables / Numeric

Variables:
 High school Grade Point Average (e.g. 4.0, 3.2, 2.1).
 Number of pets owned (e.g. 1, 2, 4).
 Bank account balance (e.g. $100, $987, $-42.
 Number of stars in a galaxy (e.g. 100, 2301, 1 trillion) .
 Average number of lottery tickets sold (e.g. 25, 2,789, 2 million).
 How many cousins you have (e.g. 0, 12, 22).
 The amount in your paycheck (e.g. $200, $1,457, $2,222).
General rule of thumb: if you can add it, it’s quantitative. For example, a G.P.A. of 3.3 and a G.P.A. of 4.0
can be added together (3.3 + 4.0 = 7.3), so that means it’s quantitative. On the other hand, grades of A, B, or C
can’t be added together unless you convert them to numbers, so A, B, and C, are not quantitative.
Examples of Categorical Variables:

 Class in college (e.g. freshman, sophomore, junior, senior).
 Party affiliation (e.g. Republican, Democrat, Independent).
 Type of pet owned (e.g. dog, cat, rodent, fish).
 Favorite author (e.g. Stephen King, James Patterson, Charles Dickens).
 Preferred airline (e.g. Southwest, Virgin, Quantas).
 Hair color (e.g. blond, brunette, black).
 Your race (e.g. Asian, Latino, black).
 Types of hats (e.g. sombrero, beanie, fedora).
As a general rule, if you can’t add something, then it’s categorical. For example, you can’t add cat + dog, or
Republican + Democrat.
Categorical vs. Quantitative

Watch this video on the difference between categorical(qualitative) and quantitative variables.
<img
<iframe width="560" height="315" src="//www.youtube.com/embed/muaVOWWMjXY" frameborder="0"

What is a Quantitative Data Condition?

When you graph or plot statistical data, make sure you have quantitative data of known units. If you don’t
have known units, then you won’t be able to graph it. For example, the first list above states that “G.P.A.” is
quantitative data. However, you won’t be able to graph G.P.A. versus another variable (say, race or sex) unless
you actually have a unit, like 3.1 or 2.9. This sounds obvious, but with more complex data you should always
check the quantitative data condition for missing or nonsensical information before you start a graph.
Histograms, boxplots and scatter plots all require that you have quantitative (numerical data). If you try to graph
categorical data with a histogram, boxplot or scatter plot, you’ll run into the same type of problem as if you try
to graph numerical data with pie charts: your graphs won’t make any sense. The following scatter plot
illustrates this point. I made a scatter plot in Microsoft Excel of categorical data (names) along with their ages in
Excel. Excel didn’t recognize the categorical data and assigned numbers instead. The scatter plot is meaningless;
no one will know that “1”, “2”, “3”, “4” and “5” refer to names and even if they do…the graph will be a mess if
you have 100 names!:
<img src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/scatter-plot.jpg"
alt="scatter plot" width="660" height="421" class="alignnone size-full wp-image-5921"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/scatter-plot.jpg 660w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/09/scatter-plot-300x191.jpg
A workaround to this problem could be to assign numbers to names (e.g. John = 1, Jan = 2…), and include a key
on the graph. However in this particular example, a scatter plot really isn’t the best choice for a graph— choose
the bar graph instead. A bar graph allows you to plot categories on one axis, so the quantitative data condition
doesn’t have to be met for one axis.
Check out our YouTube channel for more help and tips.
------------------------------------------------------------------------------
By Stephanie | August 19, 2013 | Statistics How To |

 ← Frequency Distribution Table in Excel — Easy Steps!
 What is a Population in Statistics? →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Linear Regression: Simple Steps, Video. Find Equation,

Coefficient, Slope
Contents:
1. How to Find a Linear Regression Equation by Hand.

2. Find a Linear Regression Equation in Excel.
3. TI83 Linear Regression.
4. TI 89 Linear Regression
Finding related items:
1. How to Find the Regression Coefficient.

2. Find the Linear Regression Slope.
3. Find a Linear Regression Test Value.
Leverage:
1. Leverage in Linear Regression.
Back to top

If you’re just beginning to learn about regression analysis, a simple linear is the first type of regression you’ll
come across in a stats class.
Linear regression is the most widely used statistical technique; it is a way to model a relationship between two
sets of variables. The result is a linear regression equation that can be used to make predictions about data.
Most software packages and calculators can calculate linear regression. For example:
 TI-83.
 Excel.
You can also Find a linear regression by hand.
Before you try your calculations, you should always make a scatter plot to see if your data roughly fits a line.
Why? Because regression will always give you an equation, and it may not make any sense if your data is
scattered exponentially.
Etymology
“Linear” means line. The word Regression came from a 19th-Century Scientist, Sir Francis Galton, who coined
the term “regression toward mediocrity” (in modern language, that’s regression toward the mean). He used the
term to describe the phenomenon of how nature tends to dampen excess physical traits from generation to
generation (like extreme height).
Why use Linear Relationships?

Linear relationships, i.e. lines, are easier to work with and most phenomenon are naturally linearly related. If
variables aren’t linearly related, then some math can transform that relationship into a linear one, so that it’s
easier for the researcher (i.e. you) to understand.

You’re probably familiar with plotting line graphs with one X axis and one Y axis. The X variable is sometimes
called the independent variable and the Y variable is called the dependent variable. Simple linear regression
plots one independent variable X against one dependent variable Y. Technically, in regression analysis, the
independent variable is usually called the predictor variable and the dependent variable is called the criterion
variable. However, many people just call them the independent and dependent variables. More advanced
regression techniques (like multiple regression) use multiple independent variables.
Regression analysis can result in linear or nonlinear graphs. A linear regression is where the relationships
between your variables can be described with a straight line. Non-linear regressions produce curved lines.(**)
300x180.jpg" alt="simple linear regression" width="300" height="180" class="size-medium wp-image-13461"
Simple linear regression for the amount of rainfall per year.
Regression analysis is almost always performed by a computer program, as the equations are extremely time-
consuming to perform by hand.
**As this is an introductory article, I kept it simple. But there’s actually an important technical difference
between linear and nonlinear, that will become more important if you continue studying regression. For details,
see the article on nonlinear regression.
Back to top

Overview
Regression analysis is used to find equations that fit data. Once we have the regression equation, we can use the
model to make predictions. One type of regression analysis is linear analysis. When a correlation coefficient
shows that data is likely to be able to predict future outcomes and a scatter plot of the data appears to form a
straight line, you can use simple linear regression to find a predictive function. If you recall from elementary
algebra, the equation for a line is y = mx + b. This article shows you how to take data, calculate linear
regression, and find the equation y’ = a + bx. Note: If you’re taking AP statistics, you may see the equation
written as b0 + b1x, which is the same thing (you’re just using the variables b0 + b1 instead of a + b.
Watch the video or read the steps below to find a linear regression equation by hand. Scroll to the bottom of the
page if you would prefer to use Excel:
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/CfrWexuiZyU?rel=0" frameborder="0"

The Linear Regression Equation

Linear regression is a way to model the relationship between two variables. You might also recognize the
equation as the slope formula. The equation has the form Y= a + bX, where Y is the dependent variable (that’s
the variable that goes on the Y axis), X is the independent variable (i.e. it is plotted on the X axis), b is the slope
of the line and a is the y-intercept.
content/uploads/2009/11/linearregressionequations.bmp" alt="the linear regression equation" class="alignnone
The first step in finding a linear regression equation is to determine if there is a relationship between the two
variables. This is often a judgment call for the researcher. You’ll also need a list of your data in x-y format (i.e.
two columns of data—independent and dependent variables).
Warnings:
1. Just because two variables are related, it does not mean that one causes the other. For example,
although there is a relationship between high GRE scores and better performance in grad school, it doesn’t
mean that high GRE scores cause good grad school performance.
2. If you attempt to try and find a linear regression equation for a set of data (especially through an
automated program like Excel or a TI-83), you will find one, but it does not necessarily mean the equation
is a good fit for your data. One technique is to make a scatter plot first, to see if the data roughly fits a line
before you try to find a linear regression equation.
How to Find a Linear Regression Equation: Steps

Step 1: Make a chart of your data, filling in the columns in the same way as you would fill in the chart if you
were finding the Pearson’s Correlation Coefficient.
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
From the above table, Σx = 247, Σy = 486, Σxy = 20485, Σx2 = 11409, Σy2 = 40022. n is the sample size (6, in
our case).
Step 2: Use the following equations to find a and b.

content/uploads/2009/11/linearregressionequations.bmp" alt="find a linear regression equation" />
a = 65.1416
b = .385225
Find a:
 ((486 × 11,409) – ((247 × 20,485)) / 6 (11,409) – 2472)

 484979 / 7445
 =65.14
Find b:
 (6(20,485) – (247 × 486)) / (6 (11409) – 2472)

 (122,910 – 120,042) / 68,454 – 2472
 2,868 / 7,445
 = .385225
Step 3: Insert the values into the equation.

y’ = a + bx
y’ = 65.14 + .385225x
That’s how to find a linear regression equation by hand!
* Note that this example has a low correlation coefficient, and therefore wouldn’t be too good at predicting
anything.
Back to top
Find a Linear Regression Equation in Excel

<img

Linear Regression Equation Microsoft Excel: Steps
Step 1: Install the Data Analysis Toolpak, if it isn’t already installed. For instructions on how to load the Data
Analysis Toolpak, click here.
“y” data into column b. Do not leave any blank cells between your entries.
Step 3: Click the “Data Analysis” tab on the Excel toolbar.
Step 4: Click “regression” in the pop up window and then click “OK.”
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/data-analysis-
300x205.jpg" alt="linear regression equation microsoft excel." width="300" height="205" class="size-medium
wp-image-4878" srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/08/data-
analysis-300x205.jpg 300w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/08/data-analysis.jpg 398w" sizes="(max-width: 300px) 100vw, 300px" />
The Data Analysis pop up window has many options, including linear regression.
Step 5: Select your input Y range. You can do this two ways: either select the data in the worksheet or type the
location of your data into the “Input Y Range box.” For example, if your Y data is in A2 through A10 then type
“A2:A10” into the Input Y Range box.
Step 6: Select your input X range by selecting the data in the worksheet or typing the location of your data into
the “Input X Range box.”
Step 7: Select the location where you want your output range to go by selecting a blank area in the worksheet
or typing the location of where you want your data to go in the “Output Range” box.
Step 8: Click “OK”. Excel will calculate the linear regression and populate your worksheet with the results.
Tip: The linear regression equation information is given in the last output set (the coefficients column). The first
entry in the “Intercept” row is “a” (the y-intercept) and the first entry in the “X” column is “b” (the slope).
Back to top
TI83 Linear Regression
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/PhLjUx_q_U4?rel=0" frameborder="0"

leverage-1-150x150.jpg" alt="TI 83 Linear Regression" width="150" height="150" class="size-thumbnail wp-
image-6079" />
TI 83 Linear Regression: Overview

Linear regression is tedious and prone to errors when done by hand, but you can perform linear regression in
the time it takes you to input a few variables into a list. Linear regression will only give you a reasonable result
if your data looks like a line on a scatter plot, so before you find the equation for a linear regression line you
may want to view the data on a scatter plot first. See this article for how to make a scatter plot on the TI 83.
TI 83 Linear Regression: Steps

Sample problem: Find a linear regression equation (of the form y = ax + b) for x-values of 1, 2, 3, 4, 5 and y-
values of 3, 9, 27, 64, and 102.
Step 2: Enter your x-variables, one at a time. Follow each number by pressing the ENTER key. For our list, you
would enter:
1 ENTER
2 ENTER
3 ENTER
4 ENTER
5 ENTER
Step 4: Enter your y-variables, one at a time. Follow each number by pressing the enter key. For our list, you
would enter:
3 ENTER
9 ENTER
27 ENTER
64 ENTER
102 ENTER
Step 6: Press 4 to choose “LinReg(ax+b)”. Press ENTER and then ENTER again. The TI 83 will return the
variables needed for the equation. Just insert the given variables (a, b) into the equation for linear regression
(y=ax+b). For the above data, this is y = 25.3x – 34.9.
That’s how to perform TI 83 Linear Regression!
Back to top
How to Find a Linear Regression Slope: Overview

Remember from algebra, that the slope is the “m” in the formula y = mx + b.
In the linear regression formula, the slope is the a in the equation y’ = b + ax.
They are basically the same thing. So if you’re asked to find linear regression slope, all you need to do is find b
in the same way that you would find m.
Calculating linear regression by hand is tricky, to say the least. There’s a lot of summation (that’s the Σ symbol,
which means to add up). The basic steps are below, or you can watch the video at the beginning of this
article. The video goes into a lot more detail about how to do summation. Finding the equation will also give
you the slope. If you don’t want to find the slope by hand (or if you want to check your work), you can also use
Excel.
How to Find Linear Regression Slope: Steps

Step 1: Find the following data from the information given: Σx, Σy, Σxy, Σx 2, Σy2. If you don’t remember how
to get those variables from data, see this article on how to find a Pearson’s correlation coefficient. Follow the
steps there to create a table and find Σx, Σy, Σxy, Σx2, and Σy2.
Step 2: Insert the data into the b formula (there is no need to find a).

content/uploads/2009/11/linearregressionequations.bmp" alt="how to find linear regression slope" />
If formulas scare you, you can find more comprehensive instructions on how to work the formula here: How to
Find a Linear Regression Equation: Overview.
How to Find Regression Slope in Excel 2013
<img
Subscribe to our Youtube channel for lots more stats tips and tricks.
Back to top
How to Find the Regression Coefficient

A regression coefficient is the same thing as the slope of the line of the regression equation. The equation for
the regression coefficient that you’ll find on the AP Statistics test is: B1 = b1 = Σ [ (xi – x)(yi – y) ] / Σ [ (xi – x)2].
“y” in this equation is the mean of y and “x” is the mean of x.
content/uploads/2012/04/TI-83.png" alt="regression coefficient" width="100" height="200" class="alignleft
content/uploads/2012/04/TI-83.png 286w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2012/04/TI-83-149x300.png 149w" sizes="(max-width: 100px) 100vw, 100px" />You could
find the regression coefficient by hand (as outlined in the section at the top of this page).
However, you won’t have to calculate the regression coefficient by hand in the AP test — you’ll use your TI-83
calculator. Why? Calculating linear regression by hand is very time consuming (allow yourself about 30 minutes
to do the calculations and check them) and because of the huge number of calculations you have to make you’re
very likely to make mathematical errors. When you find a linear regression equation on the TI83, you get the
regression coefficient as part of the answer.
Sample problem: Find the regression coefficient for the following set of data:
x: 1, 2, 3, 4, 5.
y: 3, 9, 27, 64, 102.
Step 1: Press STAT, then press ENTER to enter LISTS. You may need to clear data if you already have
numbers in L1 or L2. To clear the data: move the cursor onto L1, press CLEAR and then ENTER. Repeat for L2
if you need to.
Step 2: Enter your x-data into a list. Press the ENTER key after each entry.
1 ENTER
2 ENTER
3 ENTER
4 ENTER
5 ENTER
Step 3: Scroll across to the next column, L2 using the arrow keys at the top right of the keypad.
Step 4: Enter the y-data:

3 ENTER
9 ENTER
27 ENTER
64 ENTER
102 ENTER
Step 5: Press the STAT button, then scroll to highlight “CALC.” Press ENTER
Step 6: Press 4 to choose “LinReg(ax+b)”. Press ENTER. The TI 83 will return the variables needed for the
linear regression equation. The value you’re looking for >the regression coefficient > is b, which is 25.3 for this
set of data.
That’s it!
Back to top
Linear Regression Test Value
leverage-1-150x150.jpg" alt="Two linear regression lines." width="150" height="150" class="size-thumbnail
wp-image-6079" />
Linear regression test values are used in simple linear regression exactly the same way as test values (like the z-
score or T statistic) are used in hypothesis testing. Instead of working with the z-table you’ll be working with a
t-distribution table. The linear regression test value is compared to the test statistic to help you support or reject
a null hypothesis.
Linear Regression Test Value: Steps

Sample question: Given a set of data with sample size 8 and r = 0.454, find the linear regression test value.
Note: r is the correlation coefficient.
Step 1: Find r, the correlation coefficient, unless it has already been given to you in the question. In this case, r
is given (r = .0454). Not sure how to find r? See: Correlation Coefficient for steps on how to find r.
Step 2: Use the following formula to compute the test value (n is the sample size):
<img class="alignnone size-full wp-image-948" title="linear regression test value"

content/uploads/2009/11/linearregressiontestvalue.bmp" alt="linear regression test value" />
How to solve the formula:

1. Replace the variables with your numbers:
T = .454√((8 – 2)/(1-[.454] ))
2
 Subtract 2 from n:
8–2=6
 Square r:
.454 × .454 = .206116
 Subtract step (3) from 1:
1 – .206116 = .793884
 Divide step (2) by step (4):
6 / .793884 = 7.557779
 Take the square root of step (5):
√7.557779 = 2.74914154
 Multiply r by step (6):
.454 × 2.74914154 = 1.24811026
The Linear Regression Test value, T = 1.24811026
That’s it!
Finding the test statistic

The linear regression test value isn’t much use unless you have something to compare it to. Compare your value
to the test statistic. The test statistic is also a t-score (t) defined by the following equation:
t = slope of the sample regression line / standard error of the slope.
See: How to find a linear regression slope / How to find the standard error of the slope (TI-83).
You can find a worked example of calculating the linear regression test value (with an alpha level) here:
Correlation Coefficients.
Back to top
Leverage in Linear Regression
Data points that have leverage have the potential to move a linear regression line. They tend to be outliers. An
outlier is a point that is either an extremely high or extremely low value.
Influential Points
If the parameter estimates (sample standard deviation, variance etc.) change significantly when an outlier is
removed, that data point is called an influential observation.
The more a data point differs from the mean of the other x-values, the more leverage it has. The more leverage a
point is, the higher the probability that point will be influential (i.e. it could change the parameter estimates).
Leverage in Linear Regression: How it Affects

Graphs
In linear regression, the influential point (outlier) will try to pull the linear regression line toward itself. The
graph below shows what happens to a linear regression line when outlier A is included:
leverage-1.jpg" alt="Leverage Linear Regression" width="600" height="400" class="size-full wp-image-
6079" />
Two linear regression lines. The influential point A is included in the upper line but not in the lower line.
Outliers with extreme X values (values that aren’t within the range of the other data points) have more leverage
in linear regression than points with less extreme x values. In other words, extreme x-value outliers will move
the line more than less extreme values.
The following graph shows a data point outside of the range of the other values. The values range from 0 to
about 70,000. This one point has an x-value of about 80,000 which is outside the range. It affects the regression
line a lot more than the point in the first image above, which was inside the range of the other values.
leverage-3.jpg" alt="A high-leverage outlier. The point has moved the graph more because it is outside the
range of the" width="600" height="400" class="size-full wp-image-6082" />
A high-leverage outlier. The point has moved the graph more because it is outside the range of the other values.
In general outliers that have values close to the mean of x will have less leverage that outliers towards the edges
of the range. Outliers with values of x outside of the range will have more leverage. Values that are extreme on
the y-axis (compared to the other values) will have more influence than values closer to the other y-values.
Like the videos? Subscribe to our Youtube Channel.
------------------------------------------------------------------------------
Find an article
Search
top universities!

Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Standard Error of Regression Slope

Probability and Statistics > Regression Analysis > Standard Error of Regression Slope
Standard Error of Regression Slope: Overview

Standard errors for regression are measures of how spread out your y variables are around the mean, μ.The
standard error of the regression slope, s (also called the standard error of estimate) represents the average
distance that your observed values deviate from the regression line. The smaller the “s” value, the closer your
values are to the regression line.
Standard error of regression slope is a term you’re likely to come across in AP Statistics. In fact, you’ll find
the formula on the AP statistics formulas list given to you on the day of the exam.
Standard Error of Regression Slope Formula

SE of regression slope = sb1 = sqrt [ Σ(yi – ŷi)2 / (n – 2) ] / sqrt [ Σ(xi – x)2 ].
The equation looks a little ugly, but the secret is you won’t need to work the formula by hand on the test. Even if
you think you know how to use the formula, it’s so time-consuming to work that you’ll waste about 20-30
minutes on one question if you try to do the calculations by hand! The TI-83 calculator is allowed in the test and
it can help you find the standard error of regression slope.
Note: The TI83 doesn’t find the SE of the regression slope directly; the “s” reported on the output is the SE of
the residuals, not the SE of the regression slope. However, you can use the output to find it with a simple
division.
Step 1: Enter your data into lists L1 and L2. If you don’t know how to enter data into a list, see: TI-83 Scatter
Plot.)
Step 2: Press STAT, scroll right to TESTS and then select E:LinRegTTest
Step 3: Type in the name of your lists into the Xlist and Ylist. For example, type L1 and L2 if you entered your
data into list L1 and list L2 in Step 1.
Step 4: Select the sign from your alternate hypothesis. For example, select (≠ 0) and then press ENTER.
Step 5: Highlight Calculate and then press ENTER.
Step 6: Find the “t” value and the “b” value. You may need to scroll down with the arrow keys to see the result.
For example, let’s sat your t value was -2.51 and your b value was -.067.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/11/standard-error-of-
regression-slope.png" alt="standard error of regression slope" width="439" height="185" class="aligncenter
content/uploads/2013/11/standard-error-of-regression-slope.png 439w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/11/standard-error-of-regression-
slope-300x126.png 300w" sizes="(max-width: 439px) 100vw, 439px" />
Step 7: Divide b by t. For this example, -0.67 / -2.51 = 0.027.
The standard error of regression slope for this example is 0.027.
That’s it!
Reference:
Duane Hinders. 5 Steps to AP Statistics,2014-2015 Edition.
------------------------------------------------------------------------------

By Stephanie | November 11, 2013 | Statistics How To |
 ← Regression Slope Intercept: How to Find it in Easy Steps

 Multiplication Rule Probability: Definition, Examples →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Validity Coefficient: Definition and How to Find it

Probability and Statistics Index > Regression Analysis > Validity Coefficient
You may find it useful to read about validity first. See:
 Internal validity
 Construct Validity
Validity Coefficient: Definition

Validity tells you how useful your experimental results are; a validity coefficient is a gauge of how strong (or
weak) that “usefulness” factor is. For example, let’s say your research shows that a student with a high GPA.
should perform well on the SAT and in college. A validity coefficient can tell you more about the strength of
that relationship between test results and your criterion variables.
Example (for testing concurrent validity): you want to design an instrument that measures “success in
college.” You design a scale called the SUCCESS scale which measures how well students will do in their first
year of college based on GPA, social skills, extra-curricular interests and other criteria. The score ranges from
0 to 10, with career counselors grading students on a 5-point item for each set of criteria. As a criterion, you
have a second set of college advisers grade the students at the end of their first year. You correlate your
SUCCESS rankings with the rankings obtained from the college advisers. This gives you a validity coefficient.
In general, validity coefficients range from zero to .50, where 0 is a weak validity and .50 is moderate validity.
The possible range of the validity coefficient is the same as other correlation coefficients (0 to 1) and so, in
general, validity coefficients tend not to be that strong; this means that other tests are usually required. It’s not
unusual for validity coefficients to max out at around .30. For the above example, this low correlation means
that some students with GPAs may not perform well on standardized tests or in college.
How to find Validity Coefficients

The validity coefficient is just another type of correlation coefficient. Therefore, you can use any statistical
software to find validity correlation.
content/uploads/2013/12/excel-pivot-table-150x150.jpg" alt="validity coefficient" width="150" height="150"
class="alignleft size-thumbnail wp-image-11093"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/12/excel-pivot-table-
150x150.jpg 150w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2013/12/excel-
pivot-table-80x80.jpg 80w, https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/12/excel-pivot-table.jpg 204w" sizes="(max-width: 150px) 100vw, 150px" />You can use
the Excel CORREL function to find correlation coefficients:
1. Type your data into a worksheet. Your independent variables should be on one column and your
dependent variables should be in a second column.
2. Click the function button on the toolbar (fx).
3. Type “Correl” to find the Correl function. Click on “Correl.”
4. Type the cell locations of your independent variables into the array 1 box. For example, A1:A30.
5. Type the cell locations of your dependent variables into the array 2 box. For example, B1:B30.
6. Click “OK.”
Click one of the links below to see directions for finding validity correlations in different software programs:
 Correlation Coefficient SPSS

 Minitab Correlation Coefficient
Reference:
Neil J. Salkind. Tests & Measurement for People Who (Think They) Hate Tests & Measurement
------------------------------------------------------------------------------

 ← Brown-Forsythe Test: Definition
 Z score to Percentile Calculator and Manual Methods →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Quadratic Regression: Simple Definition, TI-Calculator

Instructions
Regression Analysis > Quadratic Regression
Contents (Click to skip to that section):
1. What is Quadratic Regression?

2. The Quadratic Equation
3. R-Squared
4. Find the Equation with a Calculator
5. Find by Hand
What is Quadratic Regression?
<img aria-
content/uploads/2016/05/Andragradsfunktion_med_brus_1.png" alt="quadratic regression" width="444"
height="316" class="size-full wp-image-47836"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2016/05/Andragradsfunktion_med_brus_1.png 444w,
content/uploads/2016/05/Andragradsfunktion_med_brus_1-300x214.png 300w" sizes="(max-width: 444px)
100vw, 444px" />
Data points that suggest quadratic regression would be a good fit.
Quadratic regression is finding the best fit equation for a set of data shaped like a parabola.
The first step in regression is to make a scatter plot. If your scatter plot is in a “U” shape, either concave up
(like the letter U) or concave down (∩), you’re probably looking at some type of quadratic equation as the best
fit for your data. A quadratic doesn’t have to be a full “U” shape; you can have part of a it (say, a quarter or 3/4).
Quadratic regression is an extension of simple linear regression. While linear regression can be performed
with as few as two points (i.e. enough points to draw a straight line), quadratic regression come with the
disadvantage that it requires more data points to be certain your data falls into the “U” shape. It can technically
be performed with three data points that fit a “V” shape, but more points are desirable. As more data points are
required, it’s also more costly than simple linear regression (Leeuwen, 2010).
Quadratic Regression Equation
Quadratic regression is a way to model a relationship between two sets of variables. The result is a regression
equation that can be used to make predictions about the data. The equation has the form:
y = ax2 + bx + c,
where a ≠ 0.
What is R-Squared in a Quadratic Regression?
R Squared (the coefficient of determination or R2), tells you how much variation in y is explained by x-
variables. The range is 0 to 1, where 0 is 0% variation and 1 is 100% variation. It is used to analyze how
differences in one variable can be explained by a difference in a second variable. For example, when a woman
gets pregnant has a direct relation to when they give birth, so R-squared would be close to 100%. On the other
hand, R-squared would be practically zero for when a woman gets pregnant and when she throws a retirement
party for a parent.
Find the Equation with a Calculator

Contents:
1. TI-83 Instructions
2. TI-89 Instructions
TI-83 Instructions
Step 2: Enter your x-variables, one at a time into the L1 column. Press the ENTER key after each entry.
Step 3: Use the arrow keys to scroll across to L2 (the next column to the right).
Step 4: Enter your y-variables, one at a time. Press ENTER after each number.
Step 6: Arrow right to calc and then arrow down to QuadReg. Press ENTER.
content/uploads/2016/05/quadratic-regression-ti-83-2.png" alt="quadratic regression ti 83 2" width="219"
height="206" class="alignleft size-full wp-image-28390" />
Step 7: Type in the following parameters: L1, L2, Y1. Here’s the steps to do that:
content/uploads/2016/05/quadratic-regression-ti-83.png" alt="quadratic regression ti 83" width="219"
height="208" class="alignleft size-full wp-image-28389" />
1. Press [2nd] and then 1.

2. Press the comma key.
3. Press [2nd] and then 2.
4. Press the comma key.
5. Press VARS, right arrow to Y-VARS and press ENTER.
6. Choose Y1 and press ENTER.
Step 8: Press ENTER to calculate the regression.
Tip: Press GRAPH to graph the parabola. From there, you can determine if the equation is a good fit for the
data.
TI-89 Instructions
Sample Problem: Perform a quadratic regression TI 89 for the following data set:
x: 1, 2, 3, 4, 5, 6, 7, 8, 9
y: 32.5, 35.9, 37.3, 37.9, 36.4, 32.7, 32.4, 29.5, 28.5
Step 1: Press APPS and then use the cursor keys to scroll to the Data/Matrix Editor. Press ENTER.
Step 2: Select 1 for “Current.”
Step 3: Type your x-values into the c1 list and then type your y-values into the c2 list.
Step 4: Press F5 for Calc. A new screen will appear.
Step 5: Type your x-values into column c1 and your y-values into column c2.
Step 6: Move your cursor to the Calculation Type box, press the right-cursor key and select
“9:QuadRegReg.”
Step 7: Type the location of your x-data into the “x” box. For example, if your x-values are in list c1 then type
“c1.”
Step 8: Type the location of your y-data into the “y” box. For example, if your y-values are in list c2 then type
“c2.”
Step 9: Move the cursor to the Store ReqEQ line and then press the right cursor key. Move the cursor to y1(x)
and then press ENTER. A window will pop up with the data for the quadratic regression equation y=ab x. The
trigonometric regression equation will also appear in the y1= line of the Y= screen.
This particular quadratic regression equation is .34632 * x2 + 2.62653 * x + 31.51190.
Find by Hand
In order to find the quadratic regression by hand, you have to solve the following system of equations:
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/normal-equations.png"
alt="" width="304" height="130" class="alignleft size-full wp-image-47851"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/normal-equations.png
304w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/normal-equations-
This set of equations is sometimes called normal equations. If you are not familiar with the summation sign (Σ),
the steps below should make it clear, but if you’re still unsure you may want to read the summation notation
article for more explanation.
Sample question:
Find the quadratic equation for the following set of data (this is every other data point from the sample
calculator problem above, so the solution should be very close to .34632 * x 2 + 2.62653 * x + 31.51190):
x: 1, 3, 5, 7, 9
y: 32.5, 37.3, 36.4, 32.4, 28.5
Step 1: Make a table (I used Excel so that the calculations would be easier). Input your x-values in the first
column and your y-values in the second column:
content/uploads/2016/05/step-1.png" alt="" width="128" height="131" class="alignleft size-full wp-image-
47852" />
Step 2: Add 5 more columns labeled x 2, x3, x4 xy, and x2y:
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/step-2.png" alt=""
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/step-2.png 447w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/step-2-300x84.png 300w"
Step 3: Calculate each column. For example, the x 2 colum is simply the squares of the first column; the last
column is the third column multiplied by the second column (the y-values):
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/step-3.png" alt=""
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/step-3.png 446w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/step-3-300x81.png 300w"
Step 4: Sum the columns. As you might be able to tell, this is where Excel really helps out with the calculations:
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/sums.png" alt=""
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/sums.png 476w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/sums-300x122.png 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/sums-474x194.png 474w"
Step 5: Use the blue row (the summations) to fill in the blanks. All you’re doing is transferring the numbers to
the normal equation (n is the number of items in the set, which is 5 in our example):
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/normal-equations-2.png"
alt="" width="304" height="244" class="alignleft size-full wp-image-47857"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/normal-equations-
2.png 304w, https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2016/05/normal-
equations-2-300x241.png 300w" sizes="(max-width: 304px) 100vw, 304px" />
Step 6: Solve the system of equations. I used this online calculator:

a = -0.3660714
b = 3.015714
c = 30.42179
Step 7: Insert the values from Step 6 into the quadratic equation (I’m rounding to 3 decimal places):
y = ax2 + bx + c
y = -0.366x + 3.016x + 30.422
2
As we expected, that’s very close to the TI-89 solution for all 9 points
(.346x + 2.627 x + 31.511)
2
That’s it!
References
Leeuwen, J. et al. (2010). SOFSEM 2010: Theory and Practice of Computer Science: 36th Conference on
Current Trends in Theory and Practice of Computer Science, Špindleruv Mlýn, Czech Republic, January 23-29,
2010. Proceedings
------------------------------------------------------------------------------
By Stephanie | May 24, 2016 | Statistics How To |
 ← Generalizability and Transferability in Statistics and Research

 Park Test: Definition, How to Run →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Stepwise Regression
Regression Analysis > Stepwise Regression
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2015/09/stepwise-regression.png"
alt="stepwise regression" width="266" height="153" class="alignleft size-full wp-image-21105" />Stepwise
regression is a way to build a model by adding or removing predictor variables, usually via a series of F-tests or
T-tests. The variables to be added or removed are chosen based on the test statistics of the estimated
coefficients. While the technique does have its benefits, it requires skill on the part of the researcher so should
be performed by people who are very familiar with statistical testing. In essence, unlike most regression models,
the models created with stepwise regression should be taken with a grain of salt; they require a keen eye to
detect whether they make sense or not.
How Stepwise Regression Works

The two ways that software will perform stepwise regression are:
 Start the test with all available predictor variables (the “Backward: method), deleting one variable
at a time as the regression model progresses. Use this method if you have a modest number of predictor
variables and you want to eliminate a few. At each step, the variable with the lowest “F-to-remove”
statistic is deleted from the model. The “F-to-remove” statistic is calculated as follows:
1. A t-statistic is calculated for the estimated coefficient of each variable in the model.
2. The t-statistic is squared, creating the “F-to-remove” statistic.
 Start the test with no predictor variables (the “Forward” method), adding one at a time as the
regression model progresses. If you have a large set of predictor variables, use this method. The “F-to-
add” statistic is created using the same steps above, except the system will calculate the statistic for each
variable not in the model. The variable with the highest “F-to-add” statistic is added to the model.
Advantages and Disadvantages

Advantages of stepwise regression include:
 The ability to manage large amounts of potential predictor variables, fine-tuning the model to choose
the best predictor variables from the available options.
 It’s faster than other automatic model-selection methods.
 Watching the order in which variables are removed or added can provide valuable information about
the quality of the predictor variables.
Although stepwise regression is popular, many statisticians (see here and here ) agree that it’s riddled with
problems and should not be used. Some issues include:
 Stepwise regression often has many potential predictor variables but too little data to estimate
coefficients meaningfully. Adding more data does not help much, if at all.
 If two predictor variables in the model are highly correlated, only one may make it into the model.
 R-squared values are usually too high.
 Adjusted r-squared values might be high, and then dip sharply as the model progresses. If this happens,
identify the variables that were added or removed when this happens and adjust the model.
 F and chi-square tests listed next to output variables don’t have those distributions.
 Predicted values and confidence intervals are too narrow.
 P-values are given that do not have the correct meaning.
 Regression coefficients are biased and coefficients for other variables are too high.
 Collinearity is usually a major issue. Excessive collinearity may cause the program to dump predictor
variables into the model.
 Some variables (especially dummy variables) may be removed from the model, when they are deemed
important to be included. These can be manually added back in.
------------------------------------------------------------------------------
 ← Lasso Regression: Simple Definition

 Quadratic Mean / Root Mean Square →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Unstandardized Coefficient
Unstandardized coefficients are ‘raw’ coefficients produced by regression analysis when the analysis is
performed on original, unstandardized variables. Unlike standardized coefficients, which are normalized unit-
less coefficients, an unstandardized coefficient has units and a ‘real life’ scale.
An unstandardized coefficient represents the amount of change in a dependent variable Y due to a change of 1
unit of independent variable X.
Use of Unstandardized Coefficients in Regression

Unstandardized coefficients are usually intuitive to interpret and understand. Since they represent the relation
between raw data, they can be used directly in calculations and analysis. They can also be used to make
comparisons within the regression equation when just one measurement scale is in use. If several measurement
scales are in use, standardized coefficients are preferred for comparison (see below).
Weak Point of Unstandardized Coefficients

Unstandardized coefficients are less useful for direct comparison when the measurement scales of the
independent variables are different. In these cases a larger number may still point to a smaller effect, and to
pinpoint the effect size of variables, you may want to standardize your coefficients first.
For instance, in an analysis where you regress IQ scores on both years in college and income level, your
variables will be on completely different scales and so the unstandardized coefficients (one in IQ/$ and one in
IQ/years) can’t be compared with each other. To find out which is the most interesting effect one would want to
standardize the coefficients first, which means they would both be in terms of standard deviations and so
easily compared with each other.
References
Wuensch, Karl. Regression Coefficients: Unstandardized versus Standardized. Retrieved from
http://core.ecu.edu/psyc/wuenschk/MV/multReg/Standardized_Regression_Coefficients.docx on July 19, 2018.
Janda, Kenneth. Linear Regression. Lecture Notes from Elementary Statistics for Political Research, 310.
Retrieved from http://janda.org/c10/Lectures/topic04/L25-Modeling.htm on July 19, 2018
Tindall, David. Some Notes on Statistical Interpretation. Sociology 502 Lecture Notes. Retrieved from
http://faculty.arts.ubc.ca/tindall/soci502/overheads+slides/Statistics_I/Notes_on_Interpretation.pdf on July 19,
2018.
------------------------------------------------------------------------------
By Stephanie | April 24, 2019 | Statistics How To |
 ← Relative Absolute Error

 k-NN (k-Nearest Neighbor): Overview, Simple Example →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Unstandardized Coefficient
Unstandardized coefficients are ‘raw’ coefficients produced by regression analysis when the analysis is
performed on original, unstandardized variables. Unlike standardized coefficients, which are normalized unit-
less coefficients, an unstandardized coefficient has units and a ‘real life’ scale.
An unstandardized coefficient represents the amount of change in a dependent variable Y due to a change of 1
unit of independent variable X.
Use of Unstandardized Coefficients in Regression

Unstandardized coefficients are usually intuitive to interpret and understand. Since they represent the relation
between raw data, they can be used directly in calculations and analysis. They can also be used to make
comparisons within the regression equation when just one measurement scale is in use. If several measurement
scales are in use, standardized coefficients are preferred for comparison (see below).
Weak Point of Unstandardized Coefficients

Unstandardized coefficients are less useful for direct comparison when the measurement scales of the
independent variables are different. In these cases a larger number may still point to a smaller effect, and to
pinpoint the effect size of variables, you may want to standardize your coefficients first.
For instance, in an analysis where you regress IQ scores on both years in college and income level, your
variables will be on completely different scales and so the unstandardized coefficients (one in IQ/$ and one in
IQ/years) can’t be compared with each other. To find out which is the most interesting effect one would want to
standardize the coefficients first, which means they would both be in terms of standard deviations and so
easily compared with each other.
References
Wuensch, Karl. Regression Coefficients: Unstandardized versus Standardized. Retrieved from
http://core.ecu.edu/psyc/wuenschk/MV/multReg/Standardized_Regression_Coefficients.docx on July 19, 2018.
Janda, Kenneth. Linear Regression. Lecture Notes from Elementary Statistics for Political Research, 310.
Retrieved from http://janda.org/c10/Lectures/topic04/L25-Modeling.htm on July 19, 2018
Tindall, David. Some Notes on Statistical Interpretation. Sociology 502 Lecture Notes. Retrieved from
http://faculty.arts.ubc.ca/tindall/soci502/overheads+slides/Statistics_I/Notes_on_Interpretation.pdf on July 19,
2018.
------------------------------------------------------------------------------
By Stephanie | April 24, 2019 | Statistics How To |
 ← Relative Absolute Error

 k-NN (k-Nearest Neighbor): Overview, Simple Example →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Stepwise Regression
Regression Analysis > Stepwise Regression
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2015/09/stepwise-regression.png"
alt="stepwise regression" width="266" height="153" class="alignleft size-full wp-image-21105" />Stepwise
regression is a way to build a model by adding or removing predictor variables, usually via a series of F-tests or
T-tests. The variables to be added or removed are chosen based on the test statistics of the estimated
coefficients. While the technique does have its benefits, it requires skill on the part of the researcher so should
be performed by people who are very familiar with statistical testing. In essence, unlike most regression models,
the models created with stepwise regression should be taken with a grain of salt; they require a keen eye to
detect whether they make sense or not.
How Stepwise Regression Works

The two ways that software will perform stepwise regression are:
 Start the test with all available predictor variables (the “Backward: method), deleting one variable
at a time as the regression model progresses. Use this method if you have a modest number of predictor
variables and you want to eliminate a few. At each step, the variable with the lowest “F-to-remove”
statistic is deleted from the model. The “F-to-remove” statistic is calculated as follows:
1. A t-statistic is calculated for the estimated coefficient of each variable in the model.
2. The t-statistic is squared, creating the “F-to-remove” statistic.
 Start the test with no predictor variables (the “Forward” method), adding one at a time as the
regression model progresses. If you have a large set of predictor variables, use this method. The “F-to-
add” statistic is created using the same steps above, except the system will calculate the statistic for each
variable not in the model. The variable with the highest “F-to-add” statistic is added to the model.
Advantages and Disadvantages

Advantages of stepwise regression include:
 The ability to manage large amounts of potential predictor variables, fine-tuning the model to choose
the best predictor variables from the available options.
 It’s faster than other automatic model-selection methods.
 Watching the order in which variables are removed or added can provide valuable information about
the quality of the predictor variables.
Although stepwise regression is popular, many statisticians (see here and here ) agree that it’s riddled with
problems and should not be used. Some issues include:
 Stepwise regression often has many potential predictor variables but too little data to estimate
coefficients meaningfully. Adding more data does not help much, if at all.
 If two predictor variables in the model are highly correlated, only one may make it into the model.
 R-squared values are usually too high.
 Adjusted r-squared values might be high, and then dip sharply as the model progresses. If this happens,
identify the variables that were added or removed when this happens and adjust the model.
 F and chi-square tests listed next to output variables don’t have those distributions.
 Predicted values and confidence intervals are too narrow.
 P-values are given that do not have the correct meaning.
 Regression coefficients are biased and coefficients for other variables are too high.
 Collinearity is usually a major issue. Excessive collinearity may cause the program to dump predictor
variables into the model.
 Some variables (especially dummy variables) may be removed from the model, when they are deemed
important to be included. These can be manually added back in.
------------------------------------------------------------------------------
 ← Lasso Regression: Simple Definition

 Quadratic Mean / Root Mean Square →
Find an article
Search
top universities!
Search
Responsive Menu
Statistics How To
 Home
 Tables
o F Table
o Binomials
o Expected Value
o Probability
o Statistics Basics
o T-Distribution
o Sampling
 Calculators
o Tdist Calculator
 Statistics Blog
 Matrices
Assumptions and Conditions for Regression

Probability and Statistics > Regression Analysis > Assumptions and Conditions for Regression
<img
<iframe width="420" height="315" src="//www.youtube.com/embed/wGU9JeqvX2w" frameborder="0"

Assumptions and Conditions for Regression.

Regression can be a very useful tool for finding patterns in data sets. However, your data can’t always be fit to a
regression line. Most software, like SPSS and Excel, will always give you a the best regression line it can find
even if the regression line doesn’t make sense. It’s up to you to figure out beforehand if your data makes sense
for regression analysis. How do you do that? By considering the following assumptions and conditions for
regression before you run the test:
1. The Quantitative Data Condition.

2. The Straight Enough Condition (or “linearity”).
3. The Outlier Condition.
4. Independence of Errors
5. Homoscedasticity
6. Normality of Error Distribution
The Quantitative Data Condition/ Quantitative

Variables Condition.
You can only perform regression on quantitative variables. In other words, if your data isn’t a set of numbers,
regression isn’t a good method for finding a trend. Check that your variables have actual units and that they are
measuring something that makes sense.
In order to find out if your data meets to quantitative data condition, you have to make sure you have
quantitative data (numerical data) and not qualitative data. Qualitative data is data that fits into categories
(that’s why it’s also called categorical data). See: Quantitative or Qualitative: How to Classify Variables.
Categorical Variables that Masquerade as

Quantitative.
<img
content/uploads/2013/01/Four_of_a_Kind_3263015699-300x200.jpg" alt="quantitative variables condition "
width="300" height="200" class="alignleft size-medium wp-image-3331"
srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2013/01/Four_of_a_Kind_3263015699-300x200.jpg 300w,
content/uploads/2013/01/Four_of_a_Kind_3263015699.jpg 800w" sizes="(max-width: 300px) 100vw,
300px" />Sometimes in statistics you can assign numbers to categorical variables in order to force them to
become quantitative (so you can perform computations). For example, a deck of cards is made up of quantitative
variables (the numbers on the cards) and categorical variables (the suits: hearts, diamonds, spades, clubs). You
can give the suits numbers in order to make them numeric:
 Hearts = 1
 Diamonds = 2
 Spades = 3
 Clubs = 4
However, giving numbers to categorical data does not turn them into quantitative variables; They are still
categorical variables: just ones that have been assigned numbers. Therefore, you can’t perform regression on
these types of variables because they do not meet to quantitative variables condition.
The Straight Enough Condition (Assumption of

Linearity).
(Linear Regression only). Regression lines will be very misleading if your data isn’t approximately linear. The
best way to check this condition is to make a scatter plot of your data. If the data looks like it can roughly fit a
line, you can perform regression. For other types of regression (like exponential regression), eyeball the scatter
plot to make sure it roughly follows the shape of whatever regression you are performing.
The Outlier Condition.
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/Assumptions-and-
Conditions-for-Regression-300x221.gif" alt="Assumptions and Conditions for Regression" width="300"
height="221" class="size-medium wp-image-12131" />
One outlier can dramatically affect your regression line.
Outliers can have a dramatic effect on regression lines and the correlation coefficient you get when you run
regression analysis. If you do have an outlier in your data, it’s a good idea to run regression analysis twice: Once
with the outlier and once without.
Independence of Errors
If your points are following a clear pattern, it might indicate that the errors are influencing each other. The errors
are the deviations of an observed value from the true function value. The following image shows two linear
regression lines; on the left, the points are scattered randomly. On the right, the points are clearly influencing
each other.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/Independence-of-Errors-
300x153.jpg" alt="Independence of Errors" width="300" height="153" class="alignnone size-medium wp-
image-15421" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2014/02/Independence-of-Errors-300x153.jpg 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/Independence-of-Errors.jpg
If you don’t have random errors, you can’t run linear regression as your predictions won’t be accurate.
Homoscedasticity
With homoscedasticity, you basically want your points to look like a tube instead of a cone. Heteroscedasticity
is where, like independence of errors, you see a trend in the errors but this time the trend is larger or smaller (as
opposed to the errors clearly influencing each other). In the picture below, the left graph shows a linear
regression line where the errors are getting larger. The shape is cone-like.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/Heteroscedasticity-
300x230.jpg" alt="Heteroscedasticity" width="300" height="230" class="alignnone size-medium wp-image-
15422" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2014/02/Heteroscedasticity-300x230.jpg 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/Heteroscedasticity.jpg 440w"
Running linear regression on data that shows heteroscedasticity will give you poor results.
Normality of Error Distribution
At any point in your x-values, the data points should be normally distributed around the regression line. Your
values should be fairly close to the line, evenly distributed with only a few outliers. The following image shows
data that is fairly normally distributed on the left. The data on the right has data that is either clustered to the line
or far from the line. Linear regression should not be run on values that are not normally distributed.
<img
src="https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/Normality-of-Error-
Distribution-300x220.jpg" alt="Normality of Error Distribution" width="300" height="220" class="alignnone
size-medium wp-image-15423" srcset="https://www.statisticshowto.datasciencecentral.com/wp-
content/uploads/2014/02/Normality-of-Error-Distribution-300x220.jpg 300w,
https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/Normality-of-Error-
Distribution.jpg 498w" sizes="(max-width: 300px) 100vw, 300px" />
------------------------------------------------------------------------------
By Stephanie | February 10, 2014 | Statistics How To |
 ← Marginal Distribution
 Z Test: Definition & Two Proportion Z-Test →
Find an article
Search
top universities!

Regression Analysis (1722021)

Uploaded by

Copyright:

Available Formats

You might also like

Regression Analysis (1722021)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Analysis (1722021)

Uploaded by

Copyright:

Available Formats

Regression Analysis

What is Regression Analysis?

Independent Variable: An independent variable is an input, assumption, or

Regression analysis includes several variations, such as linear, multiple linear,

Regression analysis offers numerous applications in various disciplines,

Regression Analysis – Linear model assumptions

Linear regression analysis is based on six fundamental assumptions:

1. The dependent and independent variables show a linear relationship

Regression Analysis – Simple linear regression

Simple linear regression is a model that assesses the relationship between a

X – independent (explanatory) variable

Regression Analysis – Multiple linear regression

Multiple linear regression analysis is essentially similar to the simple linear

X1, X2, X3 – independent (explanatory) variables

 Non-collinearity: Independent variables should show a minimum of

Regression analysis in finance

Regression analysis has several applications in finance. For example, the

1. Beta and CAPM

2. Forecasting Revenues and Expenses

When forecasting financial statements, financial forecasting is the

The above example shows how to use the Forecast functionFORECAST

Learn more forecasting methods in CFI’s Budgeting and Forecasting Course!

We hope you’ve enjoyed reading CFI’s explanation of regression analysis. CFI

 Cost Behavior AnalysisCost Behavior AnalysisCost behavior analysis refers

Financial Analyst Training

Visit Our Partners...

<img height="1" width="1" style="display:none;" alt=""

 How may we direct your inquiry?*

o I want to ask a question

o Send me information about the FMVA® Program

Submit form_id=234&tit 1 234 WyJbXSIsImJkMD

Regression Analysis: Step by Step Articles, Videos, Simple

Regression Analysis: An Introduction

Multiple Regression Analysis

When to Use Multiple Regression Analysis.

Image: Columbia University

Multiple Regression Analysis Output.

 R (the multiple correlation coefficient),

Minimum Sample size

Overfitting can lead to a poor model for your data.

How to Avoid Overfitting

How to Detect and Avoid Overfitting

 Removing one observation at a time from your data,

2. Shrinkage & Resampling

<iframe width="420" height="315" src="//www.youtube.com/embed/nHuh_2bGaBk?rel=0" frameborder="0"

Step 1: Type your data into two columns in Minitab.

Regression in Minitab selection.

Step 4: Repeat Step 3 for the dependent X variable, weight.

Selecting variables for Minitab regression.

Check out our updated Privacy policy and Cookie Policy

Need help NOW with a homework problem? Click here!

 What is a Scatter Plot?

What is a Scatter Plot?

Scatter plot suggesting a linear relationship.

<img class="alignleft size-full wp-image-30528"

 Plotly is an easy way to create a 3D chart online.

What is a Bubble Chart?

What is a Bubble Chart?