Professional Documents
Culture Documents
Chapter 15
Chapter 15
Chapter 15
Objective:
- Learn how to apply statistical methods in order to predict one variable from other variables.
- Distinguish between the ability to predict one variable from others and the issue of whether
changes in one variable are caused by changes in others.
Regression Lines
A regression line is a straight line that describes how a response variable y changes as an explanatory
variable x change. We often use a regression line to predict the value of y for a given value of x.
Ex.1
Ex.2
Regression Equations
The least-square regression line of y on x is the line that makes the sum of the squares of the vertical
distances of the data points from the line as small as possible.
Y=a+bx
Ex.3
Understanding Prediction
E.g., As part of college admissions process, a college might use SAT Math and Verbal’s cores and high
school grades in math, English, science (five explanatory variables) to predict first-year college grades.
- All statistical methods of predicting a response share some basic properties of least-squares regression
lines.
- Prediction Is based on fitting some “model set of data” e.g., in 15.1 and 15.2 our model is a
straight line that we draw through the points in a scatterplot, other prediction methods use
more elaborate models.
- Prediction works best when the model fits the data closely; In 15.1 the data closely follow a
line, with 15.2 they do not. Prediction is more trustworthy in 15.1 because the pattern is
stronger, if not easy to see pattern and there are many variables= no strong pattern= prediction
inaccurate.
- Prediction outside the range of the available data is risky; Prediction outside range of available
data is referred to as extrapolation
o (making predictions based on data that you do not have)
Ex.4
Correlation and Regression
The square of the correlation r2, is the proportion of the variation in the values of y that is explained by
the least-squares regression of y on x.
The usefulness of the regression line for prediction depends on the strength of the association.
Ex.5
In reporting regression, it is usual to give r2 as a measure of how successful the regression was in
explaining the response.
- When see correlation, square it to get a better feel for the strength of the association. Perfect
correlation means the points lie exactly on a line.
- E.g., r=1, r2=1 and all the variation in one variable is accounted for by the straight-line
relationship with the other variable
- E.g., r=-.07 or 0.7; r2=0.49 and about half the variation is accounted for by the straight-line
relationship. (In the r2 scale, correlation +-0.7 is about halfway between 0 and +-1.
The Question of Causation
1. A strong relationship between two variables does not always mean that changes in one variable
cause changes in the other.
2. The relationship between two variables is often influenced by other variables lurking in the
background.
3. The best evidence for causation comes from randomized comparative experiments.
Ex.6
Ex.7
Ex.8
4. The observation relationship between two variables may be due to direct causation, common
response, or confounding. Two or more of these factors may be present together.
5. An observed relationship can, however, be used for prediction without worrying about
causation as long as the patterns found in past data continue to hold true.
- Association is strong; Association between smoking and lung cancer is very strong.
- Association is consistent; many studies of different kinds of people in many countries link
smoking to lung cancer. (reduces lurking variable
- Higher doses are associated with stronger responses; people who smoke more ger cancer more
often
- The alleged cause precedes the effect in time; lung cancer develops after years of smoking, the
number of men dying from lung cancer rose as smoking became more common. Lag of about 30
years
- The alleged cause is plausible; experiments with animals show that tars from cigarette smoke do
cause cancer.
Identify common myths concerning chance behavior, such as misinterpretations of the law of averages
and the notation of the “hot land” in sports.
Chance behavior is unpredictable in the short run but has a regular and predictable distribution of
outcomes in the long run
- E.g., tossing a coin, result cannot be predicted in advance because will vary, but there still is a
regular pattern in the results.
Ex.1
- Random in statistics is a description of events that are unpredictable in the short run but that
exhibit a kind of order that emerges only in the long run
- Haphazard; lacking any principle of organization
We Call a phenomenon random if individual outcomes are uncertain but there is, nonetheless, a regular
distribution of outcomes in a large number of repetitions.
The probability of any outcome of a random phenomenon is a number between 0 and 1 that describers
the proportion of times the outcome world occur in a very long series of repetitions.
Ex.2
The myth of short-run regularity; The idea of probability is that randomness is regular in the long run
- Also says that random phenomena should also be regular in the short run but when it isn’t, we
look for explanation other than chance variation
Ex.4
Ex.5
Ex.6
Ex.7
The myth of the law of averages; if you are winning too much, you will lose. “Law of averages says that
you must lose now so that the wins and losses will balance out.
Ex.8
- No, knowing the outcome of one trial does not change the probability for the outcomes of any
other trials
Personal Probabilities
Probability is based on data about many repetitions of the same random phenomenon
Describe random phenomena by listing possible outcomes and their associated probabilities
Understand the relation between the odds of an event and the probability of an event
Probability Models
A probability model for a random phenomenon describes all the possible outcomes and says how to
assign probabilities to any collection of outcomes. We sometimes all a collection of outcomes an
event.
Probability Rules