Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

BA 216 Lecture 1 Notes

Stats Overview
● Statistics is a key part of the general process of investigation:
● 1. Identify a question or problem.

Statistics asks three questions:


● How best can we collect data?
● How should it be analyzed?
● And what can we infer from the analysis?

Motivation for the course, through case study


● We’ll start with a classic challenge in statistics: evaluating the
efficacy of a medical treatment.
● Research Question: Does the use of Wingspan Stents reduce the
risk of stroke?

● The researchers who asked this question collected data on 451


at-risk patients. Each volunteer patient was randomly assigned to
one of two groups:

○ Treatment group. Patients in the treatment group received a


stent and medical management. The medical management
included medications, management of risk factors, and help
in lifestyle modification.

○ Control group. Patients in the control group received the


same medical management as the treatment group but did
not receive stents.
Continued...
● Researchers randomly assigned 224 patients to the treatment group and 227 to
the control group

● The results of 5 patients are summarized in Table 1.1

● Considering data from each patient individually would be a long, cumbersome


path towards answering the original research question. Instead, a statistical
analysis allows us to consider all of the data at once (see next slide).

● Table 1.2 summarizes the raw data in a more helpful way.

● We can compute summary statistics from the table. A summary statistic is a


single number summarizing a large amount of data.

Example Summary Statistics for Stent Example:


● Proportion who had a stroke in the treatment (stent) group: 45/224 = 0:20 = 20%
● Proportion who had a stroke in the control group: 28/227 = 0:12 = 12%

Surprising Result -- an additional 8% of patients in the treatment group had a stroke!


Why is it interesting that patients receiving stents appeared to have more
strokes?
● This is contrary to what doctors expected, which was that stents would reduce
the rate of strokes.
● It leads to a statistical question: do the data show a “real" difference due to the
treatment?

What is a real difference? When can we be confident in results?


● Is the 8% difference small enough that it’s conceivable the difference happened
due to random chance? Or, should doctors be concerned with this particular kind
of stent?

● Suppose you flip a coin 100 times. While the chance a coin lands heads in any
given coin flip is 50%, we probably won't observe exactly 50 heads due to natural
variation.
○ Would you be surprised if you flip a coin 100 times and get 55 heads?
○ What about 100 flips and get 80 heads?
○ What about 5 flips and 4 heads

In this course, we’ll learn how to be more confident when we expect a result is “real” for
the whole population, not just the sample.

● The larger the difference we observe (for a particular sample size), the less
believable it is that the difference is due to chance.

● So what we are really asking is the following: is the difference so large that we
should reject the notion that it was due to chance?

● We haven't yet covered statistical tools to fully address this question – that’s what
this whole course is going to be about!
Introduction to data structure
● A survey was conducted on students in an introductory statistics course. Below
are a few of the questions on the survey, and the corresponding variables
the data from the responses were stored in:

Introduction to data matrix, observations, and variables


● The format under the bottom is extremely common in data analysis and statistics
– called a data matrix.
● Each row in the table represents a single person in the course (sometimes called
an “observation”).
● The columns represent characteristics, called variables, for each of the emails.

Data Matrix:
Variables come in different types
● While deceptively simple, these concepts are extremely important for choosing
the right analytical techniques when trying to answer a research or business
question!

Note*: Discrete is something you count individually whereas continuous is something


you measure

Numerical vs. Categorical variables:


● Numerical variables can be counted (like height, age, distance) and can be
either continuous or discrete (explained below).
● Whereas, categorical variables are “bins” or “types” of variables (like gender
categories).
Types of variables

Numerical discrete vs Numerical continuous variables:


● Numerical discrete variables are counts of individual items or values (i.e.
number of students, or species, or days that an event occurred, etc.)
● Numerical continuous variables are measurements of continuous or non-finite
values (age, volume, height).
● Hint: you count discrete variables, but measure continuous variables.
Types of variables

Categorical binary Yes/no outcomes. ● Heads/tails in a coin


variables (aka flip
dichotomous variables) ● Win/lose in a
football game

Regular Categorical Groups with no rank or ● Species names


Variables (also called order between them. ● Colors
Nominal variables) ● Brands

Categorical ordinal Groups that are ranked in ● Finishing place in a


variables a specific order. race
● Rating scale
responses in a
survey*

Note*: Binary can also be two options with opposite answers

Ex:
● High or low?
● Light or dark?
● More or less?

Note*: Nominal is generally classified as “types”

Ex:
● What is your gender?
Types of variables (Exercise)

Gender:
Sleep:
Bedtime:
Countries:
Dread:
Solution:

Gender: categorical, nominal


Sleep: numerical, continuous
Bedtime: categorical, ordinal
Countries: numerical, discrete
Dread: categorical, ordinal
More way to describe variables: relationships between them

Note*: The explanatory variable might affect the response variable.

● Explanatory variables are usually graphed on the x-axis, and response variables
are usually graphed on the y-axis.
● We name these variables based on the hypothesized relationship between the
two variables, based on our best-guess as to what might be affecting
what….BUT….
● Labeling variables as explanatory and response does not guarantee the
relationship between the two is actually causal, even if there is an association (a
“correlation”) identified between the two variables.
● We use these labels only to keep track of which variable we suspect affects the
other.
● When we suspect that two variables show some kind of connection with one
another, they are called associated variables.
○ Note*: Associated variables can also be called dependent variables and
vice-versa.
○ Note*: No pair of variables is both associated and independent.

Types of variables
Types of variables

Note*: Pay attention to the trend of the data, DO NOT jump to conclusions quickly

Solution:
Ans: B
Reason: We can’t assume that one variable will cause a change in the other as we do
not know the full story behind the data trend. We can only observe from what is shown
on the data, therefore we MUST make observations of the data trend itself (positive,
negative, or no correlation) INSTEAD of making any kind of inference or assumption.

Seeing evidence of a correlation between two variables DOES NOT NECESSARILY


MEAN that there is a causal relationship.

● When we suspect that two variables show some kind of connection with one
another, they are called associated variables.
● We can even say that two variables appear to be positively associated, or
negatively associated.
● But, without MUCH more rigorous work, we cannot confidently say that one
variable is causing the changes in the second variable.
● Correlation does not (always) equal causation!

Note*: When there is a predetermined number of something mentioned in a statistic


problem, DO NOT classify it as a discrete variable, instead it should be classified as a
corresponding level.

corresponding level - a number or measurement of something already determined and


mentioned (NO NEED TO COUNT OR MEASURE THESE AND IT IS NOT A
VARIABLE)

Example Question:
Exercise 1 - Fisher's irises (3 points): Sir Ronald Aylmer Fisher was an English
statistician, evolutionary biologist, and geneticist who worked on a data set that
contained sepal length and width, and petal length and width from three species of iris
flowers (setosa, versicolor and virginica). There were 50 flowers from each species in
the data set.

Because the number of flowers for each species is already determined in this problem.

Therefore, the corresponding levels are


- Setosa
- Versicolor
- Virgincia
Note*: DO NOT list a variable as numerical of the question is not measuring the
variable in numbers but in ranks instead.

Example:
In a study of the relationship between socioeconomic status and unethical behavior, 129
University of California undergraduates at Berkeley were asked to identify themselves
as having low or high social-class by comparing themselves to others with the most
(and least) money, most (and least) education, and most (and least) respected jobs.
They were also presented with a jar of individually wrapped candies and informed that
the candies were for children in a nearby laboratory, but that they could take "some" if
they wanted. After completing some unrelated tasks, participants reported the number
of candies they had taken. The study found that students who were identified as
upper-class took more candy than others.

Try this problem yourself!


Answer:
List of Variables:
- Social Class (low to high) - categorical - ordinary
- Money (most to least) - categorical - ordinary
- Education (most to least) - categorical - ordinary
- Respected jobs - categorical - ordinary
- Number of candies taken - numerical - discrete

You might also like