Statistics Notes Exam 1

§ Statistics is the science of collecting, organizing, summarizing,

and analyzing information to draw conclusions or answer

questions. In addition, statistics is about providing a measure of
confidence in any conclusions.

§ Population The entire group of individuals to be studied.

§ Individual A person or object that is a member of the population

being studied.

§ Sample A subset of the population that is being studied.

§ Statistic A numeric al summary of a sample.

§ Descriptive statistics consists of organizing and summarizing data.

Descriptive statistics describe data through numerical summaries,
tables, and graphs.

§ Inferential statistics uses methods that take a result from a sample,

extend it to the population, and measure the reliability of the

§ Parameter A numerical summary of a population.

§ Variables The characteristics of the individuals within the


Types of variables.

§ Qualitative variables Allow for classification of individuals based

on some attribute or characteristic. (categorical variables)
§ Quantitative variables Provide numerical measures of individuals.
The values of a quantitative variable can be added or subtracted
and provide meaningful results.
§ Lurking variable an explanatory variable that was not considered
in a study, but that affects the value of the response variable in the
study. In addition, lurking variables are typic ally related to
explanatory variables considered in the study.
§ Explanatory variable A variable whose value is thought to impact
the value of a response variable.
§ Response variable the variable of interest in the outcome of a
Types of variables dictates the method that can be use to analyze
the data.

§ Discrete variable A quantitative variable that has either a finite

number of possible values or a countable number of possible
values. The term countable means that the values result from
counting, such as 0, 1, 2, 3, and so on. A discrete variable cannot
take on every possible value between any two possible values.

§ Continuous variable A quantitative variable that has an infinite

number of possible values that are not countable. A continuous
variable may take on every possible value between any two

§ Data The list of observed values for a variable. The information

that we use to draw a conclusion or make a decision. Data are
individual pieces of information that describe characteristics of an

§ Quantitative data Observations corresponding to a quantitative


§ Qualitative data Observations corresponding to a qualitative


§ Discrete data Observations corresponding to a discrete variable.

§ Continuous data Observations corresponding to a continuous

The levels of measurement progress from simplest to more

complicated (Nominal, Ordinal, Interval, Ratio).

Nominal level of measurement - data that consists of names, labels,

categories only - they cannot be meaningfully ordered.
example: eye color of students in statistics class, jersey numbers of
SLCC basketball team members
§ short version: "categories only"
Ordinal level of measurement - data that can be arranged in some
meaningful order, but differences are meaningless.
example: letter grades of students in a statistics class
§ short version: "categories with order"
Interval level of measurement - differences are meaningful, but no
natural zero starting point (where none of the quantity is present).
example: birth years of students in a statistics class (the year zero
does not mean no time, it is just a scale
§ short version: "differences, but no natural zero"
Ratio level of measurement - all of the above, plus a natural zero,
therefore ratios are meaningful.
example: incomes of professional statisticians
§ short version: "differences and a natural zero"
§ Observational study A study that measures the value of the
response variable without attempting to influence the value of
either the response or explanatory variables. That is, an
observational study, the researcher observes the behavior of
individuals without trying to influence the outcome of the study. *
the research simply observes the behaviors of the individual in the
study and records the values of the explanatory and response
§ Designed experiment If a researcher assigns the individuals in a
study to a certain group, *intentionally changes the value of an
explanatory variable, and then records the value of the response
variable for each group, the study is a designed experiment.
§ Confounding in a study occurs when the effects of two or more
explanatory variables are not separated. Therefore, any relation
that may exist between an explanatory variable and the response
variable may be due to some other variable not accounted for in
the study.
§ Retrospective study A study that requires individuals to look back
in time or require the researcher to look at existing records.
§ Prospective study A study in which the data are collected over
§ Census A list of individuals in a population along with certain
characteristics of each individual.
§ Random sampling the process of using chance to select
individuals from a population to be included in the sample.
§ Simple Random Sample A sample of size n from a population of
size N is obtained through simple random sampling if every
possible sample of size n has an equally chance of occurring. The
sample is then called a simple random sample.
§ Frame A list of all the individuals within the population.
§ Sample without replacement an individual who is selected is
removed from the population and cannot be chosen again.
§ Sample with replacement A selected individual is placed back into
the population and could be chosen a second time.
§ Seed An initial point for a random number generator to start
creating numbers.
§ Stratified sample A sample obtained by dividing the population
into nonoverlapping groups called strata and then obtaining a
simple random sample from each stratum. The individuals within
each stratum should be homogenous (similar) in some way.
§ Systematic sample A sample obtained by selecting every kth
individual from the population. The first individual selected
corresponds to a number between 1 and k

§ Cluster sample A sample obtained by selecting all individuals

within a randomly selected collection or group of individuals.

Cross-sectional studies are observational studies that collect information

about individuals at a specific point in time or over a very short period of

§ Case-control studies are observational studies that

are retrospective, meaning that they require individuals to look
back in time or require the researcher to look at existing records.

§ Superior observation study Neither study is always the superior to

the other. Both have advantages and disadvantages that depend on
the situation.
§ A confounding variable is an explanatory variable that was
considered in a study whose effect cannot be distinguished from a
second explanatory variable in the study.
§ Convenience sample A sample in which the individuals are easily
obtained and not based on randomness.
§ Self-selected survey A survey in which the individuals themselves
decide to participate in the survey.
§ Voluntary response Another phrase used to describe self-selected
§ Bias When the results of the sample are not representative of the
§ Sampling biases the technique used to obtain the sample's
individuals tends to favor one part of the population over
§ Under coverage occurs in a sample when the proportion of one
segment of the population is lower in a sample than it is in the
§ Nonresponse bias Nonresponse bias exists when individuals
selected to be in the sample who do not respond to the survey have
different opinions from those who do.
§ Response bias Response bias exists when the answers on a survey
do not reflect the true feelings of the respondent.
§ Open question an open question allows the respondent to choose
his or her response.
§ Closed question A closed question requires the respondent to
choose from a list of predetermined responses.
§ Non sampling errors Non sampling errors result from under
coverage, nonresponse bias, response bias, or data-entry error.
Such errors could also be present in a census.
§ Sampling error Sampling error results from using a sample to
estimate information about a population. This type of error occurs
because a sample gives incomplete information about a
§ Factor Another word to describe an explanatory variable in a
designed experiment.
§ Experiment A controlled study conducted to determine the effect
varying one or more explanatory variables, or factors, has on a
response variable.
§ Treatment Any combination of the values of the factors.
§ Experimental unit A person, object, or some other well-defined
item upon which a treatment is applied.
§ Subject The experimental unit is often referred to as a subject
when he or she is a person.
§ Control groups the control group serves as a baseline treatment
that can be used to compare to other treatments.
§ Placebo An innocuous medication, such as a sugar tablet, that
looks, tastes, and smells like the experimental medication.
§ Blinding refers to nondisclosure of the treatment an experimental
unit is receiving.
§ Single-blind experiments the experimental unit (or subject) does
not know which treatment he or she is receiving.
§ Double-blind experiments Neither the experimental unit nor the
researcher in contact with the experimental unit knows which
treatment the experimental unit is receiving.

§ Design To design an experiment means to describe the overall

plan in conducting the experiment.

§ Replication occurs when each treatment is applied to more than

one experimental unit.

§ Completely randomized design an experiment where one in which

each experimental unit is randomly assigned to a treatment.
§ Matched pairs design An experimental design in which the
experimental units are paired up. The pairs are selected so that
they are related in some way (that is, the same person before and
after a treatment, twins, husband and wife, same geographical
location, and so on). There are only two levels of treatment in a
matched-pairs design.

A bar graph whose bars are drawn in decreasing order of frequency or

relative frequency

The highest frequency occurs in the middle and frequencies tail off to the
left and right of the middle

The tail to the right of the peak is longer than the tail to the left of the
The tail to the left of the peak is longer than the tail to the right of the

The proportion (or percent) of observations within a category and is

found using the formula

A graph that is constructed by labeling each category of data on either

the horizontal or vertical axis and the frequency or relative frequency of
the category on the other axis. Rectangles of equal width are drawn for
each category. The height of each rectangle represents the category's
frequency or relative frequency.

A circle divided into sectors, where each sector represents a category of

data. The area of each sector is proportional to the frequency of the
Constructed by drawing rectangles for each class of data. the height of
each rectangle is the frequency or relative frequency of the class. The
width of each rectangle is the same and the rectangles touch each other.

A graphical representation of quantitative data in which the data itself is

used to create the graph. So the raw data could be retrieved from the
One advantage of the stem-and-leaf plot over frequency distributions and
histograms is that the raw data can be retrieved from the stem-and-leaf
plot. So, from a stem-and-leaf plot we can determine the maximum

Drawn by placing each observation horizontally in increasing order and

placing a dot above the observation each time it is observed.
Population mean= μ

Sample mean= x̅


Computed by adding all of the values of the variable in the data set and
dividing by the number of observations.


The value that lies in the middle of the data when arranged in ascending

A numerical summary of data is said to be resistant if extreme values
relative to the data do not affect its value substantially.


The most frequent observation of the variable that occurs in the data set.

Dispersion The degree to which the data are spread out.

Denoted R. The difference between the largest and smallest data value.

Deviation about the Mean

The deviation about the mean for the ith observation.

Population standard deviation

σ. The square root of the sum of squared deviations about the population
mean divided by the number of observations in the population, N. That
is, it is the square root of the mean of the squared deviations about the
population mean.

Sample standard deviation

s. The square root of the sum of squared deviations about the sample
mean divided by the n - 1, where n is the sample size.

Whenever a statistic consistently underestimates a parameter.

Empirical Rule

If a distribution is roughly bell shaped, then (a) Approximately 68% of

the data will lie within 1 standard deviation of the mean. (b)
Approximately 95% of the data will lie within 2 standard deviations of
the mean. (c) Approximately 99.7% of the data will lie within 3 standard
deviations of the mean.


Represents the distance that a data value is from the mean in terms of the
number of standard deviations


Divide data sets into fourths, or four equal parts

Interquartile range
The range of middle 50% of the observations of a data set. That is, the
IQR is the difference between the first and third quartiles and is found
using the formula: IQR = Q3 - Q1.

Describe the distribution

To describe the shape (skewed left, right, symmetric), its center (mean or
median), and its spread (standard deviation or interquartile range).

Five-number summary
Consists of the smallest observation, the first quartile (Q1), the median,
the third quartile (Q3), and the largest observation, written in order from
smallest to largest


Graphical summary of quantitative data used to identify shape of a

distribution and outliers.

A frequency polygon is a graph that uses points, connected by line

segments, to represent the frequencies for the classes. It is constructed by
plotting a point above each class midpoint (the sum of consecutive
lower class limits divided by 2) on a horizontal axis at a height equal to
the frequency of the class.

A cumulative frequency distribution displays the aggregate frequency

of the category. In other words, it displays the total number of
observations less than or equal to the upper class limit of the class.
A cumulative relative frequency distribution displays the proportion
(or percentage) of observations less than or equal to the upper class limit
of the class.

the cumulative frequency for the second class is the sum of the
frequencies of classes 1 and 2; the cumulative frequency for the third
class is the sum of the frequencies of classes 1,2,and 3; and so on.
An ogive (read as “oh jive”) is a graph that represents the cumulative
frequency or cumulative relative frequency for the class. It is constructed
by plotting points whose x-coordinates are the upper class limits and
whose y-coordinates are the cumulative frequencies or cumulative
relative frequencies of the class. Then line segments are drawn
connecting consecutive points. An additional line segment is drawn
connecting the first point to the horizontal axis at a location representing
the upper limit of the class that would precede the first class (if it

A relative frequency ogive is drawn by plotting points whose x-

coordinates are the upper class limit of each class and whose y-
coordinates are the cumulative relative frequencies of each class. Then
connect the points with line segments. Also, an additional line segment is
drawn connecting the first point to the horizontal axis at a location
representing the upper limit of the class that would precede the first class
(if it existed).

A time-series plot is obtained by plotting the time in which a variable is

measured on the horizontal axis and the corresponding value of the
variable on the vertical axis. Line segments are then drawn connecting
the points.

The arithmetic mean of a variable is computed by adding all the values

of the variable in the data set and dividing by the number of
The population arithmetic mean, μ (pronounced "mew"), is
a parameter that is computed using data from all the individuals in a

The sample arithmetic mean, x⎯⎯⎯ (pronounced "x-bar"), is

a statistic that is computed using data from individuals in a sample.

A numerical summary of data is said to be resistant if observations that

are extreme (very large or small) relative to the data do not affect its
value substantially. When an observation that is much larger than the
rest of the data is added to a data set, the value of the mean will
Relation among the Mean, Median, and
Distribution Shape
Distribution Shape Mean versus Median
Skewed left(the preferred measure of central Mean substantially
tendency is the median.) smaller than median
Symmetric (the preferred measure of central Mean roughly equal to
tendency is the mean.) median
Skewed right (the preferred measure of Mean substantially larger
central tendency is the median.) than median

Measure Computation Interpretation When to Use

of Central
Mean Population When data are
mean: μ=Σxi over quantitative and the
N Center of gravity frequency
distribution is
Sample mean: x roughly symmetric
=Σxi over n
Median Arrange the data Divides the When the data are
in ascending order bottom 50% of quantitative and the
and find the the data from the frequency
observation in the top 50% distribution is
middle skewed left or
skewed right
Mode Tally data to Most frequent When the most
determine most observation frequent
observation is the
frequent desired measure of
observation central tendency or
the data are

key vocabulary

The population Is the entire group to be studied.

A sample is a subset of the population being studied.
An individual is a person or an object that is a member of the
population being studied.
Descriptive Descriptive statistics consists of organizing and
statistics Without summarizing data. It describes data through
making any numerical summaries, tables and graphs.
general Ex: 32 OF 40 STUDENTS CAN RETURN
about the WAY.
population. 32/40=0.8 OR 80%
As conclusion here we say that 80% of the students
will return money.
Inferential Inferential statistics uses methods that take a result
statistics (Level from a sample, extend it to the population, and
of confidence) measure the reliability of the result.
As conclusion here we say we are 95% confident
that between 76% and 84% of all students would
return the money.
A parameter are numbers that summarize data for an entire
Variables are the characteristics of the individuals within the
population. If variables did not vary, they would
be constants, and statistical inference would not be

The process/steps of statistics.

• Identify the problem to be solved. in this step, it's
important for the researcher to clearly lay out the question or
questions he or she wants answered. In addition, it's vital that
the researcher clearly specify the population to which the
study applies.

• Collect data. Now this step is extremely important In the

statistical process because if the data is not collected
appropriately, the results of the study are meaningless. Now,
gaining access to an entire population is often difficult and
expensive and so, the researcher, typically, looks at a subset
of the population called a sample.

• Describe the data: When we describe the data, we're gaining

a sense of what the data is telling us, and it also gives you a
good idea of the type of inferential methods that you can use
on your sample data.

• perform inference.: Inference means you're taking the

results from your sample and generalizing through the
population that you're studying. In addition, you always
report with your results, a measure of the reliability that they

The information we want to learn about the individuals must

be created.

Vocabulary and New Vocabulary and Concepts in Context

The characteristics of Recently, my son and I planted a tomato plant
the individuals in a in our backyard. We were interested in
study are variables. studying the weights (the variable) of each
tomato (the individuals).
Variables vary. This My son noted that the tomatoes had different
means that a variable weights even though they came from the same
can take on different plant. He discovered that a variable (weight)
values. varies.
If variables did not If each tomato had the same weight, then
vary, then they would knowing the weight of one tomato would be
be constants enough information to determine the weights
and inferential of all tomatoes.
statistics would not be
One goal of research is By researching the variable, we hope to learn
to learn the causes of to grow plants that yield the best tomatoes.
variability. That is, why do some tomatoes weigh more
than others?
Variables can be classified into The variable "weight of a tomato"
two groups: qualitative and is quantitative because the values
quantitative. of the variable (the weights) provide
Qualitative, numerical measures of the
or categorical, variables allow individuals (the tomatoes). If you
for the classification of subtract the weight of one small
individuals based on some tomato from the weight of a larger
attribute or characteristic. one, it would provide a meaningful
Quantitative variables provide number telling how much heavier
numerical measures of the large tomato was, compared
individuals. The values of a with the smaller one.
quantitative variable can be added
or subtracted and provide
meaningful results.

Quantitative variables can be further classified as

either discrete or continuous.

A discrete variable is a The term countable means that the values

quantitative variable that result from counting, such as 0,1,2,3, and
has either a finite number so on.
of possible values or a
countable number of
possible values. A discrete
variable cannot take on
every possible value
between any two possible
A continuous variable is a Continuous variables typically result from
quantitative variable that measurement. Continuous variables are
has an infinite number of often rounded. If a certain make of car
possible values that are not gets 24 miles per gallon (mpg) of gasoline,
countable. A continuous its miles per gallon must be greater than or
variable may take on every equal to 23.5 and less
possible value between any than 24.5, or 23.5≤mpg<24.5.
two values.
Figure 1 illustrates the relationship among qualitative, quantitative,
discrete, and continuous variables.

The difference between discrete and continuous variables.

Typically, a variable is discrete if its value results from counting, while a

variable is continuous if its value is measured.
• Here, we'll determine whether the quantitative variables are
discrete or continuous-- the number of heads obtained after
flipping a coin five times. This is discrete There are a finite
number of values that this variable can take on. There can be
0 heads, 1, 2, 3, 4, or 5.

• the number of cars that arrive at a McDonald's drive-through

between 12:00 PM and 1:00 PM, Discrete because we count
the numbers of cars that arrive.

• The distance a 2011 Toyota Prius can travel in city driving

conditions with a full tank of gas. is a continuous variable
because we measure that in miles, feet, et cetera.

Distinguish between Discrete and Continuous Variables

Quantitative variables can be further classified as
either discrete or continuous.

A discrete variable is a quantitative The term countable means

variable that has either a finite number that the values result from
of possible values or a countable number counting, such as 0,1,2,3, and
of possible values. A discrete variable so on.
cannot take on every possible value
between any two possible values.
A continuous variable is a Continuous variables typically result from
quantitative variable that measurement. Continuous variables are
has an infinite number of often rounded. If a certain make of car
possible values that are not gets 24 miles per gallon (mpg) of gasoline,
countable. A continuous its miles per gallon must be greater than or
variable may take on every equal to 23.5 and less
possible value between any than 24.5, or 23.5≤mpg<24.5.
two values.
Figure 1 illustrates the relationship among qualitative, quantitative,
discrete, and continuous variables.

The type of variable (qualitative, discrete, or continuous) dictates the

method that can be use to analyze the data.

Vocabulary Example
The list of observed values Gender is a variable; the observations
for a variable is data. male and female are data.
Qualitative data are Gender is a qualitative variable; male
observations corresponding to a and female are qualitative data.
qualitative variable.
Quantitative data are Income is a quantitative
observations variable; $32,012 or $57,839 are
corresponding to a quantitative data.
quantitative variable.
Discrete data are Attendance at a play is a discrete
observations variable; 8431 attendees or 2984 attendees
corresponding to a discrete are discrete data.
Continuous data are Time between calls to a call center is a
observations corresponding to continuous variable; 32seconds
a continuous variable. or 21 seconds (between calls) are
continuous data.

There are four levels of measurement. When assigning a level of

measurement to a variable, always choose the highest level of
A variable is at the nominal level of measurement if the values of the
variable name, label, or categorize. In addition, the naming scheme does
not allow for the values of the variable to be arranged in a ranked or
specific order.
A variable is at the ordinal level of measurement if it has the properties
of the nominal level of measurement. However, the naming scheme
allows for the values of the variable to be arranged in a ranked or
specific order.

A variable is at the interval level of measurement if it has the

properties of the ordinal level of measurement and the differences in the
values of the variable have meaning. A value of zero does not mean the
absence of the quantity. Arithmetic operations such as addition and
subtraction can be performed on the values of the variable.
A variable is at the ratio level of measurement if it has the properties of
the interval level of measurement and the ratios of the values of the
variable have meaning. A value of zero means the absence of the
quantity. Arithmetic operations such as multiplication and division can
be performed on the values of the variable.

In studies we are working in a sample, not the entire population. We aim

to have our sample accurately represent our population. If our sample
does not represent the population, it has bias.
Remember, the goal of sampling is to collect information about a
population through a sample.

• Sampling bias: means that the technique used to obtain

the sample's individuals tends to favor one part of the population
over another.

• Nonresponse bias: exists when the individuals selected to be in

the sample do not respond to the survey.

• Response bias: exists when the answers on a survey do not reflect

the true feelings of the respondent.

• Sampling Bias/Under-coverage: means that the technique used

to obtain the sample’s individuals tends to favor one part of the
population over another.

• Nonresponse Bias: exists when individuals selected to be in the

sample who do not respond to the survey have different opinions
from those who do. Non-response can occur because individuals
selected do not wish to respond or the interviewer was unable to
contact them. This can be controlled using callbacks.

• Response Bias: Exists when the answers on a survey do not

reflect the true feelings of the respondent.
Another method to improve non-response is using rewards such as cash
payments for completing a questionnaire or incentives such as a cover
letter that states that the responses to the questionnaire will determine
future policy. Can occur in number of ways:

• interview error: *Do not be quick to trust surveys conducted by

poorly trained interviewers,
*Do not trust survey results if the sponsor has a vested
interest in the results of the survey.

• Wording error: *he way questions are asked can lead to bias in
survey so questions must be asked in balance form.
*Avoid being vague.
• Ordering of questions or words
• Data entry error


1. A survey mailed to residents of a town has a response rate of less

than 2%: Nonresponse Bias
2. A survey that pertains to feelings about federal income tax does not
include high-income earners: Sampling Bias
3. A survey of college students in which they are asked to disclose the
number of hours they study: Response Bias
4. A survey conducted by a tax revenue collection agents of taxpayers to
identify sources of fraudulent deductions: Response Bias

What do we mean when we say sampling results in "incomplete

information"? We mean that the individuals in the sample cannot reveal
all the information about the population.

Nonsampling error is the error that results from the process of obtaining
the data, under-coverage, nonresponse bias, response bias, or data-entry
errors are all types of non-sampling errors. Sampling error is the error
that results because a sample is being used to estimate information about
a population. This type of error occurs because a sample gives
incomplete information about a population.
N.B: Since closed questions limit the possible responses, they are easier
to analyze. Open questions are harder to analyze due to the variety of
answers and the chance of misinterpreting an answer.

What does it mean when a part of the population is under-represented?

When a part of the population is proportionally smaller in a sample
than in itspopulation, this part of the population has been under-
represented. This could be caused by many different types of bias, or
even by random chance.

It is rare for frames to be accurate because frames are

obtained periodically, whereas populations are constantly changing.
For example, a frame that consists of all of the students in a school
would be inaccurate as soon as any student leaves the school, or any new
student joins the school.
Qualitative: bar graphs, pareto chart, horizontal bars, side by side bar
graph, pie chart

Class width ≈ largest data value − smallest data value over

number of classes

The arithmetic mean of a variable is computed by adding all the values

of the variable in the data set and dividing by the number of
The population arithmetic mean, μ (pronounced "mew"), is
a parameter that is computed using data from all the individuals in a

The sample arithmetic mean, x⎯⎯⎯ (pronounced "x-bar"), is

a statistic that is computed using data from individuals in a sample.

The sum of the deviations about the mean always equals zero

True, because the standard deviation describes how far, on average, each
observation is from the typical value. A larger standard deviation means
that observations are more distant from the typical value, and therefore,
more dispersed.

The variance of a variable is the square of the standard deviation.

The population variance is σ2, and the sample variance is s2.

The Z-scores-score represents the number of standard deviations an

observation is from the mean.

