Statistics Lecture Notes by IIUCian Teacher

Unit I - Descriptive Statistics
Introduction: Descriptive and Inferential Statistics
Statistics is the science of collecting, organizing, and analyzing data.
Data (singular: datum) refer to facts or pieces of information. Some examples of data are:
Example Data
The ages of the students in your statistics class: 19, 21, 18, 18, 34, 30, 25, 26, 24, 24, 19,
18, 21, 49, 27
The genders of the last eight people to walk by: male, male, female, male, female,
female, female, male
The IQ (Intelligence Quotient measurement) of five randomly selected individuals: 109,

89, 129, 101, 104
Most types of statistics are either Descriptive or Inferential.
Descriptive Statistics
Descriptive Statistics consists of organizing and summarizing data.
Inferential Statistics
Interfential Statistics consists of using data you’ve collected to form conclusions.
Here's a sample question: Let’s say there are 20 statistics classes at your university, and
you’ve collected the ages of all the students in one class.
Ages of students in your statistics class: 19, 21, 18, 18, 34, 30, 25, 26, 24, 24, 19, 18, 21,
49, 27
A descriptive question that could be asked about this data is "What is the most common
age of student in your statistics class?" The answer in this case would be 18. An
inferential question that could be asked about this data is "Are the ages of the students in
this classroom similar to what you would expect in a normal statistics class at this
university?"
In statistics, we deal with populations and samples.
Created by Syed Zahidur Rashid, Lecturer, Dept. of Electronic and Telecommunication Engineering 1
Population
The population is the entire group you are interested in studying.
Sample
A sample is a subset of the population. That is to say, it is a select group of information
taken from a population.
Let’s say you want to find the average GPA of a student at your university. Your
university has 20,000 students, and you randomly select 100 students and ask them their
GPAs. Your population is the group you’re interested in studying (the 20,000 students),
and your sample is a small group (a subset) you’ve taken from the population.
Figure 1.
Sampling Methods
How do we select from the population what goes into our sample? Before we answer this
question, we must first decide how we're going to represent different sample and
population sizes.
Representing Size
Mathemtically, A capital N is used to represent the size of a population.
A lowercase n is used to represent the size of a sample.
Let’s say you want to find the average GPA of a student at your university. Your
university has 20,000 students, and you select 100 students and ask them their GPAs.
What are N and n in this example?
N, the size of your population, is 20,000
n, the size of your sample, is 100
Samples are drawn from populations through several different sampling methods:
Simple Random Sampling

Every member of the population(N) has an equal chance of being selected for your
sample(n).
This is arguably the best sampling method, as your sampleis almost guaranteed to be
representative of your population. However, it is rarely ever used due to being too
impractical.
Stratified Sampling
With this method, the population(N) is split into non-overlapping groups ("strata"), then
simple random sampling is done on each group to form a sample(n).
One example of this would be splitting a population of students into men and women,
then sampling from each of the two groups. This may allow us to collect the same
amount of information as simple random sampling, but use less people.
Systematic Sampling
In this method, every nth individual from the population(N) is placed in the sample(n).
For example, if you add every 7th individual to walk out of a supermarket to your sample,
you are performing systematic sampling.
Convenience Sampling
Easily obtained individuals from the population(N) are placed in the sample(n).
Simply put, in this type of sampling you pick the easiest way of getting your sample. This
type of sampling is sometimes called voluntary response sampling, because individuals
often select to be a part of the sample. This can be a problem, because there may be a
difference between people who choose to participate and people who don’t.
While there are other sampling methods not mentioned in this article, these are the most
commonly used.
Types of Variables
A variable is a property that can take on many values.
"Age" is a variable. It can take on many different values, such as 18, 49, 72, and so on.
"Gender" is a variable. It can take on two different values, either male or female.
"Place" (in a race) is another variable. It can take on values such as 1st place, 2nd place,
3rd place, and so on.
There are two kinds of variables: Quantitative Variables, and Qualitative/Categorical

Variables:
Quantitative Variable
A quantitative variable is measured numerically. With measurements of quantitative
variables you can do things like add and subtract, and multiply and divide, and get a
meaningful result. In the previous example, "Age" was a quantitative variable.
Qualtitative/Categorical variables
These allow for classification based on some characteristic. With measurements of
qualitative/categorical variables you cannot do things like add and subtract, and multiply
and divide, and get a meaningful result. In the previous example, "Gender" was a
qualitative/categorical variable. Gender was categorized as either male or female.
There are two further kinds of quantitative variables:
Discrete Variable
A discrete variable is a quantitative variable with a finite number of values. For example,
imagine you rolled a six-sided die four times and measured how many times you rolled
an even number. What are your possible outcomes? {0, 1, 2, 3, 4}
Continuous Variable
A continuous variable is a quantitative variable with an infinite number of values. Take
temperature for example. Temperature can take on an infinite number of values, such as
80 degrees, or 80.01 degrees, or 80.0050592359 degrees. In the previous example we
were limited to a finite number of values (you couldn’t roll 1.5 even numbers), which is
what made it discrete.
Independent and Dependent Variables
Independent Variable
An independent variable is any variable that is being manipulated.
Dependent Variable
A dependent variable any variable that is being measured.
Imagine that researchers want to test the effectiveness of a new weight loss medication.
They split participants into three groups: one group gets a 0mg dosage (control), one
group gets a 50mg dosage, and the last group gets a 100mg dosage. After six months, the
participants’ weights are measured.
What are the independent and dependent variables in this experiment?
The independent variable would be dosage, because dosage is being manipulated.
The dependent variable would be weight, because weight is being measured.
Variable Measurement Scales
There are four different data types of measured variables:
Nominal
Nominal data (also known as qualitative/categorical data) is data that is split into
categories.
For example: what kind of data would you collect for the variable "Color"? You would
end up with information such as "red", "green", "blue", and so on. This qualitative
information is called nominal data.
Ordinal
Ordinal data is data where order matters, but distance between values does not.
For example: imagine three people in a race. One finishes in 1st place, one in 2nd place,
and the last in 3rd place. This data can be placed in order, but we can’t necessarily
measure the distance between values (maybe 1st place finished four seconds ahead of 2nd
place, and 2nd place finished nineteen seconds ahead of 3rd place).
Interval
Interval data is data where order matters, and distances between values are equal and
meaningful, and a natural zero is not present.
For example: temperature (in Fahrenheit or Celcius) is interval data. The difference
between 10 degrees and 20 degrees is 10 degrees. The difference between 80 degrees and
90 degrees is 10 degrees. The scale at any given point is constant, while a measurement
of 0 degrees does not reflect a true "lack of temperature".
Ratio
Ratio data is data where order matters, distances between values are equal and
meaningful, and a natural, meaningful zero is present.
For example: mass is ratio data. The difference between 140 grams and 155 grams is 15
grams. The difference between 280 grams and 295 grams is 15 grams. The scale at any
given point is constant, and a measurement of 0 reflects a complete lack of mass.
Why does it matter? Different types of data allow for different types of data analysis.
Frequency Distributions and Cumulative Frequency Distributions
Frequency Distribution
A frequency distribution lists all possible outcomes of an event, and the number of times
each event occurs.
Imagine that you reached into a bag of candy 16 times and pulled out the following colors
of candy:
Red, Green, Green, Green, Blue, Blue, Red, Blue, Green, Green, Red, Red, Blue, Green,
Red, Red
Record your results in a frequency distribution table.
A cumulative frequency table for the data would look like this:
Figure 1.
Bar Graphs and Pie Charts
Bar Graph
A bar graph lists each category on the horizontal axis and the number of occurrences for
each category on the vertical axis.
Imagine that you reached into a bag of candy 16 times and pulled out the following colors
of candy:
Red, Green, Green, Green, Blue, Blue, Red, Blue, Green, Green, Red, Red, Blue, Green,
Red, Red
How could we display this information as a bar graph? First, create a frequency table.
Then, use that information to create the bar graph.
Figure 1.
In a bar graph, the bars don't touch. If the bars touched, this would be a histogram.
Pie Chart
A pie chart is a circle divided into sectors, where each sector represents a category of data
that is proportional to the total amount of data collected.
Proportion
A proportion is a part considered in relation to its whole.
For example, imagine a pizza with 8 slices. Your friend takes and eats 3 slices. In this
case, 3/8 is the proportion of the pizza your friend took. This can also be represented as a
decimal. 3 / 8 = 0.375
Use the information from the "bag of candy" example to contruct a pie chart for the three
different colors.
Figure 2.
Histograms and Stem & Leaf Plots
Histogram
A histogram is a bar graph that lists each measured category on the horizontal axis and
the number of occurrences for each category on the vertical axis. The rectangles for each
bar touch one another.
Discrete Histogram
Discrete histograms are created when dealing with discrete values on the horizontal axis.
For example, say you ask 20 people to sample a new flavor of candy, then indicate how
much they liked it on a scale from 1 to 5 (with 5 being "very tasty" and 1 being "tastes
horrible"). You obtain the following results:
3, 1, 3, 2, 2, 1, 2, 3, 2, 2, 5, 5, 2, 3, 3, 4, 4, 1, 1, 3
To create a discrete histogram, first create a frequency table. Then, create a discrete
histogram.
Figure 1.
Continuous Histogram
Continuous histograms are created when dealing with continuous values on the
horizontal axis.
For example, say you light 200 matches and record how many seconds each match burns
for until it goes out. Below is a table of your (hypothetical) results:
Figure 2.
Because there are no distinct categories, we must first create classes.
Class
A class is an interval of many values.
For example, 1-10 is a class that covers all numbers from 1 to 10.
Lower Class Limit

The lowest class limit is the smallest value within each class.
Upper Class Limit
The upper class limit is the largest value within each class.
Class Width
The class width is the difference between consecutive lower class limits.
How many classes should we create for the data in the previous table? Usually, it's good
to create about 8 to 10 classes.
Here's what the frequency table and continuous histogram of our data might look like:
Figure 3.
Stem-and-Leaf Plot
A stem-and-leaf plot is another graphical representation of data, this time using stems and
leaves.
Imagine I’ve recorded rainfall (in inches) for the last 20 days, as shown in the table
below:
Figure 4.
A stem-and-leaf plot for the above data would look like this:
Figure 5.
The numbers on the left are the stems, while the numbers on the right are the leaves.
Arithmetic Mean for Samples and Populations
Arithmetic Mean
The arithmetic mean is a single value meant to "sum up" a data set.
Let’s say you have this data set:
1, 1, 2, 2, 2, 3, 3, 4, 5, 5
What single value best represents this data?
To calculate the mean, first add up all values, then divide by the total number of values
you have.
Figure 1.
There are two types of arithmetic mean: population mean, and sample mean. They're both
calculated the same way, but they're written differently:
Figure 2.
Remember that N represents the size of a population, and n represents the size of a
sample.
Central Tendency: Mean, Median, and Mode
Central Tendency
Central tendency refers to the measure used to determine the center of a distribution of
data. It is used to find a single score that is most representative of an entire data set.
1, 1, 2, 2, 2, 3, 3, 4, 5, 5
If we could pick a single value to represent the above sample data set, what ways could
we do it?
Mean
To find the mean, add up all values, then divide by the total number of values you have.
Figure 1.
Our mean is 2.8.
Median
To find the median, first put all the values in order (this has been done already). Next,
find out what value lies in the middle.
Figure 2.
What happens when there are two values in the middle? Find the median by calculating
the mean of the two values. Here, the median is (2 + 3) / 2 = 2.5
Mode
The mode is simply the most frequently occurring value.
For our data set, 2 is the mode because it occurs the most frequently.
If two values occur the most often, the distribution is said to be bi-modal. If more than
two values occur the most often, the distribution is said to be multi-modal.
What should you use?
The mean will be used for almost all occasions. However, outliers can sometimes
interfere with usage of the mean.
Outlier
An outlier is a value that is very different from the other data in your data set. This can
skew your results.
In situations with many outliers, the mean is not a good measure of central tendency. The
median or mode should be used instead, depending on the type of information you’re
dealing with.
Variance and Standard Deviation of a Population
Dispersion
Dispersion refers to how spread out a data set is about the mean.
Variance and Standard Deviation are two measures of dispersion within a data set. Below
are the definitional formulas for finding both:
Figure 1.
Using the definitional formula calculate variance for the data set 1, 2, 2, 3, 4, 5:
Figure 2.
Here's what's happening here: first, we're finding out how much each individual number
deviates from the mean.
We are then squaring all of those values (called "deviations"), and adding them together.
We take the sum of all deviations and divide by the total number of scores to get a
variance of 1.81.
To get the standard deviation of this data set, all we need to do is take the square root of
1.81. After doing so, we find the standard deviation to be 1.35.
Using the definitional formula can take a long time, so we usually use a shorter formula
called the computational formula:
Figure 3.
In this problem, N is the size of our data set(6). The other values are calculated like this:
Figure 4.
After plugging in all the values, we again find a variance of 1.81, and a standard
deviation of 1.35.
Variance and Standard Deviation of a Sample
Dispersion
Dispersion refers to how spread out a data set is about the mean.
Variance and Standard Deviation are two measures of dispersion within a data set. Below
are the definitional formulas for finding both:
Figure 1.
Using the definitional formula calculate variance for the data set 1, 2, 2, 3, 4, 5:
Figure 2.
Here's what's happening here: first, we're finding out how much each individual number
deviates from the mean.
We are then squaring all of those values (called "deviations"), and adding them together.
We take the sum of all deviations and divide by the total number of scores minus 1 to get
a variance of 2.17.
To get the standard deviation of this data set, all we need to do is take the square root of
2.17. After doing so, we find the standard deviation to be 1.47.
Using the definitional formula can take a long time, so we usually use a shorter formula
called the computational formula:
Figure 3.
In this problem, N is the size of our data set(6). The other values are calculated like this:
Figure 4.
After plugging in all the values, we again find a variance of 2.17, and a standard
deviation of 1.47.
Percentiles and Quartiles
For the following data set:
1, 2, 3, 4, 5
What percent of the numbers are even?
Figure 1.
40% of the values are even.
Percentile
A percentile is a value below which a certain percentage of observations lie.
For the following data set:
2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12
What is the percentile< ranking of "10"?
Figure 2.
Now, what value exists at the percentile ranking of 25%?
Figure 3.
Here, it is saying that the 25% percentile exists somewhere between the 5th and 6th value.
There is no "5.25th" value, so I take the average of the 5th and 6th values to find what
value exists at the 25th percentile. 5 and 5 average to give us an answer of 5.
Quartile
Quartiles divide data sets into quarters. There are three quartiles: The 1st Quartile is
located at the 25th percentile, the 2nd Quartile is located at the 50th percentile, and the
3rd Quartile is located at the 75th percentile.
Figure 4.
The Five Number Summary, Interquartile Range(IQR), and Boxplots
Five Number Summary

The Five Number Summary is a method for summarizing a distribution of data. The five
numbers are the minimum, the first quartile(Q1) value, the median, the third quartile(Q3)
value, and the maximum.
Give the five number summary for the following data set:
1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 6, 6, 7, 8, 8, 9, 27
The first thing you might notice about this data set is the number 27. This is very
different from the rest of the data. It is an outlier and must be removed. When it comes to
outliers, we remove everything that isn't between a lower fence and an upper fence:
Figure 1.
Here, we first find the First Quartile(Q1) and the Third Quartile(Q3) values. We then use
those two values to find the Interquartile Range(IQR). Finally, we can use those values to
find the lower and upper fences. Plugging in the values, we find a lower fence of -3, and
an upper fence of 13. We now remove the 27 from the original data set, because it falls
outside of this range. Our new data set is:
1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 6, 6, 6, 6, 7, 8, 8, 9
Now, we should easily be able to find our five number summary:
1. Minimum - 1
2. First Quartile(Q1) - 3
3. Median - 5
4. Third Quartile(Q3) - 7
5. Maximum - 9
Boxplot
A boxplot is a visual representation of a five number summary
Here is a boxplot of our five number summary:
Figure 2.
The Effects of Outliers
Outlier
An outlier is a value that is very different from the other data in your data set. This can
skew your results.
Let's examine what can happen to a data set with outliers. For the sample data set:
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4
We find the following mean, median, mode, and standard deviation:
Mean = 2.58
Median = 2.5
Mode = 2
Standard Deviation = 1.08
If we add an outlier to the data set:
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 400
The new values of our statistics are:
Mean = 35.38
Median = 2.5
Mode = 2
Standard Deviation = 114.74
As you can see, having outliers often has a significant effect on your mean and standard
deviation. Because of this, we must take steps to remove outliers from our data sets.
Skewness
Skewness
Skewness refers to how a distribution "leans". It is important to recognize skewness
because it has strong implications in hypothesis testing.
Let's examine three different kinds of skew:
Figure 1.
This symmetrical distribution has no skew. The mean exists perfectly at the center.
Figure 2.
This distribution is skewed to the right. The mean is pulled to the left from the center.
Figure 3.
This distribution is skewed to the left. The mean is pulled to the right from the center.
The Normal Curve and Empirical Rule
The distributions of most continuous random variables will follow the shape of the
normal curve.
Figure 1.
On the normal curve, mean, median, and mode all exist at the center.
Figure 2.
The graph changes direction at inflection points. These first points mark the distance of
one standard deviation from the mean.
Figure 3.
The Empirical Rule

The empirical rule states that:
68% of all values fall within 1 standard deviation of the mean.
95% of all values fall within 2 standard deviations of the mean.
99.7% of all values fall within 3 standard deviations of the mean.
Z-Scores (part one)
z-Scores
z-Scores are standardized values that can be used to compare scores in different
distributions.
Take this example: For the past two years, Joe has been in a bowling league.
First Year Stats:
League Average = 181
Standard Deviation = 12
Joe’s Score in Final Game = 187
Second Year Stats:
League Average = 182
Standard Deviation = 5
Joe’s Score in Final Game = 185
Compared to the rest of the league, in which year was Joe’s score in the final game
better?
Figure 1.
We can calculate a z-score for each year:
Figure 2.
We can then plot the z-scores and compare their placement on the distribution. From the
graphs below you can see that compared to the rest of the league, Joe had a better score in
his second year.
Figure 3.
Z-Scores (part two)
Let’s say we have a random variable, x, distributed as such:
Figure 1.
You can see from the distribution that x has a mean of 4, and a standard deviation of 1. I
want to know: What percentage of scores fall above 4.25?
Figure 2.
Or basically, what falls in this blue area. Remember from the empirical rule that we know
what probabilities are associated with different areas of the normal distribution. We can
use this information to find the percentage of scores above 4.25. First, we need to
calculate a z-score:
Figure 3.
With our z-score of 0.25, we now head to the z-table to find the area associated with it.
The first thing you'll notice about the z table is that it's asking for the "area in body".
Figure 4.
When you split a distribution into two parts, the smaller portion is the tail, while the
larger portion is the body. We're trying to look up a tail, but this table only gives us body.
To find the correct answer, we must first find the area in the body, then subtract it from 1.
That will give us the area in the tail.
Looking up the area in the body for z = 0.25, we find a proportion of 0.5987. To find our
final answer, we just have to make one quick change:
Area in Body = 0.5987
Area in Tail = 1.00-0.5987 = 0.4013
So, about 40% of scores fall above 4.25.
Extra Z-Score Problems
Here are some extra problems involving z-scores:
In the United States, the average IQ is 100, with a standard deviation of 15. What
percentage of the population would you expect to have an IQ lower than 85?
Figure 1.
Basically, we're trying to find what area corresponds to the blue tail shown above. First,
we calculate the z-score.
Figure 2.
We then look up this z-score in our z-table. After doing so, we find the area in the body
is .8413. We subtract that value from 1 to find the area in the tail.
Figure 3.
Our answer is .1587. About 16% of the population has an IQ score lower than 85.
What if the question was like this: In the United States, the average IQ is 100, with a
standard deviation of 15. What percentage of the population would you expect to have an
IQ between 90 and 120?
Figure 4.
We're trying to find what area corresponds to the blue area shown above. First, we
calculate both z-scores.
Figure 5.
In order to find the area between those two z-scores, we must first look up each z-score in
the Z table (click to open). We find that the area in the body for 0.66 is 0.7454, and the
area in the body for 1.33 is 0.9082.
Figure 6.
Using this information and what we know about the normal curve, we find that 65.36%
of the population has an IQ score between 90 and 120.
Unit II – Probability
The Basics of Probability
If I flip a coin, what’s the probability of the coin landing on tails?
Figure 1.
Sample Space
A sample space is a collection of all possible outcomes.
The coin can land on either heads or tails. There is a 1 / 2 = .50 (50%) chance of the coin
landing on tails.
Probabilities will always be between 0 (0%) and 1.00 (100%).
Impossible Events
An event with a probability of 0 is impossible (it will never occur).
Certain Events
An event with a probability of 1 is certain (it will always occur).
Addition Rule (Probability "or")
Mutually Exclusive
Two events are mutually exclusive if they cannot occur at the same time.
When flipping a coin, the two events (heads and tails) are mutually exclusive.
Figure 1.
What is the probability of flipping a coin and getting heads or tails?
Figure 2.
There is a 100% chance the coin will land on either heads or tails.
When picking randomly from a deck of cards, the two events heart and king are not
mutually exclusive.
Figure 3.
When picking randomly from a deck of cards, what is the probability of choosing a card
that is a heart or a king, but not both?
Figure 4.
The probability of drawing a card that is either a heart or a king, but not both, is
approximately 31%.
Multiplication Rule (Probability "and")
Independent Events
Two events are independent if they do not affect one another.
For example: rolling a five and then rolling a three with a normal six-sided die. These
events are independent because rolling a five does not change the probability of rolling a
three (it is still 1/6). The same is true the other way around.
What is the probability of rolling a 5 and then a 3 with a normal six-sided die? To answer
this, we have the Multiplication Rule for Independent Events:
Figure 1.
There is a 1 in 36 chance of rolling and 5, and then rolling a 3.
Dependent Events
Two events are dependent if they do affect one another.
For example: drawing a king and then drawing a queen from a deck of cards, without
putting the king back. These events are dependent because drawing a king changes the
probability of drawing a queen. Without the king in the deck the probability of drawing a
queen changes from 4/52 to 4/51.
What is the probability of drawing a king and then drawing a queen from a deck of cards?
To answer this, we have the General Multiplication Rule for Dependent/Conditional
Events:
Figure 2.
There is roughly a 0.6% chance of drawing a king, and then drawing a queen without
replacement from a deck of cards.
Permutations
Imagine you’re visiting a zoo with six animals, and I ask you to record the first three
animals you see. The six animals are:
Tiger, Lion, Monkey, Zebra, Walrus, Snake
How many different ways could you run into three animals?
Figure 1.
There are 6 animals you could record in the first spot. After that, there are only 5 animals
left to see for the second spot. And after that, there are only 4 animals left to see for the
final spot. When we multiply those three numbers, we find that there are 120 different
ways you could run into three different animals. This is a permutation.
When I say that there are 120 different ways you could run into three different animals,
I'm saying that order matters. For example, there are six different ways to run into the
same three animals:
Tiger, Lion, Monkey
Tiger, Monkey, Lion
Lion, Monkey, Tiger
Lion, Tiger, Monkey
Monkey, Lion, Tiger
Monkey, Tiger, Lion
The permutations formula for this data would look something like this:
Figure 2.
But wait, what do the exclamation points(!) mean? Those mark factorials. Here are two
examples of factorials:
6! = 6 * 5 * 4 * 3 * 2 * 1 = 720
4! = 4 * 3 * 2 * 1 = 24
In a factorial, you take the initial number and multiply it by every number between itself
and one. So, our final answer for the first problem would look like this:
Figure 3.
As you can see, using the equation we still get an answer of 120.
Combinations
Imagine you’re visiting a zoo with six animals, and I ask you to record the first three
animals you see. The six animals are:
Tiger, Lion, Monkey, Zebra, Walrus, Snake
How many different combinations of three animals could you run into?
Figure 1.
There are 6 animals you could record in the first spot. After that, there are only 5 animals
left to see for the second spot. And after that, there are only 4 animals left to see for the
final spot. When we multiply those three numbers, we find that there are 120 different
ways you could run into three different animals. Remember from the last lecture that this
is a permutation.
When I say that there are 120 different ways you could run into three different animals,
I'm saying that order matters. For example, there are six different ways to run into the
same three animals:
Tiger, Lion, Monkey
Tiger, Monkey, Lion
Lion, Monkey, Tiger
Lion, Tiger, Monkey
Monkey, Lion, Tiger
Monkey, Tiger, Lion
In a combination, order does not matter. All six of those answers found above are really
the same thing. So, there are 120 / 6 = 20 different combinations of animals that we could
run into.
The combinations formula for our data would look something like this:
Figure 2.
But wait, what do the exclamation points(!) mean? Those mark factorials. Here are two
examples of factorials:
6! = 6 * 5 * 4 * 3 * 2 * 1 = 720
4! = 4 * 3 * 2 * 1 = 24
In a factorial, you take the initial number and multiply it by every number between itself
and one. So, our final answer for the first problem would look like this:
Figure 3.
As you can see, using the equation we still get an answer of 20.
Discrete and Continuous Random Variables
Random Variable
A random variable is a variable which has its value determined by a probability
experiment.
If you flip a coin once, how many tails could you come up with? Let's create a new
random variable called "T". "T" represents the number of tails possible from our
probability experiment. After flipping a coin once (a probability experiment), T's value
will be either 1 or 0. T is a random variable.
Discrete Random Variable

A discrete random variable is a random variable which has a finite number of values.
Let’s say you flip a coin six times. How many tails could you come up with?
Figure 1.
There are a finite number of possible values. Values such as "1.5" or "2.5923" don’t
make sense for this type of problem.
Continuous Random Variable
A continuous random variable is a random variable which has an infinite number of
values.
Let’s say you measure the speed (in miles per hour) of the first car to drive by your house.
What kind of values could you obtain?
Figure 2.
Maybe the car is going 25mph, or 50mph, or 62.00252mph. The variable (speed) can take
on an infinite number of values.
Discrete Probability Distributions
Probability Distribution
A probability distribution displays the probabilities associated with all possible outcomes
of an event.
Here's a probability distribution for a coin flip:
Figure 1.
What would the probability distribution look like for one roll of a six-sided die?
Figure 2.
Probability Histograms
Probability Distribution
A probability distribution displays the probabilities associated with all possible outcomes
of an event.
Here's a probability distribution for one roll of a six-sided die:
Figure 1.
As you can see, every event has an equal chance of occuring.
Probability Histogram
A probability histogram is a histogram with possible values on the x axis, and
probabilities on the y axis.
Here's a made-up probability distribution(left) with its probability histogram(right):
Figure 2.
Mean and Expected Value of Discrete Random Variables
Let's calculate the mean of a discrete random variable:
Below is the probability distribution for a golfer on a par 3 hole, where x = Number of
Strokes to Complete Course
Figure 1.
The mean can be calculated by multiplying each "x" by each "P(x)", then adding the
resulting values together:
Figure 2.
Here, the mean is 2.65
The mean we just calculated of 2.65 is an expected value. If we were to take a large
enough sample of this golfer’s performance on par 3 holes, we expect his mean to
approach 2.65.
This is a short example of the Law of Large Numbers.
Variance and Standard Deviation of Discrete Random Variables
To calculate the variance of a discrete random variable, we must first calculate the mean.
Here is the mean we calculated from the example in the previous lecture:
Figure 1.
Now, we can move on to the variance formula:
Figure 2.
To find the first part of the equation, we first square every "x". Then, we multiply each
squared "x" by "P(x)". Last, we add together all resulting values.
Figure 3.
We find the first part of the equation to be 7.75. Now, we can plug in the rest of the
values to get our answers:
Figure 4.
The variance is 0.73, while the standard deviation is 0.85.
The Law of Large Numbers
The Law of Large Numbers

According to the law of large numbers, as a probability experiment is performed many
times, the observed value (usually a mean) will arrive at the expected value.
Imagine a probability experiment where a coin is flipped, and the number of heads is
measured:
Figure 1.
As more probability experiments are performed, the actual value will approach the
expected value of 0.50. Below are 10 simulated coin flips. As you can see from the line
graph on the right, the actual value is approaching the expected value.
Figure 2.
Binomial Distribution
Binomial Experiment
An experiment is a binomial experiment if:
1. It is repeated a fixed number of times.
2. The trials are independent.
3. Trials have two mutually exclusive outcomes, either success or failure.
4. The probability of success is the same for all trials.
Try this example: In a recent survey, it was found that 85% of households in the United
States have High-Speed Internet. If you take a sample of 18 households, what is the
probability that exactly 15 will have High-Speed Internet?
First, let's check if it meets the four conditions:
1. Is this experiment being repeated a fixed number of times?
YES
2. Are the trials independent?
YES, discovering that one home has High-Speed Internet will not affect the probability of
other homes having High-Speed Internet.
3. Are there two mutually exclusive outcomes?
YES, a home either has High-Speed Internet(success) or doesn’t have High-Speed

Internet(failure). These events cannot occur together so they are mutually exclusive.
4. Is the probability of success the same for all trials?
YES, the probability of success for each trial is 85%.
Now, let's work through the equation. To use this equation, you should already have a
pretty good idea about what a combination is.
Figure 1.
By following the above steps, you should find that the probability of 15 households
having High-Speed Internet is .239.
What if I changed the example around a little? In a recent survey, it was found that 85%
of households in the United States have High-Speed Internet. If you take a sample of 18
households, what is the probability that at least 15 will have High-Speed Internet?
It says "at least" 15, so that means we have to calculate the probabilties for 15, 16, 17,
and 18 homes, then add everything together:
Figure 2.
By following these steps, you should find that the probability of at least 15 households
having High-Speed Internet is .718.
Mean and Standard Deviation of Binomial Random Variables
Let's use the data from the last lecture: In a recent survey, it was found that 85% of
households in the United States have High-Speed Internet. If you take a sample of 18
households, what is the probability that exactly 15 will have High-Speed Internet?
Here are the equations for mean and standard deviation of a binomial random variables:
Figure 1.
We can now easily plug in the number of trials and the probability of success to come up
with our answers:
Figure 2.
The mean is 15.3, and the standard deviation is 1.515.
Poisson Distribution/Process
Poisson Distribution
The Poisson probability distribution is used when computing the probability of a certain
number of successes within a specified interval.
An experiment follows the Poisson process if:
1. The probability of two successes in a small enough interval is 0%.
2. The probability of a success is the same for any two intervals which share the same
length.
3. Successes are independent of successes in other intervals.
Here's an example: At a theme park, there is a roller coaster that sends an average of three
cars through its circuit every minute between 6pm and 7pm. A random variable, X,
represents the number of roller coaster cars to pass through the circuit between 6pm and
6:10pm.
First, let's check if this contains a Poisson random varilable:
Is the probability of two successes in a small enough interval 0%?
YES, in a small enough interval (say, 1 second) it would be impossible for two successes
(cars through the circuit) to occur.
Are the probabilities of success equal for any two intervals of equal length?
YES, between any two equal intervals (say, 6pm-6:15pm, and 6:30pm-6:45pm), the
average (probability of success) remains 3 cars.
Are successes independent of successes in other intervals?
YES, a success in one interval is independent of a success in any other interval.
What is the probability that 35 cars will pass through the circuit between 6pm and
6:10pm?
Figure 1.
The probability that 35 cars will pass through the circuit between 6pm and 6:10pm is
0.045.
Mean and Standard Deviation of Poisson Random Variables
Here's my previous example: At a theme park, there is a roller coaster that sends an
average of three cars through its circuit every minute between 6pm and 7pm. A random
variable, X, represents the number of roller coaster cars to pass through the circuit
between 6pm and 6:10pm. What is the probability that 35 cars will pass through the
circuit between 6pm and 6:10pm?
We can use this information to calculate the mean and standard deviation of the Poisson
random variable, as shown below:
Figure 1.
The mean of this variable is 30, while the standard deviation is 5.477.
Unit III - Correlation & Regression
Coordinate (Cartesian) Planes
Below is a Coordinate Plane:
Figure 1.
This two dimensional coordinate plane displays points on two dimensions: x, and y. Any
point on this plane will have an x value, and a y value. The point in the very center of the
plane is (0, 0).
Figure 2.
A red point has been plotted on the plane above. Where is the point located? Points are
described in the format (x, y). So, this point is located at (5, 4).Quadrants
Figure 1.
Above is a coordinate plane with the point (5, 4) plotted on it. What quadrant is this point
in?
Figure 2.
Coordinate planes have four quadrants, as shown above. From this, we can conclude that
the point is in Quadrant I.
Scatter Plots
Scatter Plots
Scatter plots are a method of graphically displaying bivariate data.
Example bivariate data is displayed below. Our two variables are age and yearly income.
Figure 1.
What is the relationship between age and yearly income?
The relationship can be plotted on a scatter plot. Below, age is on the x-axis, while yearly
income is on the y-axis:
Figure 2.
Just by looking at the graph, it would appear that as age inceases, yearly income increases.
It is a fairly good bet that our variables are related.
Pearson’s r Correlation
In the previous lecture on scatter plots, we made a scatter plot for some sample bivariate
data and concluded that the two variables were probably related.
Figure 1.
We can use this data to calculate Pearson's r
Pearson’s r
Pearson’s r measures the strength of the linear relationship between two variables.
Pearson’s r is always between -1 and 1.
Here is a perfect positive relationship. r is equal to 1.0:
Figure 2.
Here is a perfect negative relationship. r is equal to -1.0:
Figure 3.
Here is an example of data that has no relationship. r is somewhere close to 0.0:
Figure 4.
Pearson's r is calculated with the following equation:
Figure 5.
Plugging in the values from our original example with ages and yearly incomes, we can
calculate the following r:
Figure 6.
This r is almost 1.0, so we can conclude that x(Age) and y(Yearly Income) have a strong
positive relationship. As one increases, the other tends to increase as well.
Hypothesis Testing with Pearson's r
Just like with other tests such as the z-test or ANOVA, we can conduct hypothesis testing
using Pearson’s r.
To test if age and income are related, researchers collected the ages and yearly incomes
of 10 individuals, shown below. Using alpha = 0.05, are they related?
Figure 1.
Steps for Hypothesis Testing with Pearson's r
1. Define Null and Alternative Hypotheses
2. State Alpha
3. Calculate Degrees of Freedom
4. State Decision Rule
5. Calculate Test Statistic
6. State Results
7. State Conclusion
1. Define Null and Alternative Hypotheses
Figure 2.
2. State Alpha
alpha = 0.05
3. Calculate Degrees of Freedom
Where n is the number of subjects you have:
df = n - 2 = 10 – 2 = 8
4. State Decision Rule
Using our alpha level and degrees of freedom, we look up a critical value in the r-Table.
We find a critical r of 0.632.
If r is greater than 0.632, reject the null hypothesis.
5. Calculate Test Statistic
We calculate r using the same method as we did in the previous lecture:
Figure 3.
6. State Results
r = 0.99
Reject the null hypothesis.
7. State Conclusion
There is a relationship between age and yearly income, r(8) = 0.99, p < 0.05
The Spearman Correlation
In the last lectures I talked about Pearson’s r, which measures the relationship between
two continuous (interval or ratio scale) variables.
Spearman Correlation
The Spearman correlation is used when:
1. Measuring the relationship between two ordinal variables.
2. Measuring the relationship between two variables that are related, but not linearly.
Below is an example of some data that is related in a non-linear fashion. For this, we
would use the Spearman correlation:
Figure 1.
Let's calculate the Spearman correlation for the following data set:
Figure 2.
To calculate the Spearman correlation, we must first rank the scores:
Figure 3.
We then calculate the correlation using these new ranks:
Figure 4.
We find an r of -1.00, meaning that our data has a negative relationship. As x increases, y
decreases. As x decreases, y increases.
Linear Regression
In a previous lecture on Pearson's r, we found two sets of data to be highly correlated:
Figure 1.
If we know that two variables are strongly correlated, we can use one variable to predict
the other using the following equations:
Figure 2.
Here, we first calculate beta1 and beta0 and place them in the top equation. Then, if we
plug an x into the equation, we can predict what our y value will be.
The stronger your correlation (that is, the closer r is to -1 or 1), the more accurate your
prediction will be.
First, we solve for beta1:
Figure 3.
We then use beta1's value to solve for beta0:
Figure 4.
Now, putting those values into the original equation, we have our completed regression
equation:
Figure 5.
Predict the yearly income of someone who is 33 years old.
Figure 6.
We would expect someone who is 33 years old to make approximately $36,963 a year.
Correlation vs. Causation
Causation
Causation means that one variable causes a change in another variable.
Correlation
To say that two variables are correlated is to say that they share some kind of
relationship.
In order to imply causation, a true experiment must be performed where subjects are
randomly assigned to different conditions.
Here's an example of a true experiment where causation can be implied:
Researchers want to test a new anti-anxiety medication. They split participants into three
conditions (0mg, 50mg, and 100mg), then ask them to rate their anxiety level on a scale
of 1-10. Are there any differences between the three conditions using alpha = 0.05?
Figure 1.
This is a true experiment because participants are randomly being assigned to different
conditions. Any differences between the three groups should only be due to the effects of
dosage.
Here's an example of correlational data:
Figure 2.
Here, we see that students who spend more time studying for tests tend to score better
than students who spend less time studying. However, because this is not a true
experiment we cannot imply that studying causes better test scores. Perhaps the high
scoring students in this sample were just better test takers.

Statistics Lecture Notes by IIUCian Teacher

Uploaded by

Copyright:

Available Formats

You might also like

Statistics Lecture Notes by IIUCian Teacher

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Lecture Notes by IIUCian Teacher

Uploaded by

Copyright:

Available Formats

Unit I - Descriptive Statistics

Introduction: Descriptive and Inferential Statistics

Statistics is the science of collecting, organizing, and analyzing data.

The IQ (Intelligence Quotient measurement) of five randomly selected individuals: 109,

Most types of statistics are either Descriptive or Inferential.

In statistics, we deal with populations and samples.

A lowercase n is used to represent the size of a sample.

What are N and n in this example?

N, the size of your population, is 20,000

n, the size of your sample, is 100

Simple Random Sampling

A variable is a property that can take on many values.

There are two kinds of variables: Quantitative Variables, and Qualitative/Categorical

There are two further kinds of quantitative variables:

Independent and Dependent Variables

What are the independent and dependent variables in this experiment?

The independent variable would be dosage, because dosage is being manipulated.

The dependent variable would be weight, because weight is being measured.

Variable Measurement Scales

There are four different data types of measured variables:

Frequency Distributions and Cumulative Frequency Distributions

Record your results in a frequency distribution table.

Bar Graphs and Pie Charts

Histograms and Stem & Leaf Plots

Because there are no distinct categories, we must first create classes.

Lower Class Limit

Arithmetic Mean for Samples and Populations

Let’s say you have this data set:

What single value best represents this data?

Central Tendency: Mean, Median, and Mode

Our mean is 2.8.

What should you use?

Variance and Standard Deviation of a Population

Variance and Standard Deviation of a Sample

Percentiles and Quartiles

For the following data set:

What percent of the numbers are even?

40% of the values are even.

For the following data set:

2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12

What is the percentile< ranking of "10"?

Now, what value exists at the percentile ranking of 25%?

The Five Number Summary, Interquartile Range(IQR), and Boxplots

Five Number Summary

Now, we should easily be able to find our five number summary:

Here is a boxplot of our five number summary:

The Effects of Outliers

We find the following mean, median, mode, and standard deviation:

Standard Deviation = 1.08

If we add an outlier to the data set:

The new values of our statistics are:

Let's examine three different kinds of skew:

The Normal Curve and Empirical Rule

The Empirical Rule

68% of all values fall within 1 standard deviation of the mean.

95% of all values fall within 2 standard deviations of the mean.

99.7% of all values fall within 3 standard deviations of the mean.

Z-Scores (part one)

First Year Stats: