Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 80

CHAPTER 3-

THE NATURE OF STATISTICS

Copyright: CNE Mathematics in the Modern World


What we will do in this Chapter.
• We will study the nature of statistics.
• We will differentiate population from samples
• We will identify the level of measurements of variables
• We will study the different methods of presenting
statistical data
• We will calculate the different measures of central
tendency
• We will calculate z-scores and measures of relative
standing
• We will calculate the different measures of variation
• We will determine if two variables are linearly correlated
• We will determine if a “line-of-best fit” can model the
relationship between two variables
• We will solve problems involving the normal distribution
The Nature of Statistics

• What is the role of Statistics in our


daily lives?
• When and how do we use technology
Essential Questions

to aid in solving statistical


problems?
• What does it mean to properly
analyze and describe data using
graphical and numerical
summaries?
• Why are data collected and analyzed?
• How do people use data to influence
others?
• How can predictions be made based
on data?
What is Statistics?

• Statistics is a word with two meanings. Most


people are aware of the mundane definition of
statistics as that branch of Mathematics that
involves collecting, organizing, summarizing, and
presenting data, such as baseball statistics or
statistics the government collects during a census.
• The larger definition of statistics is a discipline
concerned with the analysis of data and decision
making based upon the data. It can also be used to
spot trends or isolate causes.
• Statistics is based upon a solid edifice of
mathematical theorems proven through unassailable
laws of logic.
Descriptive and Inferential Statistics

• Descriptive Statistics is that branch that


deals with the description of data collected.

• Inferential Statistics deals with


examining the relationships between
variables within a sample and then making
generalizations or predictions about how
those variables will relate to a larger
population.
Population and Sample

 If a measurement is gathered for every


experimental unit in the entire collection,
the resulting data set constitutes the
population of interest.
 Any smaller subset of measurements is a
sample.
Variables : Characteristics of a sample
• Quantitative Variable – The
variable is numerical, so
operations such as adding and
averaging make sense.
• Qualitative Variable – The
variable describes an individual
through grouping or
categorization.
Examples of Quantitative Variables
• High School Grade Point Average
• Number of pets owned
• Bank account balance
• Number of stars in a solar system
• Average number of lottery tickets
sold
• How many cousins you have
• Distance travelled by migratory
birds
Examples of Qualitative / Categorical
Variables
• Class in college (freshman,
sophomore, junior, senior)
• Types of pets owned (dogs, cats,
birds, fishes)
• Favorite authors
• Preferred airline
• Hair color
• Race
• Type of hats
Levels of Measurements
• In the nominal data or categorical data the numbers in the
variable are used only to classify the data. In this level of
measurement, words, letters, and alpha-numeric symbols can be
used.
• The ordinal level of measurement depicts some ordered
relationship among the observations on the variables.
• The interval level of measurement not only classifies the
measurements, but it also specifies that the distances between
each interval on the scale are equivalent along the scale from low
interval to high interval.
• In the ratio level of measurement, the observations, in addition
to having equal intervals, can have a value of zero as well. In the
ratio level of measurement, the divisions between the points on
the scale have an equivalent distance between them.
Frequency Distribution Table
• Frequency tells you how often something
happened. The frequency of an observation
tells you the number of times the observation
occurs in the data.
• For example:
Bar Graph
Pie Graph
Time Series – Line Graph

A line graph showing the Total Exports of Goods of the


Philippines from 1st Quarter of 1998 to 1st quarter of 2018.
900000

800000

700000

600000

500000

400000

300000

200000

100000

0
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018

ITEMS -
Series3 TOTAL EXPORTS OF GOODS
Now try these!

1. A newspaper website contains a poll asking people their


opinion on a recent news article. What is the
population?
2. To determine the average length of milkfish in a
fishpond, researchers catch 50 fishes and measure them.
What is the sample and population in this study?
3. Classify each measurement as qualitative or quantitative
a. Eye color of a group of people
b. Daily high temperature of a city over
several weeks
c. Annual income
d) Zip codes
e) Vital statistics of a Ms. Barangay contestants
Now try these!

A political scientist surveys 34 of the 250


representatives in the Legislature. Of them, 24 said they
were supporting a new education bill, 8 said they were not
supporting the bill, and 2 were undecided.
a. What is the population of this survey?
b. What is the size of the population?
c. What is the size of the sample?
d. Give the sample statistics for the proportion of
voters surveyed who said they were supporting
the education bill.
e. Based on this sample, it might be expected how
many of the representatives will support the
education bill?
Graphical Presentation

• When do you use the bar graph,


pie chart, and line graph?
• How do you decide which graph
is most appropriate for a given
data set?
Now try these!

1. A botanist went to a forest to study plants and their flowers.

Color Frequency
Blue 15
Green 25
Red 30
White 18

Yellow 12

(a) What is the most appropriate graphical


display that can be used for these data?
(b) Use technology to construct the graph.
Now try these!

2. In a survey, adults were asked whether they are personally


worried about a variety of environmental concerns. The
numbers (out of 1740 surveyed) who indicated that they are
worried “a great deal” about some selected concerns are
summarized below.
Color Frequency
Pollution of Drinking Water 596
Contamination of Soil and Water by Toxic 324
Waste
Air Pollution 472
Global Warming 348

(a) What is the most appropriate graphical


display that can be used for these data?
(b) Use technology to construct the graph.
Measures of Central Tendency

• Arithmetic Mean
• The mean of a set of measurements is
the sum of a sample of measurements
divided by the number of data points.
If there are data points in a data set,
and the data points are represented as ,
then the mean may be computed as

Example: Arithmetic Mean

• A cyclist recorded the number of miles


per day he cycled for five days. The
recording were as follows: 13 10 12 10
11. What is the average number of miles
he cycled per day?
Measures of Central Tendency

• Median
• The median of a set of data arranged
according to size (ascending or
descending) is the value of the middle
data point if the number of data points
is odd, and the mean of the two most
middle data points if the number of
data points is even.
Example: Median

• A student borrowed 7 Statistics books from the


library. The page numbers of each book are as
follows: 231 423 521 139 347 400 345.
What is the median of this set of numbers?

• Arranging the numbers in ascending order


yields 139 231 345 347 400 423 521.
The most middle value is 347.
Measures of Central Tendency

• Mode
• The mode is simply the value in a data
set that occurs with the highest
frequency and more than once.
• It is possible that in a set of data, there
is no mode, or more than 1 mode
Example: Mode

• A Math teacher recorded the following


special quiz scores (out of possible 10
points) for the 12 students who attended
a training and missed the quiz. The
scores were 7, 4, 4, 7, 2, 9, 10, 6, 7, 3, 8,
5.

• The mode is 7 since 7 appears three


times.
Example: Mean (Frequency Distribution Table)

• Use the following frequency distribution


table to find the mean.

Values Frequency
20 2
29 4
30 4
39 3
44 2
Example: Mean (Frequency Distribution Table)

• Use the following frequency distribution


table to find the mean.

Values, x frequency, f  fx

20 2 40
29 4 116
30 4 120
39 3 117
44 2 88
  15 481
Now try these!
• A teacher records scores on a 20-point quiz
for the 30 students in his class. The scores are:
19 20 18 18 17 18 19 17 20 18
20 16 20 15 17 12 18 19 18 19
17 20 18 16 15 18 20 5 0 0
(a) Construct a frequency distribution
table.
(b) Find the mean, median, and
mode.
Now try these!

• An NBA player scores in the 2018


NBA finals are as follows:
35 46 26 44 27
42 15
(a) Find the mean.
(b) Find the median.
(c) Find the mode.
Now try these!
• In the 2018 Winter Olympics, the following
were the top 20 total number of medals earned
by the countries which participated.
39 31 29 23 20 17 17
15 15 14 14 13 10 9
7 6 5 3 3 2
(a) Find the mean.
(b) Find the median.
(c) Find the mode.
Now try these!
• A jogger kept records of the number of miles he
ran per week during the past year. The frequency
distribution below summarized the records. Find
the mean, median, and mode of the number of
miles per week that the jogger ran.
Miles Per Week Number of Weeks
1 5
2 4
3 10
4 8
5 10
6 7
7 4
Now try these!

• Consider the following frequency


distribution.
x frequency
20 2
29 4
30 8
35 5
42 3

Find the mean.


Measures of Variation

 Range - This is the simplest, but not very useful


measure of dispersion. It is simply the difference
between the highest and lowest observations in a
set of data. Since, it only considers two extreme
values in a data set, it does not really give us real
picture of variation.

 Interquartile Range - The Interquartile Range


(IQR) is the difference between the third and first
quartiles. One half of the distribution lie within
this range. It consists of the middle 50% of the
observations in that it cuts off the lower 25% and
the upper 25% of the data points.
Measures of Variation

 The Variance and Standard Deviation

 The standard deviation is by far the most generally


useful measure of variation. It is simply the square root
of the variance. As a measure of dispersion, the variance
and the standard deviation measure the tendency for
individual observations to deviate from the mean. The
variance and the standard deviation are measures of
how spread out a distribution is.

 The variance is the mean of the standard deviation from


the mean. It means that we are finding the amount by
which each observation deviates from the mean. Then,
we square those deviations and find the average of
those squared deviations.
Measures of Variation

• Coefficient of Variation

• The Coefficient of Variation (C.V.) is used


when comparing two or more sets of variables
specially when the units of measurements are
different. This measure is much safer to use when
two or more distributions with significantly
different means (average) are compared on the
basis of the standard deviation. The formula to use
is
• When this formula is used, the units are
cancelled and the units in percentage.
Example: Measures of Variation

• A sample of 10 students was asked by the


teacher to record the number of hours each
spent studying for a given exam from the
time the exam was announced in class.
The following data values were the
recorded number of hours: 12, 15, 8, 9, 14,
8, 17, 14, 8, 15. Find the range, standard
deviation, and variance.
Example: Measures of Variation

• The range is R = 17 – 8 = 9
• The variance is:
Hours, x   ( 𝑥 − 𝑥)   ( 𝑥 − 𝑥 )2

12 0 0
15 3 9
8 -4 16
• The standard 9 -3 9
14 2 4
deviation is 8
17
-4
5
16
25
14 2 4
8 -4 16
15 3 9
120   108
Sample: Measures of Variation

• According to the weather office, the


average amount of rainfall in October
was 13.8 inches with a standard
deviation of 2.25 inches. During the
same month, the average wind speed
was 8.0 miles/hour with a standard
deviation of 1.2 miles/hour. Which of
the two variables, amount of rainfall
or wind speed, is more variable?
Example: Measures of Variation

• Calculating for the coefficient of variation


for the two variables, yields
• Amount of Rainfall

• Wind Speed

Therefore, the amount of rainfall is more


variable than the wind speed relative to its
mean.
Now try these!
• A lady recorded the ages of her 15 students in
an Essay Writing class:
18 21 25 21 23
23 21 31 19 24
20 21 24 18 20
(a) Find the range.
(b) Find the variance and the
standard deviation.
Now try this!
• A consumer surveyed 10 different gasoline
stations in Metro Manila in May 2018. The
results are as follows:
55.75 54.85 51.89 54.75 55.23
55.75 56.17 53.25 53.98 55.64
Find the variance and the standard deviation
of the gasoline prices.
Now try this!
• The scores of an NBA player in the 2018
NBA finals were as follows:
35 46 26 44 27 42 15
Find the variance and the
standard deviation of the scores.
Now try this!
• The following frequency distribution table shows
the weights of 36 students in a certain Math class of
Mr. Valdriz. Find the variance and standard
deviation of the weights.

Weights, lbs (x) Frequency f

100 3
115 5
120 10
125 9
130 7
135 2
Now try this!
• The Internet was used to make a survey of the per
quart of synthetic motor oil for the high-performance
go-cart. The sample data, converted to pesos per
quart, were summarized in the following table.
Determine the variance and the standard deviation of
the price.

Price per quart, Number of


Pesos (x) websites
372.20 1
399.50 3
424.50 6
449.50 7
474.50 2
499.50 1
Now try these!
• LeBron James’ scores in the 2018 NBA Finals were as
follows:
35 46 26 44 27 42 15
Stephen Curry’s scores in the
same NBA Finals were as follows:
27 29 22 28 35 16 18
(a) Find the variance and the
standard deviation of the scores.
(b) Find the coefficient of variation for
each player. Who is more
consistent?
Now try this!
• According to the weather office, from 2000–
2015, the average summer temperature in the
capital was 26.36°C with a standard deviation
of 1.48°C. The average precipitation in the
same period was 0.48 m with a deviation of
0.13 m.
Which has a greater spread relative to its mean?
Measures of Relative Standing (Position)

• When we seek answer to the question “How


high or low is a data value relative to the
others?”. We want standardized measures that
will work for practically all populations
involving quantitative data.
• The common measures of position are
percentiles, quartiles, and standard scores (z-
scores).
• Measures of relative standing can be used to
compare values from different data sets, or to
compare values within the same data set.
Measures of Relative Standing (Position)

• Percentiles
• Assume that the elements in a data set are rank
ordered from the smallest to the largest. The values that
divide a rank-ordered set of elements into 100 equal parts
are called percentiles.
• An element having a percentile rank of Pi would have
a greater value than i percent of all the elements in the
set. Thus, the observation at the 50th percentile would be
denoted P50, and it would be greater than 50 percent of
the observations in the set. An observation at the 50th
percentile would correspond to the median value in the
set.
Measures of Relative Standing (Position)

• Quartiles
• Quartiles divide a rank-ordered data set into
four equal parts. The values that divide each
part are called the first, second, and third
quartiles; and they are denoted by Q1, Q2, and
Q3, respectively.
• Note the relationship between quartiles and
percentiles. Q1 corresponds to P25,
Q2 corresponds to P50, Q3 corresponds to P75.
Q2 is the median value in the set.
Measures of Relative Standing (Position)
• Standard Scores (z-Scores)
• A standard score (z-score) indicates how many standard deviations an element is
from the mean.
• A standard score can be calculated using the following formula.

where z is the z-score, X is the value of the element, μ is the mean of the population,
and σ is the standard deviation.
 Here is how to interpret z-scores.
 A z-score less than 0 represents an element less than the mean.
 A z-score greater than 0 represents an element greater than the mean.
 A z-score equal to 0 represents an element equal to the mean.
 A z-score equal to 1 represents an element that is 1 standard deviation greater than
the mean; a z-score equal to 2, 2 standard deviations greater than the mean; etc.
 A z-score equal to -1 represents an element that is 1 standard deviation less than
the mean; a z-score equal to -2, 2 standard deviations less than the mean; etc.
Measures of Relative Standing (Position)

• Here is how to interpret z-scores.


 A z-score less than 0 represents an
element less than the mean.
 A z-score greater than 0 represents an
element greater than the mean.
 A z-score equal to 0 represents an
element equal to the mean.
Measures of Relative Standing (Position)

• Here is how to interpret z-scores.


 A z-score equal to 1 represents an element
that is 1 standard deviation greater than the
mean; a z-score equal to 2, 2 standard
deviations greater than the mean; etc.
 A z-score equal to -1 represents an element
that is 1 standard deviation less than the
mean; a z-score equal to -2, 2 standard
deviations less than the mean; etc.
Example: Measures of Relative Standing
(Position)
• A national achievement test is administered
annually to Grade 6 pupils before they graduate
from elementary. The test has a mean score of
100 and a standard deviation of 15. If Zion’s z-
score is 1.20, what was his score in the test?
• Solving for Zion’s test score (X), we get

X = (z)(σ) + 100
X = 18 + 100 = 118
Outliers

• An outlier is an observation that lies


an abnormal distance from other values in
a random sample from a population.

• It is one or more data points that are


too large or too small compared to the
rest of the data points in the data set.
Linear Correlation

• The purpose of a LINEAR


CORRELATION ANALYSIS is to
determine whether there is a relationship
between two sets of variables.
• We may find that:
(1) there is a positive
correlation,
(2) there is a negative
correlation, or
(3) there is no correlation.
Positive Correlation Scatter Diagram

Figure 1: Relationship between height and


trunk diameter in Eastern White Pines

0
0 1 2 3 4
Height (m)
Negative Correlation Scatter Diagram

Figure 2: Relationship between incidence of


an apple parasite and fruit harvest
100

80

60

40

20

0
0 10 20 30 40 50 60
Codling moths trapped per acre
No Correlation Scatter Diagram

Figure 3: Relationship between density of


pillbugs and red clover
20

10

0
0 100 200 300
Pillbugs/Square Meter
Correlation Coefficient

• One of the measures of the degree of a


linear correlation between two
variables is called the coefficient of
correlation, denoted by the or simply
• A correlation coefficient is a numerical
measure of the linear relationship
between two variables.
• Correlation coefficients can lie between
-1 and 1 inclusive. That is,
 
Correlation Coefficient

Pearson Product Moment Correlation is


 
Example: Correlation Coefficient

• For the following data on heights and weights of


14 UAAP Basketball players, determine if heights
and weights are correlated.
Height, X Weight, Y   𝑋2   𝑌2 XY
77 230 5929 52900 17710
76 225 5776 50625 17100
77 241 5929 58081 18557
72 209 5184 43681 15048
76 225 5776 50625 17100
77 235 5929 55225 18095
77 228 5929 51984 17556
74 214 5476 45796 15836
74 240 5476 57600 17760
75 233 5625 54289 17475
74 225 5476 50625 16650
76 220 5776 48400 16720
74 222 5476 49284 16428
76 225 5776 50625 17100
1055 3172 79533 719740 239135
Example: Correlation Coefficient

𝑛 ∑ 𝑥𝑦 − ( ∑ 𝑥 )( ∑ 𝑦 )
𝑟=
√[ 𝑛( ∑ 𝑥 ) − (∑ 𝑥 ) ][ 𝑛 (∑ 𝑦 ) − (∑ 𝑦 ) ]
2 2 2 2

14 ( 239 , 135 ) −(1 , 055)(3 , 172)


𝑟=
√[14 ( 79 ,533 ) − (1 , 055 ) 2 2
][14 ( 719 ,740 ) − ( 3 , 172 ) ]

There is a positive correlation


between heights and weights of the 14
players.
Linear Regression

• Linear regression attempts to model the relationship


between two variables by fitting a linear equation to the
observed data.

• One variable is considered to be an independent variable,


and the other is considered to be a dependent variable.

• For example, a modeler might want to relate the weights


of individuals to their heights using a linear regression
model.

• A linear regression line has an equation in the form 


where x is the independent variable and y is the dependent
variable. The slope of the line is m, and b is the y-intercept
(the value of y when x = 0).
Example: Linear Regression

• For the following data on heights and weights of


14 UAAP Basketball players, determine if heights
and weights are correlated. Find a regression
equation that will predict weight from height.
Height, X Weight, Y   𝑋2   𝑌2 XY
77 230 5929 52900 17710
76 225 5776 50625 17100
77 241 5929 58081 18557
72 209 5184 43681 15048
76 225 5776 50625 17100
77 235 5929 55225 18095
77 228 5929 51984 17556
74 214 5476 45796 15836
74 240 5476 57600 17760
75 233 5625 54289 17475
74 225 5476 50625 16650
76 220 5776 48400 16720
74 222 5476 49284 16428
76 225 5776 50625 17100
1055 3172 79533 719740 239135
Normal Distribution

• Data can be "distributed" (spread out) in different


ways.
It can be spread or more on the right
out more on the left

or it can be all jumbled up.


Normal Distribution

• But there are many cases where the data tends to


be around a central value with no bias left or
right, and it gets close to a "Normal Distribution"
like this:

It is often called a "Bell Curve“ because it looks


like a bell.
Normal Distribution

• Many things closely follow a Normal


Distribution. For example:
 heights of people
 size of things produced by machines
 errors in measurements
 blood pressure
 marks on a test
Normal Distribution

• Properties of a normal distribution


 The mean, median, and mode are all equal.
 The curve is symmetric at the center (i.e. around
the mean, μ).
 Exactly half of the values are to the left of
center and exactly half the values are to the
right.
 The total area under the curve is 1.
Normal Distribution

• In a normal curve, the empirical rule tells us


what percentage of our data falls within a certain
number of standard deviations from the mean:
• 68% of the data falls within one standard
deviation of the mean.
• 95% of the data falls within two standard
deviation of the mean.
• 99.7% of the data falls within three standard
deviation of the mean.
Normal Distribution

• To find the probability that a normal random


variable lies in the interval from , we need to
find the area under the normal curve between the
points
• We use a table of areas under a normal curve,
like this one.
The Standard Normal Random Variable

• A normal random variable is standardized by


expressing its value as a number of standard
deviations it lies to the left or right of the mean
• The standardized normal random variable, is
defined as , or equivalently
• From the formula for we can draw these
conclusions:
• When is less than the mean the value of is
negative.
• When is more than the mean the value of is
positive.
• When the value of
Finding Areas Under a Normal Curve

• The probability distribution for is called the standardized


normal distribution because its mean is 0 and its standard
deviation is 1.
• Values of z on the left side are negative, while values of z
on the right side are positive.
• The area under the standard normal curve to the left of a
specified value of – say is the probability
• The table of areas under a normal curve contains both
positive and negative values of z. The left hand column
gives the value of z correct to the tenth place; the second
decimal place for z, corresponding to hundredths, is given
across the top row.
Example: Finding Areas Under a Normal Curve
• Find
This probability corresponds to the area to the right of a point
standard deviations to the right of the mean. The required area
is shown below.

Using the table, proceed down the left-hand column of the


table to z = 1.4 and across the top of the table to column
marked 0.05. The intersection of this row and column
combination gives the area 0.9265. Since the area under the
curve is 1, we find .
Example: Finding Areas Under a Normal Curve
• Find
This probability corresponds to the area between the points
standard deviations to the left of the mean and z=1.84 standard
deviations to the right of the mean. The required area is shown
below.

The area to the left of z = -2.43 is 0.0075 and the area to the
left of z =1.84 is 0.9671. To find the required area subtract the
two areas, giving
0.9671- 0.0075 = 0.9596.
Example: Applications

• Suppose that in a bowling tournament, the


scores among all bowlers are normally
distributed with mean with a standard deviation

(a) What proportions of the players


scored less than 175 points?
(b) What proportions of the players
scored more than 200 points?
(c) What proportions of the players
scored between 180 and 210
points?
Example: Applications

(a) What proportions of the players scored less


than 175 points?

(b) What proportions of the players scored


more than 200 points?

(c) What proportions of the players scored


between 180 and 210 points?
Example: Applications

Given: Pop. Mean=390 sec.; Pop. Std. Dev.=148 sec.


(a)If a single video is selected at random, what is
the probability that the running time of the
video exceeds 6 minutes?

(b)If a single video is selected at random, what is


the probability that the running time of the
video is less than 6.2 minutes?
Example: Applications

(c) If a single video is selected at


random, what is the probability that
the running time of the video is
between 6.8 minutes and 10.2
minutes?
Now try these!

1. Trees in a certain forest have heights


that are normally distributed with
mean 112 inches and a standard
deviation inches.
(a) What proportions of trees are
more than 120 inches?
(b) What proportions of these trees
are less than 100 inches?
(c) What is the probability that a
randomly chosen tree is
between 90 and 100 inches
tall?
Now try these!

2. The weights of male basketball players on a certain collegiate


basketball league are normally distributed with a mean of 180
pounds and a standard deviation of 26 pounds. If a player is
randomly selected:

(a) what is the probability that the player weighs less than 225
pounds?

(b) what is the probability that the player weighs more than 225
pounds?

(c) what is the probability that the randomly chosen player


weighs between 180 and 225 pounds?

You might also like