Professional Documents
Culture Documents
Topic 5 Data Management (Statistics)
Topic 5 Data Management (Statistics)
Topic 5 Data Management (Statistics)
Statistics is a branch of
Mathematics dealing
with the collection,
Introduction
and presentation, analysis,
Terminologies
and interpretation of
data.
History of Statistics
The term statistics came from the Latin phrase “ratio status”
which means study of practical politics or the statesman’s
art.
In the middle of 18th century, the term statistik (a term due
to Achenwall) was used, a German term defined as “the
political science of several countries”
Based on the results, it was concluded that the new milk formulation is
effective in improving the psychomotor development of infants.
Inferential Statistics
Larger Set
(N units/observations)
Smaller Set
(n units/observations)
Inferences and
Generalizations
Definition of Some Basic
Statistical Terms
Qualitative data
▪ Describes the quality or character of something
Quantitative data
▪ Describes the amount or number of something
a. Discrete
-countable
b. Continuous
-measurable (measured using a continuous scale
such as kilos, cms, grams)
Levels of Measurement
Mean ( or )=
Mean
n
xi
x + x + x + ... + xn i =1
F1: X = 1 2 3 =
N N
n n
f X +f X +
fi X i fi X n
f 3 X 3 + ... + f n X n i = 1
F2 : X = 1 1 2 2 = or i = 1
f1 + f 2 + f 3 + ... + f n n N
fi
i =1
Where f1 + f2 + f3 + … + fn = fi = N
n
w x + w2 x2 + w3 x3 + ... + wn xn i =1
wi xi
F3 : X = 1 1 =
w1 + w2 + w3 + ... + wn n
wi
i =1
Examples
Solution:
5 + 9 + 10 + 11 + 15
X=
5
= 10
2. The salaries of 6 employees were P11,000, P9,000, P12,500,
P20,000, P15,000, and P24,000. What is the average salary?
Solution:
11,000 + 9 ,000 + 12,500 + 20 ,000 + 15,000 + 24 ,000
X=
6
= 15,250
Examples
3. Cloe has test scores in her Mathematics
class of 84, 92, 78, 82, and 90. Find the mean
of her test scores.
• To compute, arrange the scores from lowest to highest. If N is odd, the median is the value
of the middle term, (N + 1) th term. If N is even, there are 2 middle terms; N th and the next
2 2
term. The median is the average of these values.
Examples:
~
X = 10
X n + X n+2 X 6 + X 6 +2
~ X3 + X4 6 + 8
n = 6; X= 2 2 = 2 2 = = =7
2 2 2 2
~
X =7
The heights, in inches, of the 5 male faculty members of
the Mathematics Department of the CIT – University are 63
in., 68 in., 60 in., 70 in., and 65 in. Find the median of the
heights.
Solution:
Arranging the heights from lowest to highest, we have
60 63 65 68 70.
Thus since n = 5( odd ), the median is the middle value 65.
Example 3: Find the median for the data in the
following lists. 5, 18, 12, 4, 21, 16.
Solution:
Arranging the values from lowest to highest, we have
4 5 12 16 18 21.
Thus since n = 6( even ), the median is the arithmetic
mean of the 2 middle values 12 and 16.
Median =
Median = 14
Mode
The mode of a set of data is the value which occurs most often or with
the greatest frequency.
Not all set of data have modes. There are some cases in which all the
numbers occur with equal frequency, hence the set has no mode or it is
non – modal. On the other hand, some set of data may have several values
that occur with equal greatest frequency. In this case, there are more than one
mode ( bi – modal, tri – modal, etc. )
Example 4: During the fire last March in Subangdako, Mandaue City, the first
10 donors who extended monetary help to the victims gave P 200, P 500,
-P 400, P 200, P 300, P 100, P 200, P 300, P 1,000, P 800. Find the mode of
the donations.
Mode
1. 2 5 2 3 5 2 1
X̂ = 2
2. 1 2 3 3 2 1 4
X̂ = 1, 2 , and 3
Solution:
Range = P 44,200 – P8,200
Range = P 36,000
While the range is the most easily calculated measure of variability, the
range is dependent entirely on 2 extreme values and is insensitive to what is
happening in between. Thus, it is considered as the least satisfactory measure
of dispersion.
The Variance
Variance is the measure of variability that indicates
dispersion around the mean. This makes use of the
individual amount that each data value deviates from
the mean. These deviations, represented by ( x - ), are
positive when the data value is greater than the mean,
and are negative when the data value is less than the
mean. The sum of all the deviations ( x - ) is 0 for all sets
of data, and thus cannot be used in computing
variance, but instead we make use of the square of the
deviations.
s =
2 ( x−x)
.
2
n−1
Example 6: During the Bb. Pilipinas Beauty Pageant last April 18, 2018, the
scores of the 6 judges for the winning candidate were 92, 94, 88, 97, 85, and
90. Compute the variance of the scores.
Solution:
92 + 94 + 88 + 97 + 85 + 90
μ=
6
µ = 91
( x − μ ) = ( 92 – 91 ) + ( 94 – 91 ) + ( 88 – 91 ) + ( 97 – 91 )
2 2 2 2 2
2 2
+ ( 85 – 91 ) + ( 90 – 91 )
( x − μ ) = 92
2
σ =
2 ( x −μ)
2
n
92
σ2 =
6
= 15.33
2
Example 7: The following are samples taken for the volume content of a
500 – ml pack orange juice by Sunkist Orange.
501 ml, 498 ml, 505 ml, 492 ml, 500 ml.
Solution:
501 + 498 + 505 + 492 + 500
x=
5
x = 499.2
2 2
+ ( 492 – 499.2 ) + ( 500 – 499.2 )
( x − x ) = 90.8
2
(x − x )
2
s =
2
n −1
90.8
s2 =
5 −1
2
s = 22.7
The Standard Deviation
Standard Deviation is the square root of the variance. Typically standard
deviation is represented by for population and s for sample, and defined in
the same manner as variance.
(x −μ) .
2
σ=
n
If x1, x2, x3, ... , xn is a sample of n numbers with a mean of x , then
the standard deviation of the sample is
(x − x ) .
2
s=
n−1
Example 8: The table below shows the measurements, in liters, for 2 samples
of soft drinks bottled by companies X and Y.
x =1
2 2
+ ( 1 – 1 ) + ( 1.01 – 1 )
(x − x )
2
= 0.0084
(x − x )
2
s=
n −1
0.0084
s=
5 −1
s = 0.0458
1.06 + 1.12 + 0.90 + 0.91 + 1.01
2. sample Y: x=
5
x =1
(x − x )
2 2 2 2
= ( 1.06 – 1 ) + ( 1.12 – 1 ) + ( 0.90 – 1 )
2 2
+ ( 0.91 – 1 ) + ( 1.01 – 1 )
(x − x )
2
= 0.0362
(x − x )
2
s=
n −1
0.0362
s=
5 −1
s = 0.0951
3. Since the standard deviation of company X is
smaller, this means that the soft drinks produced by
Company X is more consistent with regards to volume
content than those bottled by Company Y.
3. There are 50% - 30% = 20% of the drivers have salary in between
P 105,782 and P172,840.
A teacher gives a 20-point tests to 10 students. The
scores are shown below. Find the percentile rank of a
score of 15.
10, 20, 3, 5, 6, 8, 18, 12, 15, 2
Solution:
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
5, 6, 6, 8, 20, 15, 14, 16, 18, 11, 13, 14, 16, 9, 8, 10.
Solution:
20
1. P20 is at ( 16 ) = 3.2 which is not a whole number, so take the next
100
higher whole number as the position. P20 is the 4th score.
P20 = 8. This implies that 20% of the set of scores is below 8 or
80% is above 8.
50
2. P50 is at ( 16 ) = 8 which is a whole number, so take the position
100
halfway the 8th and 9th scores.
th th
8 +9 11 + 13
P50 = = = 12 .
2 2
1
1. Q1 is at ( 16 ) = 4 which is a whole number, so take the position
4
between the 4th and 5th scores.
4 th + 5 th 8 + 8
Q1 = = =8
2 2
4
2. D4 is at ( 16 ) = 6.4 which is not a whole number, so take the
10
next higher whole number as the position. D4 is the 7th score.
D4 = 10
9
3. D9 is at ( 16 ) = 14.4 which is not a whole number, so take the
10
15th score.
D9 = 18
Quartile Ranking
Quartiles are values that divide a set of data into 4 equal
parts, denoted by Q1, Q2, and Q3, such that 25% of the
data lie below Q1, 50% of the data lie below Q2, and 75%
of the data lie below Q3. Q1 is called the first quartile, Q2
is called the second quartile and it is the median of the
set, and Q3 is called the third quartile.
A method using medians can easily be employed in
finding quartiles and has the following steps below.
1. Form a ranked list of the data.
2. Find the median of the ranked data. This is the second
quartile, Q2.
3. The first quartile, Q1 is the median of the values less
than Q2.
4. The third quartile, Q3 is the median of the values
greater than Q2.
The following are the result of testing a sample of 9 batteries for the battery
life, in hours, of Ever Glow Battery.
Solution:
Rank the data: 5.3 5.6 6.2 6.4 6.8 7.1 7.2 8.3 9.3
Median ( Q2 )
The z – score
One problem that may arise in statistics is comparing 2
observations from 2 different populations. Let’s say for
example, the supervisor of a construction project
evaluates the speed of his 2 workers for possible salary
increase. The first worker is laying hollow blocks, and he
can lay 120 hollow blocks per day. The second worker is
setting tiles and he can set 60 tiles in one day. Which
one is the faster worker? Of course we cannot decide
unless we have the basis for comparison.
x − μ 120 − 100
Worker 1: z1 = = = 4 ( 4 standard deviation above the mean )
σ 5
x −μ 60 − 50
Worker 2: z 2 = = = 5 ( 5 standard deviation above the mean )
σ 2
Solution:
x −μ
Mathematics: z1 =
σ
82 − 68
z1 =
8
z1 = 1.75
x −μ
Chemistry: z2 =
σ
89 − 80
z2 =
6
z2 = 1.5
Solution:
32 43 48 55 62 69 74 83
38 43 49 59 63 72 76 84
38 45 49 59 63 72 77 85
40 45 51 62 64 72 79 85
40 46 54 62 65 74 83 93
Stem and leaf display of data - a device that is
useful for representing relatively small quantitative
data sets.
(Using the set of scores above.)
STEM LEAF
3 2,8,8
4 0,0,3,3,5,5,6,8,9,9
5 1,4,5,9,9
6 2,2,2,3,3,4,5,9
7 2,2,2,4,4,6,7,9
8 3,3,4,5,5
9 3
The Frequency Distribution Table (FDT)
48 79 83 84 62 62 43 72
45 46 59 93 64 59 32 54
83 55 45 76 72 40 51 72
65 49 62 85 74 40 74 49
69 38 85 77 63 38 43 63
32 43 48 55 62 69 74 83
38 43 49 59 63 72 76 84
38 45 49 59 63 72 77 85
40 45 51 62 64 72 79 85
40 46 54 62 65 74 83 93
2. Stem and leaf display of data
STEM LEAF
3 2,8,8
4 0,0,3,3,5,5,6,8,9,9
5 1,4,5,9,9
6 2,2,2,3,3,4,5,9
7 2,2,2,4,4,6,7,9
8 3,3,4,5,5
9 3
The Frequency Distribution Table
(FDT)
K = 1 + 3.3221 log 40
= 6.3222
=7
R = HS – LS = 93-32 = 61
27 36 45 54 63 72 81 90 99 Class Marks
Histogram
➢ Histogram is a chart in which the rectangular bars are constructed at the boundaries of
each class.
➢
31.5 40.5 49.5 58.5 67.5 76.5 85.5 94.5 Class boundaries
For < ogive, x - axis values are upper class boundary
y - axis are the < cumulative frequency
For > ogive, x - axis values are the lower class boundary
y - axis are > cumulative frequency
Sample:SK =
Population: SK =
NORMAL DISTRIBUTION
Properties of a normal curve:
1. It is symmetrical about the mean.
2. The mean is equal to the median, which is also equal to
the mode.
3. The tails or ends are asymptotic relative to the horizontal
line.
4. The total area under the normal curve is equal to 1 or
100%.
5. The normal curve area may be subdivided into at least
three standard scores each to the left and to the right of the
vertical axis.
Area to the Area to the
left is 0.5 right is 0.5
− +
THE STANDARD NORMAL
RANDOM VARIABLE
A normal random variable x is standardized by expressing its
value as the number of standard deviation () it lies to the
left or right of its mean (). The standardized normal random
variables (z) is defined as.
Z= x- or equivalently x = + z.
Z = x – Mean
s
Note:
1. When x is less than the mean, the value of z is negative.
2. When x is greater than the mean, the value of z is positive.
3. When x = mean, the value of z = 0
Examples
1. The mean and the standard deviation on an
examination are 70 and 10 respectively. Find
the scores in standard units of the students
receiving the marks
a) 65 b) 70 c) 87
2. Referring to the preceding problem, find the
marks corresponding to the standards scores
a) -1 b) 0.5 c) 1.25 d) –1.75
Example
Intelligence Quotient ( IQ ) scores are
distributed normally with mean of 100 and
standard deviation of 15.
1. What percentage of the population has
an IQ score below 85?
2. What percentage of the population has
an IQ score between 85 and 115?
3. What percentage of the population has
an IQ score above 120?
Solution:
-1 0
standard normal distribution
Step 2: Set calculator mode to STAT.
Press: MODE ---- then 3: STAT ---- then AC
x −μ 115 − 100
z2 = = =1
σ 15
This means that 115 is one standard deviation above the mean.
-1 0 1
standard normal distribution
Following the same procedure in pressing the
calculator, we get
P = P( 1 ) – P( - 1 ) = 0.68268
x −μ 120 − 100
z= = = 1.33
σ 15
0 1.33
standard normal distribution
Since the total area under the normal curve is 1, to get the area to right,
we simply subtract the area to the left of z = 1.33 from 1. We have
P = 1 – P( 1.33 ) = 0.09176
Solution:
Let X be the area under the standard normal curve.
z=0.56
X = 1– P(z = 0.56)
= 1 – 0.7123
= 0.2877
b) to the right of z = -0.75
z = -0.75
X = 1 - P(Z = -0.75)
= 1 - 0.2578
= 0.7422
4. Suppose the temperature last May was normally
distributed with mean 30C and standard deviation
5.33C. Find the probability that the temperature is
between 34.2C and 36.45C.
Fx 991ES plus
*freq on – shift – mode – scroll down – 4(stat)
*mode – 3 – 1 – [input the data] – AC – shift –
1 – 4 - 𝑥ҧ - 𝑠𝑥 - 𝜕𝑥
Old model
*mode – SD – 𝑋1 - M+ - 𝑋2 - M+ - … M+ - 𝑋𝑛 - M+ -shift – S-
Var - 𝑥ҧ - 𝑠𝑥 - 𝜕𝑥
Linear Regression and
Correlation
Many decisions are based on a
remarkable relationship between 2
variables. For instance, a person’s blood
pressure may vary inversely with the
amount of hypertension medicine he took.
A company’s market share may increase
directly with the advertising cost.
Correlation is a statistical method used to determine
whether a relationship between variables exists.
Regression is a statistical method used to describe
the nature of the relationship between variables,
that is, positive or negative, linear or nonlinear.
A scatter plot is a graph of the ordered pairs (x, y) of
numbers consisting of the independent variable x
and the dependent variable y.
Scatter Diagram
Consider the study made by a retail merchant to determine the relation
between monthly advertising expenditure and sales.
y– y =m x−x ( ) or y= y +m x−x ( )
where: x = mean of the variable x
( x y )− n x y
x − n(x )
m = slope of the line, m = 2
2
The symbol y ( pronounced as y – hat ) is
used in place of y in the least – squares line
to differentiate with the y – values in the
given ordered pairs.
Determine the equation of the least – squares line for the sales and advertising
relationship above.
Solution:
x = 53.68
2
Advertising cost ( in P 1,000’s)
( x y ) − n x y 1,529.8 − 8 ( 2.5 )( 70.25 )
x − n(x )
m= = = 33.91
53.68 − 8 ( 2.5 )
2 2
2
( )
The equation is y^ = y + m x − x .
^y = 33.91x – 14.525
y = 36 + 44 + 48 + 71 + 78 + 90 + 95 + 100 = 562
y = 36 + 44 + 48 + 71 + 78 + 90 + 95 + 100 = 43,786
2 2 2 2 2 2 2 2 2
n ( ( xy ) ) − ( x )( y )
r=
2
( )
n( x ) − x2 n( y ) − y
2
( ) 2
8 (1,529.8 ) − ( 20 )( 562 )
r=
8 (53.68 ) − ( 20 ) 8 ( 43,786 ) − ( 562)
2 2
r = 0.9915
Activity
A real estate company conducts a survey of 15 of its agents. The table
below shows the number of minutes spent with each costumer and the
number of sales in a month.
X 20 21 25 26 26 25 22 18 23 20 27 29 30 30 33
Y 8 10 12 15 11 10 10 9 11 11 12 14 14 15 18