Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 42

POLYTECHNIC UNIVERSITY OF THE PHILIPPINES

COLLEGE OF ACCOUNTANCY AND FINANCE


BS ACCOUNTANCY
STA. MESA, MANILA

CHAPTER 4
STATISTICS

MATHEMATICS IN THE MODERN WORLD


GROUP 1

BSA 1-11

Antonio, Josefiel
Esguerra, Regina Dianne
Gaje, Michelle
Lubrico, Rica
Pitablo, Divine Grace

Prof Cynthia Equiza

4.3 MEASURES OF RELATIVE POSITION


PERCENTILES

A  percentile  is a comparison score between a particular score and the scores


of the rest of a group. It shows the percentage of scores that a particular score
surpassed.
Percentiles are commonly used to report scores in tests, like the SAT, GRE and
LSAT. For example, the 70th percentile on the 2013 GRE was 156. That means if you
scored 156 on the exam, your score was better than 70 percent of test takers.

 The 25th percentile is also called the first quartile.


 The 50th percentile is generally the median.
 The 75th percentile is also called the third quartile.
 The difference between the third and first quartiles is the interquartile range.

The data arranged in ascending or descending order can be divided into 100
equal parts by 99 values. These values are called percentiles and denoted by
P1,P2,...,P99. There is a slight difference in the calculation of percentiles in the case of
grouped and ungrouped data. Therefore, calculation for both types of data is explained
with the help of problems.

CALCULATION OF PERCENTILES FOR UNGROUPED DATA

Formula:

Problem:

The data given below shows the marks obtained by the MBA students in statistics.

36,28,47,40,44,44,39,33,33,32,48,34,38,27,40,37,41,42,38,48,43,35,37,37

Find: P25, P50,...,P75

Solution:
Arrange the data in ascending order

25,27,28,32,33,33,34,35,36,37,37,38,38,39,40,40,41,42,43,44,44,47,48

25th Percentile (P25)

Formula:

50th Percentile (P50)

Formula:

75th Percentile (P75)


Formula:

CALCULATION OF PERCENTILES FOR FREQUENCY DISTRIBUTION

In the case of frequency distribution, percentiles can be calculated by using the formula:

Problem:
From the following frequency table of expenses incurred by employees of an
organization. Calculate P25, P40, and P75

Table 1.

Table 2.

Solution:

25th Percentile (P25)

40th Percentile (P40)


75th Percentile (P75)

QUARTILES
In statistics, a quartile, a type of quantile, is three points that divide sorted data
set into four equal groups (by count of numbers), each representing a fourth of the
distributed sampled population. 

Quartiles in statistics are values that divide your data into quarters. However,
quartiles aren’t shaped like pizza slices; instead they divide your data into four
segments according to where the numbers fall on the number line. The quarters that
divide a data set into quartiles are:

 Minimum, or (rarely) “0th percentile”—the smallest number in the group.


 1st quartile, Q1, or 25th percentile—the number that separates the lowest 25% of
the group from the highest 75% of the group.
 Median, or 50th percentile—the number in the middle of the group, when
arranged from smallest to largest.
 3rd quartile, Q3, or 75th percentile, called the upper quartile (QU)—the number
that separates the lowest 75% of the group from the highest 25% of the group.
 Maximum, or (rarely) “100th percentile”—the largest number in the group.

A set of numbers (-3,-2,-1,0,1,2,3) divided into four quartiles.

As quartiles divide numbers up according to where their position is on the


number line, you have to put the numbers in order before you can figure out where the
quartiles are.

The quartiles are closely related to the histogram of a data set. Since area equals
the proportion of values in a histogram, the quartiles split the histogram into four
approximately equal areas.

CALCULATING QUARTILES
Formulas for Quartile:

n+1
Q 1=
4

n+1
Q 2=
2

3 n+1
Q 3=
4

Example:
Following is the data of marks obtained by 20 students in a test of statistics;

53 74 82 42 39 20 81 68 58 28
67 54 93 70 30 55 36 37 29 61
Table 3.

In order to apply formula, we need to arrange the above data into ascending order i.e. in
the form of an array.

20 28 29 30 36 37 39 42 53 54
55 58 61 67 68 70 74 81 82 93
Table 4.
Here, n = 20

Find: Q1

Solution:
n+1 20+1
Q 1= =
4 4

= 5.25 ; therefore, Q1= 36.5

Find: Q2

Solution:

n+1 20+1
Q 2= =
2 2

= 10.5 ; therefore, Q2 = 54.5

Find: Q3
Solution:

3 n+1 3 (20 )+ 1
Q 3= =
4 4

= 15.25 ; therefore, Q3 = 68.25

QUARTILES FOR GROUPED DATA:

The quartiles may be determined from grouped data in the same way as the
median except that in place of n/2 we will use n/4. For calculating quartiles from
grouped data we will form cumulative frequency column. Quartiles for grouped data will
be calculated from the following formulae;

Q2 = Median.

Where:

l = lower class boundary of the class containing the Q1 or Q3, i.e. the class
corresponding to the cumulative frequency in which n/4 or 3n/4 lies

h = class interval size of the class containing Q 1 or Q3.

f = frequency of the class containing Q1 or Q3.

n = number of values, or the total frequency.

C.F = cumulative frequency of the class preceding the class containing Q1 or Q3.

EXAMPLE:

We will calculate the quartiles from the frequency distribution for the weight of 120
students as given in the table:

Weight (lb) Frequency (f) Class Boundaries Cumulative Frequency


110 – 119 1 109.5 – 119.5 0
120 – 129 4 119.5 – 129.5 5
130 – 139 17 129.5 – 139.5 22
140 – 149 28 139.5 – 149.5 50
150 – 159 25 149.5 – 159.5 75
160 – 169 18 159.5 – 169.5 93
170 – 179 13 169.5 – 179.5 106
180 – 189 6 179.5 – 189.5 112
190 – 199 5 189.5 – 199.5 117
200 – 209 2 195.5 – 209.5 119
210 – 219 1 209.5 – 219.5 120
∑f = n = 120
Table 5.

i. The first quartile Q1 is the value of or the 30th item from the lower
end. From the table, we see that cumulative frequency of the third
class is 22 and that of the fourth class is 50. Thus Q1 lies in the fourth class i.e. 140 –
149.

ii. The third quartile Q3 is the value of or 90th item from the lower
end. The cumulative frequency of the fifth class is 75 and that of the
sixth class is 93. Thus, Q3 lies in the sixth class i.e. 160 – 169.

Conclusion
From Q1 and Q3 we conclude that 25% of the students weigh 142.36 pounds or
less and 75% of the students weigh 167.83 pounds or less.

Why do we need quartiles in statistics?


The main reason is to perform further calculations, like the interquartile range,
which is a measure of how the data is spread out around the mean.
DECILES
In descriptive statistics, a decile is used to categorize large data sets from
highest to lowest values, or vice versa. Like the quartile and the percentile, a decile is a
form of a quantile which divides a set of observations into samples that are easier to
analyze and measure. While quartiles are three data points that divide an observation
into four equal groups or quarters, a decile consists of nine data points that divide a
data set into ten equal parts. 

Deciles are similar to quartiles. But while quartiles sort data into four quarters,
deciles sort data into ten equal parts: The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th,
90th and 100th percentiles.

The values which divide an array into ten equal parts are called deciles. The first,
second,…… ninth deciles by D1,D2,...,D9 respectively. The fifth decile (D5)
corresponds to median. The second, fourth, sixth and eighth deciles which collectively
divide the data into five equal parts are called quintiles.

Formula for Decile

x (n+ 1)
D x=
10

Example:

We will calculate second decile from the following array of data.

20 28 29 30 36 37 39 42 53 54
55 58 61 67 68 70 74 81 82 93
Table 6.
Find: D2
x (n+ 1)
D x=
10
2(20+1)
D 2= = 4.25 or 30
10

WHY ARE DECILE RANKS USED INSTEAD OF PERCENTILES OF QUARTILES?


Decile rankings are just another way to categorize data. Which system you use is
usually a judgment call. For example, if you wanted to display class rankings on a pie
chart, using deciles would make more sense that percentiles. That’s because a pie
chart with 10-categories would be much easier to read than a pie chart with 99
categories.
EXAMPLE:
The histogram below shows the distribution of marks in a test (out of 60) that was
attempted by 600 students. Each student's mark is represented by a square in the
histogram.

The nine deciles split the students into 10 groups of 60.

The first decile is 17.5 so the weakest tenth of the students in the class had a mark
below this. This decile therefore summarises the performance of the weakest students.

Students with marks below 17.5 are said to be in decile 1. Those with marks between
17.5 and 26.5 are in decile 2, and so on, up to students with marks higher than 54.5
who are in decile 10.

EXAMPLE:

Calculate 2nd and 8th deciles of following ordered data 13, 13,13, 20, 26, 27, 31, 34,
34, 34, 35, 35, 36, 37, 38, 41, 41, 41, 45, 47, 47, 47, 50, 51, 53, 54, 56, 62, 67, 82.

Solution:

x (n+ 1)
D x=
10

2(30+1)
D 2=
10

2(30+1)
D 2=
10

D 2=6.2∨6 t h Term

D2=27
DECILE FOR GROUPED DATA

Decile for grouped data can be calculated from the formula:

Where            

Lᵢ = Lower limit of the decile class.

N = Sum of the absolute frequency

Fᵢ₋₁ = Absolute frequency immediately below the decile class

aᵢ = Width of the class containing the decile class.

Note: The decile is independent of the width of the classes.

EXAMPLE:

Calculate the D₁ and D₃ for the following table.

SCORES f1 F2
40-50 8 8
50-60 12 20
60-70 14 34
70-80 10 44
80-90 6 50
90-100 16 66
100-110 4 70
70

Solution:

Calculation for the first decile

D₁ = L₁ + { [(k.N)/10 - F₁₋₁]/f₁}.a₁

= 40 + {[(1.70)/10 - 0]/8}.10

= 40 + [(70/10)/8) .10]
= 40 + 70/8

= 390/8

= 48.75

Calculation for 3rd decile

D₃ = L₃ + { [(k.N)/10 - F₃₋₁]/f₃}.a₃

= 60 + {[3.70/10-F₂]/14}10

= 60 + (210/10)-20]/14}.10

= 60 + [(21-20)/14].10

= 60 + 10/14

= 60 + 0.71

= 60.71

4.4 NORMAL DISTRIBUTIONS


FREQUENCY DISTRIBUTION AND HISTOGRM

A frequency distribution table is a table that shows how often a data point or a
group of data points appears in a given data set.

A. MAKING A FREQUENCY DISTRIBUTION

To create a histogram, you first must make a quantitative frequency distribution.


A frequency distribution table is a table that shows how often a data point or a group of
data points appears in a given data set. The following list of steps allows you to
construct a perfect quantitative frequency distribution every time.

STEP 1: Divide the numbers over which the data ranges into intervals of equal length.

STEP 2: Count how many data points fall into each interval.

STEP 3: If there are many values, it is sometimes useful to go through all the data
points in order and make a tally mark in the interval that each point falls.

STEP 4: All the tally marks can be counted to see how many data points fall into each
interval. The "tally system" ensures that no points will be missed.

EXAMPLE:
The following is a list of prices (in dollars) of birthday cards found in various
drug stores:
1.45 2.20 0.75 1.23 1.25
1.25 3.09 1.99 2.00 0.78
1.32 2.25 3.15 3.85 0.52
0.99 1.38 1.75 1.22 1.75

Make a frequency distribution table for this data.

1. We omit the units (dollars) while calculating.


2. The values go from 0.52 to 3.85, which is roughly 0.50 to 4.00.
3. We can divide this into 7 intervals of equal length: 0.50 - 0.99, 1.00 - 1.49, 1.50 -
1.99, 2.00 - 2.49, 2.50 - 2.99, 3.00 - 3.49, and 3.50 - 3.99.
4. Then we can count the number of data points which fall into each interval—for example,
4 points fall into the first interval: 0.75, 0.78, 0.55, and 0.99--and make a frequency
distribution table: 

Intervals (in Frequency


dollars)
0.50 - 0.99 4
1.00 - 1.49 7
1.50 - 1.99 3
2.00 - 2.49 3
2.50 - 2.99
3.00 - 3.49 2
3.50 - 3.99 1
Total 20

Making a Histogram Using a Frequency Distribution Table

A histogram is a bar graph which shows frequency distribution.


To make a histogram, follow these steps:

STEP 1: On the vertical axis, place frequencies. Label this axis "Frequency".

STEP 2: On the horizontal axis, place the lower value of each interval. Label this axis
with the type of data shown (price of birthday cards, etc.)

STEP 3: Draw a bar extending from the lower value of each interval to the lower value
of the next interval. The height of each bar should be equal to the frequency of its
corresponding interval.

EXAMPLE:

Make a histogram showing the frequency distribution of the price of birthday cards. 
NORMAL DISTRIBUTIONS AND THE EMPIRICAL RULE

The normal distribution is the most widely known and used of all distributions.
Because the normal distribution approximates many natural phenomena so well, it has
developed into a standard of reference for many probability problems.

 The bell curve which represents a normal distribution of data shows what
standard deviation represents.
 The normal distribution depends on the two parameters μ and σ, its mean and
standard deviation, respectively.
 The graph of a normal distribution is called the normal curve.
 The normal distribution is often referred to as the Gaussian distribution, in honor
of Karl Friedrich Gauss (1777-1855)

Areas under the Normal Curve


 Transform each observation to its equivalent standardized score or z-score. Use
x−μ
the formula, z=
σ
 If the area or probability is known, we can solve for the x-value using the formula
x= σz+ μ

Properties of a normal curve


 The graph of a normal distribution is a bell-shaped symmetrical distribution.
 Here, mean=median=mode
 The curve is asymptotic to the x-axis in both directions.
 The total area under the curve is 1 or 100%
 The curve extends indefinitely in both directions
On the normal
curve, mean,
median, and mode all
exist at the center.

The graph changes direction at inflection points. These first points mark the distance of
one standard deviation from the mean.

THE EMPIRICAL RULE

The empirical rule states that:

68% of all values fall within 1


standard deviation of the mean.

95% of all values fall within 2


standard deviations of the mean.
99.7% of all values fall within 3 standard deviations of the mean.

EXAMPLES:

Problem 1: The ages of all students in a class is normally distributed. The 95 percent of
total data is between the age 15.6 and 18.4. Find the mean and standard deviation of
the given data.

Solution: 
From 68-95-99.7 rule, we know that in normal distribution 95 percent of data comes
under 2-standard deviation.

15.6+18.4
Mean of the data =  =17
2

From the mean, 17, to one end, 18.4, there are two standard deviations.

18.4−17
Standard deviation =  =0.7
2

Problem 2: If mean of a given data for a random value is 81.1 and standard deviation is
4.7, then find the probability of getting a value more than 83.

Solution: 
Standard deviation, σ = 4.7

Mean, μ = 81.1

Expected value, X = 83

X−μ
Z-score, z=
σ

83−81.1
z =  =0.404255
4.7

Looking up the z-score in the z-table, 0.6700.

Hence, probability (1−0.6700)= 0.330.33.
THE STANDARD NORMAL DISTRIBUTION

Z-score Formulas:

x−μ x−x̅
 z= or z=
σ s

A standard normal distribution is a normal distribution with mean 0 and standard


deviation 1.

From the empirical rule we know that for a variable with the standard normal
distribution, 68% of the observations fall between -1 and 1 (within 1 standard deviation
of the mean of 0), 95% fall between -2 and 2 (within 2 standard deviations of the mean)
and 99.7% fall between -3 and 3 (within 3 standard deviations of the mean).

It shows you the percent of population:

 between 0 and Z (option "0 to Z")


 less than Z (option "Up to Z")
 greater than Z (option "Z onwards") 

It only display values to 0.01%


QUESTION: 

What is the relative frequency of observations below 1.18?

That is, find the relative frequency of the event Z < 1.18. (Here small z is 1.18.)

Step 1

Sketch the curve. Identify--on the measurement (horizontal/X) axis--the indicated range
of values.

The event Z < 1.18 is shaded in green. Events and possibilities are one in the same.

Step 2

The relative frequency of the event is equal to the area under the curve over the
description of the event.
The blue area is the relative frequency of the event Z < 1.18. This area appears to be
approximately 85%-90%.

Step 3

Use the standard normal table to find this area.

Corresponding to a measurement value

For a standard normal variable the relative frequency of observations falling


below 1.18 is 0.8810. (Also, for any normal distribution, 0.8810 or 88.1% of the
observations fall below 1.18 times the standard deviation above the mean.)

ANSWER: 
0.8810 or 88.10%.

THE TABLE
STANDARD NORMAL DISTRIBUTION: Table Values Represent AREA to the LEFT
of the Z score.

Z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
-3.9 .00005 .00005 .00004 .00004 .00004 .00004 .00004 .00004 .00003 .00003
-3.8 .00007 .00007 .00007 .00006 .00006 .00006 .00006 .00005 .00005 .00005
-3.7 .00011 .00010 .00010 .00010 .00009 .00009 .00008 .00008 .00008 .00008
-3.6 .00016 .00015 .00015 .00014 .00014 .00013 .00013 .00012 .00012 .00011
-3.5 .00023 .00022 .00022 .00021 .00020 .00019 .00019 .00018 .00017 .00017
-3.4 .00034 .00032 .00031 .00030 .00029 .00028 .00027 .00026 .00025 .00024
-3.3 .00048 .00047 .00045 .00043 .00042 .00040 .00039 .00038 .00036 .00035
-3.2 .00069 .00066 .00064 .00062 .00060 .00058 .00056 .00054 .00052 .00050
-3.1 .00097 .00094 .00090 .00087 .00084 .00082 .00079 .00076 .00074 .00071
-3.0 .00135 .00131 .00126 .00122 .00118 .00114 .00111 .00107 .00104 .00100
-2.9 .00187 .00181 .00175 .00169 .00164 .00159 .00154 .00149 .00144 .00139
-2.8 .00256 .00248 .00240 .00233 .00226 .00219 .00212 .00205 .00199 .00193
-2.7 .00347 .00336 .00326 .00317 .00307 .00298 .00289 .00280 .00272 .00264
-2.6 .00466 .00453 .00440 .00427 .00415 .00402 .00391 .00379 .00368 .00357
-2.5 .00621 .00604 .00587 .00570 .00554 .00539 .00523 .00508 .00494 .00480
-2.4 .00820 .00798 .00776 .00755 .00734 .00714 .00695 .00676 .00657 .00639
-2.3 .01072 .01044 .01017 .00990 .00964 .00939 .00914 .00889 .00866 .00842
-2.2 .01390 .01355 .01321 .01287 .01255 .01222 .01191 .01160 .01130 .01101
-2.1 .01786 .01743 .01700 .01659 .01618 .01578 .01539 .01500 .01463 .01426
-2.0 .02275 .02222 .02169 .02118 .02068 .02018 .01970 .01923 .01876 .01831
-1.9 .02872 .02807 .02743 .02680 .02619 .02559 .02500 .02442 .02385 .02330
-1.8 .03593 .03515 .03438 .03362 .03288 .03216 .03144 .03074 .03005 .02938
-1.7 .04457 .04363 .04272 .04182 .04093 .04006 .03920 .03836 .03754 .03673
-1.6 .05480 .05370 .05262 .05155 .05050 .04947 .04846 .04746 .04648 .04551
-1.5 .06681 .06552 .06426 .06301 .06178 .06057 .05938 .05821 .05705 .05592
-1.4 .08076 .07927 .07780 .07636 .07493 .07353 .07215 .07078 .06944 .06811
-1.3 .09680 .09510 .09342 .09176 .09012 .08851 .08691 .08534 .08379 .08226
-1.2 .11507 .11314 .11123 .10935 .10749 .10565 .10383 .10204 .10027 .09853
-1.1 .13567 .13350 .13136 .12924 .12714 .12507 .12302 .12100 .11900 .11702
-1.0 .15866 .15625 .15386 .15151 .14917 .14686 .14457 .14231 .14007 .13786
-0.9 .18406 .18141 .17879 .17619 .17361 .17106 .16853 .16602 .16354 .16109
-0.8 .21186 .20897 .20611 .20327 .20045 .19766 .19489 .19215 .18943 .18673
-0.7 .24196 .23885 .23576 .23270 .22965 .22663 .22363 .22065 .21770 .21476
-0.6 .27425 .27093 .26763 .26435 .26109 .25785 .25463 .25143 .24825 .24510
-0.5 .30854 .30503 .30153 .29806 .29460 .29116 .28774 .28434 .28096 .27760
-0.4 .34458 .34090 .33724 .33360 .32997 .32636 .32276 .31918 .31561 .31207
-0.3 .38209 .37828 .37448 .37070 .36693 .36317 .35942 .35569 .35197 .34827
-0.2 .42074 .41683 .41294 .40905 .40517 .40129 .39743 .39358 .38974 .38591
-0.1 .46017 .45620 .45224 .44828 .44433 .44038 .43644 .43251 .42858 .42465
-0.0 .50000 .49601 .49202 .48803 .48405 .48006 .47608 .47210 .46812 .46414
STANDARD NORMAL DISTRIBUTION: Table Values Represent AREA to the LEFT
of the Z score.

Z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .50000 .50399 .50798 .51197 .51595 .51994 .52392 .52790 .53188 .53586
0.1 .53983 .54380 .54776 .55172 .55567 .55962 .56356 .56749 .57142 .57535
0.2 .57926 .58317 .58706 .59095 .59483 .59871 .60257 .60642 .61026 .61409
0.3 .61791 .62172 .62552 .62930 .63307 .63683 .64058 .64431 .64803 .65173
0.4 .65542 .65910 .66276 .66640 .67003 .67364 .67724 .68082 .68439 .68793
0.5 .69146 .69497 .69847 .70194 .70540 .70884 .71226 .71566 .71904 .72240
0.6 .72575 .72907 .73237 .73565 .73891 .74215 .74537 .74857 .75175 .75490
0.7 .75804 .76115 .76424 .76730 .77035 .77337 .77637 .77935 .78230 .78524
0.8 .78814 .79103 .79389 .79673 .79955 .80234 .80511 .80785 .81057 .81327
0.9 .81594 .81859 .82121 .82381 .82639 .82894 .83147 .83398 .83646 .83891
1.0 .84134 .84375 .84614 .84849 .85083 .85314 .85543 .85769 .85993 .86214
1.1 .86433 .86650 .86864 .87076 .87286 .87493 .87698 .87900 .88100 .88298
1.2 .88493 .88686 .88877 .89065 .89251 .89435 .89617 .89796 .89973 .90147
1.3 .90320 .90490 .90658 .90824 .90988 .91149 .91309 .91466 .91621 .91774
1.4 .91924 .92073 .92220 .92364 .92507 .92647 .92785 .92922 .93056 .93189
1.5 .93319 .93448 .93574 .93699 .93822 .93943 .94062 .94179 .94295 .94408
1.6 .94520 .94630 .94738 .94845 .94950 .95053 .95154 .95254 .95352 .95449
1.7 .95543 .95637 .95728 .95818 .95907 .95994 .96080 .96164 .96246 .96327
1.8 .96407 .96485 .96562 .96638 .96712 .96784 .96856 .96926 .96995 .97062
1.9 .97128 .97193 .97257 .97320 .97381 .97441 .97500 .97558 .97615 .97670
2.0 .97725 .97778 .97831 .97882 .97932 .97982 .98030 .98077 .98124 .98169
2.1 .98214 .98257 .98300 .98341 .98382 .98422 .98461 .98500 .98537 .98574
2.2 .98610 .98645 .98679 .98713 .98745 .98778 .98809 .98840 .98870 .98899
2.3 .98928 .98956 .98983 .99010 .99036 .99061 .99086 .99111 .99134 .99158
2.4 .99180 .99202 .99224 .99245 .99266 .99286 .99305 .99324 .99343 .99361
2.5 .99379 .99396 .99413 .99430 .99446 .99461 .99477 .99492 .99506 .99520
2.6 .99534 .99547 .99560 .99573 .99585 .99598 .99609 .99621 .99632 .99643
2.7 .99653 .99664 .99674 .99683 .99693 .99702 .99711 .99720 .99728 .99736
2.8 .99744 .99752 .99760 .99767 .99774 .99781 .99788 .99795 .99801 .99807
2.9 .99813 .99819 .99825 .99831 .99836 .99841 .99846 .99851 .99856 .99861
3.0 .99865 .99869 .99874 .99878 .99882 .99886 .99889 .99893 .99896 .99900
3.1 .99903 .99906 .99910 .99913 .99916 .99918 .99921 .99924 .99926 .99929
3.2 .99931 .99934 .99936 .99938 .99940 .99942 .99944 .99946 .99948 .99950
3.3 .99952 .99953 .99955 .99957 .99958 .99960 .99961 .99962 .99964 .99965
3.4 .99966 .99968 .99969 .99970 .99971 .99972 .99973 .99974 .99975 .99976
3.5 .99977 .99978 .99978 .99979 .99980 .99981 .99981 .99982 .99983 .99983
3.6 .99984 .99985 .99985 .99986 .99986 .99987 .99987 .99988 .99988 .99989
3.7 .99989 .99990 .99990 .99990 .99991 .99991 .99992 .99992 .99992 .99992
3.8 .99993 .99993 .99993 .99994 .99994 .99994 .99994 .99995 .99995 .99995
3.9 .99995 .99995 .99996 .99996 .99996 .99996 .99996 .99996 .99997 .99997
EXAMPLE 1:

Percent of Population Between 0 and 0.45

Start at the row for 0.4, and read along until 0.45: there is the
value 0.1736

And 0.1736 is 17.36%

So 17.36% of the population are between 0 and 0.45 Standard Deviations from the
Mean.

EXAMPLE 2:

Percent of Population Z Between -1 and 2

From −1 to 0 is the same as from 0 to +1:

At the row for 1.0, first column 1.00, there is the value 0.3413

From 0 to +2 is: 0.4772

Add the two to get the total between -1 and 2: 0.3413 + 0.4772 = 0.8185

And 0.8185 is 81.85%
So 81.85% of the population are between -1 and +2 Standard Deviations from the
Mean.

4.5 LINEAR REGRESSION AND CORRELTION

LINEAR REGRESSION

Simple linear regression is a statistical method that allows us to summarize and


study relationships between two continuous (quantitative) variables:

 One variable, denoted x, is regarded as the predictor, explanatory,


or independent variable
 The other variable, denoted y, is regarded as the response, outcome,
or dependent variable.

Simple linear regression gets its adjective "simple," because it concerns the
study of only one predictor variable. In contrast, multiple linear regression, gets its
adjective "multiple," because it concerns the study of two or more predictor variables.

o Scatterplot

A scatterplot can be a helpful tool in determining the strength of the


relationship between two variables. If there appears to be no association
between the proposed explanatory and dependent variables (i.e., the scatterplot
does not indicate any increasing or decreasing trends), then fitting a linear
regression model to the data probably will not provide a useful model. A valuable
numerical measure of association between two variables is the correlation
coefficient, which is a value between -1 and 1 indicating the strength of the
association of the observed data for the two variables.

A linear regression line has an equation of the form Y = a + bX, where:

 X = explanatory variable;

 Y = the dependent variable;

 b = slope of the line/ the regression coefficent; and 

 a = constant/ intercept (the value of y when x = 0).

Three major uses for regression analysis are:


 Determining the strength of predictors;
 Forecasting an effect; and
 Trend forecasting.

Least-Squares Line

Bivariate data are given as ordered pairs. The least-squares regression line, or at least-
squares line, for a set of bivariate data is the line that maximizes the sum of the sqaures
of the vertical deviations from each data point to the line. The equation of the least-
squares line for the n ordered pairs (x1+y1), (x2+y2), (x3+y3), ... , (xn+yn) is ŷ=ax+b, where

n Σ xy−( Σ x )( Σ y )
a= and b= ŷ - a x̄
n ( Σ x 2 )−( Σ x )2

Example

The following data was gathered for five production runs of ABC Company. Determine
the cost function using the least squares method.

Batch Units Total Cost


(x) (y)
1 680 $29,800
2 820 $34,000
3 570 $27,500
4 660 $29,000
5 750 $31,900

Solution:

Batch Units Total Cost xy x²


(x) (y)
1 680 29,800 20,264,000 462,400
2 820 34,000 27,880,000 672,400
3 570 27,500 15,675,000 324,900
4 660 29,000 19,140,000 435,600
5 750 31,900 23,925,000 562,500
∑ 3,480 152,200 106,884,000 2,457,800

Substituting the computed values in the formula, we can compute for a.


n Σ xy−( Σ x )( Σ y )
a=  a= 5 ¿ ¿
n ( Σ x 2 )−(Σ x )2

a= 26.6741 ≈ $26.67 per unit

Total fixed cost (a) can then be computed by substituting the computed b. 

∑ y – b∑x
b=
n

152,200 – ( 26.67 ) (3,480)


b=
5

b= $11,877.68

The cost function for this particular set using the method of least squares is: 

y= b+ax

y= $11,887.68 + $26.67x.

There are several types of linear regression analyses available to researchers.

 Simple linear regression


1 dependent variable (interval or ratio), 1 independent variable (interval or
ratio or dichotomous)
 Multiple linear regression
1 dependent variable (interval or ratio) , 2+ independent variables (interval
or ratio or dichotomous)
  Logistic regression
1 dependent variable (dichotomous), 2+ independent variable(s) (interval
or ratio or dichotomous)
  Ordinal regression
1 dependent variable (ordinal), 1+ independent variable(s) (nominal or
dichotomous)
  Multinominal regression
1 dependent variable (nominal), 1+ independent variable(s) (interval or
ratio or dichotomous)

 
EXAMPLE:

Table 1.

X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

Figure 1. A scatter plot of the example data.

Linear regression consists of finding the best-fitting straight line through the points.
The best-fitting line is called a regression line. The black diagonal line in Figure 2 is the
regression line and consists of the predicted score on Y for each possible value of X.
The vertical lines from the points to the regression line represent the errors of
prediction. As you can see, the red point is very near the regression line; its error of
prediction is small. By contrast, the yellow point is much higher than the regression line
and therefore its error of prediction is large.
A scatter plot of the example data. The
black line consists of the predictions, the
points are the actual data, and the
vertical lines between the points and the
black line represent errors of prediction.
The error of prediction for a point is the
value of the point minus the predicted
value (the value on the line). Table 2
shows the predicted values (Y') and the
errors of prediction (Y-Y'). For example,
the first point has a Y of 1.00 and a
predicted Y (called Y') of 1.21. Therefore,
its error of prediction is -0.21.

Table 2. Example data.

X Y Y' Y-Y' (Y-Y')2


1.00 1.00 1.210 -0.210 0.044
2.00 2.00 1.635 0.365 0.133
3.00 1.30 2.060 -0.760 0.578
4.00 3.75 2.485 1.265 1.600
5.00 2.25 2.910 -0.660 0.436

You may have noticed that we did not specify what is meant by "best-fitting line."
By far, the most commonly-used criterion for the best-fitting line is the line that
minimizes the sum of the squared errors of prediction. That is the criterion that was
used to find the line in Figure 2. The last column in Table 2 shows the squared errors of
prediction. The sum of the squared errors of prediction shown in Table 2 is lower than it
would be for any other regression line.

The formula for a regression line is

Y' = a + bX

where Y' is the predicted score, b is the slope of the line, and A is the Y intercept. The
equation for the line in Figure 2 is

Y' = 0.785 + 0.425X

For X = 1,

Y' = + 0.785 + (0.425)(1) = 1.21.


For X = 2,

Y' = 0.785 + (0.425)(2) = 1.64.

COMPUTING THE REGRESSION LINE


In the age of computers, the regression line is typically computed with statistical
software. However, the calculations are relatively easy, and are given here for anyone
who is interested. The calculations are based on the statistics shown in Table 3. M X is
the mean of X, MY is the mean of Y, sX is the standard deviation of X, sY is the standard
deviation of Y, and r is the correlation between X and Y.
Statistics for computing the regression line.
MX MY sX sY r
3 2.06 1.581 1.072 0.627

The slope (b) can be calculated as follows:

b = r sY/sX

and the intercept (A) can be calculated as:

A = MY - bMX.

For these data,

b = (0.627)(1.072)/1.581 = 0.425

A = 2.06 - (0.425)(3) = 0.785

Note that the calculations have all been shown in terms of sample statistics
rather than population parameters. The formulas are the same; simply use the
parameter values for means, standard deviations, and the correlation.

STANDARDIZED VARIABLES
The regression equation is simpler if variables are standardized so that their
means are equal to 0 and standard deviations are equal to 1, for then b = r and A = 0.
This makes the regression line:

ZY' = (r)(ZX)

where ZY' is the predicted standard score for Y, r is the correlation, and Z X is the
standardized score for X. Note that the slope of the regression equation for
standardized variables is r.
A REAL EXAMPLE
The case study "SAT and College GPA" contains high school and university
grades for 105 computer science majors at a local state school. We now consider how
we could predict a student's university GPA if we knew his or her high school GPA.

Figure 3 shows a scatter plot of University GPA as a function of High School GPA. You
can see from the figure that there is a strong positive relationship. The correlation is
0.78. The regression equation is

University GPA' = (0.675)(High School GPA) + 1.097

Therefore, a student with a high school GPA of 3 would be predicted to have a


university GPA of

University GPA' = (0.675)(3) + 1.097 = 3.12.

Figure 3. University GPA as


a function of High School
GPA.

LINEAR CORRELATION COEFFICIENT


Correlation coefficients measure the strength of association between two
variables. The most common correlation coefficient, called the Pearson product-
moment correlation coefficient, measures the strength of the linear
association between variables measured on an interval or ratio scale.

For the n ordered pairs (x1+y1), (x2+y2), (x3+y3), ... , (xn+yn), the linear correlation
coefficient r is given by:

HOW TO INTERPRET A CORRELATION COEFFICIENT

The sign and the absolute value of a correlation coefficient describe the direction
and the magnitude of the relationship between two variables.

 The value of a correlation coefficient ranges between -1 and 1.


 The greater the absolute value of the Pearson product-moment correlation
coefficient, the stronger the linear relationship.
 The strongest linear relationship is indicated by a correlation coefficient of -1 or 1.
 The weakest linear relationship is indicated by a correlation coefficient equal to 0.
 A positive correlation means that if one variable gets bigger, the other variable
tends to get bigger.
 A negative correlation means that if one variable gets bigger, the other variable
tends to get smaller.
 A result of zero indicates no relationship at all.

SCATTERPLOTS AND CORRELATION COEFFICIENTS

The scatter plots below show how different patterns of data produce different degrees of
correlation.
Maximum positive correlation Strong positive correlation

(r = 1.0) (r = 0.80)

Zero correlation Maximum negative correlation

(r = 0) (r = -1.0)

Moderate negative correlation Strong correlation & outlier

(r = -0.43) (r = 0.71)
Several points are evident from the scatter plots.

 When the slope of the line in the plot is negative, the correlation is negative; and
vice versa.
 The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points
fall exactly on a straight line.
 The correlation becomes weaker as the data points become more scattered.
 If the data points fall in a random pattern, the correlation is equal to zero.
 Correlation is affected by outliers. Compare the first scatterplot with the last
scatterplot. The single outlier in the last plot greatly reduces the correlation (from
1.00 to 0.71).
 A correlation coefficient of 1 means that for every positive increase in one
variable, there is a positive increase of a fixed proportion in the other. For
example, shoe sizes go up in (almost) perfect correlation with foot length.
 A correlation coefficient of -1 means that for every positive increase in one
variable, there is a negative decrease of a fixed proportion in the other. For
example, the amount of gas in a tank decreases in (almost) perfect correlation
with speed.
 Zero means that for every increase, there isn’t a positive or negative increase.
The two just aren’t related.

The absolute value of the correlation coefficient gives us the relationship strength. The
larger the number, the stronger the relationship. For example, |-.75| = .75, which has a
stronger relationship than .65.

TYPES OF CORRELATION COEFFICIENT FORMULAS

Product- moment correlation


coefficient

The correlation r between two variables is:

r=

where:

 Σ is the summation symbol;


 x = xi - x, xi is the x value for observation i;

 x is the mean x value;

 y = yi - y, yi is the y value for observation i; and 

 y is the mean y value.

Population correlation coefficient

The correlation ρ between two variables is:

ρ=

where:

 N is the number of observations in the population;

 Σ is the summation symbol;

 Xi is the X value for observation i;

 μX is the population mean for variable X;

 Yi is the Y value for observation i;

 μY is the population mean for variable Y;

 σx is the population standard deviation of X; and

 σy is the population standard deviation of Y.

Sample correlation coefficient

The correlation r between two variables is:

r = [ 1 / (n - 1) ] * Σ { [ (xi - x) / sx ] * [ (yi - y) / sy ] }


where:

 n is the number of observations in the sample;

 Σ is the summation symbol;

 xi is the x value for observation i;

 x is the sample mean of x;

 yi is the y value for observation i;

 y is the sample mean of y;

 sx is the sample standard deviation of x; and

 sy is the sample standard deviation of y.

EXAMPLE:

X Y
7 9
7 11
8 7
11 11
12 9
12 11
13 9

The linear correlation coefficient measures the relationship between the paired values in


a sample.

Sum up the values of the first column


of data x.
∑x=7+7+8+11+12+12+13

Simplify the expression.
∑x=70
Sum up the values of the second column of data y.
∑y=9+11+7+11+9+11+9

Simplify the expression.
∑y=67

Sum up the values of x⋅y

∑xy=7⋅9+7⋅11+8⋅7+11⋅11+12⋅9+12⋅11+13⋅9

Simplify the expression.
∑xy=674

Sum up the values of x2


∑x2=(7)2+(7)2+(8)2+(11)2+(12)2+(12)2+(13)2

Simplify the expression.
∑x2=740

Sum up the values of y2


∑y2=(9)2+(11)2+(7)2+(11)2+(9)2+(11)2+(9)2

Simplify the expression.
∑y2=655

Fill in the computed values.

Simplify the expression.
r= 0.17078251
REFERENCES

4.3 Measures of Relative position

 Percentiles
https://www.varsitytutors.com/hotmath/hotmath_help/topics/percentilehttp://www.
statisticshowto.com/probability-and-statistics/percentiles-rank-range/
http://mba-lectures.com/statistics/descriptive-statistics/245/percentiles.html

 Quartiles
http://www.statisticshowto.com/what-are-quartiles/
https://magoosh.com/statistics/quartiles-used-statistics/
https://explorable.com/quartile
https://www.hackmath.net/en/calculator/quartile-q1-q3

 Deciles
http://www-ist.massey.ac.nz/dstirlin/cast/cast/scentre/centre5.html
https://www.investopedia.com/terms/d/decile.asp
http://www.statisticshowto.com/decile/
https://econtutorials.com/blog/quartiles-deciles-and-percentiles/
https://www.onlinemath4all.com/deciles.html
4.4 Normal Distributions

 Frequency Distribution and Histogram


http://www.sparknotes.com/math/algebra1/graphingdata/section2/

 Normal Distributions and the Empirical Rule


http://statisticslectures.com/topics/normalcurve/#lecture
https://www.mathsisfun.com/data/standard-deviation-formulas.html
http://www.probabilityformula.org/normal-distribution-examples.html

 The Standard Normal Distribution


http://www.oswego.edu/~srp/stats/z.htm
https://www.mathsisfun.com/data/standard-normal-distribution-table.html

4.5 Linear Regression and Correlation

 Linear Regression
http://www.statisticssolutions.com/what-is-linear-regression/
http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm
http://onlinestatbook.com/2/regression/intro.html

 Linear Correlation Coefficient


http://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-
formula/
https://stattrek.com/statistics/correlation.aspx
http://www.oswego.edu/~srp/stats/z.htm
https://www.stat.ncsu.edu/people/osborne/courses/xiamen/normal-table.pdf
https://www.mathway.com/examples/statistics/correlation-and-regression/finding-
the-linear-correlation-coefficient?id=328

You might also like