Describing, Exploring, and Comparing Data

APPLIED STATISTICS
Submitted to : Dr. IMELDA E. CUATEL
GRADUATE SCHOOL UNIVERRSITY OF LUZON Sunday 8:00-12:30
DESCRIBING, EXPLORING, AND COMPARING DATA

Prepared by : SAIFULDEEN SINAN
Introduction to Statistics
What is Statistics? a set of procedures and rulesfor reducing
large masses of data to manageable proportions and for allowing us to draw conclusions from those data
Statistics is a branch of mathematics that deals with
the effective management and analysis of data.
What can Stats do?

Allow us to draw conclusions from the data Make data more manageable Allows us to do this objectively and quantitatively
Why Statistics?
To develop an appreciation for variability and how it effects
products and processes. Build an appreciation for the advantages & Limitations of informed observation and Experimentation. Determine how to analyze data from designed experiments in order to build knowledge and continuously improve.
Grouped Frequency Distributions

A frequency distribution is a table used to organize data . The left column (called classes or groups)
includes numerical intervals on a variable being studied. The right column is a list of the frequencies, or number of observations, for each class. .
Grouped frequency distributions - can be used when the
range of values in the data set is very large. The data must be grouped into classes that are more than one unit in width
Construction of a Frequency Distribution

Find the highest and lowest value. Find the range. Select the number of classes desired. Find the width by dividing the range by the number of
classes and rounding up Select a starting point (usually the lowest value); add the width to get the lower limits. Find the upper class limits. Find the boundaries. Tally the data, find the frequencies and find the cumulative frequency.
Example
In a survey of 20 patients who smoked, the following data were obtained. Each value represents the number of cigarettes the patient smoked per day. Construct a frequency distribution using six classes.
10 22 11 13 5
8 13 9 12 11
6 17 18 15 16
14 19 14 15 11
Answer
Step 1: Find the highest and lowest
values: H = 22 and L = 5.
Step 2: Find the range:
R = H L = 22 5 = 17.
Step 3: Select the number of classes desired. In this case it is equal to
6.
Step 4: Find the class width by dividing the range by the number of
classes. Width = 17/6 = 2.83. This value is rounded up to 3.

Step 5: Select a starting point for the lowest class limit. For
convenience, this value is chosen to be 5, the smallest data value. The lower class limits will be 5, 8, 11, 14, 17 and 20.
Step 6: The upper class limits will be 7, 10, 13, 16, 19 and 22.
Step 7: Find the class boundaries by subtracting 0.5 from each lower
class limit and adding 0.5 to the upper class limit

Step 8: Tally the data, write the numerical values for the tallies in the
frequency column and find the cumulative frequencies.

Note: The dash - represents to.
Class Limits
Class Boundaries
Frequency
Cumulative Frequency
05 to 07 08 to 10 11 to 13 14 to 16 17 to 19 20 to 22
4.5 - 7.5 7.5 - 10.5 10.5 - 13.5 13.5 - 16.5 16.5 - 19.5 19.5 - 22.5
2 3 6 5 3 1
2 5 11 16 19 20
Histogram
What is a histogram
It is "a representation of a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies A histogram is like a bar chart, but there are some important differences. It can only be used to show continuous data It can only be used to show numerical data The data is always grouped.
So The width of a bar represents a quantitative variable x, such as age rather than a category The height of each bar indicates frequency
How is a Real Histogram Made? Example * Consider the set Below
{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.
A graph which shows how many ones, how many twos, how many threes, etc. would be meaningless. Instead we bin the data into convenient ranges. In this case, with a bin width of 10, we can easily group the data as below Bin =The class size (width of the rectangles) in a histogram
SEE NEXT SLIDE
SOLUTION
{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}. a bin width of 10 Data Range 0-10 10-20
Frequency 1 3
20-30 30-40 40-50
6 4 2
Note: Changing the size of the bin changes the apprearance of the graph
Histogram shapes
Box plot
A box plot (also referred to as a box and whisker diagram) is a diagram showing statistical distribution.
A box plot summarizes data using the median, upper and lower
quartiles, and the extreme (least and greatest) values. It allows you to see important characteristics of the data at a glance.
We need 5 numbers, called the 5 number summary:

1. minimum value 2. Q1 3. median 4. Q3 5. maximum value
Construction of BOX PLOT

28 30 24 38 31 32 25 32 34 28 42 44 33 30 31 37 38 44 44 29
39
29
32
29
MPG of 4-cylinder cars
To make a box plot, organize the data in order least to greatest : 24 25 28 28 29 29 29 30 30 31 31 32 32 32 33 34 37 38 38 39 42 44 44 44
* THEN we Find the median of the data. It is 32 * This divides the data in half. The lower half : 24 25 28 28 29 29 29 30 30 31 31 32 and the upper half: 32 32 33 34 37 38 38 39 42 44 44 44
Find the median of the top half of the data. 32 32 33 34 37 38 38 39 42 44 44 44 This is called the high median, upper quartile or quartile 3 . Q 3 = 38. Take the lower half of the data and find the median of it. 24 25 28 28 29 29 29 30 30 31 31 32 This is called the low median, or quartile 1. Q1 = 29 Next, find the lowest data, 24, and the highest data, 44. Lets organize all 5 pieces of data together so we can see Lower extreme = 24 Lower quartile(Q1) =29 Median (Q2) = 32 Upper quartile(Q3) =38 Upper extreme(Q4)=44
Next, make a number line that will best display the 5 pieces of data (24 ,29 , 32 ,38, 44)
20 24 28 32 36 40 44
Place a dot above the number line to show the lower
extreme and one for the upper extreme. Put a vertical slash above the number line for the median and one for the lower and upper quartiles.
20
24
28
32
36
40
44
Enclose the vertical slashes into a box. Draw a line from the right center of the box to the upper extreme and one from the lower end of the box to the lower extreme, forming the whiskers. THEN
All graphs must have a title that clearly represents what your graph is showing
Miles per Gallon of 4-cylinder Cars
20
24
28
32
36
40
44
Miles per gallon (mpg)
OGIVE
An ogive, sometimes called a cumulative line graph, is a
line that connects points that are the cumulative percentage of observations below the upper limit of each class in a cumulative frequency distribution. How to Construct Ogives ? Make a frequency table showing class boundaries and cumulative frequencies. For each class, put a dot over the upper class boundary at the height of the cumulative class frequency. Place dot on horizontal axis at the lower class boundary of the first class. Connect the dots.
Example
Draw the x and y axis , Plot the points
Pie Chart
Pie graph - A pie graph is a circle that is divided into
sections or wedges according to the percentage of frequencies in each category of the distribution How to make a Pie Chart ? 1. Organize your information 2. Add the data all together and reach a sum 3. Know the angle between the two sides of the piece 4. Use a mathematical compass to draw a circle 5. Draw the radius 6. Draw each section division 7. Color each segment.
Example
A family's weekly expenditure on its house mortgage, food and fuel is as follows:
Draw a pie chart to display the information.
Solution :
We can find what percentage of the total expenditure each item equals. Percentage of weekly expenditure on:
To draw a pie chart, divide the circle into 100 percentage parts. Then allocate the number of percentage parts required for each item.
Measures of Central Tendency (Averages)

A measure of central tendency is a univariate statistic that
indicates, in one manner or another. the average or typical observed value of a variable in a data set. Central Tendency = values that summarize/ represent the majority of scores in a distribution Three main measures of central tendency: Mean
Median
Mode
Averages
Mode
The mode (or modal value) of a variable in a set of data is
the value of the variable that is observed most frequently in that data (or, given a continuous frequency curve, is at the point of greatest
Note: the mode is the value that is observed most
frequently, not the frequency itself ) The mode is defined for every type of variable [i.e., nominal, ordinal, interval, or ratio].
Frequency
40 35 30 25 20 15 10 5 0
5 DV
Mode = most frequently occurring data point
Mode = (3+4)/2 = 3.5

Data Point
Frequency
2
1
2
5
7
3
4 5 6
14
15 8 5
Median
Middle-most Value 50% of observations are above the Median, 50% are
below it The difference in magnitude between the observations does not matter Therefore, it is not sensitive to outliers Formula Median = n + 1 / 2
Median = the middle number when data are
arranged in numerical order

Data: 3 5 1
Step 1: Arrange in numerical order
1 3 5 Step 2: Pick the middle number (3) Data: 3 5 7 11 14 15 Median = (7+11)/2 = 9
Median Median Location = (N +1)/2 = (56 + 1)/2 = 28.5 Median = (3+4)/2 = 3.5
Data Point
0 2
Frequency
1
2 3
5
7 14
4
5 6
15
8 5
Mean
The mean (or mean value) of a variable in a set of data is
the result of adding up all the observed values of the variable and dividing by the number of cases ( the average as the term is most commonly used). The mean is defined if and only if the variable is at least interval in nature [i.e., interval or ratio].
Mean = Average = X/N X = 191 Mean = 191/56 = 3.41

Data Point 0 1 2 3 4 5 6 2 5 7 14 15 8 5 Frequency 0 5 14 42 60 40 30 X
Advantages and Disadvantages of the Measures: Median 1. Also unaffected by extreme scores Data: 5 8 11 Median = 8 Data: 5 8 5 million Median = 8 2. Usually its value actually occurs in the data 3. But cannot be entered into equations, because there is no equation that defines it 4. And not as stable from sample to sample, because dependent upon the number of scores in the sample
Advantages and Disadvantages of the Measures: Mean 1. Defined algebraically 2. Stable from sample to sample 3. But usually does not actually occur in the data 4. And heavily influenced by outliers Data: 5 8 11 Mean = 8 Data: 5 8 5 million Mean = 1,666,671
Measures of Variation
Measures of variation is a measure that describes how spread
out or scattered a set of data. It is also known as measures of dispersion or measures of spread.
Measures of Variation include:
1. The range
2. The Variance 3. The Standard Deviation
The standard deviation is just the square root of the variance
Range: difference between the extreme values (max - min),
actual values are most often reported in the literature (min max) rather than the difference Variance - measure of variation in a sample of data: mean squared deviations of a value from the mean, often referred to as the mean square or MS Standard deviation: square root of the variance, measures amount of variation of values around the mean
Example
Heights (in inches) of 5 starting players from basketball
team A: A: 72 , 73, 76, 76, 78 The range is the difference between maximum and minimum values of the data set. Range of team A: 78-72=6 The sample standard deviation takes into account all data values. The following procedure is used to find the sample standard deviation.
Step 1.
Find the mean of data
Step 2. Find the deviation of each score from the mean
xi
72 73
xx
72-75 = -3 7375 = -2
76
76
76-75 = 1
76-75 = 1
78
Note that the sum of the deviations is zero:
78-75= 3
Step 3. Square each deviation from the mean . Find the sum of the squared deviations.
xi 72 73 76 76
x xi
72-75 = -3 7375 = -2 76-75 = 1 76-75 = 1
( x xi )2
9 4 1 1
78
78-75= 3
0
9
24
Step 4. The sample variance is determined by dividing the sum of the squared deviations by (n-1) (number of scores minus one)
Team A, the sample variance is
Step 5. The standard deviation Is the square root of the variance.

The mathematical formula for the sample standard deviation is
The sample standard deviation for Team A is
Measures of Position
Identify the position of a data value in a data set, using
various measures of position such as percentiles and quartiles Are used to locate the relative position of a data value in a data set Can be used to compare data values from different data sets Can be used to compare data values within the same data set Can be used to help determine outliers within a data set Includes z-(standard) score, percentiles, quartiles
z-scores
Also called the standard score
Represents the number of standard deviations a score is
from the mean
Always round value to 2 decimal places Can be used to compare data values from different data
sets by converting raw data to a standardized scale Calculation involves the mean and standard deviation of the data set Represents the number of standard deviations that a data value is from the mean for a specific distribution
Z -score
Is obtained by subtracting the
mean from the given data value and dividing the result by the standard deviation. Symbol of BOTH population and sample is z Can be positive, negative or zero A date point can be considered unusual if its z-score is sufficiently large or small
Formula
Sample
Example
Human body temperatures have a mean of 98.20 degrees and a standard deviation of 0.62 degrees. Find the z score for temperatures of: a. 100 degrees b. 97 degrees
Solution Z = (100 98.20)/0.62 Z = 2.90

Z = (97 98.20)/0.62 Z = -1.94
Significance of Z
Z scores above 2 or below -2 are considered to be
UNUSUAL. Z scores above 3 or below -3 are considered to be VERY UNUSUAL. So The temperature of 100 degrees is UNUSUAL.
The temperature of 97 degrees is ordinary
Percentiles
Are position measures used indicate the position of an
individual in a group Divides the data set in 100 (per cent) equal groups Used to compare an individual data value with the national norm Symbolized by P1, P2 ,.. Percentile rank indicates the percentage of data values that fall below the specified rank
Where B = number of scores below x E = number of scores equal to x n = number of scores
A percentile tells the percent of scores that are lower
than a given score. Example : If Jason graduated 25th out of a class of 150 students, then 125 students were ranked below Jason. Jason's percentile rank would be:
Jason's standing in the class at the 84th percentile is as higher or higher than 84% of the graduates.
Quartiles
Quartiles divide the data set into 4 groups, each of which
has the same number of members. Q1 corresponds to P25 Q2 corresponds to P50 or the median Q3 corresponds to P75 Q1, Q2, Q3 divides ranked scores into four equal parts
Example
Find : Q1,Q2,Q3 ?
Q2(Median)
The median is the
average of the 6th and 7th scores.

(80.2+ 82.5)/2
Q2= 81.35
Q1
Find the median of
the first 6 scores

(78.6 + 79.2)/2 78.9
Q3
Find the median
of the last 6 scores

(84.3+84.6)/2 84.45
THE END

Describing, Exploring, and Comparing Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Describing, Exploring, and Comparing Data

Uploaded by

Copyright:

Available Formats

APPLIED STATISTICS

Submitted to : Dr. IMELDA E. CUATEL

GRADUATE SCHOOL UNIVERRSITY OF LUZON Sunday 8:00-12:30

DESCRIBING, EXPLORING, AND COMPARING DATA

What is Statistics? a set of procedures and rulesfor reducing

the effective management and analysis of data.

What can Stats do?

To develop an appreciation for variability and how it effects

Grouped Frequency Distributions

Grouped frequency distributions - can be used when the

Construction of a Frequency Distribution

classes. Width = 17/6 = 2.83. This value is rounded up to 3.

class limit and adding 0.5 to the upper class limit

frequency column and find the cumulative frequencies.

SEE NEXT SLIDE

20-30 30-40 40-50

We need 5 numbers, called the 5 number summary:

Construction of BOX PLOT

MPG of 4-cylinder cars

To make a box plot, organize the data in order least to greatest : 24 25 28 28 29 29 29 30 30 31 31 32 32 32 33 34 37 38 38 39 42 44 44 44

Place a dot above the number line to show the lower

Miles per gallon (mpg)

Draw the x and y axis , Plot the points

Draw a pie chart to display the information.

Measures of Central Tendency (Averages)

Note: the mode is the value that is observed most

Mode = most frequently occurring data point

Mode = (3+4)/2 = 3.5

Median = the middle number when data are

arranged in numerical order

1 3 5 Step 2: Pick the middle number (3) Data: 3 5 7 11 14 15 Median = (7+11)/2 = 9

Mean = Average = X/N X = 191 Mean = 191/56 = 3.41

The standard deviation is just the square root of the variance

Range: difference between the extreme values (max - min),

Find the mean of data

Step 2. Find the deviation of each score from the mean

Team A, the sample variance is

Step 5. The standard deviation Is the square root of the variance.

The sample standard deviation for Team A is

Represents the number of standard deviations a score is

from the mean

Solution Z = (100 98.20)/0.62 Z = 2.90

The temperature of 97 degrees is ordinary

Where B = number of scores below x E = number of scores equal to x n = number of scores

A percentile tells the percent of scores that are lower

average of the 6th and 7th scores.

the first 6 scores

of the last 6 scores

You might also like