Professional Documents
Culture Documents
Describing, Exploring, and Comparing Data
Describing, Exploring, and Comparing Data
Introduction to Statistics
large masses of data to manageable proportions and for allowing us to draw conclusions from those data
Statistics is a branch of mathematics that deals with
Allow us to draw conclusions from the data Make data more manageable Allows us to do this objectively and quantitatively
Why Statistics?
products and processes. Build an appreciation for the advantages & Limitations of informed observation and Experimentation. Determine how to analyze data from designed experiments in order to build knowledge and continuously improve.
includes numerical intervals on a variable being studied. The right column is a list of the frequencies, or number of observations, for each class. .
range of values in the data set is very large. The data must be grouped into classes that are more than one unit in width
classes and rounding up Select a starting point (usually the lowest value); add the width to get the lower limits. Find the upper class limits. Find the boundaries. Tally the data, find the frequencies and find the cumulative frequency.
Example
In a survey of 20 patients who smoked, the following data were obtained. Each value represents the number of cigarettes the patient smoked per day. Construct a frequency distribution using six classes.
10 22 11 13 5
8 13 9 12 11
6 17 18 15 16
14 19 14 15 11
Answer
Step 1: Find the highest and lowest
values: H = 22 and L = 5.
Step 2: Find the range:
R = H L = 22 5 = 17.
Step 3: Select the number of classes desired. In this case it is equal to
6.
Step 4: Find the class width by dividing the range by the number of
convenience, this value is chosen to be 5, the smallest data value. The lower class limits will be 5, 8, 11, 14, 17 and 20.
Step 6: The upper class limits will be 7, 10, 13, 16, 19 and 22.
Step 7: Find the class boundaries by subtracting 0.5 from each lower
Class Limits
Class Boundaries
Frequency
Cumulative Frequency
05 to 07 08 to 10 11 to 13 14 to 16 17 to 19 20 to 22
4.5 - 7.5 7.5 - 10.5 10.5 - 13.5 13.5 - 16.5 16.5 - 19.5 19.5 - 22.5
2 3 6 5 3 1
2 5 11 16 19 20
Histogram
What is a histogram
It is "a representation of a frequency distribution by means of rectangles whose widths represent class intervals and whose areas are proportional to the corresponding frequencies A histogram is like a bar chart, but there are some important differences. It can only be used to show continuous data It can only be used to show numerical data The data is always grouped.
So The width of a bar represents a quantitative variable x, such as age rather than a category The height of each bar indicates frequency
How is a Real Histogram Made? Example * Consider the set Below
{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}.
A graph which shows how many ones, how many twos, how many threes, etc. would be meaningless. Instead we bin the data into convenient ranges. In this case, with a bin width of 10, we can easily group the data as below Bin =The class size (width of the rectangles) in a histogram
SOLUTION
{3, 11, 12, 19, 22, 23, 24, 25, 27, 29,31, 35, 36, 37, 45, 49}. a bin width of 10 Data Range 0-10 10-20
Frequency 1 3
6 4 2
Note: Changing the size of the bin changes the apprearance of the graph
Histogram shapes
Box plot
A box plot (also referred to as a box and whisker diagram) is a diagram showing statistical distribution.
A box plot summarizes data using the median, upper and lower
quartiles, and the extreme (least and greatest) values. It allows you to see important characteristics of the data at a glance.
39
29
32
29
* THEN we Find the median of the data. It is 32 * This divides the data in half. The lower half : 24 25 28 28 29 29 29 30 30 31 31 32 and the upper half: 32 32 33 34 37 38 38 39 42 44 44 44
Find the median of the top half of the data. 32 32 33 34 37 38 38 39 42 44 44 44 This is called the high median, upper quartile or quartile 3 . Q 3 = 38. Take the lower half of the data and find the median of it. 24 25 28 28 29 29 29 30 30 31 31 32 This is called the low median, or quartile 1. Q1 = 29 Next, find the lowest data, 24, and the highest data, 44. Lets organize all 5 pieces of data together so we can see Lower extreme = 24 Lower quartile(Q1) =29 Median (Q2) = 32 Upper quartile(Q3) =38 Upper extreme(Q4)=44
Next, make a number line that will best display the 5 pieces of data (24 ,29 , 32 ,38, 44)
20 24 28 32 36 40 44
extreme and one for the upper extreme. Put a vertical slash above the number line for the median and one for the lower and upper quartiles.
20
24
28
32
36
40
44
Enclose the vertical slashes into a box. Draw a line from the right center of the box to the upper extreme and one from the lower end of the box to the lower extreme, forming the whiskers. THEN
All graphs must have a title that clearly represents what your graph is showing
Miles per Gallon of 4-cylinder Cars
20
24
28
32
36
40
44
OGIVE
An ogive, sometimes called a cumulative line graph, is a
line that connects points that are the cumulative percentage of observations below the upper limit of each class in a cumulative frequency distribution. How to Construct Ogives ? Make a frequency table showing class boundaries and cumulative frequencies. For each class, put a dot over the upper class boundary at the height of the cumulative class frequency. Place dot on horizontal axis at the lower class boundary of the first class. Connect the dots.
Example
Pie Chart
Pie graph - A pie graph is a circle that is divided into
sections or wedges according to the percentage of frequencies in each category of the distribution How to make a Pie Chart ? 1. Organize your information 2. Add the data all together and reach a sum 3. Know the angle between the two sides of the piece 4. Use a mathematical compass to draw a circle 5. Draw the radius 6. Draw each section division 7. Color each segment.
Example
A family's weekly expenditure on its house mortgage, food and fuel is as follows:
Solution :
We can find what percentage of the total expenditure each item equals. Percentage of weekly expenditure on:
To draw a pie chart, divide the circle into 100 percentage parts. Then allocate the number of percentage parts required for each item.
indicates, in one manner or another. the average or typical observed value of a variable in a data set. Central Tendency = values that summarize/ represent the majority of scores in a distribution Three main measures of central tendency: Mean
Median
Mode
Averages
Mode
The mode (or modal value) of a variable in a set of data is
the value of the variable that is observed most frequently in that data (or, given a continuous frequency curve, is at the point of greatest
frequently, not the frequency itself ) The mode is defined for every type of variable [i.e., nominal, ordinal, interval, or ratio].
Frequency
40 35 30 25 20 15 10 5 0
5 DV
Frequency
2
1
2
5
7
3
4 5 6
14
15 8 5
Median
Middle-most Value 50% of observations are above the Median, 50% are
below it The difference in magnitude between the observations does not matter Therefore, it is not sensitive to outliers Formula Median = n + 1 / 2
Median Median Location = (N +1)/2 = (56 + 1)/2 = 28.5 Median = (3+4)/2 = 3.5
Data Point
0 2
Frequency
1
2 3
5
7 14
4
5 6
15
8 5
Mean
The mean (or mean value) of a variable in a set of data is
the result of adding up all the observed values of the variable and dividing by the number of cases ( the average as the term is most commonly used). The mean is defined if and only if the variable is at least interval in nature [i.e., interval or ratio].
Advantages and Disadvantages of the Measures: Median 1. Also unaffected by extreme scores Data: 5 8 11 Median = 8 Data: 5 8 5 million Median = 8 2. Usually its value actually occurs in the data 3. But cannot be entered into equations, because there is no equation that defines it 4. And not as stable from sample to sample, because dependent upon the number of scores in the sample
Advantages and Disadvantages of the Measures: Mean 1. Defined algebraically 2. Stable from sample to sample 3. But usually does not actually occur in the data 4. And heavily influenced by outliers Data: 5 8 11 Mean = 8 Data: 5 8 5 million Mean = 1,666,671
Measures of Variation
Measures of variation is a measure that describes how spread
out or scattered a set of data. It is also known as measures of dispersion or measures of spread.
Measures of Variation include:
1. The range
2. The Variance 3. The Standard Deviation
actual values are most often reported in the literature (min max) rather than the difference Variance - measure of variation in a sample of data: mean squared deviations of a value from the mean, often referred to as the mean square or MS Standard deviation: square root of the variance, measures amount of variation of values around the mean
Example
Heights (in inches) of 5 starting players from basketball
team A: A: 72 , 73, 76, 76, 78 The range is the difference between maximum and minimum values of the data set. Range of team A: 78-72=6 The sample standard deviation takes into account all data values. The following procedure is used to find the sample standard deviation.
Step 1.
xi
72 73
xx
72-75 = -3 7375 = -2
76
76
76-75 = 1
76-75 = 1
78
Note that the sum of the deviations is zero:
78-75= 3
Step 3. Square each deviation from the mean . Find the sum of the squared deviations.
xi 72 73 76 76
x xi
72-75 = -3 7375 = -2 76-75 = 1 76-75 = 1
( x xi )2
9 4 1 1
78
78-75= 3
0
9
24
Step 4. The sample variance is determined by dividing the sum of the squared deviations by (n-1) (number of scores minus one)
Measures of Position
Identify the position of a data value in a data set, using
various measures of position such as percentiles and quartiles Are used to locate the relative position of a data value in a data set Can be used to compare data values from different data sets Can be used to compare data values within the same data set Can be used to help determine outliers within a data set Includes z-(standard) score, percentiles, quartiles
z-scores
Also called the standard score
Always round value to 2 decimal places Can be used to compare data values from different data
sets by converting raw data to a standardized scale Calculation involves the mean and standard deviation of the data set Represents the number of standard deviations that a data value is from the mean for a specific distribution
Z -score
Is obtained by subtracting the
mean from the given data value and dividing the result by the standard deviation. Symbol of BOTH population and sample is z Can be positive, negative or zero A date point can be considered unusual if its z-score is sufficiently large or small
Formula
Sample
Example
Human body temperatures have a mean of 98.20 degrees and a standard deviation of 0.62 degrees. Find the z score for temperatures of: a. 100 degrees b. 97 degrees
Significance of Z
Z scores above 2 or below -2 are considered to be
UNUSUAL. Z scores above 3 or below -3 are considered to be VERY UNUSUAL. So The temperature of 100 degrees is UNUSUAL.
Percentiles
Are position measures used indicate the position of an
individual in a group Divides the data set in 100 (per cent) equal groups Used to compare an individual data value with the national norm Symbolized by P1, P2 ,.. Percentile rank indicates the percentage of data values that fall below the specified rank
than a given score. Example : If Jason graduated 25th out of a class of 150 students, then 125 students were ranked below Jason. Jason's percentile rank would be:
Jason's standing in the class at the 84th percentile is as higher or higher than 84% of the graduates.
Quartiles
Quartiles divide the data set into 4 groups, each of which
has the same number of members. Q1 corresponds to P25 Q2 corresponds to P50 or the median Q3 corresponds to P75 Q1, Q2, Q3 divides ranked scores into four equal parts
Example
Find : Q1,Q2,Q3 ?
Q2(Median)
The median is the
Q2= 81.35
Q1
Find the median of
Q3
Find the median
THE END