Professional Documents
Culture Documents
Data Description
Data Description
INTRODUCTION
This chapter shows the statistical methods that can be used to summarize data. The most familiar
of these methods is the finding of averages.
ENENDA30 – ENGINEERING DATA ANALYSIS
The word average is ambiguous, since several different methods can be used to obtain an average.
Lesson 3: Data Description Loosely stated, the average means the center of the distribution or the most typical case. Measures
of average are also called measures of central tendency and include the mean, median, mode,
and midrange.
Measures of Central Tendency (Mean, Median, Mode, Midrange)
In addition to knowing the average, you must know how the data values are dispersed. The
Distribution Shapes
measures that determine the spread of the data values are called measures of variation, or
Measures of Variation (Range, Variance, Standard Deviation) measures of dispersion. These measures include the range, variance, and standard deviation.
Chebyshev’s Theorem and The Empirical Rule
Measures of Position (z-score, Percentiles, Quartiles, Interquartile Range, Finally, another set of measures is necessary to describe data. These measures are called measures
Deciles) of position. They tell where a specific data value falls within the data set or its relative position in
Exploratory Data Analysis (Boxplot) comparison with other data values. The most common position measures are percentiles, deciles,
and quartiles. These measures are used extensively in psychology and education. Sometimes they
are referred to as norms.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
Rounding Rule for the Mean: Rounding Rule for the Mean:
The mean should be rounded to one more decimal place than occurs in the raw data. For example, if the raw data are given in The mean should be rounded to one more decimal place than occurs in the raw data. For example, if the raw data are given in
whole numbers, the mean should be rounded to the nearest tenth. whole numbers, the mean should be rounded to the nearest tenth.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
The procedure for finding the mean for grouped data assumes that the mean of all the raw data
values in each class is equal to the midpoint of the class.
In reality, this is not true, since the average of the raw data values in each class usually will not be
exactly equal to the midpoint. However, using this procedure will give an acceptable
approximation of the mean, since some values fall above the midpoint and other values fall below
the midpoint for each class, and the midpoint represents an estimate of all values in the class.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
The median is the midpoint of the data array. The symbol for the median is MD.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
NOTE:
NOTE: The mode is the only measure of central tendency that can be used in finding the most typical case when the data are nominal
The mode for grouped data is the modal class. The modal class is the class with the largest frequency. or categorical.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
NOTE:
In statistics, several measures can be used for an average. The most common measures are the mean, median, mode, and
midrange. Each has its own specific purpose and use.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
Consider the previous example, find the ranges for the paints.
For brand A, the range is
R = 60 − 10 = 50 months
For brand B, the range is
R = 45 − 25 = 20 months
CONCLUSION:
The range for these data is quite large since it depends on the highest data value and the lowest data value. To have a more
Conclusion: meaningful statistic to measure the variability, statisticians use measures called the variance and standard deviation.
The range for brand A shows that 50 months separate the largest data value from the smallest data value. For brand B,
20 months separate the largest data value from the smallest data value, which is less than one-half of brand A’s range. NOTE:
Make sure the range is given as a single number.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
It is necessary to know what data variation means. It is based on the difference or distance each data Step 1 Find the mean for the data. Step 5 Divide by N to get the variance.
value is from the mean. This difference or distance is called a deviation. σ𝑿 σ 𝑿−𝝁 𝟐
𝝁= 𝝈𝟐 =
The population variance is the average of the squares of the distance each value is from the 𝑵 𝑵
Step 2 Find the deviation for each data value. Step 6 Take the square root of the variance to
mean. The symbol for the population variance is 𝜎 2 and the formula is:
𝑿−𝝁 get the standard deviation.
σ 𝑿−𝝁 𝟐
𝝈𝟐 = Step 3 Square each of the deviations. 𝟐
𝑵 σ 𝑿−𝝁
𝑿−𝝁 2 𝝈=
where: X individual value 𝜇 population mean N population size Step 4 Find the sum of the squares. 𝑵
𝑿−𝝁 𝟐
The population standard deviation is the square root of the variance. The symbol for the
population standard deviation is 𝜎 and its formula is:
σ 𝑿−𝝁 𝟐
𝝈= 𝝈𝟐 = ROUNDING RULE FOR THE STANDARD DEVIATION:
𝑵 The rounding rule for the standard deviation is the same as that for the mean. The final answer should be rounded to one more
decimal place than that of the original data.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
NOTE:
Therefore, instead of dividing by n, find the variance of the sample by dividing by n − 1, giving a The procedure for finding the sample variance and the sample standard deviation is the same as the procedure for finding the
slightly larger value and an unbiased estimate of the population variance. population variance and the population standard deviation except the sum of the squares is divided by n – 1 (sample size
minus 1) instead of N (population size).
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
SOLUTION: 𝒏 σ𝑿 𝟐
− σ𝑿 𝟐 𝒏 σ 𝑿𝟐 − σ 𝑿 𝟐
𝒔𝟐 = 𝒔=
𝒏(𝒏−𝟏) 𝒏(𝒏−𝟏)
NOTE:
σ 𝑿𝟐 is not the same as σ 𝑿 𝟐 . The notation σ 𝑿𝟐 means to square the values first, then sum; σ 𝑿 𝟐
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
SOLUTION: 𝒏 σ 𝒇 ∙ 𝑿𝒎𝟐 − σ 𝒇 ∙ 𝑿𝒎 𝟐 𝒏 σ 𝒇 ∙ 𝑿 𝒎𝟐 − σ 𝒇 ∙ 𝑿 𝒎 𝟐
𝒔𝟐 = 𝒔=
𝒏(𝒏−𝟏) 𝒏(𝒏−𝟏)
Where 𝑋𝑚 is the midpoint of each class and f is the frequency of each class.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
Step 2 Multiply the frequency by the midpoint for each class, and place the products in column D.
Step 3 Multiply the frequency by the square of the midpoint, and place the products in column E.
Step 4 Find the sums of columns B, D, and E. (The sum of column B is n. The sum of column D is
σ 𝑓 ∙ 𝑋𝑚 . The sum of column E is. σ 𝑓 ∙ 𝑋 𝑚2 )
Step 5 Substitute in the formula and solve to get the variance.
𝑛 σ 𝑓 ∙ 𝑋𝑚 2 − σ 𝑓 ∙ 𝑋𝑚 2
𝑠2 =
𝑛(𝑛 − 1)
Step 6 Take the square root to get the standard deviation.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
• Chebyshev’s theorem specifies the proportions of the spread in terms of the standard deviation.
• The proportion of values from a data set that will fall within ‘k’ standard deviations of the mean
1
will be at least 1 − 𝑘 2 , where ‘k’ is a number greater than 1 (k is not necessarily an integer).
In summary, then, Chebyshev’s theorem states:
At least three-fourths, or 75%, of all data values fall within 2 standard deviations of the mean.
At least eight-ninths, or 89%, of all data values fall within 3 standard deviations of the mean.
This theorem can be applied to any distribution regardless of its shape.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
NOTE: Chebyshev’s theorem can be used to find the minimum percentage of data values that will fall between any
two given values.
PROCEDURE:
1. Subtract the mean from the larger value.
2. Divide the difference by the standard deviation to get k.
3. Use Chebyshev’s theorem to find the percentage.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
NOTE:
Because the empirical rule requires that the distribution be approximately bell-shaped, the results are
more accurate than those of Chebyshev’s theorem, which applies to all distributions.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
The z score represents the number of standard deviations that a data value falls above or below the mean.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
The percentile corresponding to a given value X is computed by using the following formula:
Note that a blood pressure of 130 corresponds to
𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒗𝒂𝒍𝒖𝒆𝒔 𝒃𝒆𝒍𝒐𝒘 𝑿 + 𝟎. 𝟓 approximately the 70th percentile. Also, the 40th
𝑷𝒆𝒓𝒄𝒆𝒏𝒕𝒊𝒍𝒆 = percentile corresponds to a value of approximately
𝒕𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒗𝒂𝒍𝒖𝒆𝒔 118. Thus, if a person has a blood pressure of 118,
he or she is at the 40th percentile.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022
DECILES
Deciles divide the distribution into 10 groups, as shown. They are denoted by 𝐷1, 𝐷2, 𝐷3, etc.
Note that 𝐷1 corresponds to 𝑃10; 𝐷2 corresponds to 𝑃20; etc. Deciles can be found by using the
formulas given for percentiles.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
Check the following data set for outliers. The measure of central tendency used in EDA is the median.
5, 6,12, 13, 15, 18, 22, 50 The measure of variation used in EDA is the interquartile range.
The data are represented graphically using a boxplot (sometimes called a box and whisker plot).
The purpose of exploratory data analysis is to examine data to find out what information can be
discovered about the data, such as the center and the spread.
A boxplot can be used to graphically represent the data set. The five-number summary consists of:
1. The lowest value of the data set (i.e., minimum)
2. 𝑄1
3. The median
4. 𝑄3
5. The highest value of the data set (i.e., maximum)
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
Step 3 Locate the lowest data value, 𝑄1, the median, 𝑄3, and the highest data value; then draw a SOLUTION:
box whose vertical sides go through 𝑄1 and 𝑄3. Draw a vertical line through the median.
Finally, draw a line from the minimum data value to the left side of the box, and draw a line
from the maximum data value to the right side of the box.
No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical No part of this material may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law. methods, without the prior written permission of the owner, except for personal academic use and certain other noncommercial uses permitted by copyright law.
12/20/2022