Professional Documents
Culture Documents
Chapter 3 mgsc
Chapter 3 mgsc
Chapter 3:
Summarizing Data
When confronted with a large mass of information, we usually start by trying to condense it.
Summarizing data may start with grouping similar values together, but we will also want to
find simple “landmark” values that describe important characteristics of the data set.
However, before summarizing the data, it is advisable to simply explore the data to see what is
there. Play with the data as if you were in a sandbox, before you start forming opinions about
the data and prematurely drawing conclusions.
If your data is in a Table, then there are some functions that make basic exploration easy. In
the header row at the top of the Table, each variable name is followed with a down arrow. If
you click on it you can see that you can sort or filter your data. The first two sorts (smallest to
largest and largest to smallest for numeric variables and A to Z vs Z to A for text) are self-
explanatory.
Figure 3-1
Sorting and Filtering a numeric variable in a Table
Fig
ure 3-2 Sorting and Filtering a text variable in a Table
The third is labeled Sort by Color. This is a custom sort. You can sort at multiple levels. That is,
sort students by their Program, then within each Program you can sort by Gender, then among
each Gender-Program combination you can sort by GPA,… Note that the whole file is sorted,
not simply the variable that you selected. Normally in Excel, you have to select the data you
wish to sort.
Figure 3-3 Custom
Sort
Suppose that we want to look just at the Business students. We can click on the down arrow
by Program and uncheck all programs except Business.
Maybe we want only the International Business students. Click on the down arrow
by Home and select International.
We can continue selecting just Females or just 1st year students. We can also use more
complex text filters. Or we can select just those that expect to earn over $50,000, or between
$40,000 and $80,000…
We can also Sort the data in a similar fashion. If you select Sort by Color you can then choose
Custom. Sort that allows you to sort by several variables concurrently.
When first exploring data, doing a simple sort on each numeric variable can be insightful. For
example, sort Salary, either smallest to largest or largest to smallest.
Look at the smallest values. Someone expects to earn $0? $500? $9,450?
Are these students crazy or are these data entry errors? Looking at the extremes will give you
some sense of data quality issues. Should we keep these values when we do our analyses?
Maybe the $650,000 was supposed to be $65,000, but do we know for sure? Maybe we should
investigate this record more closely. Are there other responses for this student that may give
us insights? Maybe the student gave all erroneous answers and we should discard all
responses for the student.
When you are done sorting, it is recommended that you return your data to the original sort
order. Initially, the data was likely sorted by the record ID. Resort the Table by ID, smallest to
largest. If your data set does not have an ID column, it is recommended that you create one.
As we explore our data and come to understand it, we may start asking ourselves questions
about it.
A row will be added to the bottom of the table. Scroll to the last row. Select the entry in the
Salary column. A down arrow should appear. Click on the arrow.
Select Count Numbers. 752 students reported a salary value. Look at Average, Max and Min.
The average salary was $52,914.76, with a maximum of $650,000 and a minimum of $0. The
$650,000 appears unrealistic. We noted before when Sorting, that some students gave values
that we thought were suspicious.
Go to the down arrow by Salary in the header row and select Number Filters.
Select Between and then select 20000 and 120000 as your limits. You now get an average of
$51,480.40. It would appear that the extremes at each end did not seriously distort our
average.
Click on cell ID for Total row 813 (cell A813). Click on the arrow and select Count.
811 students attempted the survey. This is all of the students. Click on the Total cell for Year.
802 told us the year they started. Repeat this for Home and for Home2. Why is the count 811
for Home and 801 for Home2? When we applied VLOOKUP to Home, blanks were coded as
#NA. This may be a problem.
What happens when you look at Gender and Gender2? 811 and 665? Why? Initially, we used
VLOOKUP to code 0 and 1 as Male and Female, but VLOOKUP treated a blank as zero = Male.
We added an IF statement to treat blanks as blanks. But, Excel coded blanks as blank text. So
the cells are no longer empty and they were counted.
Question: why did 802 tell us when they started but only 665 tell us gender?
Are salary expectations of Business students similar to those of Arts students? We know that
we have some strange salaries reported, so let us Filter on those between $20,000 and
$120,000 (note:do not insert $ and commas. Type 20000 and 120000). The average salary is
$51,480.80.
Now filter on Program and select Business. The average is $52,463.20. Filter on Program and
select Arts. The average is $48,185. If you filter on Science you will see they have the highest
expectations, at $57,774.77.
Although tedious, you can answer many simple questions with the Total row and Filters.
Suppose you take our breakdown of salaries from very low to unrealistic and count how many
fit each description.
Table 3-1
- Proposed grouping for Expected Salary data
Usually, frequency distributions do not have text descriptions, just the ranges of values.
Building frequency distributions by using VLOOKUP to categorize values and then having to
count values is clumsy.
Excel has functions to build the distribution and to graph it. Some are antiquated, but
commonly used, and others are more flexible. We will look at the most common one in this
chapter and an alternative in the next chapter.
If you are using Office 365 in the cloud, then installation is somewhat more complicated. The
Analysis Toolpak is not included in 365. Go to Insert and then select Add-Ins
Figure 3-8 Excel 365 Add-ins
Select STORE from the options listed at the top of the pop-up screen. From the list of
categories at the left, select Data Analytics. Scroll down the list of suggestions until you
see XLMiner.
Figure 3-9 XLMiner Add-in
The XLMiner add-in is free and behaves in a very similar way to the Analysis Toolpak.
If you are using a Mac, then you must go through a similar process to the above. You may need
to google to find installation instructions.
Since the end of one bin is the start of the next, we only need to define the end of the bin. Note
that this differs from VLOOKUP that required us to define the start of each group. (confusing!)
Let us use a bin width of $10,000 and define bins up to $120,000. To keep our data sheet clean,
we recommend that you put the bins on a separate sheet, such as your LookUp Table
worksheet.
In column R, put a heading “Salary bins” in the 1st row and then enter 10000, 20000, 30000,….
In the 2nd, 3rd, 4th,….rows. Note: do not use commas. Write 10,000 as 10000.
Go to the Data tab and on the far right, you should see Data Analysis. Click on it. A pop up
window appears. Select Histogram.
Ask for a New Worksheet, so we do not add clutter to our data sheet.
Figure 3-13 - Histogram output
The histogram is a simple column chart. You can format the labels on the chart to make it
more attractive. It is common practice to remove the space between columns in a histogram.
To the right are several formatting options. Change the Gap Width to 0.
The More group on the right is deceptive. Excel makes all columns the same width. But this
group actually represents values from $120,000 to $650,000! The tail on the right should be
very very long, but that would distract us from looking at where most of the data is.
Figure 3-16 - Histogram with 65 bins
Don’t be distracted by outliers. They are important, but not the focus.
It would be nice to filter out extreme values. If you apply a filter and then construct your
histogram, Excel ignores the filter! The only way to filter data and then make a histogram is to
To get a sense of any numeric value, you think of how big or small the value is. Big and small
are relative to some standard. The first standard that comes to mind is “where is the middle?”
There are multiple measures of the “middle”. The most common are the mean (average) and
the median.
What we commonly call “average”, statisticians call the mean. It is the total of all values
divided by the number of observations.
This value is called “x-bar” and represents the sample mean. Each value is equally important.
All values are included. If you use the Total row in an Excel Table, the mean value is Average.
The mean or average expected salary was $52,914.76.
But we know that there were some outliers – unusually high salaries. Outliers are not reflective
of most of the data, but the mean includes everything and treats each value as equally valid. A
popular alternative is the median.
Median = value for which 50% of observations are larger and 50% of observations are smaller.
The median is the middle number if you sort the data from smallest to largest. If there are two
values in the middle, take the average of them. To find the median in the Total Row of
a Table, you must select More Functions and then look under Statistical Functions.
Figure 3-28 Median among More
Functions
It is also very easy to find the median manually in Excel. Sort the Salary column. How many
non-blank values are there in the column? Use Count, or scroll down the list to find the last
entry.
Count = 752 - last entry is in row 753 (the column has a label in row 1)
752/2 = 376
376 values are greater than the median and 376 are smaller.
376th value from the start is in the 377th row = $50,000
376th value from the end is in the 378th row = $50,000
The median salary is $50,000. In this case, the median is very similar to the mean.
Mean versus Median - The median is unaffected by outliers. That is, it is unaffected by any
especially large or small values – just the values exactly in the middle. But, most people think
of “average” when they think of “middle”, so the mean is the most commonly used measure.
If the distribution of the data is symmetrical, the mean and the median are very similar.
However, if the distribution is highly skewed, then the mean will be pulled in the direction of
the long tail. For example, with income data, it usually has a long right tail reflecting those
with very high incomes, but is truncated on the left because incomes cannot be negative. In
this case, the mean income will be higher than the median. This means that more people have
incomes below the mean than those having incomes above the mean. It is not unusual to have
70% of values smaller than the mean. In this case, many believe that the mean is an
inappropriate measure of the “middle”.
Maximum and minimum are the extremes and may not be representative of many students.
We often classify high income earners as the top 1% or the top 10%.
1% of 752 students is 7.52. What is the 8th largest salary? Or the 744th smallest? $175,000
We could also say that 90% have lower salaries. 90% of 752 is 676.8
What is the 677th value from the bottom (smallest)? (row 678) $80,000
One might consider $80,000 to be a high salary. If 90% are below the value, we call it the 90th
percentile. We may use percentiles to define “low” and “high”. Percentiles are not influenced
by outliers like Minimum and Maximum are.
The median is the 50th percentile. The 25th and 75th percentiles are called quartiles. The
bottom 25% are below the 1st quartile. The next 25% are between the 1st quartile and the
median (2nd quartile). The next 25% are between the median and the 3rd quartile. The last
25% are above the 3rd quartile. Quartile split the data set into 4 quarters.
Often, researchers looking at how household income affects an issue will compare the bottom
25% to the top 25% of households (compare the average for the bottom 25% to the average
for the top 25%).
We might wish to develop profiles of the most successful students and compare them to the
least successful. “Most” and “least” could be defined by percentiles or quartiles.
Note that that you may find different software tools calculate percentiles and quartiles
differently. For example, consider the following data set:
2 4 7 8 8 9 11 14 20
There are 9 observations, so the median value is 8. 4 observations are smaller and 4
observations are larger. The first quartile is 7 if you include the median when splitting the first
half. But if you exclude the median, then the middle of the first half is between 4 and 7. With
tiny samples, the rules used to find percentiles and quartiles can give different results, but
with larger samples, the results are almost identical. In practice, the choice of rule is not
important.
Measuring Diversity
Is our data set homogeneous or diverse? Do most students expect to earn the same amount or
is there a lot of variation among students? One measure of diversity is the Range = Max – Min.
But Max and Min may simply reflect outliers and not most students.
Common measures of diversity (dispersion or variability) involve the relative gap between
observations and the mean (average). Are most values close to the middle or are there many
that are very far from the middle? $20,000 is $32,915 below avaerage, whereas $65,000 is
$12,085 above average.
We don’t care whether values are positive or negative, just how far they are from the middle.
Why not ignore the sign and simply average the differences?
Measure how far each value is from the mean, take the absolute value, and then average the
numbers. In Excel, use the AVEDEV function to get MAD.
It is a simple measure, but almost no one uses it, except in forecasting. It is not unusual to hear
someone say that on average, their sales forecast deviates from the actual by 300 either way.
Absolute values are difficult to work with mathematically. They are a particular problem in
calculus. Another way to remove minus signs is to square the values. The Variance is the
average squared difference between observations and the mean. The sample variance is
Why would you divide by n-1 and not n? Because we are using the sample mean, which is
based upon the observations we are comparing, dividing by n will slightly underestimate the
variance and dividing by n-1 corrects for this bias.
For salaries, our units for variance would be “square dollars”? What does it mean to say that
the variance of salaries is 1,940,893,818 (dollars)2?
We can take the square root to convert out units back to dollars. We call this the Standard
Deviation.
Confused? The standard deviation is the most widely used measure of variability, so you will
have to get used to it. Thankfully, Excel and many calculators have functions to calculate the
standard deviation. In this case, the standard deviation of salaries is $44,055.58. How do you
interpret this number? What does it tell you about the variation in salary expectations among
students?
At the most basic level, we can say that when there is more variation, then the standard
deviation is larger. If you had two groups of students and one had a standard deviation of
$30,000 and the other had a standard deviation of $55,000, then we would say that the second
group was more diverse than the first.
But to interpret the actual number can be challenging. If the distribution of the data is
approximately bell shaped (Normal distribution), then we can make some statements about
high and low values in the data, using just the mean and the standard deviation.
Our data set is not bell shaped because of outliers, so let us trim off those values that are
below $10,000 and above $150,000. This removes 27 values. Of the remaining 725 values, the
mean is $50,688.48 and the standard deviation is $20,637.85. This data set still has a somewhat
long tail to the right, so is not exactly bell shaped.
If a distribution is approximately bell shaped, like a Normal Distribution, then the Empirical
Rule states that
Looking out two standard deviations above and below the mean, we find the range to be
Although our data is somewhat skewed, the rule appears to work well.
By knowing just the mean and standard deviation, we can say that a “low” salary would be
$30,000 and a “very low” salary would be $9,000. A “high” salary would be $71,000 and a “very
high” salary would be $91,000.
The Empirical Rule generally works if the distribution is not severely skewed. Knowing just the
mean and standard deviation, you can quickly identify
Very low
Low
Middle
High, and
Very high
The output gives many different summary statistics. Many are ones we have already seen.
The mode is the most frequently occurring value. For interval or ratio data, the mode is often
meaningless since there is such a diversity of individual values that no one value occurs very
frequently. In our case, we are looking at personal estimates and most students rounded off
their estimates to the nearest 10,000. For this reason, we do have some values occurring
frequently.
Kurtosis is a measure of how “fat” the tails of the distribution are. Skewness measures
whether we have a long tail to the right or the left. Very few researchers pay a lot of attention
to these values and we can generally get a sense of the distribution shape by looking at a
histogram.
If you have a sample of at least 30 observations (we have 752), then we can make statements
about our sample mean that are similar to the Empirical Rule.
The standard error is $1,606.54. We can be very confident that the estimate of $52,914.76 is
within 2(1606.54) or $3,213.08 of the true mean salary expectation. That is, the correct value is
likely to be between $49,700 and $56,100.
Note that I rounded off my calculations to the nearest $100. Since there is error in my estimate
of the mean that could be as large as $3,213.08 or larger, it makes no sense to state my results
to the penny or the dollar. Displaying too many digits can be confusing to read and also
suggests precision that is not justified. It makes more sense to say that graduates expect to
earn approximately $53,000 (+/- $3,000). There is no loss of information in rounding off and it
is easier for the reader to imagine $53,000 rather than $52,914.76.
But if we have a lot of data, we should be able to get a more accurate estimate than if we had a
small sample. In our example, we had a sample of 752 observations and we were confident
that our estimate of the average was within $3,213. If we wanted to be confident of having an
estimate within $800 (cut our error to 1/4 of what it is), then we would need 16 times as many
observations (12,000 observations). More accurate estimates can be costly to obtain.
The topic of estimating mean values with an interval (range of plausible values) is covered in
more depth in a basic statistics course.
Much of data analytics involves making predictions and it is important to have ways of
measuring the accuracy of our predictions.
Before leaving summary statistics using Excel, there is one important caution that must be
made. When we used the Table Row to find averages, we could filter our data and the averages
would reflect the filter (e.g., when we compared average salaries for Arts, Business and
Science). If you filter your data and then use the Descriptive Statistics in the Analysis Toolpak,
you will discover that Excel ignores the filter. This is the same problem we saw with
Histograms. If you want to compare sub-groups, you must filter and then copy and paste the
data into a new worksheet before you calculate the Descriptive Statistics. In the next chapter
we will look at a powerful and simple tool for doing comparisons.