Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

Exploring Data –

Chapter 3:
Summarizing Data
When confronted with a large mass of information, we usually start by trying to condense it.
Summarizing data may start with grouping similar values together, but we will also want to
find simple “landmark” values that describe important characteristics of the data set.

However, before summarizing the data, it is advisable to simply explore the data to see what is
there. Play with the data as if you were in a sandbox, before you start forming opinions about
the data and prematurely drawing conclusions.

If your data is in a Table, then there are some functions that make basic exploration easy. In
the header row at the top of the Table, each variable name is followed with a down arrow. If
you click on it you can see that you can sort or filter your data. The first two sorts (smallest to
largest and largest to smallest for numeric variables and A to Z vs Z to A for text) are self-
explanatory.
Figure 3-1
Sorting and Filtering a numeric variable in a Table
Fig
ure 3-2 Sorting and Filtering a text variable in a Table

The third is labeled Sort by Color. This is a custom sort. You can sort at multiple levels. That is,
sort students by their Program, then within each Program you can sort by Gender, then among
each Gender-Program combination you can sort by GPA,… Note that the whole file is sorted,
not simply the variable that you selected. Normally in Excel, you have to select the data you
wish to sort.
Figure 3-3 Custom
Sort

Figure 3-4 Custom Sort dialog box


You can also Filter your data. This could be by selecting the specific values you wish to
examine, or by using a variety of functions to select ranges of values. You can do a Custom
Filter that combines ranges by using logical operators, such as OR and AND. You can filter on
several different variables separately. This is useful when you want to explore specific sub-
groups in your data.

Suppose that we want to look just at the Business students. We can click on the down arrow
by Program and uncheck all programs except Business.

Maybe we want only the International Business students. Click on the down arrow
by Home and select International.

We can continue selecting just Females or just 1st year students. We can also use more
complex text filters. Or we can select just those that expect to earn over $50,000, or between
$40,000 and $80,000…

Remember to take filters off to return to the full data set.

We can also Sort the data in a similar fashion. If you select Sort by Color you can then choose
Custom. Sort that allows you to sort by several variables concurrently.

When first exploring data, doing a simple sort on each numeric variable can be insightful. For
example, sort Salary, either smallest to largest or largest to smallest.

Look at the smallest values. Someone expects to earn $0? $500? $9,450?

Look at the largest values. Someone expects to earn $650,000? $560,000?

Are these students crazy or are these data entry errors? Looking at the extremes will give you
some sense of data quality issues. Should we keep these values when we do our analyses?
Maybe the $650,000 was supposed to be $65,000, but do we know for sure? Maybe we should
investigate this record more closely. Are there other responses for this student that may give
us insights? Maybe the student gave all erroneous answers and we should discard all
responses for the student.

Notice how many gave no answer. Why? Should we be concerned?

When you are done sorting, it is recommended that you return your data to the original sort
order. Initially, the data was likely sorted by the record ID. Resort the Table by ID, smallest to
largest. If your data set does not have an ID column, it is recommended that you create one.
As we explore our data and come to understand it, we may start asking ourselves questions
about it.

3.3 Total Row


When in a Table, if you select the Design tab at the top of the screen, there is a checkbox
labeled Total Row. Click it.

Figure 3-5 - Adding a Total Row

A row will be added to the bottom of the table. Scroll to the last row. Select the entry in the
Salary column. A down arrow should appear. Click on the arrow.

Figure 3-6 Summary functions in Total Row

Select Count Numbers. 752 students reported a salary value. Look at Average, Max and Min.
The average salary was $52,914.76, with a maximum of $650,000 and a minimum of $0. The
$650,000 appears unrealistic. We noted before when Sorting, that some students gave values
that we thought were suspicious.

Go to the down arrow by Salary in the header row and select Number Filters.
Select Between and then select 20000 and 120000 as your limits. You now get an average of
$51,480.40. It would appear that the extremes at each end did not seriously distort our
average.

Click on cell ID for Total row 813 (cell A813). Click on the arrow and select Count.

811 students attempted the survey. This is all of the students. Click on the Total cell for Year.
802 told us the year they started. Repeat this for Home and for Home2. Why is the count 811
for Home and 801 for Home2? When we applied VLOOKUP to Home, blanks were coded as
#NA. This may be a problem.

What happens when you look at Gender and Gender2? 811 and 665? Why? Initially, we used
VLOOKUP to code 0 and 1 as Male and Female, but VLOOKUP treated a blank as zero = Male.
We added an IF statement to treat blanks as blanks. But, Excel coded blanks as blank text. So
the cells are no longer empty and they were counted.

Lesson: Recoding may have unintended consequences.

Question: why did 802 tell us when they started but only 665 tell us gender?

Lesson: Watch out for survey fatigue with long surveys.

Are salary expectations of Business students similar to those of Arts students? We know that
we have some strange salaries reported, so let us Filter on those between $20,000 and
$120,000 (note:do not insert $ and commas. Type 20000 and 120000). The average salary is
$51,480.80.

Now filter on Program and select Business. The average is $52,463.20. Filter on Program and
select Arts. The average is $48,185. If you filter on Science you will see they have the highest
expectations, at $57,774.77.

Although tedious, you can answer many simple questions with the Total row and Filters.

3.4 Aggregating Data with Frequency


Distributions
A frequency distribution is a tabular summary of how many values are in each group of values.

Suppose you take our breakdown of salaries from very low to unrealistic and count how many
fit each description.
Table 3-1
- Proposed grouping for Expected Salary data

Usually, frequency distributions do not have text descriptions, just the ranges of values.
Building frequency distributions by using VLOOKUP to categorize values and then having to
count values is clumsy.

Excel has functions to build the distribution and to graph it. Some are antiquated, but
commonly used, and others are more flexible. We will look at the most common one in this
chapter and an alternative in the next chapter.

To build a frequency distribution, you must group the data.

What are the “best practices” in grouping?


The objectives for grouping data are to:

 Make the summary concise.


 Make the summary informative.

What does this mean? How do we do it well?

 Don’t have too many groups – not concise – normally limit to a


maximum of 20.
 Don’t have too few groups – loss of information – have at least 4-5.
 Groups should be unambiguous (should not overlap so each value fits
in only one group).
 The summary should include all the data.
 Use “nice numbers. We like multiples of 2, 5, 10, 25, 50, 100 and not
awkward starting or ending values, such as 13 or 27 or 146.3.
 It is easier to compare groups that are of “equal size”. The proposed
groups above went in increments of $20,000, but “very high” had a
range of $40,000. This is NOT a good practice.
 To satisfy all of the above, the 1st and/or last group may need to be
larger to capture outliers.

3.5 Frequency Distributions in Excel –


Histograms
Excel has a data analysis function called Histogram. A Histogram is a Column Chart of a
frequency distribution. To access this function, you must install the Analysis ToolPak add-
in in Excel.

Installing the Analyis ToolPak


In Excel for Windows:

 Go to the File tab.


 Select Options (bottom of list).
 Select Add Ins (near bottom of list).
 Select Manage |Excel Add Ins| Go (at the bottom of the screen).
 Click on Analysis ToolPak.
Figure 3-7 - Installing the Analysis ToolPak in Excel for Windows

If you are using Office 365 in the cloud, then installation is somewhat more complicated. The
Analysis Toolpak is not included in 365. Go to Insert and then select Add-Ins
Figure 3-8 Excel 365 Add-ins

Select STORE from the options listed at the top of the pop-up screen. From the list of
categories at the left, select Data Analytics. Scroll down the list of suggestions until you
see XLMiner.
Figure 3-9 XLMiner Add-in

The XLMiner add-in is free and behaves in a very similar way to the Analysis Toolpak.

If you are using a Mac, then you must go through a similar process to the above. You may need
to google to find installation instructions.

Using the Histogram Function


To use the Histogram function in Excel, you must define your groups. Excel calls the groups
“bins”.

Since the end of one bin is the start of the next, we only need to define the end of the bin. Note
that this differs from VLOOKUP that required us to define the start of each group. (confusing!)

Let us use a bin width of $10,000 and define bins up to $120,000. To keep our data sheet clean,
we recommend that you put the bins on a separate sheet, such as your LookUp Table
worksheet.
In column R, put a heading “Salary bins” in the 1st row and then enter 10000, 20000, 30000,….
In the 2nd, 3rd, 4th,….rows. Note: do not use commas. Write 10,000 as 10000.

Figure 3-10 - Salary bins for building Histograms

Go to the Data tab and on the far right, you should see Data Analysis. Click on it. A pop up
window appears. Select Histogram.

Figure 3-11 - Histogram function

A dialogue box opens.


Figure 3-12 - Histogram dialog box

The Input Range is the location of the data to be summarized $L$1:$L$812.

The Bin Range is the location of the bins ,'Lookup Tables'!$R$1:$R$13.

Our columns have Labels, so check this box.

We want Chart Output, so check this box.

Ask for a New Worksheet, so we do not add clutter to our data sheet.
Figure 3-13 - Histogram output

The histogram is a simple column chart. You can format the labels on the chart to make it
more attractive. It is common practice to remove the space between columns in a histogram.

Click on any column and then right click.

Select Format Data Series.


Figure 3-14 - Formatting the histogram

To the right are several formatting options. Change the Gap Width to 0.

Figure 3-15 - Histogram without gaps

The More group on the right is deceptive. Excel makes all columns the same width. But this
group actually represents values from $120,000 to $650,000! The tail on the right should be
very very long, but that would distract us from looking at where most of the data is.
Figure 3-16 - Histogram with 65 bins

Don’t be distracted by outliers. They are important, but not the focus.

It would be nice to filter out extreme values. If you apply a filter and then construct your
histogram, Excel ignores the filter! The only way to filter data and then make a histogram is to

 Apply the filter.


 Copy the filtered data to a new sheet.
 Build the histogram using the copied data.

3.7 Numeric Summaries


Although relative frequency distributions and histograms are informative summaries of a data
set, they are often more than the reader is looking for. Many would like data summarized into
one or two important numbers.

Middle and Average


When a test is returned, what is the first thing that students usually ask the instructor? What
was the class average?

To get a sense of any numeric value, you think of how big or small the value is. Big and small
are relative to some standard. The first standard that comes to mind is “where is the middle?”
There are multiple measures of the “middle”. The most common are the mean (average) and
the median.
What we commonly call “average”, statisticians call the mean. It is the total of all values
divided by the number of observations.

This value is called “x-bar” and represents the sample mean. Each value is equally important.
All values are included. If you use the Total row in an Excel Table, the mean value is Average.
The mean or average expected salary was $52,914.76.

But we know that there were some outliers – unusually high salaries. Outliers are not reflective
of most of the data, but the mean includes everything and treats each value as equally valid. A
popular alternative is the median.

Median = value for which 50% of observations are larger and 50% of observations are smaller.

The median is the middle number if you sort the data from smallest to largest. If there are two
values in the middle, take the average of them. To find the median in the Total Row of
a Table, you must select More Functions and then look under Statistical Functions.
Figure 3-28 Median among More
Functions

It is also very easy to find the median manually in Excel. Sort the Salary column. How many
non-blank values are there in the column? Use Count, or scroll down the list to find the last
entry.

 Count = 752 - last entry is in row 753 (the column has a label in row 1)
 752/2 = 376
 376 values are greater than the median and 376 are smaller.
 376th value from the start is in the 377th row = $50,000
 376th value from the end is in the 378th row = $50,000

The median salary is $50,000. In this case, the median is very similar to the mean.

Mean versus Median - The median is unaffected by outliers. That is, it is unaffected by any
especially large or small values – just the values exactly in the middle. But, most people think
of “average” when they think of “middle”, so the mean is the most commonly used measure.

If the distribution of the data is symmetrical, the mean and the median are very similar.
However, if the distribution is highly skewed, then the mean will be pulled in the direction of
the long tail. For example, with income data, it usually has a long right tail reflecting those
with very high incomes, but is truncated on the left because incomes cannot be negative. In
this case, the mean income will be higher than the median. This means that more people have
incomes below the mean than those having incomes above the mean. It is not unusual to have
70% of values smaller than the mean. In this case, many believe that the mean is an
inappropriate measure of the “middle”.

High and Low – Percentiles and Quartiles


Students who receive above average grades on a test want to know what the top grade was or
how many students received higher grades. Students who receive below average grades don’t
ask, but wonder if many did worse.

Maximum and minimum are the extremes and may not be representative of many students.
We often classify high income earners as the top 1% or the top 10%.

1% of 752 students is 7.52. What is the 8th largest salary? Or the 744th smallest? $175,000

10% of 752 is 75.2, What is the 75th largest? $80,000

We could also say that 90% have lower salaries. 90% of 752 is 676.8

What is the 677th value from the bottom (smallest)? (row 678) $80,000

One might consider $80,000 to be a high salary. If 90% are below the value, we call it the 90th
percentile. We may use percentiles to define “low” and “high”. Percentiles are not influenced
by outliers like Minimum and Maximum are.

The median is the 50th percentile. The 25th and 75th percentiles are called quartiles. The
bottom 25% are below the 1st quartile. The next 25% are between the 1st quartile and the
median (2nd quartile). The next 25% are between the median and the 3rd quartile. The last
25% are above the 3rd quartile. Quartile split the data set into 4 quarters.

Often, researchers looking at how household income affects an issue will compare the bottom
25% to the top 25% of households (compare the average for the bottom 25% to the average
for the top 25%).

We might wish to develop profiles of the most successful students and compare them to the
least successful. “Most” and “least” could be defined by percentiles or quartiles.

Note that that you may find different software tools calculate percentiles and quartiles
differently. For example, consider the following data set:
2 4 7 8 8 9 11 14 20

There are 9 observations, so the median value is 8. 4 observations are smaller and 4
observations are larger. The first quartile is 7 if you include the median when splitting the first
half. But if you exclude the median, then the middle of the first half is between 4 and 7. With
tiny samples, the rules used to find percentiles and quartiles can give different results, but
with larger samples, the results are almost identical. In practice, the choice of rule is not
important.

Measuring Diversity
Is our data set homogeneous or diverse? Do most students expect to earn the same amount or
is there a lot of variation among students? One measure of diversity is the Range = Max – Min.

But Max and Min may simply reflect outliers and not most students.

Common measures of diversity (dispersion or variability) involve the relative gap between
observations and the mean (average). Are most values close to the middle or are there many
that are very far from the middle? $20,000 is $32,915 below avaerage, whereas $65,000 is
$12,085 above average.

 $20,000 - $52,915 = -$32,915 negative


 $65,000 - $52,915 = $12,085 positive

We don’t care whether values are positive or negative, just how far they are from the middle.
Why not ignore the sign and simply average the differences?

Measure how far each value is from the mean, take the absolute value, and then average the
numbers. In Excel, use the AVEDEV function to get MAD.

It is a simple measure, but almost no one uses it, except in forecasting. It is not unusual to hear
someone say that on average, their sales forecast deviates from the actual by 300 either way.
Absolute values are difficult to work with mathematically. They are a particular problem in
calculus. Another way to remove minus signs is to square the values. The Variance is the
average squared difference between observations and the mean. The sample variance is

Why would you divide by n-1 and not n? Because we are using the sample mean, which is
based upon the observations we are comparing, dividing by n will slightly underestimate the
variance and dividing by n-1 corrects for this bias.

For salaries, our units for variance would be “square dollars”? What does it mean to say that
the variance of salaries is 1,940,893,818 (dollars)2?

We can take the square root to convert out units back to dollars. We call this the Standard
Deviation.

Confused? The standard deviation is the most widely used measure of variability, so you will
have to get used to it. Thankfully, Excel and many calculators have functions to calculate the
standard deviation. In this case, the standard deviation of salaries is $44,055.58. How do you
interpret this number? What does it tell you about the variation in salary expectations among
students?

At the most basic level, we can say that when there is more variation, then the standard
deviation is larger. If you had two groups of students and one had a standard deviation of
$30,000 and the other had a standard deviation of $55,000, then we would say that the second
group was more diverse than the first.
But to interpret the actual number can be challenging. If the distribution of the data is
approximately bell shaped (Normal distribution), then we can make some statements about
high and low values in the data, using just the mean and the standard deviation.

Our data set is not bell shaped because of outliers, so let us trim off those values that are
below $10,000 and above $150,000. This removes 27 values. Of the remaining 725 values, the
mean is $50,688.48 and the standard deviation is $20,637.85. This data set still has a somewhat
long tail to the right, so is not exactly bell shaped.

If a distribution is approximately bell shaped, like a Normal Distribution, then the Empirical
Rule states that

 Approximately 68% of the values should be between Mean – StdDev


and Mean + StdDev
 Approximately 95% of the values should be between Mean – 2*StdDev
and Mean + 2*StdDev
 And almost no values should be more than 3 StdDev from the Mean

For the salary data, we should find

 68% between $50,688 - $20,638 = $30,050 and


 $50,688 + $20,638 = $71,326.
 482 values fall in this range. 482/725 = 66.5% - close to 68%.

Looking out two standard deviations above and below the mean, we find the range to be

 $50,688 - 2(20,638) = $9,412 to


 $50,688 + 2(20,638) = $91,964.
 691 values are in this range, or 691/725 = 95.3%.

Although our data is somewhat skewed, the rule appears to work well.

By knowing just the mean and standard deviation, we can say that a “low” salary would be
$30,000 and a “very low” salary would be $9,000. A “high” salary would be $71,000 and a “very
high” salary would be $91,000.

The Empirical Rule is a quick and dirty way of defining

 Low/High as 1 StdDev below or above the mean, and


 Very Low/High as 2 StdDev below or above the mean.

The Empirical Rule generally works if the distribution is not severely skewed. Knowing just the
mean and standard deviation, you can quickly identify

 Very low
 Low
 Middle
 High, and
 Very high

3.8 Descriptive Statistics in Excel


We found mean and standard deviation by using the Total Row in a Table. You can obtain a
variety of simple Descriptive Statistics by using the Data Analysis function in Excel. Go to
the Data tab and select Data Analysis like we did for Histogram. This time select Descriptive
Statistics.

Figure 3-29 - Descriptive Statistics in Excel


Figure 3-30 - Descriptive Statistics dialog box
Table 3-2 - Descriptive
Statistics output

The output gives many different summary statistics. Many are ones we have already seen.

The mode is the most frequently occurring value. For interval or ratio data, the mode is often
meaningless since there is such a diversity of individual values that no one value occurs very
frequently. In our case, we are looking at personal estimates and most students rounded off
their estimates to the nearest 10,000. For this reason, we do have some values occurring
frequently.

Kurtosis is a measure of how “fat” the tails of the distribution are. Skewness measures
whether we have a long tail to the right or the left. Very few researchers pay a lot of attention
to these values and we can generally get a sense of the distribution shape by looking at a
histogram.

Measuring the Precision(Accuracy) of Estimates


Standard Error deserves special mention. We are dealing with a sample of students. Our
summary statistics are estimates of what we would have seen if we had responses from every
student. If our sample was a random sample of all students, then the summary statistics
should be reasonable and unbiased estimates of the population values for all students. But
they are still estimates and not exact measures. We are most often interested in the mean. Our
estimate is $52,914.76. Is this close to the true value?

If you have a sample of at least 30 observations (we have 752), then we can make statements
about our sample mean that are similar to the Empirical Rule.

 Approximately 68% of the time, the sample mean is within 1 standard


error or the true mean, and
 Approximately 95% of the time, the sample mean is within 2 standard
errors of the true mean.

The standard error is $1,606.54. We can be very confident that the estimate of $52,914.76 is
within 2(1606.54) or $3,213.08 of the true mean salary expectation. That is, the correct value is
likely to be between $49,700 and $56,100.

Note that I rounded off my calculations to the nearest $100. Since there is error in my estimate
of the mean that could be as large as $3,213.08 or larger, it makes no sense to state my results
to the penny or the dollar. Displaying too many digits can be confusing to read and also
suggests precision that is not justified. It makes more sense to say that graduates expect to
earn approximately $53,000 (+/- $3,000). There is no loss of information in rounding off and it
is easier for the reader to imagine $53,000 rather than $52,914.76.

For a simple average, the standard error is calculated as


If our sample has a lot of variability, the standard deviation will be large, and this will make it
hard to estimate the average accurately.

But if we have a lot of data, we should be able to get a more accurate estimate than if we had a
small sample. In our example, we had a sample of 752 observations and we were confident
that our estimate of the average was within $3,213. If we wanted to be confident of having an
estimate within $800 (cut our error to 1/4 of what it is), then we would need 16 times as many
observations (12,000 observations). More accurate estimates can be costly to obtain.

The topic of estimating mean values with an interval (range of plausible values) is covered in
more depth in a basic statistics course.

Much of data analytics involves making predictions and it is important to have ways of
measuring the accuracy of our predictions.

Before leaving summary statistics using Excel, there is one important caution that must be
made. When we used the Table Row to find averages, we could filter our data and the averages
would reflect the filter (e.g., when we compared average salaries for Arts, Business and
Science). If you filter your data and then use the Descriptive Statistics in the Analysis Toolpak,
you will discover that Excel ignores the filter. This is the same problem we saw with
Histograms. If you want to compare sub-groups, you must filter and then copy and paste the
data into a new worksheet before you calculate the Descriptive Statistics. In the next chapter
we will look at a powerful and simple tool for doing comparisons.

You might also like