Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 32

Data Documentation

In general, data documentation should include:

1. Information on how the data was collected.


2. An explanation of what each field and row represent.
3. Details on how the data types are stored (i.e., whether a field is an integer or a
text field).

Summary Statistics
Mean is the mathematical average of a set of numbers. It’s a useful way to identify the
expected value for an entire column or row of data.

Median is the middle value of a set of quantitative variables. To find the median of
range of numbers, you must first arrange them in order from least to greatest. The
number in the middle is the median.

Mode is the value that appears most frequently in a set of numbers.

You may also see these values referred to as descriptive statistics. Why? Because they
allow you to take a range of numerical inputs and output a number that’s descriptive of
that range.

=average(firstcell:lastcell) *to calculate mean*

=median(firstcell:lastcell)

=mode(firstcell:lastcell)

You may also see formulas expressed like this: =average(number1, number2...)

Note that (firstcell:lastcell) is used when the cells you're calculating are sequential (like
A1, A2, A3) Instead of typing each cell name out, we can simply ask excel to calculate
the values in and between cells by naming the first and the last cell (A1:A3) – a major
time saver when you're calculating formulas on large data ranges!
The Power of Mean and Median
Mean and median may seem similar, but they’re actually quite different. The mean of a
data set will be affected by any numbers in the set that are abnormally large or small.
The median, on the other hand, is resilient — those outliers won’t have an impact.

So, if you compare the mean and median of your data set, you’ll be able to tell whether
it has a normal symmetric distribution, like a bell curve, or an asymmetric distribution.

A distribution’s degree of asymmetry is called its “skewness.” A distribution of skewed


data set might look something like this
If Mean > Median
If the mean is larger than the median, as it is here, the data set is positively skewed.
This means that there’s at least one number in the set that’s larger than the rest. In this
case, it’s the ticket that took 320 days to close.
If Mean < Median
If the mean is smaller than the median, the data set is negatively skewed, with at least
one number that’s smaller than the rest.
Standard Deviation
When investigating your data, you may also want to get a sense of its spread. Are a
bulk of the numbers in the set close to the average or do they represent a wider range?

To answer this question, you’ll want to calculate the standard deviation, a value that
represents how much a set of data departs from the mean. The greater the standard
deviation, the wider the spread of the data.

In Excel, you can use the following functions:

 STDEV.P will calculate the standard deviation of a population (a population


includes all members of a defined group).
 STDEV.S will calculate the standard deviation of a sample or a smaller subset of
the defined group.

Skewness of Time to Close


Because an effective 311 service responds quickly, let’s use the summary statistics
we’ve calculated to identify a standard time to close of all closed tickets. This number
can serve as a benchmark for the service, and as it’s based on the service’s past
performance, we know it’ll be reasonably achievable.

 Mean: The average number of days it took to close a ticket in 2017 was 12.6
days.
 Median: The middle value of days from open to close was three days.
 Mode: The most frequent number of days from open to close was zero days.

When the mean and median are similar, we can use either to guide our “standard.” If
there is a sizable difference between the two, it’s because the mean has been skewed
by outliers, and we should rely on the median.

In this case, we should set our standard time to close by the median value: three days.
Our mean has been skewed by the few tickets that took hundreds of days to close.
Visualizing the Skewness
A histogram takes a single quantitative variable and plots how frequently it occurred.
For example, we could use a histogram to show the frequency of “days to close.”

How frequently were tickets closed in 1–3 days? How frequently were they closed in 4–
6 days?

When do we start to see outliers pulling the average (mean) number of days to close?

Are there tickets we should exclude from our data set to avoid this skewing?

Histograms can help us answer these questions. Especially with a data set of almost
200,000 tickets, there’s a very low likelihood that you plan to scroll through the “days to
close” and determine this on your own.

We can quickly create a histogram by:

 Selecting the data we want ’gramed: Column F, “Days From Open to Close.”
 Navigating to the “Insert” tab.
 Selecting the histogram icon. (statistic chart)

Histogram Attempt No. 1: Output


Recall that a histogram plots how frequently a quantitative variable occurs.

Take a moment to study the histogram output.

Frequency is on the y axis and days to close is on the x axis. You’ll notice that the x axis
is divided into groups of days to close, or bins. For instance, the first bin represents
tickets closed in 0 to 2 days, the second bin represents tickets closed in 3 to five days
etc.
Histogram Attempt No. 1: More Output
The histogram shows there are:

 Around 90,000 of tickets are closed between zero and two days after they’re
opened. That’s almost half of our data set.
 Nearly 20,000 tickets took 3–5 days to close and almost the same amount of
tickets took 6–8 days to close.

So, what’s happening with the long, seemingly empty space to the right?

Funny you should ask — the histogram is showing the positive skew of the data set!

Doesn’t our histogram look similar to the graphic representing positively skewed data
we saw earlier?
Finding the Analysis ToolPak
With our first histogram, Excel chose our bins for us. Let’s try creating one a different
way, using the Analysis ToolPak. First, click on the “Data” menu and find the “Data
Analysis” button:

If it’s not there:

For Macs: Tools > Excel Add-Ins > check “Analysis ToolPak” > click “OK.” The “Data
Analysis” button should now be in your toolbar.
For PCs: File > Options > click the “Add-Ins” category > check the “Analysis ToolPak”
box > click “OK.”

Note: The Analysis ToolPak works on any Windows version, but only on Excel 2016 and
2014 for Mac. If you have Excel for Mac 2011 or 2008, you’ll need to install a third-party
data analysis tool such as StatPlus:mac LE.

Spoiler Alert
These are the values we need in order to create a histogram using the Analysis
ToolPak.

Input Range: This is where you enter the range of cells containing the data we’re
’gramming: the count of tickets closed per month.

Bin Range: This is the range of cells that contain the bins we predetermine.

Output Range: This is where Excel will put the results.


Input and Bin Range
Because we’ve got our “Input Range” covered — Column F (F2:F191335) — we’ll move
right along to the “Bin Range.”

Our first histogram had a TON of bins, leaving us with a long tail to the right. Let’s see
what happens if we reduce the number of bins.

There’s no magic number of correct bins to use when designing a histogram. One or
two will usually be too few to see a pattern, while hundreds will result in a histogram
that’s difficult to read. A good rule of thumb is to use between five and 20 bins.

Min and Max


We’ll set up our second histogram attempt in a new tab called “Histogram Two.” We’ll
use the minimum and maximum values of tickets closed to guide our bins.

We can find these in a few ways:

 Looking at the first and last numbers in the filter on Column F.


 Sorting Column F in ascending order and looking at the first and last cells.
 Using functions!

Copy and paste these formulas into your “Histogram Two” tab, or enter them for the
practice:

=min(first_number:last_number)

=min('All Closed Tickets'!F2:F191335)

=max(first_number:last_number)

=max('All Closed Tickets'!F2:F191335)


Bins
Our range is (0, 605).

Min: 0 days to close

Max: 605 days to close

Because 20 is the recommended maximum number of bins for an easy-to-read


histogram, we’ll round down to 600 and divide 600 by 20. This gives us bins that range
by 30 days.

So, our first bin will be tickets closed within 0–30 days, our second bin will be tickets
closed within 31–60 days, and so on.

In a column on the “Histogram Two” tab, copy and paste the following bin values:

30

60

90

120

150
180

210

240

270

300

330

360

390
420

450

480

510

540

570

600

A Familiar Face
We saw this before when we learned about the information the Analysis ToolPak needs
to create a histogram for us.

Fill in the “Input Range,” “Bin Range,” and “Output Range,” and make sure to check
“Chart Output.”

Yours should look something like this (the “Input Range” should read F2:F191335).

Histogram Attempt No. 2


This is what the Analysis ToolPak gives us. Interesting, right?

We still have a long, seemingly blank tail on the right. But more importantly, Excel has
provided a chart with the frequency of tickets in each bin. With this chart, we can quickly
note that almost 180,000 tickets — that’s close to 94 percent of our data set — falls into
our first bin.

So, what now?

You guessed it…


Histogram Attempt No. 3
Thanks to the Analysis ToolPak, we have a quick and dirty way to assess the frequency
of each bin.

What if we make our maximum 45 days and ignore the outliers from then on?

Should we remove data from a histogram? No. But thankfully, the Analysis ToolPak
groups outliers into a “more” bin to condense the histogram without eliminating these
values.

Proceeding with 45 as the maximum days to close will move 13,415 tickets (the sum of
tickets that took more than 45 days to close) to the “more” bin.

Note that 13,415 tickets is less than 10 percent of our data set of 191,334 tickets!
Aggregate Functions
OK, now that we’ve seen histograms, let’s move on to our last navigational tool:
aggregate functions.

Aggregate functions include average, sum, and count, and they help us group values
across multiple rows or columns.

You’ll most frequently find them used in business settings to calculate things like the
sum total of revenue or the number of units sold. But that’s not their only application.
They’ll come in handy for our 311 data, too.

How? Let’s see.

311 Aggregates
Average: The same as mean.

=average(firstcell:lastcell)

In our data set, if we calculate the average time to update a ticket, we’ll learn the
average amount of time it took to open, update, or close a ticket.

Sum: The total amount of a given set of numbers.

=sum(firstcell:lastcell)

For our client, the key measurement is time, so we might find the sum total number of
days taken to open, update, or close tickets.

Count: A quantity of something.

For numerical data:

=count(firstcell:lastcell)

For alphanumeric data:


=counta(firstcell:lastcell)

In this example, the relevant quantity is the total number of tickets in the data set.
The IF Function
This is the basic format for an IF statement in Excel:

=IF(Condition, Value to return if true, Action if false)

It’s telling Excel: If this condition is met, return specified value. Otherwise, perform the
second action specified.

Let’s try it in action. For example, we could ask Excel IF the value in cell A4 below is 6.

We’d use this function: =IF(A4=6, “yes”, “no”)

Excel would return “no,” because the value in A4 is 24, so the condition is false.

Conditional Operators
In the previous example, we used the equals sign to ask whether or not one value (A4)
was equivalent to another (6). But there is a range of other conditional operators we can
use to ask Excel questions.

Here’s the list:


Two Conditions
IF statements are great, but they’re pretty simple. To ask Excel a slightly more complex
question, you may need to set multiple conditions. You can do this by creating logical
functions that use AND or OR.

What exactly does that mean? Let’s say the client wants to identify tickets that were
both opened within their target response time AND closed within three days.

Here, we’ll use the AND function to ask Excel if a value meets more than one condition.
If both conditions are true, the function will return TRUE, but if any are false, it will return
FALSE.

Breaking Down the AND Function


Here’s the basic AND function:
=AND(condition1, condition2...)

In this case:
Condition 1: Time to update is less than or equal to one day.
Condition 2: Time to close is less than three days.

So, we’ll use:


Operator 1: <= 1
Operator 2: < 3

The AND function does not give us the chance to designate an “action if true” and an
“action if false.”

The AND function will always return TRUE if all conditions are met and FALSE if any
one condition is not met.

The OR Function
Like the AND function, the OR function allows us to ask Excel if a value meets more
than one condition.

Unlike the AND function, if any of the conditions are met, the OR function returns TRUE.

The OR function will only return FALSE if none of the conditions are met.

The basic OR function syntax is:

=OR(condition1, condition2,…)
Now you’ve got a table — great!
Let’s say you’re in the middle of your analysis for the city of San Francisco, when the
311 office dredges up more data or decides to add another year of 311 logs to your
plate. A huge headache, right? Not with a table.

Tables easily accommodate new information. With a click of the “refresh” button,


any data object connected to your table automatically updates to reflect new or
changed data.

That means tables are also aces at dealing with live data. In our 311 example, you
might be working with live data if your Excel workbook was connected to San
Francisco’s current 311 logs as operators added and updated new tickets in real time.

To retrieve live data, you’d need to connect to a source like a Microsoft SQL server, a
PostgreSQL server, or a website. You can set this up using the buttons highlighted in
red.

Segmenting Data
OK, so we know that tables are pretty cool, but what do they actually do? They provide
insight into certain segments of your data.

This is in contrast to summary statistics, which describe features of your data set as a
whole.

For example, summary statistics make it easy to calculate the average amount of time it
took to close all tickets — simply find the mean, or average, of all values in the column
titled “Days From Open to Close.”

=average(F2:F191335)

But, if we wanted to get more specific, we’d need to turn to a table; specifically, a
PivotTable. A PivotTable might help us answer questions like:
 On average, do tickets reported on Twitter get closed faster than those reported
by phone?
 How do the number of tickets reported in the Balboa Terrace neighborhood
compare to the number of tickets reported in Haight Ashbury?

PivotTables make it easy to calculate aggregate functions for small segments of


data, as opposed to the data set as a whole.

Insert Pivot
Follow along with us as we create a PivotTable:

 First, click on any cell inside the table on the All Closed Tickets sheet, then go to
“Insert” and click “PivotTable.” In the PivotTable dialogue, notice that it now says
“Table1” in the Table/Range box.
 Next, choose where you want your PivotTable to be located in your workbook. The
default setting creates a new worksheet for your PivotTable, but you could also choose
to add the PivotTable to an existing worksheet. Add this PivotTable to a new
worksheet.
 Click “OK” and Excel will display the PivotTable Builder. Think of this as the control
panel for our PivotTable.

Click here to watch an animated GIF walk through the steps to creating a PivotTable.
This is where it gets interesting...
Remember, we wanted to compare the average time it takes to close Twitter
tickets versus time it takes to close phone tickets.

But right now, our PivotTable isn’t telling us what we need to know — it’s showing the
total number of days with open tickets from each source. Not super helpful. That’s
because the Values field in PivotTables defaults to “sum.”

Still, we can easily change the Values field to display an average instead.

On a Mac, click on the “i” next to “Sum of Time to Close (Days).” This brings up a pop
up that allows you to change how you want the values summarized among other
options. Select “average.”

On a PC, click the dropdown next to "Sum of Time to Close (Days)." Then, click "Value
Field Settings." This brings up a pop up that allows you to change how you want the
values summarized among other options. Select “average.”
What if we want to investigate how the number of tickets opened varies by each
quarter?

We can ask a PivotTable that provides the count of tickets opened, segmented by
quarter. The first thing we need to do is insert a new PivotTable. Since we’re looking to
measure the “count of tickets by quarter,” we need to identify a field in which each
record is represented once.

That would be the “CaseID” field. Every ticket filed is ascribed a unique CaseID. So,
we’ll drag the CaseID field to the “Values” section of the PivotTable Builder.

The value defaults to the “sum” of the CaseIDs, which is not what we’re interested in.
We can open field settings and change the value from “sum” to “count.”

Our PivotTable now tell us there are 191,334 tickets in our **All Closed Ticket** data
set. Now we need to group these tickets.

We will drag the “Opened” field to the “Rows” section of the PivotTable Builder.

Notice how "Years" and "Quarters" have been added to the Rows section of your
PivotTable Builder. This is because Excel has automatically recognized "Opened" as a
date and included the most popular date parts Excel users typically add.

Drag both “Years” and “Quarters” back to the field list so they are removed from Rows.

You should now be left with only the months of the “Opened” date, without any context
regarding corresponding years.

Right-click anywhere on a month that's displayed. Click on “Group.” This should display
a “Grouping” window, with “Months,” “Quarters,” and “Years” already highlighted.
To answer our question, select “Quarters” and “Years” using the shift key to highlight
both. Click “OK” to apply this aggregation.

This will separate your “Opened” dates out by the years, as well as the quarters, in
which they were filed, using the American fiscal quarter system.

We can easily see that in the first three months of 2017, there were 22,218 tickets filed
that were later closed.
Where We Stand
Your client, a program manager in the San Francisco city government, needs help
improving the city’s 311 system. They’ve given you data on 311 tickets filed in San
Francisco and asked for help finding patterns and identifying areas for improvement.

So far, you’ve cleaned the data, summarized it using descriptive statistics, and put it into
a table. You’ve gleaned the following insights:

 The baseline number of days to close a ticket should be three. We used the
median value of three days to arrive at this conclusion, as the mean (12.6 days)
was skewed by outliers.
 On average, tickets reported through Twitter and by phone are closed in a similar
amount of time. However, the average time to close for tickets reported by the
web is half that of tickets reported on Twitter.
 The number of tickets reported in Haight Ashbury is 10 times the number of
tickets reported in Balboa Terrace.

So Now What?
Now, it’s time to relay those insights to your client. You could simply tell them what
you’ve learned, but don’t you want to wow them with your findings? A chart will help you
do just that.

Charts can convey complex information in simple, impactful ways. The four most
common charts are: pie, line, bar, and scatter.
Pie Charts
Pie charts show the relationship of parts to a whole.

Look at the chart below. It’s comedic, but it uses a simple visual gag to make a clear
point; one we can all relate to.

Out of the total time people spend using their Tupperware (or any other food storage
container), most of it is devoted to looking for the correct lid.

Dos and Don’ts of Pie Charts


Just like their namesake (actual pie), pie charts are great — in moderation. Overdo
them or serve them up in the wrong context, and you’ll find they quickly lose their
appeal.
Here are some things to keep in mind when considering a pie chart:

 Pie charts should only be used for data sets that have five or fewer segments. It’s
hard for the human eye to distinguish between shares that differ by less than 20
percent.
 If segments are so similar in size that you need to label the percent represented
by each, opt for a different kind of chart.

We’ll make a pie chart that shows the share of tickets filled by source.

Like most things in Excel, there is more than one way to do this. Start by creating a
PivotTable of what we want to chart.

In this case, the count of CaseID (or tickets) by source. Our PivotTable shows us there
was one ticket filed by email and more than 89,000 filed by phone. To display this data
as a pie chart, go to the “Insert” tab, find the pie chart icon, and select the 2-D option.

Here you can see that almost 50 percent of tickets filed were done so over the phone.
Because we created this pie chart from a PivotTable, it is technically a PivotChart.

If you’re using Excel 2016, there is no visible difference between a PivotChart and a
regular chart. If you’re using any other version of Excel, you’ll see some filterable
options in your PivotChart.

To create a pie chart without the filterable fields, you can copy and paste the values you
want to chart into a new range of cells. With those cells selected, follow the same steps
— go to the “Insert” tab and select the pie chart icon.
Bar Charts
Pie charts are eye-catching and popular, but bar charts are the true workhorse of the
chart world. They’re often the first kind of chart we learn to read, so people tend to be
most comfortable interpreting them.

Check out this kindergarten-level worksheet from K5 Learning — proof that people’s
exposure to bar charts starts early.

Bar charts display the measure or metric of a category. Generally, the x axis is a
categorical variable and the y axis is some measure.
More Bar Charts
Let’s go back to that messy European Parliament pie chart and re- bake the data into
something more useful: a bar chart.

As you can see, the number of representatives per party is much clearer when the data
is presented as a bar chart. The x axis displays the party (a categorical variable), while
the y axis represents the number of reps (a measure).

At a glance, we can see that the ALDE has greater representation than ECR in the
European Parliament. ALDE has about 80 representatives, while the ECR has about 60.

The reality is, most pie charts can and should be bar charts.
Line Charts
OK, we’ve got pie charts and bar charts down. Now it’s time for line charts.

Line charts show a series of data points connected by a line. They’re frequently used in
time trend, or time series, analysis — i.e., the analysis of metrics measured over time.

You’ve probably seen line charts used to visualize changes in stock prices over time,
but they have many other uses as well. Check out the chart below.

It shows the number of times per week Google users searched for the phrase "curling"
over the past four years. The peaks correspond with the Winter Olympics.

Line Charts 2.0


We know, we’ve been singing the praises of line charts that map a metric against time,
but they’re great for charting the relationship between almost any two metrics.

The example below charts the average number of stays booked in an Airbnb rental
property, based on its price per night. Here, the two metrics are the number of stays and
the price per night.

While there are some outliers, the chart does a good job of illustrating that total stays
per property generally decrease as prices increase:
Scatterplots
Now, it’s on to our last kind of chart: the scatterplot. Scatterplots can also be used to
compare the relationship between two metrics.

The graph below plots movement by all football (soccer) players at the 2014 World Cup.
The x axis represents meters run per minute when the player's team has possession,
and the y axis represents meters run per minute when the other team has possession.

Each dot represents an individual record or data point. In this case, each dot is a player.

You’ll notice that the chart also uses multiple colors. Each color represents a different
field position, making it easy to compare the relative amount of running done various
positions. Other scatterplots create similar effects by altering the size or shape of the
dots.

(If you’re wondering why Lionel Messi is called out as an outlier here, it’s because he
notably struggled with fatigue in the 2014 World Cup.)

You might also like