Professional Documents
Culture Documents
statistical-tableau-how-to-use-statistical-models-and-decision-science-in-tableau-first-early-release-9781098151799-9781098151737
statistical-tableau-how-to-use-statistical-models-and-decision-science-in-tableau-first-early-release-9781098151799-9781098151737
With Early Release ebooks, you get books in their earliest form—the
author’s raw and unedited content as they write—so you can take advantage
of these technologies long before the official release of these titles.
Ethan Lang
Statistical Tableau
by Ethan Lang
Copyright © 2023 Ethan Lang. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional
use. Online editions are also available for most titles (http://oreilly.com). For
more information, contact our corporate/institutional sales department: 800-998-
9938 or corporate@oreilly.com.
On this page you will see Tableau’s different products. Find Tableau Desktop
and click on the link underneath that says “SEE ALL RELEASES” (see figure 1-
2).
Figure 1-2. Releases by Product on Tableau’s Website
You will find all the versions of Tableau Desktop on this page. At the time of
this writing I am using 2023.1. Tableau has a very aggressive release schedule
and they push out a new version typically every quarter. These updates are
donated with the year, the quarter, followed by the latest update number. Click
on 2023.1 to expand that verison and then download the software (see figure 1-
3).
After installation open Tableau Desktop. This is a licensed product so you will
be prompted to enter your information. You can also opt to start a free trial of the
tool. Once you have finalized those details you will land on the Start Page as
shown in figure 1-4.
Figure 1-4. Start Page of Tableau Desktop
Tableau Desktop has hundreds of connectors that you can use to access data. On
the left hand side of the Start Page you can explore all of those options. For all
the demonstrations in this book I will be using the Sample - Superstore dataset.
To connect to this dataset, simply click on Sample - Superstore which I have
highlighted in figure 1-5 below.
Figure 1-5. Choose Sample - Superstore from the List of Connectors
After clicking on the sample dataset, you will be navigated from the Start Page
to Tableau Desktop’s authoring interface which you can see in figure 1-6.
Figure 1-6. Tableau Desktops Authoring Interface
I provide step by step instructions in the following chapters so I won’t spend too
much time covering every aspect of the authoring interface. However, to give
you a brief overview, on the left hand side you will find the Data pane as shown
in figure 1-7.
Figure 1-7. The Data Pane of the Authoring Interface
At the top of the Data pane you will see a list of the data sources you are
connected to. Moving down you will find a list of measures, dimensions, bins,
and sets. Last, you can find a list of your parameters.
To the right of the Data pane you will find the Marks shelf, Filters shelf, Pages
shelf, Columns shelf, Rows shelf, and canvas which you can see in figure 1-8.
The last major feature I want to call out in this chapter is in the bottom left
corner of the authoring interface. There you will find a button to navigate to the
Data Source page and three small buttons. These buttons are used to create new
sheets, new dashboards, or a story (see figure 1-9).
Figure 1-9. Key Actions of the Authoring Interface
Using Tableau Desktop is very intuitive and there are many different ways to do
things. To give you a basic example I am going to show you how to create two
simple charts and how to add them to a dashboard. First, double click on Sales in
the Data pane then double click Order Date. You should end up with a line chart
(see figure 1-10).
This will open Sheet 2; your first chart is still viewable by navigating back to
sheet 1. Double click on Sales then Segment in the Data pane. This will create a
simple Bar Chart showing the SUM of Sales by Segment on the canvas similar
to figure 1-12.
This will open a new view where you can create dashboards as shown in figure
1-14. Dashboards are the bread and butter of Tableau and are ultimately what
you will publish online for users to interact with.
Figure 1-14. Dashboard Canvas in Tableau Desktop
Now add your two sheets on the dashboard canvas. On the left click and drag
Sheet 1 onto the canvas. Then click and drag Sheet 2 onto the canvas. Your
dashboard should should now look similar to figure 1-15
Figure 1-15. Creating a Simple Dashboard Layout in Tableau Desktop
Introduction to Statistics
According to the Webster Dictionary, statistics is defined as “a branch of
mathematics dealing with the collection, analysis, interpretation, and
presentation of masses of numerical data.”. I personally think this definition
nails it on the head especially in today’s landscape. To unlock deep insights in
your data you need to incorporate statistics into almost every aspect of the
analytics process. This includes collecting data in an efficient and ethical way,
understanding the data, finding deeper insights in the analysis, and presenting
your findings so your stakeholders can make informed decisions.
The best way I can introduce you to the power that statistical analysis can unlock
is to dive into a real world example. Let’s say that your company wants to test
some new marketing in an email. However; they are worried that if the new
marketing fails it could significantly impact sales for this quarter. Therefore,
they want to test the new marketing email by sending it to a subset of the total
email list, then analyze the performance. Below are the results of that test in a
contingency table 1-1.
The marketing team has done a simple analysis looking at the conversion rates
of the emails by taking the Total Sent / Conversions. Using this calculation they
found that the original email had a conversion rate of about 3% (23/750=0.030)
and the new marketing email had a conversion rate of about 6% (8/125=0.064).
They claim that the new email is an absolute success and that it will lead to
double the amount of conversions when they send it out to their entire list next
time.
Management is thrilled with the idea of doubling the amount of sales and wants
to invest in several new salespeople to help with the increase. However; they
come to you for a second opinion and ask if the analytics team could review the
data and confirm the marketing team’s assumptions.
Where do you begin? This is where statistical analysis will become your best
friend. Armed with some basic statistics you know that you can run a few simple
tests that will let you know if the new marketing email was statistically
significant or not. Before I get too far in the weeds let’s set a foundation so you
better understand the test and results.
Hypothesis test
The first thing you need to do in this situation is to set up a hypothesis test. In a
standard hypothesis test you set two hypotheses. The first, is referred to as your
Null Hypothesis and the second is called the Alternative Hypothesis. For this
example the hypothesis will be as follows:
.
Null hypothesis
The new marketing email is not statistically significant therefore email
conversions will remain the same on average as the original.
Alternative hypothesis
The new marketing email is statistically significant therefore email
conversions will be higher on average than the original.
In statistics it’s important to understand that you are always trying to prove
yourself wrong. What do I mean by that? You always want to assume that
nothing is going to change when new things are introduced. Therefore you want
to assume that the Null Hypothesis is correct and your test will determine if that
is wrong. In statistics you would say you have failed to reject the Null
hypothesis if the null hypothesis is proven correct. If it turns out that the
Alternative Hypothesis is statistically significant you would say you reject the
Null hypothesis in favor of the Alternative.
Chi-square test
Now that you have your hypothesis set up, it’s time to run a statistical analysis.
In the spirit of providing you with a foundational understanding, I have decided
to run a simple statistical test called a chi-square test. This is a perfect test to run
in this situation and very accessible even if you’re new to statistics. You don’t
have to have any special software or know any coding to calculate this test. You
can do it by hand, run it in excel, or look for a calculator online.
To begin, let’s revisit the contingency table and add to it. As you can see in Table
1.2, I added totals for each column, row, and a grand total column.
You need to use these totals to calculate the expected values for each cell in the
original table. This can be expressed mathematically as: or more simply: ((rows
total * columns total) / the grand total) for each cell. Starting with the cell in the
upper left which I will call , I have (750*844)/875 = 723.43. I will calculate , ,
and in table 1-3 you can visually see where the connections are.
With your expected values calculated you need to finish by comparing those
values to the values you observed. This step is expressed mathematically by the
following formula:
.
Simply put you need to take the original value minus the expected value I just
calculated, square that, then divide by the expected value. You will do this for
each cell then add up each of the values we get from that. To make it simple to
follow along I have put those calculations in table 1-4 below.
You can see that the Standard Deviation, r, and mean are all the same across all
four datasets. However; if you were to plot the datasets and visualize them as
shown in figure 1.16 you can clearly see that each dataset is very different.
Figure 1-16. Visual Representation of Anscombe’s Quartet
Summary
In this chapter I discussed what Tableau is and listed several of their key
products. I then showed you how to download Tableau and gave you a brief
overview of the product. This foundational knowledge will be key in later
chapters especially if you are new to Tableau. Then I showed you a basic
example of a statistical analysis and walked you through the importance of
running these tests. To drive home the importance of pairing both Tableau and
statistics together I showed you the Anscombe’s Quartet example. In the
following chapters I will show you how to start incorporating statistical analysis
into your data visualizations in Tableau.
Chapter 2. What is a Confidence
Interval
As you begin applying statistics to your analysis you will be producing a lot of
estimates. These estimates will always have uncertainty around them because
you are making assumptions about things unknown to you. To quantify
uncertainty, statisticians turn to confidence intervals. There are many types of
confidence intervals but for this book I will cover average and median two tailed
confidence intervals.
To be less abstract, a confidence interval is a range of values that you expect
your estimate to fall between a certain percentage of the time if you were to run
your experiment again or re-sample the population the same way. Confidence
intervals normally contain the average or median of the estimate and a plus and
minus variation from the estimate. This plus or minus variation is your
confidence interval range.
To set your interval range you first have to decide on what your confidence level
is. The standard level you will see is 95% confidence. However; you can
increase it which would widen your interval range or decrease your level of
confidence which would shorten your interval range.
Your desired confidence level is usually one minus the alpha (a) value.
Confidence level = 1 - a
So if you use an alpha value of p < 0.05, then your confidence level would be 1 -
0.05 = 0.95, or 95%. If you wanted to be more confident you would use an alpha
value of p < 0.01, 1 - 0.01 = 0.99 or 99%. If you wanted to be less confident you
would use an alpha level of p < 0.1, 1 - 0.1 = 0.90, or 90%.
The level of confidence is completely up to you but important to understand. For
example, if you are working with healthcare data where your results could be the
difference between life or death, use a higher level of confidence. If you are
estimating the height of the next person to walk into a classroom of college
students, you have room to be less confident. Remember when someone says
they are 95% confident in an estimate they are basically saying that 95 out of
100 times the average of the estimate would fall between the upper and lower
values of the confidence interval.
If you visualize a confidence interval it would look like figure 3.1 below.
Figure 2-1. Confidence Interval Bell Curve
Looking at the visual you can see that the 95% is representing the majority of
this distribution with 2.5% on either side which represent the two tails of the
confidence interval. Let’s say I resampled the population and I ended up with
new results. 95% of the time the new average would fall in that 95% range like
figure 3.2 below.
Figure 2-2. Resampling and Falling within the Original Created Confidence Interval
If you were to re-run the experiment using this new sample of data you would
probably get similar results with some slight variation. However; you know that
5% of the time the average would fall outside the confidence interval range like
in figure 3.3.
Figure 2-3. Resampling and Falling Outside the Original Confidence Interval
If you were to re-run your experiment using this new sample of the data you
could get results that were farther off. This is why it is important to understand
that you can change your level of confidence to better suit your needs.
Since my degrees of freedom (DF) equals my sample size (n) minus one. I will
go to DF 11 then move across to the 0.95 column that is my confidence level. I
end up with a t-value of 2.201. I now have all the data points I need to work
through my equation.
Upper confidence interval = 85 + 4.064 = 89.064
Lower confidence interval = 85 - 4.064 = 80.936
While this alone is powerful insight into the data, there is another quick statistic
you can calculate that will give you further insight called the standard error. The
standard error is calculated by dividing the standard deviation (s) by the square
root of the sample population (n). Pulling those values from our formula I get the
following results.
6.396 / = 1.846
The standard error tells you how accurately the sample data will reflect the total
population. The higher the standard error the more volatility you can expect
from the total population the lower the standard error the less volatility. Since I
have a relatively low standard error you can say that the test scores in future
rounds will be close to the sample. However; 12 data points is a small amount.
My personal rule of thumb is anything over 30 is a decent sample size. Be
careful of the assumptions you make with anything less than 30 rows of data.
A window will appear and ask you to choose the file you want to connect to.
Navigate to Chapter 3 - Test Scores.xlsx, select it, and click connect. From the
data source page, click Go to Sheet in the bottom left of the page (see figure 3-
6).
Figure 2-7. Data Source Page in Tableau After Connecting to Chapter 3 - Test_Scores.xlsx
To start,drag Test Scores onto the rows shelf and Student ID to the columns
shelf. This will give you a vertical bar chart in the view as shown in figure 3-7.
Figure 2-8. Bar Chart of Test Scores
From here, tab to the Analytics pane, drag Average with 95% CI onto the view,
and drop it on Table (see figure 3-8).
Figure 2-9. Adding Confidence Intervals to the View From the Analytics Pane
If you hover over the upper and lower bounds of the confidence interval a tooltip
will activate. This shows the upper and lower bounds as well as the sample
average when you hover over each line. You can see in figure 3-9 that the
confidence intervals match exactly to what you got when you calculated them by
hand.
Getting the confidence intervals in Tableau took me 2 minutes total vs the 15-20
minutes it took me to calculate the results by hand. Not only that but Tableau is
extremely flexible and can compute these results extremely quickly even when
you begin to get more data.
From the Connect menu, choose Sample - Superstore, which is the second to last
option from the menu (see figure 3-11).
Figure 2-12. Data Connection Page in Tableau Desktop
Tableau will add a new data source into the top of the Data pane and you will see
a list of the measures and dimensions loaded in the Tables section of the Data
pane as shown in figure 3.12.
Create a new sheet by dragging Sales onto the rows shelf then Sub-Category
onto the columns shelf as shown in figure 3.13. This will create a vertical bar
chart that displays the SUM of Sales by the Sub-Category dimension
Figure 2-14. Bar Chart of Sales by Sub-Category in Tableau Desktop
Toggle to the Analytics pane and drag Average with 95% CI onto the view. As
you can see in figure 3.14 Tableau automatically calculates this figure for you in
a matter of seconds. This dataset also has 10,194 rows of data. This is a
relatively small amount in today’s standards but large enough to know it would
not be feasible to calculate this by hand like you did for the test scores.
Figure 2-15. Implementing a Confidence Interval on Bar Chart of Sales by Sub-Category in Tableau
Desktop
Now that the confidence intervals are incorporated into the view, you can also
change the level of detail and Tableau will recalculate the results for you on the
fly. For example, replace Sub-Category with State/Province by dragging and
dropping the State/Province dimension onto Sub-Category in the columns shelf.
Tableau will calculate the confidence interval for you immediately at this new
level of detail as shown in figure 3.15.
Figure 2-16. Bar Chart With 95% CI of Sales by State
Summary
In this chapter, I showed you how to calculate confidence intervals so you have a
better understanding of the math Tableau uses behind the scenes. This will allow
you to better understand and communicate the results to your stakeholders. I also
walked you through how to implement this model within Tableau using the built
in feature from the Analytics pane. This allows you to apply this method at an
enterprise level on large amounts of data. Last, I showed you the flexibility of
Tableau when changing the level of detail. This allows you to move very quickly
when business requirements change.
Table of Contents
1. Introduction to Tableau
Download and start using Tableau Desktop
Introduction to Statistics
Hypothesis test
Chi-square test
Conclusions drawn from statistical analysis
Data Visualization and Statistics
Summary
2. What is a Confidence Interval
How to Calculate Confidence Intervals
Interpreting the Results
How to Calculate Confidence Intervals in Tableau
Check the confidence interval you solved by hand
Implement confidence intervals on sample dataset
Summary