Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 48

data science simplified!

Data Visualization
The Science Behind the Art of DataSt
orytelling
Agenda

▪ The Art of Data Storytelling


▪ Know Your Purpose and Data
▪ Craft Your Message
▪ Storytelling Assets
▪ Tips
▪ Superset
The Art of Data Storytelling
Why Visualize
▪ Today we receive 5X as much information as we did in 1986. This means how you
share your story will drastically determine the size of your audience. Researchers
have found that colour visuals increase the willingness to read by 80% and that we get
the sense of a visual scene in less than 1/10 of a second. Data visualization makes it
easy to recognize patterns and find exceptions while interpreting the data at a faster
pace. It allows access to challenging data sets, it allows exploration, can be fun and
provides useful information in an efficient way.

90%

5x 1/10
SECOND

THE AMOUNT OF THE AMOUNT OF TIME IT OF THE WORLD’S DATA


INFORMATION WE RECEIVE TAKES TO GET THE HAS BEEN CREATED IN
COMPARED TO 1985 SENSE OF A COLOURED THE PAST 2 YEARS
VISUAL
Visual Perception

▪ People are more inclined to perceive certain visual cues (variables) better than
non-visual cues.

MORE ACCURATE LESS


ACCURATE

Position Length Angle Direction Area Volume Saturation Hue


Visual Perception

Selection or Do changes in the visual cue allow you to


distinguish a point from others? Do changes in
Grouping
Visual the cue allow you to group data points?

Perception
Measurement Can you make a numeric observation from a
change?

Some visual cues, for example:

Position Length Direction


Ordering Does the visual cue have a perceived order?

are better at supporting:

Steps How many distinct “steps” can be perceived in


the cue?
Visual Perception

▪ Most quantitative
analysis can be
performed with charts
that use only four kinds
of objects.
Points Lines
▪ These objects (and their 2D position, 2D position,
subsequent related example: scatter example: line chart

charts) work because plot

we immediately and
more precisely perceive
both position and
length.

Bars/Columns Boxes
2D position + length, 2D position + length (unlike bars, show
example: bar chart distribution of an entire set of values)
example: bar chart
Know Your Purpose and Data
Know Your Data

▪ It is important to model your


data appropriately, before
you explore it, in order to be Numerical Data
able to answer your
business questions
correctly. Data types can be
used to model certain
characteristics of your data.
AND

Dimensions
(Categorical)
Measures
▪ Measures constitute numerical data
that are calculated or aggregated – like
the sum of Revenue, average Cost, or
Profit-per-capita or non-numeric data
that are counted. What do measures represent?
Measures can represent observations in your data or
calculated values.

Measures are objects that represent


calculations and aggregate functions that are
usually applied to numeric data. Aggregating
the object must make sense for the column to
How are they formatted?
be a measure.
Measures have an aggregation type associated with them.
Sales Revenue is a measure but summing up For example, if the chart includes Revenue by Country, and
sum is associated with Revenue, You can customize the
product list prices isn’t. That’s a dimension.
prefix or suffix to indicate data such as, units of measures,
You can create measures from categories by
like CAD, EUR, and USD.
counting their elements, for example, Number
of Countries visited by our Customers.
Dimensions
▪ Dimensions constitute
categorical data such as
year, product, country MEN’S WOMEN’S

and salary range. Categorical The dimension Product


Type may include the
(Also called “nominal”) values Men’s Clothing
▪ What do they represent? for discrete values. and Women’s Clothing.
VALUE VALUE

A dimension reflecting the


Ordinal outcome of a survey result
may include the values
The dimension
Agree, Neutral, Disagree
members have a set
that have an implicit order.
default order.
AGREE NEUTRAL DISAGREE

Interval The dimension Salary can


be categorized into the
Each value in the following salary ranges:
dimension represents a <$20k, $20 - 40k, $40 -
range of values. 80k, >$80k
<20k $20 to $40 to >$80k 40k
80k

13 • Chapter 2: Know Your Purpose/Data


Encoding Quantitative Versus Categorical Data
Craft Your Message
Craft Your Message

Keep these Then ask yourself:


questions in mind:
1.What questions do you want to answer with your
data?

1. 2.What kind of relationships exist in your data? What


are the best techniques for displaying these? Do you
What is the overall goal of your data
analysis? need a chart (overview), a table (details), or maybe
both, to convey your message?
3.Can you highlight specific data points to better get
2. your message across?
Who is this message intended for?
What do you know about your 4.How can you incorporate a summary of your
audience? message in your chart titles to emphasize on your
overall message?
Know Your Audience

“It is important to know ▪ Think about the story


you want to tell with
your audience’s your data. What
background and the insights do you want
to bring to light?
domain of your data.” Keeping your
audience in mind,
▪ you do not want to
include any
unnecessary noise to
obscure the meaning
of your message. It is
easy to misrepresent
data by choosing the
wrong visualization
type.

Every little detail matters in the


telling of your story and
connecting with your audience.
Storytelling Assets
Selecting the Right Visualizations

Change Over Comparison Ranking


Shows how a measur
Part-To-Whole
Shows the comparison of Shows the top or bottom N
changes over time, and allows categorical values, where the values to emphasize the largest, Shows how the categories
user to highlight temporal data does not have any intrinsic or smallest values contribute to the value
trends order, for example, a list of
products

Bar Chart: set to % scale


Line Chart: Highlights potential trends in data Bar Chart: used for comparing
categorical values
Bar Chart: shows categorical values in
Pie Chart: Compares percentage values
Bar Chart: Highlights comparison Trellis: uses multiple views to show decreasing or increasing order
between individual values different partitions of a dataset
Stacked Bar Chart: shows overall
measure total

Correlation Geographical
Distribution Overview
Shows, whether there is a Information and Maps
Shows how a measure is Shows the exact values in
potential correlation between Shows the geographical
spread across its domain table format
two measures distribution of measure values

Histogram: Column Chart showing the


Choropleth Chart: highlights geographical
count of binned measure values Scatter Plot: highlights potential data by colouring geographical areas
Box Plot: shows distributions for correlation of two measures according to their measure values
Table: highlights exact values
different categorical values Trellis: uses multiple views to show
Geo Bubble Chart: highlights
different partitions of a dataset
Heat and Tree Map: shows the geographical data by showing them
distribution of measure values as bubbles on a map
Change Over Time

LineandAreaC
hart Sales revenue in 2015 - 2017

Suggestions
The Line Chart displays measures over a time period. Line
Charts are used frequently to show trends and 1.
relationships between them. The Y-Axis always shows a Create a time hierarchy to
allow drilling up or down to
measure value, and the X-Axis denotes a time dimension
Days, Months, and Years
such as Month, Quarter, or Year.

2.
Used for Add a moving average line to
smooth the trend over time
•Trends The impact of different product lines by sales revenue
•Data over time in 2015 - 2017 3.
•Temporal patterns and correlation Add a forecast or linear
•Period-over-period regression to emphasize
current or future trends

4.
Consider an Area Chart for
showing cumulative totals
Change Over Time

ColumnLineC
hart Customer retention rate plays a role in increasing sales

The Column Line Chart is a combination of a Line Chart and


Column Chart. This chart type displays one measure as a Suggestions
column and a secondary measure as
a line. The two measures are displayed over a Time 1.
Dimension which may include Years, Quarters, or Months. Use this chart type to show
This chart is great for showing the relationship two trends of different types
between two measures over a period of time such as Gross (for example, Returning
Margin and Sales Revenue, or Net Income after Customers and Sold items)
Tax and Tax Rates. over time

Used for 2.
Other options for showing
•Trends change over time include
•Data over time Bar Charts or Tables.
•Temporal patterns and correlation
Year / Quarter /
Month
Comparison

BarChartandStacke
dBar Chart
Number of ships the company possesses based on entities they transport

Suggestions
1.
Use data labels to improve the
Bar Charts are probably the most frequently used
readability of data values.
chart type. Focus the attention of your audience to
important details by:
Quantity sold by Manager and Lines 2.
Customize hierarchies to
•Ranking data from largest to smallest or vice versa
allow drilling from a high-level
•Filtering out data that isn’t important for your message overview to more specific
details; users easily drill up
•Grouping data by combining values in a chart – if there
and down
are too many categories, you can group less relevant
categorical values together into an Other group (for
3.
example, “Other Clothing”)
Use Color to clearly
differentiate separate
Used for categorical values in
your dimension
Comparing different categorical values Which products are most likely to win deals?

25 • Chapter 4: Storytelling Assets


Comparison

Waterfal
Chart

Sales Revenue by Lines Suggestions


A waterfall chart is used to show the cumulative effect of
temporal (or other sequential) data. It is useful to visualize
1.
the fluctuation of a value in positive and negative values.
Show how you arrived at a
net value

Used for 2.
Break down the cumulative
•Cumulative effect
effect of positive and negative
•Deviations and differences contributions

3.
Visualize a starting
quantity
Comparison

TrelisLayoutofMultiple
Charts
Milk sold in glass containers generates the highest revenue

Organic Milk 2.00M

1.50M

$1,086,655.70

1.00M
The Trellis Layout, also known as Small Multiples,
$542,127,14
contains a set of charts based on the same set 500.00k
Suggestions
of data using the same axes to allow the viewer $39,054.26
0.00
categorical comparisons of different values within a Regular Milk 2.00M 1.
dimension. Used to compare values
$1,422,326.44
1.50M within
a category (such as Trellis by
1.00M
Used for Milk Type to show the Sales
500.00k $363,565.66 Values for each Milk Type in a
Comparison, identifying patterns across multiple $255,456.12

0.00
separate chart.
categorical values
Carton Glass Plastic
Ranking

BarCha
rt
Number of participants from top 10 countries in a design contest
The Ranking feature allows the user to sort and filter data
based on their importance. For example, we may want to
sort Countries based on their Number of Participants.
The Group by Selection functionality can be used in order
to group values.

Used for
Suggestions
Emphasizing on top or bottom values in a chart

1.
Often categorical values (in this
case Countries) that contribute
less to the overall measure
value might be filtered out or
grouped together in another
category.
Part-to-Whole

StackBarCh
art Percentage of items sold for each product line in 2015 - 2017
Suggestions
Used for
A Part-to-Whole relationship shows how measure values that 1.
make up the whole of something (for example, Number of You can use stacked or side-
containers sold) compare to one another and how they each by-side bars to compare
compare to the whole. different hierarchy levels
(Country Region) or
classifications (Men’s Clothing,
Women’s Clothing).

2.
You can use a 100% Stacked
Bar Chart (or Marimekko
Chart) to show the portion that
each segment makes up in a
category.

3.
In addition to Stacked Bar
Charts and Marimekko
Charts, other charts (such as
pie, ring, and funnel charts)
can be used to show Part-to-
Whole relationships.
Part-to-Whole

Pie,Ring,andFu
nnel Charts
The percentage of containers sold by container type
Suggestions
Pie, Ring (Donut), and Funnel Charts are used to discern
1.
part-to-whole comparisons to either highlight a portion of
Limit use of Pie Charts to a
the data or to compare values for different categorical
small number of slices (no
values. These chart types are generally
more than 5 slices). Do not
not recommended if they include too many segments,
use a pie chart if the slices are
as the viewer will have a difficult time differentiating
of similar size.
between too many different colors.

2.
Used for Consider showing data
labels
Comparing percentage values in proportion to the whole for ease of reading

3.
Highlight only the most
important slice if possible

4.
Compare with using a bar
chart or ring (donut) chart –
the viewer is more likely to
Distribution

HistogramandGrou
ping
Number of prescription drugs sold in a pharmacy
across different age groups

A Histogram is a type of Column Chart that shows the


distribution of measure values, for example, Number of
Items Sold. Instead of showing each measure value
directly, in a histogram, values are grouped first. For Suggestions
example, in Figure 9, instead of creating one column per
Age, we grouped the values first into the age ranges [11- 1.
20], [21-30]…...[80-90]. This allowed us to show the Create bins or ranges of
distribution of Number of Prescription Drugs Sold in an numbers to count the number
audience friendly way. of occurrences within your
data. This can be done either
in the Data view section
Used for using
Distribution of measure values, identifying data issues the Group by Range
including outliers functionality or in the Design
Age Group
view section using a
Calculated Dimension.
Distribution

BoxPl The distribution of dollars spent on our products by customers in


the Western United States

ot
A Box Plot visually displays statistical distribution of a
measure within a dataset. It is often used to also show the
range in values for each categorical value. The lines on the
box plot refer to the minimum, first quartile, median, third
quartile, and maximum range of variation. The dots on the
box plot are visual representations of the outliers. Suggestions
1.
Used for
Compare data distribution for
Comparison, distribution of values, identifying outliers several categorical values

2.
Show distribution of medians
in data

3.
Include a reference line for
the
overall median in your data
Distribution

HeatMapandTree
Map
Sales Revenue and Quantity sold by Country

In a Heat Map the categorical values are contained in a


matrix of tiles; based on a single measure, these tiles have
different shades. In contrast, in a Tree
Map, two measures are considered. Larger values are
represented by larger tiles and darker shades. Suggestions
1.
Used for Only use this if the resulting
Showing the distribution of measure values Heat Map shows visibly
different color intensities (it will
confuse the viewer if the heat
map segments are of similar
color intensities).
Correlation

ScatterP
lot Countries with higher household income have higher
population growth

Additional Chart Types Used For Showing


Correlation:

•The Scatter Matrix shows several Scatter Plots


Suggestions
in a Trellis layout in order to compare several
Scatter Plots in one chart. 1.
Use the color to show groups
•A Bubble Chart is similar to a Scatter Plot, but of points, but limit the number
allows visualization of a third measure. The size of colors used; too many
of the bubbles indicates this third measure. The colors or shapes will impact
larger the measure is, the larger the bubble. the readability of a chart

2.
Used for
Keep the aspect ratio square
Showing the correlation of two measures
3.
Create a Geo Hierarchy on
top
of location data (for example,

Population Change Percentage in the Last 10 Years


States, Cities) to enable
drilling up to higher levels of
geographical detail
Overview

Tabl
e
You can find the Table as one of the visualization types in
Suggestions
the Data view section.
Discount, Gross Margin and Sales Revenue by Lines
1.
Used for Best for showing exact
values
Show multiple measures in one or two categories or
2.
hierarchies
Often charts and Tables are
shown on the same page, as
they emphasize different
aspects

3.
Highlight key information
with the Conditional
Formatting feature

4.
Setting the correct precision
(number formatting) for
measures included in a
Table is paramount in order
to not overload the user
Geographical Information and Maps

Choropleth
Map
Unemployment percentage across different states in the Suggestions
United States in 2014

Unemployment
1.
Use the Choropleth Map for
A Choropleth Map uses differences in shading, coloring, or 0.13
locations of similar size, as
the placing of symbols within predefined regions to indicate 0.11
the size of the area coloured
measure values in those areas. 0.09 may overemphasize larger
0.07
areas (for example, Canada
Used for covers a much larger area
0.05
than Japan despite being
Supports location-based comparisons of standardized data 0.03
much smaller in terms of
such as Rates, Densities, or Percentages population)
2.
Make sure your measure
values are normalized by the
geographic properties, for
example, by the population of
a geographic area

3.
Remember that the
granularity
of your regions (counties, for
example) will impact the
signal (aggregated measure
values) from your data.
Geographical Information and Maps

GeoBubbleCh
art

The Geo Bubble Chart shows measure values in the form


of bubbles on a map. The larger the measure, the larger will
be the bubble on the map. Suggestions
1.
Used for Use to show values on a map
and to create an animation
Viewing a measure by Country, Region, or City; comparing
over time
measures across different geographic areas
2.
Use Geo Bubble or Pie
Charts
on maps to show measure
values if the relative size of
the
underlying regions cannot be
compared
Other Chart Types
Line Chart for Time Series Scatter Plot for Time Numeric Point
Series

42K
Show trends over time and forecast future
values based on historical data within the Use a scatterplot to find out whether there is a
chart. correlation between two specific measures.
Often used in Stories to reflect a important key
figure
Network Chart Radar Chart Parallel Coordinates

Used to show the relationship between the


categorical values of two dimensions, for example, Compares multiple measures across categorical values Used to compare multiple measures for
People and Locations a single category
Tag Cloud Tree Chart Funnel Chart

Visualization type used for text data. .


The size and color of the value reflects Often used to show the hierarchical relationship A funnel chart provides the percentage of
the between dimensions parts in a whole
measure values
Tips
A great visual design
standard will accelerate
understanding and
consumptions of your data.
It’s that simple. For your
business to reap the benefits 1. 2. 3.
Avoid decorative use of Avoid three-
of data visualizations, Less is more. Make every pixel
graphics.
and word count. dimensional
organizations need to create chart types.

a visual design standard that


incorporates best practices.
0 50 100 150

0 10% 20% 30%


0

4. 5. 6.
Use bullet graphs instead
Avoid pie charts. Start bar charts at zero. of gauges to save space.

Key Data

1991 1992 1993 1994

7. 8. 9.
Use sparklines to show trends Show time going from left to Use color only to highlight
on the X-axis. right on the X-axis. or accentuate meaning.
Tips

▪ Using Vivid and Subtle Colors Appropriately


Tip

▪ The Principle of Closure


Tips

▪ Reduce the non‐data pixels


▪ Enhance the data pixels.
Superset
Superset – What is?

▪ A modern data exploration and visualization web application.


▪ Superset’s main goal is to make it easy to slice, dice and visualize data.
▪ Developed by engineers at Airbnb now released under Apache license
2.0
▪ This project was originally named Panoramix, was renamed to Caravel in
March 2016, and is currently named Superset as of November 2016.
▪ Build over the Flask framework in Python.
▪ Works as a web app on all most used browsers so it does not require any
additional desktop installations.
▪ Easy to deploy on a Server and ability to handle multiple users with roles
and authentication.
Superset Advantages

▪ Features:
• A rich set of data visualizations
• Interactivity: You can create visualizations even without knowledge of SQL or
Python!
• An easy-to-use interface for exploring and visualizing data
• Create and share dashboards
• Enterprise-ready authentication with integration with major authentication
providers (database, OpenID, LDAP, OAuth & REMOTE_USER through Flask
AppBuilder)
• An extensible, high-granularity security/permission model allowing intricate
rules on who can access individual features and the dataset
• A simple semantic layer, allowing users to control how data sources are
displayed in the UI by defining which fields should show up in which drop-
down and which aggregation and function metrics are made available to the
user
• Integration with most SQL-speaking RDBMS through SQLAlchemy
• Deep integration with Druid.io(high performance real-time analytics database)
• Completely free, no user license or one time download fee
Superset – Data Sources

▪ Superset allows to connect following types of sources


▪ Database(using SQLAlchemy)
• PostgreSQL
• MySQL
• Oracle
• Microsoft SQL Server
• SQLite
▪ Files
▪ Tables
Superset – Connect Database(1/2)

▪ Enter Connection Details(Name, connection string)


▪ Test Connection and save
Superset – Connect Database(2/2)

▪ Required Packages
Superset – SQL Lab

▪ SQL Lab is a powerful SQL IDE that works with all SQLAlchemy
compatible databases.
▪ queries are executed in the scope of a web request.
▪ To enable support for long running queries that execute beyond the
typical web request’s timeout (30-60 seconds).
▪ In config.py file of superset set SQLLAB_TIMEOUT = 180
Superset – Creating chart/slice

▪ Superset provides wide range of chart


▪ Add Chart

▪ Configure Chart
Superset – Creating chart/slice

▪ Save Chart/Save and add into existing Dashboard/Save and add in new
Dashboard

▪ View Dashboard
Superset – Making your Dashboard Public

▪ To view your dashboard without login, please do following changes in


your superset roles settings.
▪ Provide all “can **” and “can view **” permissions to “Public” role.

You might also like