Python - Adv - 3 - Jupyter Notebook (Student)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: %matplotlib inline


import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# At the time of creating this material, there was a versioning issue
# between seaborn and numpy that results in a FutureWarning. This does
# not affect the results and will presumably be fixed in some update cycle
# but creates an annoying warning message we don't want to see every time.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Visualization with Seaborn


Visualization with Seaborn
Introduction
Relationships Between Continuous Variables
Scatter plots
Line plots
Aggregating Data
Plotting Dates
Exercises
Relationships to Categorical Variables
Categorical Scatter Plots
Distribution Plots
Exercises
Element Ordering
Facetting
Under the Hood
Customizing Plots
Plot Text and Axis Labels
Axis Limits
Color
Themes
Saving Plots
Exercises

Introduction

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 1/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

The base library for visualization in Python is matplotlib . Nearly every other library for
visualizing data is built on top of it. However, despite being incredibly flexible and powerful,
matplotlib is difficult to use for data analysis. Instead of being developed with one single API
design, it has grown organically as every new update needed to ensure backwards compatibility
with old code (otherwise all libraries building on it would break until updated). This continuity is part
of what makes it so attractive and simultaneously complicated.

Furthermore, matplotlib is designed to visualize anything, not just data. Because we're most
interested in examining and presenting relationships between data, however, we will use a different
library, seaborn . This library is specifically designed for statistical data visualization and provides
a consistent and easy-to-use API.

Relationships Between Continuous Variables


Visualizing the relationship between continuous variables is as simple as plotting the values of
both variables for each data entry on the x- and y-axes of a plot.

Scatter plots

In [ ]: tips = pd.read_csv("../data/tips.csv")


tips.head()

In [ ]: sns.relplot(x="total_bill", y="tip", data=tips)

We may, of course, be interested in more than just the x- and y- values. We can use additional
arguments to relplot(...) to distinguish data points

In [ ]: sns.relplot(x="total_bill", y="tip", hue="day", data=tips)

Points are now colored differently depending on whether the entry in the dataset corresponds to a
smoker or not. We can do the same for the size and style aesthetics as well.

In [ ]: sns.relplot(x="total_bill", y="tip", size="smoker", data=tips)

In [ ]: sns.relplot(x="total_bill", y="tip", style="day", data=tips)

The aesthetic mappings can be combined as desired to visualize up to 5 dimensions in our


datasets via the x , y , hue , style , and size arguments.

In [ ]: sns.relplot(x="total_bill", y="tip", hue="smoker", size="day", style="time", data

Be warned that this will make plots extremely difficult to visualize parse.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 2/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

The hue and size aesthetics have been categorical so far, meaning that distinct colors and
sizes were chosen for each possible, discrete value of the dataframe columns they were applied
to. They can also be applied to continuous, numerical variables. In this case, the color palette will
automatically be set to a gradient. We will see further on how to customize colors.

In [ ]: sns.relplot(x="total_bill", y="tip", hue="size", data=tips)

In [ ]: sns.relplot(x="total_bill", y="tip", size="tip", data=tips, kind="scatter")

Line plots
By default, seaborn will create a scatterplot. In the case of time series, we may be interested in
creating a line plot to better visualize trends. We can do this by simply adding a kind="line"
argument (by default, this argument is kind="scatter" ).

In [ ]: df = pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum()})

In [ ]: sns.relplot(x="time", y="value", kind="line", data=df)

By default, the dataframe will be sorted so that the x-values are in ascending order. This ensures
that the line plot looks like a timeseries plot. This can, however, be disabled by setting
sort=False . This could be useful, for example, if we are following the movement of an object or
tracking how two variables change simultaneously through time.

In [ ]: df = pd.DataFrame(np.random.randn(500, 2).cumsum(axis=0), columns=["x", "y"])

In [ ]: sns.relplot(x="x", y="y", sort=False, kind="line", data=df)

Line plots have the same aesthetic mapping possibilities as scatter plots, hue , size , and
style , and they can also be combined in the same way. Notice how multiple lines are created
and only points with the identical mapped aesthetics are connected. That means, if we create a
line plot that maps a variable to hue and to style , we will end up with an individual line for each
existing combination of variables in our data.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 3/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: df = pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "North", "division": "A"})
df = df.append(pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "North", "division": "B"}))
df = df.append(pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "North", "division": "C"}))
df = df.append(pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "South", "division": "A"}))
df = df.append(pd.DataFrame({
"time": np.arange(500),
"value": np.random.randn(500).cumsum(),
"region": "South", "division": "B"}))

sns.relplot(
x="time", y="value", kind="line", hue="region",
style="division", data=df)

In [ ]: df.head()

In [ ]: # Using size instead of style


sns.relplot(x="time", y="value", kind="line", hue="region", size="division", data

If using the style parameter, we can also decide whether we want dashes, dots, or both.

In [ ]: df = pd.DataFrame({
"time": np.arange(20),
"value": np.random.randn(20).cumsum(),
"region": "North"})
df = df.append(pd.DataFrame({
"time": np.arange(20),
"value": np.random.randn(20).cumsum(),
"region": "South"}))
sns.relplot(x="time", y="value", kind="line",
style="region", markers=True, data=df)

In [ ]: sns.relplot(x="time", y="value", kind="line", style="region",


dashes=False, data=df)

In [ ]: sns.relplot(x="time", y="value", kind="line", style="region",


dashes=False, markers=True, data=df)

Aggregating Data

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 4/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

Often, we may have data with multiple measurements for the same data point, i.e. x-value. For
example, we might have several temperature sensors in a device as a failsafe. seaborn can
automatically aggregate y-values for identical x-values. By default, it plots the mean and the 95%
confidence interval around this mean in either direction.

In [ ]: fmri = pd.read_csv("../data/fmri.csv")


fmri.head()

In [ ]: fmri.loc[(fmri["timepoint"] == 18)].head()

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", data=fmri)

Because seaborn uses bootstrapping to compute the confidence intervals and this is a time-
consuming process, it may be better to either switch to the standard deviation ( ci="sd" ) or turn
this off entirely and only plot the mean ( ci=None )

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", ci="sd", data=fmri)

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line", ci=None, data=fmri)

We can also change our estimator to any aggregation function, such as np.median(...) ,
np.sum(...) , or even np.max(...) . If we want to turn off aggregation then we just set
estimator=None . Note that this will plot all measurements and cause the data to be plotted in
strange ways.

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line",


estimator=np.median, data=fmri)

In [ ]: sns.relplot(x="timepoint", y="signal", kind="line",


estimator=None, data=fmri)

Plotting Dates

Because they're so ubiquitous, seaborn natively supports the date format and will automatically
format plots accordingly.

In [ ]: pd.date_range("2017-1-1", periods=5)

In [ ]: pd.date_range("1-1-2017", "22-3-2017")

In [ ]: df = pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum()})
df.head()

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 5/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: g = sns.relplot(x="time", y="value", kind="line", data=df)


g.fig.autofmt_xdate()

Exercises
1. Load the iris.csv dataset and create a scatter plot relating the petal length to the petal
width.

In [ ]: ###

In [ ]: # MC
iris = pd.read_csv("../data/iris.csv")
sns.relplot(x="petal_length", y="petal_width", data=iris)

2. Load the diamonds.csv dataset. Plot the carats versus the price again, but this time make
sure that points are colored based on the cut.

In [ ]: ###

In [ ]: # MC
diamonds = pd.read_csv("../data/diamonds.csv")
sns.relplot(data=diamonds, x="carat", y="price", hue="cut")

3. Load the mpg.csv dataset and create a line plot relating the mean mpg to the model_year .
Make sure each country of origin is shown in a separate line style.

In [ ]: ###

In [ ]: # MC
mpg = pd.read_csv("../data/mpg.csv")
sns.relplot(data=mpg, x="model_year", y="mpg",
kind="line", style="origin")

4. This time, use pandas to find the mean mpg value for each model_year and each country
of origin . Create a line plot relating the mean mpg to the model_year with one line for
each country of origin , as above.

Hint: Remember groupby ? Remember how we can use it for multiple columns
simultaneously?

Note: seaborn cannot use the index, even if it is named. You must use *.reset_index()
to ensure that the columns you grouped by are columns in the new data frame

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 6/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: ###

In [ ]: # MC
mpg = pd.read_csv("../data/mpg.csv")
mpg_mean = mpg.groupby(["model_year", "origin"])["mpg"].mean()
mpg_mean = mpg_mean.reset_index()
sns.relplot(data=mpg_mean, x="model_year", y="mpg", kind="line",
style="origin")

5. Consider the following (fake) stock data. Create a line plot from this data with one line for each
stock symbol and format the x-axis as a date.

In [ ]: ###
np.random.seed(101)
stock_data = pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum(),
"symbol": "TRDS"})
stock_data = stock_data.append(pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum(),
"symbol": "RISL"}))
stock_data.head()

In [ ]: # MC
np.random.seed(101)
stock_data = pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum(),
"symbol": "TRDS"})
stock_data = stock_data.append(pd.DataFrame({
"time": pd.date_range("2017-1-1", periods=500),
"value": np.random.randn(500).cumsum(),
"symbol": "RISL"}))

stock_fig = sns.relplot(data=stock_data, x="time", y="value",
hue="symbol", kind="line")
stock_fig.fig.autofmt_xdate()

Relationships to Categorical Variables


We've already seen how we can show dependence on categorical variables with the various
aesthetics in the previous section ( hue , size , and style ). Often, we may not have two
continuous variables to relate to each other, though. For this, we use the seaborn function
catplot(...) which can create multiple kinds of categorical plots.

Categorical Scatter Plots

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 7/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

The simplest way to represent the relationship between continuous and categorical data is with a
categorical scatter plot that represents the distribution of (continuous) values for each category.
For this, we can make use of the default value kind="strip" .

In [ ]: tips = pd.read_csv("../data/tips.csv")


tips.head()

In [ ]: sns.catplot(x="day", y="total_bill", data=tips)

seaborn automatically adds jitter to the points to reduce their overlap. We can adjust this jitter by
passing a value between 0 and 1 (exclusive) or eliminate this jitter entirely by passing a boolean
False . Note that a value of 1 is interpreted as True and the default jitter width is used!

In [ ]: sns.catplot(x="day", y="total_bill", jitter=False, data=tips)

In [ ]: # When a number is passed, this corresponds to a relative width


# jitter=0.5 will typically mean that the "point columns" touch.
sns.catplot(x="day", y="total_bill", jitter=0.3, data=tips)

We can also prevent point overlap entirely by using a swarm plot. This will create a useful visual
approximateion of the distribution of the values.

In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)

Categorical plots only support the hue aesthetic, not the style or size aesthetics.

In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", hue="sex", data=tips)

seaborn will make assumptions on the nature of your data. For example, if you pass two
continuous, numerical variables to catplot(...) , it will try to treat the x-axis as a categorical
variable.

In [ ]: sns.catplot(x="size", y="total_bill", kind="swarm", data=tips)

In [ ]: sns.catplot(x="total_bill", y="size", kind="swarm", data=tips)

Notice that this will break seaborn if you attempt to place to pseudo-categorical variable onto the
y-axis. We can, however, invert our axes if one of the variables is truly categorical, i.e. not
numerical.

In [ ]: sns.catplot(x="day", y="total_bill", kind="swarm", data=tips)

In [ ]: sns.catplot(x="total_bill", y="day", kind="swarm", data=tips)

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 8/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

Distribution Plots
Swarm plots are good for approximating distributions, but we often want to have an exact
description of the data distribution. For this, we can use box plots and variants thereof.

In [ ]: sns.catplot(x="day", y="total_bill", kind="box", data=tips)

Boxplots encode valuable information about our distribution. For each subset of the data, i.e. each
box, the following pieces of information are shown:

The central line of each box represents the median value


The top and bottom of the boxes are the 3𝑟𝑑
and 1𝑠𝑡
quantile, respectively.
This means that 25% of all values are below the bottom line and 25% are above the top
line, i.e. 50% of all values are within the colored region
The whiskers denote the outlier limits. Any value between the whiskers is considered "normal"
The points outside of the whiskers are outliers that may require special attention

The hue argument can be used to show additional, nested relationships

In [ ]: sns.catplot(x="day", y="total_bill", kind="box", hue="sex", data=tips)

Note that hue assumes a categorical variable when used on catplot(...) and seaborn will
therefore automatically convert numerical variables into categorical ones.

In [ ]: sns.catplot(x="day", y="total_bill", kind="box", hue="size", data=tips)

When quantiles aren't enough, seaborn can also display a violin plot. This kind of plot estimates
a density and plots it as a distribution

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", data=tips)

If a variable has only two possible values and is mapped to the hue aesthetic, then split=True
can be used to combine the two density estimates to compare them more easily.

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin",


hue="sex", split=True, data=tips)

Violin plots estimate the density. This kernel density estimator (KDE) requires a parameter, called
bandwidth, that determines how smooth or how detailed the density plot will be. Understanding
violin plots can therefore be more difficult and potentially misleading.

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", bw=0.1, data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin", bw=5, data=tips)

Violin plots automatically show the corresponding box plot stats inside. We can change this to
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 9/18
8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

either showing sticks , points , or nothing at all.

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin",


inner="stick", data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin",


inner="points", data=tips)

In [ ]: sns.catplot(x="day", y="total_bill", kind="violin",


inner=None, data=tips)

Like with line plots, we may be interested in summary statistics over our data. For this, we can use
a bar plot. seaborn will compute a summary statistic, such as the mean, as well as confidence
intervals for each individual category (denoted by the x-axis).

In [ ]: titanic = pd.read_csv("../data/titanic.csv")


titanic.head()

In [ ]: # Compute the mean survival rate for each sex and class as well as confidence int
sns.catplot(x="sex", y="survived", hue="class", kind="bar", data=titanic)

In [ ]: # Compute the total number of survivors for each sex and class as well as confide
sns.catplot(x="sex", y="survived", hue="class", kind="bar",
estimator=np.sum, data=titanic)

If we're just interested in counting the number of occurances of a single variable, we can use
kind="count" .

In [ ]: # Count the number of passengers by sex and class


sns.catplot(x="sex", hue="class", kind="count", data=titanic)

An alternative to a barplot is a "point plot", which connects groups. This can be used to track
pseudo-timeseries data that may only have a few categorical time points, e.g. sales data for 5
years. Notice how it connects data subgroups with the same value of the variable mapped to the
hue aesthetic ( sex ).

In [ ]: sns.catplot(x="class", y="survived", hue="sex", kind="point", data=titanic)

As before, we can also change the estimator and confidence interval method for point plots.

In [ ]: sns.catplot(x="class", y="survived", hue="sex", kind="point",


estimator=np.mean, ci="sd", data=titanic)

Exercises

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 10/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

1. Load the diamonds.csv dataset and create a categorical scatter plot that relates the price to
the cut

In [ ]: ###

In [ ]: # MC
diamonds = pd.read_csv("../data/diamonds.csv")
sns.catplot(data=diamonds, x="cut", y="price")

2. Change the jitter width of the previous plot so that the dot-columns are touching.

In [ ]: ###

In [ ]: # MC
sns.catplot(data=diamonds, x="cut", y="price", jitter=0.5)

3. This time, create a box plot that relates the carats to the clarity

In [ ]: ###

In [ ]: # MC
sns.catplot(data=diamonds, x="clarity", y="carat", kind="box")

4. Create a subset of the diamonds data consisting of only diamonds with colors of "J" (worst)
and "D" (best) and only with clarity "IF".

Hint: We can combine boolean masks for Pandas like so: diamonds.loc[(condition1) &
(condition2)]

Create a violin plot relating the price to the clarity and map the color to the hue aesthetic.
Make sure the density estimates for each color are combined in each violin.

In [ ]: ###

In [ ]: # MC
diamonds_jd = diamonds.loc[
diamonds["color"].isin(("J", "D")) &
(diamonds["clarity"] == "IF")]

sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin")

5. Play with the bandwidth parameter ( bw ) for the previous plot. How can you interpret the plot
for bw=0.01 , bw=0.1 , and bw=1 ?

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 11/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: ###

In [ ]: # MC
sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin", bw=0.01)
sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin", bw=0.1)
sns.catplot(data=diamonds_jd, x="clarity", y="price",
hue="color", split=True, kind="violin", bw=1)

6. Using the full diamond dataset again, use a bar plot to determine how many diamonds there
are of each cut.

In [ ]: ###

In [ ]: # MC
sns.catplot(data=diamonds, x="cut", kind="count")

Element Ordering
All of the above plots allow us to customize the order of elements, both on the axes as well as for
the aesthetics. Naturally, the functions will only enable ordering aesthetics that are supported, e.g.
catplot(...) has no size_order or style_order arguments and relplot(...) has no
order argument as both axes depict continuous values.

In [ ]: # Compute the mean survival rate for each sex and class as well as confidence int
sns.catplot(x="sex", y="survived", hue="class", kind="bar",
order=["female", "male"], data=titanic)

In [ ]: sns.catplot(x="sex", y="survived", hue="class", kind="bar",


order=["female", "male"], data=titanic,
hue_order=["First", "Second", "Third"])

In [ ]: sns.relplot(
x="total_bill", y="tip", hue="smoker", size="day", style="time",
style_order=["Lunch", "Dinner"],
size_order=["Thur", "Fri", "Sat", "Sun"],
hue_order=["Yes", "No"], data=tips)

Faceting
We can also instruct the functions relplot(...) and catplot(...) to create multiple plots if
we simply have too much detail to show in one. The parameters col=... and row=... let us
further split apart the data and show subsets in individual plots.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 12/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: titanic.head()

In [ ]: # Compute the total number of survivors for each sex and class as well as confide
sns.catplot(x="sex", y="survived", hue="class",
kind="bar", col="embark_town", data=titanic)

In [ ]: tips.head()

In [ ]: sns.relplot(x="total_bill", y="tip", hue="sex",


row="day", col="smoker", data=tips)

In [ ]: sns.relplot(x="total_bill", y="tip", hue="sex",


row="day", col="smoker", data=tips,
row_order=["Thur", "Fri", "Sat", "Sun"])

Under the Hood


seaborn is a high-level interface for matplotlib . The two functions introduced here call other,
intermediate functions, which in turn call matplotlib functions.

relplot(kind=...)
scatter: scatterplot() --> matplotlib.pyplot.scatter()
line: lineplot() --> matplotlib.pyplot.line()
catplot(kind=...)
strip: stripplot() --> Calls multiple matplotlib functions
swarm: swarmplot() --> Calls multiple matplotlib functions
box: boxplot() --> Calls matplotlib.pyplot.boxplot()
violin: violinplot() --> Calls multiple matplotlib functions
bar: barplot() --> Calls matplotlib.pyplot.bar()
count: countplot() --> Calls matplotlib.pyplot.bar()
point: pointplot() --> Calls multiple matplotlib functions

seaborn is essentially a "convenience" to make matplotlib more accessible.

Customizing Plots

Plot Text and Axis Labels


Customizing the text of axis labels is unfortunately not as intuitive as building the plots. This is
because seaborn builds heavily on matplotlib but attempts to reduce the fine granularity of
building a plot with the latter. For example, to create the facetted plots above using matplotlib ,
we would have to subset the data into all possible variants, build each individual plot, arrange them
in a grid, and then add the legend and axis titles. seaborn makes this step somewhat easier, but
cannot get around this granularity when it comes to customizing plots.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 13/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

Title

In [ ]: # Look at documentation for more info


myFigure = sns.catplot(
x="sex", y="survived", hue="class", kind="bar",
hue_order=["First", "Second", "Third"], data=titanic)
myFigure.fig.suptitle("Titanic Survivors", fontsize=15)

Legend

In [ ]: myFigure._legend.set_title("Passenger Class")


myFigure.fig

Legend labels are stored as Text(...) elements

In [ ]: myFigure._legend.texts

We can change these by calling *.set_text(...) on each of them

In [ ]: myFigure._legend.texts[0].set_text("1st")
myFigure._legend.texts[1].set_text("2nd")
myFigure._legend.texts[2].set_text("3rd")
myFigure._legend.texts

In [ ]: myFigure.fig

Axis Labels

In [ ]: myFigure.set_axis_labels(x_var="Passenger Sex", y_var="Survival Rate")


myFigure.fig

We can set the value of categorical tick labels as follows:

In [ ]: myFigure.set_xticklabels(labels=["Apples", "Oranges"])


myFigure.fig

Rotate Tick Labels

In some cases, tick labels may be too dense and must be rotated

In [ ]: myFigure.set_yticklabels(rotation=30)
myFigure.fig

Axis Limits
We use matplotlib to set our axis limits
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 14/18
8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)


plt.xlim(20, 40)
plt.ylim(2, 8)

Color
There are far more methods of creating and choosing color palettes in seaborn than could
possibly be shown here.

In general, we set the colors of our plot with the parameter palette=... . The simplest way to do
this is to define a dictionary relating the aesthetic names (here the passenger class) to colors. The
colors can be given either as a string (insofar as the color is known to seaborn), in hexadecimal
format indicating the color channel intensities ( #RRGGBB ), or as a tuple/list with 3 values indicating
the color mixing ( [r, g, b] , values should be between 0 and 1)

In [ ]: sns.catplot(
x="sex", y="survived", hue="class", kind="bar",
hue_order=["First", "Second", "Third"],
palette=["red", "#00FF00", (0, 1, 1)],
data=titanic)

xkcd produces a set of 954 named colors (https://xkcd.com/color/rgb/) to name random RGB
colors.

This becomes tiresome for many categories so seaborn offers several functions to generate
color palettes automatically. Some of these include:

sns.cubehelix_palette(...)
sns.diverging_palette(...)
sns.dark_palette(...)
Any of the ColorBrewer (http://colorbrewer2.org) presets
... and many more

Themes
Beyond color, seaborn also has support for themes. There are five built-in seaborn themes:
darkgrid , whitegrid , dark , white , and ticks . They can be invoked with
sns.set_style(...)

In [ ]: sns.set_style("dark")
sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)

In [ ]: sns.set_style("ticks")
sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 15/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

We can edit these styles to our liking. Note that the floating point numbers are actually strings!

In [ ]: # See current style details


sns.set_style("ticks")
sns.axes_style()

In [ ]: # Overwrite styles


sns.set_style("ticks", {"text.color": '1'})
sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)

Lastly, we can also use sns.set(...) to tweak our plots, such as font size scaling.

In [ ]: # Overwrite styles


sns.set(font_scale=1)
sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)

Saving Plots
Typically, an analysis pipeline won't run in Jupyter, or any other interactive environment, but as a
script that generates a report. We can use seaborn to this end by saving our plots.

In [ ]: myFigure = sns.catplot(


x="sex", y="survived", hue="class", kind="bar",
hue_order=["First", "Second", "Third"],
order=["male", "female"], data=titanic, ci=None)
myFigure.fig.suptitle("Titanic Survivors", fontsize=20)
myFigure._legend.set_title("Legend Title")
myFigure._legend.texts[0].set_text("1st")
myFigure._legend.texts[1].set_text("2nd")
myFigure._legend.texts[2].set_text("3rd")
myFigure.set_axis_labels(x_var="Passenger Sex", y_var="Survival Rate")
myFigure.ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, y: "{:.0f}%".fo
myFigure.set_xticklabels(labels=["Male", "Female"])
myFigure.set_xticklabels(rotation=30)

# Save the plot in all its glory
myFigure.savefig("output.png")

seaborn supports saving both in bitmap format, e.g. PNG, as well as in vector format, e.g. PDF.

Exercises

1. Using the full diamond dataset again, use a bar plot to determine how many dimaonds there
are of each clarity. Create facets for the cut (columns) and color (rows)

In [ ]: ###

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 16/18


8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

In [ ]: # MC
sns.catplot(data=diamonds, x="clarity", kind="count",
col="cut", row="color")

2. Create a box plot that relates the carats to the clarity and place the boxes in the correct order
(I1 , SI2, SI1, VS2, VS1, VVS2, VVS1, IF)

In [ ]: ###

In [ ]: # MC
sns.catplot(
data=diamonds, x="clarity", y="carat", kind="box",
order=("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))

3. Plot the relationship between the x and y columns of the diamonds dataframe. Limit the x-
axis to the interval [3, 11] and the y-axis to the interval [0, 15] to remove outliers

In [ ]: ###

In [ ]: # MC
myFig = sns.relplot(data=diamonds, x="x", y="y")
plt.xlim(3, 11)
plt.ylim(0, 15)

4. Load the exercise.csv dataset.


A. Plot the relationship between the pulse and diet as a boxplot.
B. Map the kind of exercise to the hue
C. Facet the data into columns so that we have one plot for each timepoint

In [ ]: ###

In [ ]: # MC
exercise = pd.read_csv("../data/exercise.csv")
efig = sns.catplot(data=exercise, x="diet", y="pulse",
kind="box", hue="kind", col="time")
efig.set_axis_labels(x_var="Diet", y_var="Pulse")

Open Exercise

5. Using the dataset tips , plot any relationship between variables you find worth investigating
and make the figure "presentation-ready". That means:

Use aesthetics ( hue , style , size ) and facets where appropriate.


Create a figure title and label the axes and legend.
Format tick marks if necessary.
Edit tick labels and legend entries if necessary.
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 17/18
8/22/22, 9:23 PM Python_Day6_MC - Jupyter Notebook

Find a visually appealing color palette.


Choose one of the base themes and play around with the options to them until they are to
your liking.

Save your plot as "output.png" and "output.pdf" and compare the two images. What happens
to them when you zoom in very close?

In [ ]: ###

In [ ]: # MC

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 6/Python_Day6_MC.ipynb 18/18

You might also like