Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

A population is the entire group that you want to draw conclusions

about. A sample is the specific group that you will collect data from.
The size of the sample is always less than the total size of the
population.

Interpreting Data Using


Descriptive Statistics with
Python
Introduction
Descriptive Statistics is the building block of data science. Advanced
analytics is often incomplete without analyzing descriptive statistics of the
key metrics. In simple terms, descriptive statistics can be defined as the
measures that summarize a given data, and these measures can be
broken down further into the measures of central tendency and the
measures of dispersion.
Measures of central tendency include mean, median, and the mode, while
the measures of variability include standard deviation, variance, and the
interquartile range. In this guide, you will learn how to compute these
measures of descriptive statistics and use them to interpret the data.
We will cover the topics given below:

1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Interquartile Range
7. Skewness
We will begin by loading the dataset to be used in this guide.

Data
In this guide, we will be using fictitious data of loan applicants containing
600 observations and 10 variables, as described below:

1. Marital_status: Whether the applicant is married ("Yes") or not ("No").


2. Dependents: Number of dependents of the applicant.
3. Is_graduate: Whether the applicant is a graduate ("Yes") or not
("No").
4. Income: Annual Income of the applicant (in USD).
5. Loan_amount: Loan amount (in USD) for which the application was
submitted.
6. Term_months: Tenure of the loan (in months).
7. Credit_score: Whether the applicant's credit score was good
("Satisfactory") or not ("Not_satisfactory").
8. Age: The applicant’s age in years.
9. Sex: Whether the applicant is female (F) or male (M).
10. approval_status: Whether the loan application was approved
("Yes") or not ("No").

Let's start by loading the required libraries and the data.


1import pandas as pd
2import numpy as np
3import statistics as st
4
5# Load the data
6df = pd.read_csv("data_desc.csv")
7print(df.shape)
8print(df.info())
python
Output:
1 (600, 10)
2
3 <class 'pandas.core.frame.DataFrame'>
4 RangeIndex: 600 entries, 0 to 599
5 Data columns (total 10 columns):
6 Marital_status 600 non-null object
7 Dependents 600 non-null int64
8 Is_graduate 600 non-null object
9 Income 600 non-null int64
10 Loan_amount 600 non-null int64
11 Term_months 600 non-null int64
12 Credit_score 600 non-null object
13 approval_status 600 non-null object
14 Age 600 non-null int64
15 Sex 600 non-null object
16 dtypes: int64(5), object(5)
17 memory usage: 47.0+ KB
18 None
Five of the variables are categorical (labelled as 'object') while the
remaining five are numerical (labelled as 'int').

Measures of Central Tendency


Measures of central tendency describe the center of the data, and are often
represented by the mean, the median, and the mode.

Mean
Mean represents the arithmetic average of the data. The line of code below
prints the mean of the numerical variables in the data. From the output, we
can infer that the average age of the applicant is 49 years, the average
annual income is USD 705,541, and the average tenure of loans is 183
months. The command df.mean(axis = 0) will also give the same output.
1df.mean()
python
Output:
1 Dependents 0.748333
2 Income 705541.333333
3 Loan_amount 323793.666667
4 Term_months 183.350000
5 Age 49.450000
6 dtype: float64
It is also possible to calculate the mean of a particular variable in a data, as
shown below, where we calculate the mean of the variables 'Age' and
'Income'.
1print(df.loc[:,'Age'].mean())
2print(df.loc[:,'Income'].mean())
python
Output:
1 49.45
2 705541.33
In the previous sections, we computed the column-wise mean. It is also
possible to calculate the mean of the rows by specifying the (axis =
1) argument. The code below calculates the mean of the first five rows.
1df.mean(axis = 1)[0:5]
python
Output:
1 0 70096.0
2 1 161274.0
3 2 125113.4
4 3 119853.8
5 4 120653.8
6 dtype: float64

Median
In simple terms, median represents the 50th percentile, or the middle value
of the data, that separates the distribution into two halves. The line of code
below prints the median of the numerical variables in the data. The
command df.median(axis = 0) will also give the same output.
1df.median()
python
Output:
1 Dependents 0.0
2 Income 508350.0
3 Loan_amount 76000.0
4 Term_months 192.0
5 Age 51.0
6 dtype: float64
From the output, we can infer that the median age of the applicants is 51
years, the median annual income is USD 508,350, and the median tenure
of loans is 192 months. There is a difference between the mean and the
median values of these variables, which is because of the distribution of the
data. We will learn more about this in the subsequent sections.
It is also possible to calculate the median of a particular variable in a data,
as shown in the first two lines of code below. We can also calculate the
median of the rows by specifying the (axis = 1) argument. The third
line below calculates the median of the first five rows.
1#to calculate a median of a particular column
2print(df.loc[:,'Age'].median())
3print(df.loc[:,'Income'].median())
4
5df.median(axis = 1)[0:5]
python
Output:
1 51.0
2 508350.0
3
4 0 102.0
5 1 192.0
6 2 192.0
7 3 192.0
8 4 192.0
9 dtype: float64

Mode
Mode represents the most frequent value of a variable in the data. This is
the only central tendency measure that can be used with categorical
variables, unlike the mean and the median which can be used only with
quantitative data.
The line of code below prints the mode of all the variables in the data.
The .mode() function returns the most common value or most repeated
value of a variable. The command df.mode(axis = 0) will also give the same
output.
1df.mode()
python
Output:
1| | Marital_status | Dependents |
Is_graduate | Income | Loan_amount |
Term_months | Credit_score | approval_status
| Age | Sex |
2|--- |---------------- |------------ |---------
---- |-------- |------------- |------------- |---
----------- |----------------- |----- |-----
|
3| 0 | Yes | 0 | Yes
| 333300 | 70000 | 192.0 |
Satisfactory | Yes | 55 | M
|
The interpretation of the mode is simple. The output above shows that most
of the applicants are married, as depicted by the 'Marital_status' value of
"Yes". Similar interpreation could be done for the other categorical
variables like 'Sex' and 'Credit-Score'. For numerical variables, the mode
value represents the value that occurs most frequently. For example, the
mode value of 55 for the variable 'Age' means that the highest number (or
frequency) of applicants are 55 years old.

Measures of Dispersion
In the previous sections, we have discussed the various measures of
central tendency. However, as we have seen in the data, the values of
these measures differ for many variables. This is because of the extent to
which a distribution is stretched or squeezed. In statistics, this is measured
by dispersion which is also referred to as variability, scatter, or spread. The
most popular measures of dispersion are standard deviation, variance, and
the interquartile range.

Standard Deviation
Standard deviation is a measure that is used to quantify the amount of
variation of a set of data values from its mean. A low standard deviation for
a variable indicates that the data points tend to be close to its mean, and
vice versa. The line of code below prints the standard deviation of all the
numerical variables in the data.
1df.std()
python
Output:
1 Dependents 1.026362
2 Income 711421.814154
3 Loan_amount 724293.480782
4 Term_months 31.933949
5 Age 14.728511
6 dtype: float64
While interpreting standard deviation values, it is important to understand
them in conjunction with the mean. For example, in the above output, the
standard deviation of the variable 'Income' is much higher than that of the
variable 'Dependents'. However, the unit of these two variables is different
and, therefore, comparing the dispersion of these two variables on the
basis of standard deviation alone will be incorrect. This needs to be kept in
mind.
It is also possible to calculate the standard deviation of a particular
variable, as shown in the first two lines of code below. The third
line calculates the standard deviation for the first five rows.
1print(df.loc[:,'Age'].std())
2print(df.loc[:,'Income'].std())
3
4#calculate the standard deviation of the first five
rows
5df.std(axis = 1)[0:5]
python
Output:
1 14.728511412020659
2 711421.814154101
3
4 0 133651.842584
5 1 305660.733951
6 2 244137.726597
7 3 233466.205060
8 4 202769.786470
9 dtype: float64

Variance
Variance is another measure of dispersion. It is the square of the standard
deviation and the covariance of the random variable with itself. The line of
code below prints the variance of all the numerical variables in the dataset.
The interpretation of the variance is similar to that of the standard deviation.
1df.var()
python
Output:
1 Dependents 1.053420e+00
2 Income 5.061210e+11
3 Loan_amount 5.246010e+11
4 Term_months 1.019777e+03
5 Age 2.169290e+02
6 dtype: float64

Interquartile Range (IQR)


The Interquartile Range (IQR) is a measure of statistical dispersion, and is
calculated as the difference between the upper quartile (75th percentile)
and the lower quartile (25th percentile). The IQR is also a very important
measure for identifying outliers and could be visualized using a boxplot.
IQR can be calculated using the iqr() function. The first line of code below
imports the 'iqr' function from the scipy.stats module, while the second
line prints the IQR for the variable 'Age'.
1from scipy.stats import iqr
2iqr(df['Age'])
python
Output:
1 25.0

Skewness
Another useful statistic is skewness, which is the measure of the symmetry,
or lack of it, for a real-valued random variable about its mean. The
skewness value can be positive, negative, or undefined. In a perfectly
symmetrical distribution, the mean, the median, and the mode will all have
the same value. However, the variables in our data are not symmetrical,
resulting in different values of the central tendency.
We can calculate the skewness of the numerical variables using
the skew() function, as shown below.
1print(df.skew())
python
Output:
1 Dependents 1.169632
2 Income 5.344587
3 Loan_amount 5.006374
4 Term_months -2.471879
5 Age -0.055537
6 dtype: float64
The skewness values can be interpreted in the following manner:
 Highly skewed distribution: If the skewness value is less than −1 or
greater than +1.
 Moderately skewed distribution: If the skewness value is between −1
and −½ or between +½ and +1.
 Approximately symmetric distribution: If the skewness value is
between −½ and +½.

Data Visualisation in Python using


Matplotlib and Seaborn
It may sometimes seem easier to go through a set of data points and build insights
from it but usually this process may not yield good results. There could be a lot of
things left undiscovered as a result of this process. Additionally, most of the data sets
used in real life are too big to do any analysis manually. This is essentially where data
visualization steps in.
Data visualization is an easier way of presenting the data, however complex it is, to
analyze trends and relationships amongst variables with the help of pictorial
representation.
The following are the advantages of Data Visualization
 Easier representation of compels data
 Highlights good and bad performing areas
 Explores relationship between data points
 Identifies data patterns even for larger data points
While building visualization, it is always a good practice to keep some below
mentioned points in mind
 Ensure appropriate usage of shapes, colors, and size while building visualization
 Plots/graphs using a co-ordinate system are more pronounced
 Knowledge of suitable plot with respect to the data types brings more clarity to the
information
 Usage of labels, titles, legends and pointers passes seamless information the wider
audience
Python Libraries
There are a lot of python libraries which could be used to build visualization
like matplotlib, vispy, bokeh, seaborn, pygal, folium, plotly, cufflinks, and networkx.
Of the many, matplotlib and seaborn seems to be very widely used for basic to
intermediate level of visualizations.

Matplotlib

It is an amazing visualization library in Python for 2D plots of arrays, It is a multi-


platform data visualization library built on NumPy arrays and designed to work with
the broader SciPy stack. It was introduced by John Hunter in the year 2002. Let’s try
to understand some of the benefits and features of matplotlib
 It’s fast, efficient as it is based on numpy and also easier to build
 Has undergone a lot of improvements from the open source community since
inception and hence a better library having advanced features as well
 Well maintained visualization output with high quality graphics draws a lot of
users to it
 Basic as well as advanced charts could be very easily built
 From the users/developers point of view, since it has a large community support,
resolving issues and debugging becomes much easier

Seaborn

Conceptualized and built originally at the Stanford University, this library sits on top
of matplotlib. In a sense, it has some flavors of matplotlib while from the visualization
point, it is much better than matplotlib and has added features as well. Below are its
advantages
 Built-in themes aid better visualization
 Statistical functions aiding better data insights
 Better aesthetics and built-in plots
 Helpful documentation with effective examples
Nature of Visualization
Depending on the number of variables used for plotting the visualization and the type
of variables, there could be different types of charts which we could use to understand
the relationship. Based on the count of variables, we could have
 Univariate plot(involves only one variable)
 Bivariate plot(more than one variable in required)
A Univariate plot could be for a continuous variable to understand the spread and
distribution of the variable while for a discrete variable it could tell us the count
Similarly, a Bivariate plot for continuous variable could display essential statistic like
correlation, for a continuous versus discrete variable could lead us to very important
conclusions like understanding data distribution across different levels of a categorical
variable. A bivariate plot between two discrete variables could also be developed.
Box plot
A boxplot, also known as a box and whisker plot, the box and the whisker are clearly
displayed in the below image. It is a very good visual representation when it comes to
measuring the data distribution. Clearly plots the median values, outliers and the
quartiles. Understanding data distribution is another important factor which leads to
better model building. If data has outliers, box plot is a recommended way to identify
them and take necessary actions.
Syntax: seaborn.boxplot(x=None, y=None, hue=None, data=None, order=None,
hue_order=None, orient=None, color=None, palette=None, saturation=0.75,
width=0.8, dodge=True, fliersize=5, linewidth=None, whis=1.5, ax=None, **kwargs)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting. If x and y are absent, this is interpreted as wide-form.
color: Color for all of the elements.
Returns: It returns the Axes object with the plot drawn onto it.
The box and whiskers chart shows how data is spread out. Five pieces of information
are generally included in the chart
1. The minimum is shown at the far left of the chart, at the end of the left ‘whisker’
2. First quartile, Q1, is the far left of the box (left whisker)
3. The median is shown as a line in the center of the box
4. Third quartile, Q3, shown at the far right of the box (right whisker)
5. The maximum is at the far right of the box
As could be seen in the below representations and charts, a box plot could be plotted
for one or more than one variable providing very good insights to our data.
Representation of box plot.

Box plot representing multi-variate categorical variables


Box plot representing multi-variate categorical variables

 Python3

# import required modules

import matplotlib as plt

import seaborn as sns

# Box plot and violin plot for Outcome vs BloodPressure

_, axes = plt.subplots(1, 2, sharey=True, figsize=(10, 4))

# box plot illustration

sns.boxplot(x='Outcome', y='BloodPressure', data=diabetes,


ax=axes[0])

# violin plot illustration

sns.violinplot(x='Outcome', y='BloodPressure', data=diabetes,


ax=axes[1])
Output for Box Plot and Violin Plot

 Python3

# Box plot for all the numerical variables

sns.set(rc={'figure.figsize': (16, 5)})

# multiple box plot illustration

sns.boxplot(data=diabetes.select_dtypes(include='number'))

Output Multiple Box PLot

Scatter Plot
Scatter plots or scatter graphs is a bivariate plot having greater resemblance to line
graphs in the way they are built. A line graph uses a line on an X-Y axis to plot a
continuous function, while a scatter plot relies on dots to represent individual pieces of
data. These plots are very useful to see if two variables are correlated. Scatter plot
could be 2 dimensional or 3 dimensional.
Syntax: seaborn.scatterplot(x=None, y=None, hue=None, style=None, size=None,
data=None, palette=None, hue_order=None, hue_norm=None, sizes=None,
size_order=None, size_norm=None, markers=True, style_order=None,
x_bins=None, y_bins=None, units=None, estimator=None, ci=95, n_boot=1000,
alpha=’auto’, x_jitter=None, y_jitter=None, legend=’brief’, ax=None, **kwargs)
Parameters:
x, y: Input data variables that should be numeric.
data: Dataframe where each column is a variable and each row is an observation.
size: Grouping variable that will produce points with different sizes.
style: Grouping variable that will produce points with different markers.
palette: Grouping variable that will produce points with different markers.
markers: Object determining how to draw the markers for different levels.
alpha: Proportional opacity of the points.
Returns: This method returns the Axes object with the plot drawn onto it.

Advantages of a scatter plot

 Displays correlation between variables


 Suitable for large data sets
 Easier to find data clusters
 Better representation of each data point

 Python3

# import module

import matplotlib.pyplot as plt

# scatter plot illustration

plt.scatter(diabetes['DiabetesPedigreeFunction'], diabetes['BMI'])
Output 2D Scattered Plot

 Python3

# import required modules

from mpl_toolkits.mplot3d import Axes3D

# assign axis values

x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

y = [5, 6, 2, 3, 13, 4, 1, 2, 4, 8]

z = [2, 3, 3, 3, 5, 7, 9, 11, 9, 10]

# adjust size of plot

sns.set(rc={'figure.figsize': (8, 5)})

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

ax.scatter(x, y, z, c='r', marker='o')


# assign labels

ax.set_xlabel('X Label'), ax.set_ylabel('Y Label'),


ax.set_zlabel('Z Label')

# display illustration

plt.show()

Output 3D Scattered Plot

Histogram
Histograms display counts of data and are hence similar to a bar chart. A histogram
plot can also tell us how close a data distribution is to a normal curve. While working
out statistical method, it is very important that we have a data which is normally or
close to a normal distribution. However, histograms are univariate in nature and bar
charts bivariate.
A bar graph charts actual counts against categories e.g. height of the bar indicates the
number of items in that category whereas a histogram displays the same categorical
variables in bins.
Bins are integral part while building a histogram they control the data points which are
within a range. As a widely accepted choice we usually limit bin to a size of 5-20,
however this is totally governed by the data points which is present.

 Python3

# illustrate histogram

features = ['BloodPressure', 'SkinThickness']

diabetes[features].hist(figsize=(10, 4))

Output Histogram

Countplot
A countplot is a plot between a categorical and a continuous variable. The continuous
variable in this case being the number of times the categorical is present or simply the
frequency. In a sense, count plot can be said to be closely linked to a histogram or a
bar graph.
Syntax : seaborn.countplot(x=None, y=None, hue=None, data=None, order=None,
hue_order=None, orient=None, color=None, palette=None, saturation=0.75,
dodge=True, ax=None, **kwargs)
Parameters : This method is accepting the following parameters that are described
below:

 x, y: This parameter take names of variables in data or vector data, optional,


Inputs for plotting long-form data.
 hue : (optional) This parameter take column name for colour encoding.
 data : (optional) This parameter take DataFrame, array, or list of arrays, Dataset
for plotting. If x and y are absent, this is interpreted as wide-form. Otherwise it is
expected to be long-form.
 order, hue_order : (optional) This parameter take lists of strings. Order to plot the
categorical levels in, otherwise the levels are inferred from the data objects.
 orient : (optional)This parameter take “v” | “h”, Orientation of the plot (vertical
or horizontal). This is usually inferred from the dtype of the input variables but
can be used to specify when the “categorical” variable is a numeric or when
plotting wide-form data.
 color : (optional) This parameter take matplotlib color, Color for all of the
elements, or seed for a gradient palette.
 palette : (optional) This parameter take palette name, list, or dict, Colors to use
for the different levels of the hue variable. Should be something that can be
interpreted by color_palette(), or a dictionary mapping hue levels to matplotlib
colors.
 saturation : (optional) This parameter take float value, Proportion of the original
saturation to draw colors at. Large patches often look better with slightly
desaturated colors, but set this to 1 if you want the plot colors to perfectly match
the input color spec.
 dodge : (optional) This parameter take bool value, When hue nesting is used,
whether elements should be shifted along the categorical axis.
 ax : (optional) This parameter take matplotlib Axes, Axes object to draw the plot
onto, otherwise uses the current Axes.
 kwargs : This parameter take key, value mappings, Other keyword arguments are
passed through to matplotlib.axes.Axes.bar().
Returns: Returns the Axes object with the plot drawn onto it.
It simply shows the number of occurrences of an item based on a certain type of
category.In python, we can create a count plot using the seaborn library. Seaborn is a
module in Python that is built on top of matplotlib and used for visually appealing
statistical plots.
 Python3

# import required module

import seaborn as sns

# assign required values

_, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))


# illustrate count plots

sns.countplot(x='Outcome', data=diabetes, ax=axes[0])

sns.countplot(x='BloodPressure', data=diabetes, ax=axes[1])

Output Countplot

Correlation plot
Correlation plot is a multi-variate analysis which comes very handy to have a look at
relationship with data points. Scatter plots helps to understand the affect of one
variable over the other. Correlation could be defined as the affect which one variable
has over the other.
Correlation could be calculated between two variables or it could be one versus many
correlations as well which we could see the below plot. Correlation could be positive,
negative or neutral and the mathematical range of correlations is from -1 to 1.
Understanding the correlation could have a very significant effect on the model
building stage and also understanding the model outputs.

 Python3

# Finding and plotting the correlation for

# the independent variables


# import required module

import seaborn as sns

# adjust plot

sns.set(rc={'figure.figsize': (14, 5)})

# assign data

ind_var = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM',

'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

# illustrate heat map.

sns.heatmap(diabetes.select_dtypes(include='number').corr(),

cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))

Output Correlation Plot

Heat Maps
Heat map is a multi-variate data representation. The color intensity in a heat map
displays becomes an important factor to understand the affect of data points. Heat
maps are easier to understand and easier to explain as well. When it comes to data
analysis using visualization, its very important that the desired message gets conveyed
with the help of plots.
Syntax:
seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, robus
t=False, annot=None, fmt=’.2g’, annot_kws=None, linewidths=0, linecolor=’white’,
cbar=True, cbar_kws=None, cbar_ax=None, square=False, xticklabels=’auto’, ytick
labels=’auto’, mask=None, ax=None, **kwargs)
Parameters : This method is accepting the following parameters that are described
below:

 x, y: This parameter take names of variables in data or vector data, optional,


Inputs for plotting long-form data.
 hue : (optional) This parameter take column name for colour encoding.
 data : (optional) This parameter take DataFrame, array, or list of arrays, Dataset
for plotting. If x and y are absent, this is interpreted as wide-form. Otherwise it is
expected to be long-form.
 color : (optional) This parameter take matplotlib color, Color for all of the
elements, or seed for a gradient palette.
 palette : (optional) This parameter take palette name, list, or dict, Colors to use
for the different levels of the hue variable. Should be something that can be
interpreted by color_palette(), or a dictionary mapping hue levels to matplotlib
colors.
 ax : (optional) This parameter take matplotlib Axes, Axes object to draw the plot
onto, otherwise uses the current Axes.
 kwargs : This parameter take key, value mappings, Other keyword arguments are
passed through to matplotlib.axes.Axes.bar().
Returns: Returns the Axes object with the plot drawn onto it.

 Python3

# import required module

import seaborn as sns

import numpy as np

# assign data

data = np.random.randn(50, 20)


# illustrate heat map

ax = sns.heatmap(data, xticklabels=2, yticklabels=False)

Output Heat Map

Pie Chart
Pie chart is a univariate analysis and are typically used to show percentage or
proportional data. The percentage distribution of each class in a variable is provided
next to the corresponding slice of the pie. The python libraries which could be used to
build a pie chart is matplotlib and seaborn.
Syntax: matplotlib.pyplot.pie(data, explode=None, labels=None, colors=None,
autopct=None, shadow=False)
Parameters:
data represents the array of data values to be plotted, the fractional area of each slice
is represented by data/sum(data). If sum(data)<1, then the data values returns the
fractional area directly, thus resulting pie will have empty wedge of size 1-sum(data).
labels is a list of sequence of strings which sets the label of each wedge.
color attribute is used to provide color to the wedges.
autopct is a string used to label the wedge with their numerical value.
shadow is used to create shadow of wedge.
Below are the advantages of a pie chart
 Easier visual summarization of large data points
 Effect and size of different classes can be easily understood
 Percentage points are used to represent the classes in the data points
 Python3
# import required module

import matplotlib.pyplot as plt

# Creating dataset

cars = ['AUDI', 'BMW', 'FORD', 'TESLA', 'JAGUAR', 'MERCEDES']

data = [23, 17, 35, 29, 12, 41]

# Creating plot

fig = plt.figure(figsize=(10, 7))

plt.pie(data, labels=cars)

# Show plot

plt.show()

Output Pie Chart


 Python3

# Import required module

import matplotlib.pyplot as plt

import numpy as np

# Creating dataset

cars = ['AUDI', 'BMW', 'FORD', 'TESLA', 'JAGUAR', 'MERCEDES']

data = [23, 17, 35, 29, 12, 41]

# Creating explode data

explode = (0.1, 0.0, 0.2, 0.3, 0.0, 0.0)

# Creating color parameters

colors = ("orange", "cyan", "brown", "grey", "indigo", "beige")

# Wedge properties

wp = {'linewidth': 1, 'edgecolor': "green"}

# Creating autocpt arguments

def func(pct, allvalues):

absolute = int(pct / 100.*np.sum(allvalues))


return "{:.1f}%\n({:d} g)".format(pct, absolute)

# Creating plot

fig, ax = plt.subplots(figsize=(10, 7))

wedges, texts, autotexts = ax.pie(data, autopct=lambda pct:


func(pct, data), explode=explode, labels=cars,

shadow=True, colors=colors,
startangle=90, wedgeprops=wp,

textprops=dict(color="magenta"))

# Adding legend

ax.legend(wedges, cars, title="Cars", loc="center left",

bbox_to_anchor=(1, 0, 0.5, 1))

plt.setp(autotexts, size=8, weight="bold")

ax.set_title("Customizing pie chart")

# Show plot

plt.show()
Output

Error Bars
Error bars could be defined as a line through a point on a graph, parallel to one of the
axes, which represents the uncertainty or error of the corresponding coordinate of the
point. These types of plots are very handy to understand and analyze the deviations
from the target. Once errors are identified, it could easily lead to deeper analysis of the
factors causing them.
 Deviation of data points from the threshold could be easily captured
 Easily captures deviations from a larger set of data points
 It defines the underlying data
 Python3

# Import required module

import matplotlib.pyplot as plt

import numpy as np

# Assign axes

x = np.linspace(0,5.5,10)

y = 10*np.exp(-x)

# Assign errors regarding each axis

xerr = np.random.random_sample(10)

yerr = np.random.random_sample(10)

# Adjust plot

fig, ax = plt.subplots()
ax.errorbar(x, y, xerr=xerr, yerr=yerr, fmt='-o')

# Assign labels

ax.set_xlabel('x-axis'), ax.set_ylabel('y-axis')

ax.set_title('Line plot with error bars')

# Illustrate error bars

plt.show()

Output Error Plot

Python Seaborn Tutorial | Data


Visualization Using Seaborn
Python is a storehouse of numerous immensely powerful libraries and
frameworks. Among them, is Seaborn, which is a dominant data
visualization library. In this Python Seaborn Tutorial, you will be leaning
all the knacks of data visualization using Seaborn.
So let’s begin first by reasoning out the importance of Python Seaborn.

Why use Python Seaborn?


As mentioned earlier, the Python Seaborn library is used to ease the
challenging task of data visualization and it’s based on Matplotlib.
Seaborn allows the creation of statistical graphics through the following
functionalities:

An API that is based on datasets allowing comparison between multiple


variable

Supports multi-plot grids that in turn ease building complex visualizations

Univariate and bivariate visualizations available to compare between


subsets of data

Availability of different color palettes to reveal various kinds of patterns

Estimates and plots linear regression automatically

So, if you were wondering as to why use Seaborn when you already have
Matplotlib, here is the answer to it.

Python Seaborn vs Matplotlib:

“If Matplotlib “tries to make easy things easy and hard things possible”,
seaborn tries to make a well-defined set of hard things easy too” –
Michael Waskom (Creator of Seaborn).
Factually, Matplotlib is good but Seaborn is better. There are basically two
shortcomings of Matplotlib that Seaborn fixes:

Matplotlib can be personalized but it’s difficult to figure out what settings
are required to make plots more attractive. On the other hand, Seaborn
comes with numerous customized themes and high-level interfaces to
solve this issue.

When working with Pandas, Matplotlib doesn’t serve well when it comes
to dealing with DataFrames, while Seaborn functions actually work on
DataFrames.

Seaborn Plotting Functions

Visualizing Statistical Relationships:

The process of understanding relationships between variables of a


dataset and how these relationships, in turn, depend on other variables is
known as statistical analysis. Let’s now take a deeper look at functions
needed for this:

relplot():
This is a figure-level-function that makes use of two other axes functions
for Visualizing Statistical Relationships which are:

These functions can be specified using the ‘kind’ parameter of relplot(). In


case this parameter is given, it takes the default one which is scatterplot().
Before you begin writing your code, make sure to import the required
libraries as follows:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

sns.set(style="darkgrid")

Please note that the style attribute is also customizable and can take any
value such as darkgrid, ticks, etc which I will discuss later in the plot-
aesthetics section. Let’s now take a look at a small example:

EXAMPLE:

f = sns.load_dataset("flights")

sns.relplot(x="passengers", y="month", data=f);

OUTPUT:
As you can see, the points are plotted in 2-dimensions. However, you can
add another dimension using the ‘hue’ semantic. Let’s take a look at an
example of the same:

EXAMPLE:

f = sns.load_dataset("flights")

sns.relplot(x="passengers", y="month", hue="year", data=f);

You will see the following output:

OUTPUT:

However, there are many more customizations that you can try out such
as colors, styles, size, etc. Let me just show how you can change the
color in the following example:

EXAMPLE:
sns.set(style="darkgrid")

f = sns.load_dataset("flights")

sns.relplot(x="passengers", y="month", hue="year",palette="ch:r=-


.5,l=.75", data=f);

OUTPUT:

lineplot():
This function will allow you to draw a continuous line for your data. You
can use this function by changing the ‘kind’ parameter as follows:

EXAMPLE:

a=pd.DataFrame({'Day':[1,2,3,4,5,6,7],'Grocery':[30,80,45,23,51,46
,76],'Clothes':[13,40,34,23,54,67,98],'Utensils':[12,32,27,56,87,5
4,34]},index=[1,2,3,4,5,6,7])

g = sns.relplot(x="Day", y="Clothes", kind="line", data=a)


g.fig.autofmt_xdate()

OUTPUT:

The default for lineplot is y as a function of x. However, it can be changed


if you wish to do so. There are many more options which you can try out
further.

Now let’s take a look at how to plot categorical data.

Plotting with Categorical Data:

This approach comes into the picture when our main variable is further
divided into discrete groups (categorical). This can be achieved using the
catplot() function.

catplot():
This is a figure-level-function like relplot(). It can be characterized by three
families of axes level functions namely:
Scatterplots – These include stripplot(), swarmplot()

Distribution Plots – which are boxplot(), violinplot(), boxenplot()

Estimateplots – namely pointplot(), barplot(), countplot()

Let’s now take a few examples to demonstrate this:

EXAMPLE:

import seaborn as sns

import matplotlib.pyplot as plt

sns.set(style="ticks", color_codes=True)

a = sns.load_dataset("tips")

sns.catplot(x="day", y="total_bill", data=a);

OUTPUT:
As you can see, in the above example I have not set the ‘kind’ parameter.
Therefore it has returned the graph as the default scatterplot. You can
specify any of the axes level function to change the graph as need be.
Let’s take an example of this as well:

EXAMPLE:

import seaborn as sns

import matplotlib.pyplot as plt

sns.set(style="ticks", color_codes=True)

a = sns.load_dataset("tips")

sns.catplot(x="day", y="total_bill", kind="violin", data=a);

OUTPUT:
The above output shows the violinplot for the tips dataset. Now let us try
to find how to visualize the distribution of a dataset.

Visualizing the distribution of a dataset:

This basically deals with understanding datasets with context to being


univariate or bivariate. Before starting with this, just import the following:

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from scipy import stats

sns.set(color_codes=True)
Once this is done, you can continue plotting univariate and bivariate
distributions.

Plotting Univariate distributions:

To plot them, you can make use of distplot() function as follows:

EXAMPLE:

a = np.random.normal(loc=5,size=100,scale=2)

sns.distplot(a);

OUTPUT:

As you can see in the above example, we have plotted a graph for the
variable a whose values are generated by the normal() function using
distplot.

Plotting bivariate distributions:


This comes into picture when you have two random independent variables
resulting in some probable event. The best function to plot these type of
graphs is jointplot(). Let us now plot a bivariate graph using jointplot().

EXAMPLE:

x=pd.DataFrame({'Day':[1,2,3,4,5,6,7],'Grocery':[30,80,45,23,51,46
,76],'Clothes':[13,40,34,23,54,67,98],'Utensils':[12,32,27,56,87,5
4,34]},index=[1,2,3,4,5,6,7])

y=pd.DataFrame({'Day':[8,9,10,11,12,13,14],'Grocery':[30,80,45,23,
51,46,76],'Clothes':[13,40,34,23,54,67,98],'Utensils':[12,32,27,56
,87,54,34]},index=[8,9,10,11,12,13,14])

mean, cov = [0, 1], [(1, .5), (.5, 1)]

data = np.random.multivariate_normal(mean, cov, 200)

with sns.axes_style("white"):

sns.jointplot(x=x, y=y, kind="kde", color="b");

OUTPUT:
Now that you have understood the various functions in Python Seaborn,
let’s move on to build structured multi-plot grids.

You might also like