Data Visualization

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 103

U NIVERSIT Y OF TOMORROW

Data Visualization
Unit 1

Introduction

Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Data visualization is a powerful tool that transforms
● What is data visualization? complex datasets into meaningful and insightful
visuals. In today’s data-driven world, where an immense
● Why is it important to learn
amount of information is generated daily, the ability to
visualization?
effectively present and understand data is crucial. Data
● Ways to visualize in python visualization bridges the gap between raw data and
● Why choose matplotlib over human comprehension, allowing us to explore patterns,
others? trends, and relationships that might otherwise remain
hidden in spreadsheets or databases.
● Installation of matplotlib in
python
The primary goal of data visualization is to communicate
information clearly, accurately, and efficiently. It goes
beyond mere aesthetics; a well-designed visualization
can convey information, insights, and narratives
that can drive informed decision-making, enhance
communication, and tell compelling stories. Whether
you’re a business analyst trying to convey sales trends,
a scientist explaining experimental results, or a journalist
presenting investigative findings, data visualization is
a universal language that can be understood by both
experts and laypeople alike.

There are various types of data visualization techniques,


ranging from simple bar charts and line graphs to more
1 Introduction
complex visualizations like heatmaps, scatter plots, ● Bar Charts: Employed for comparing categorical
bubble charts, and network diagrams. The choice data and illustrating the frequency or distribution
of visualization depends on the nature of the data of distinct categories.
and the specific insights you aim to communicate.
● Line Graphs: Valuable for depicting trends and
Visualization tools and software have evolved
fluctuations in data over time.
significantly, making it easier for individuals with
● Pie Charts: Effective in illustrating the proportion
varying levels of technical expertise to create
of different constituents within a whole.
compelling visuals.
● Scatter Plots: Utilized to exhibit the correlation
What is Data Visualization? or patterns between two variables.

● Heatmaps: Visual portrayals of data utilizing


Data visualization is the graphical representation
color gradients to demonstrate the density or
of information and data. It involves utilizing visual
intensity of values in a matrix.
elements like charts, graphs, and maps to convey
● Geographic Maps: Utilized to spatially represent
intricate data in a clear and easily comprehensible
data and showcase disparities across
manner. The primary objective of data
geographical regions.
visualization is to aid individuals in
comprehending the importance ● Tree Maps: Hierarchical
of the data by presenting it in data representation employing
a visually captivating and nested rectangles to visualize
informative way. proportions and hierarchical
relationships.
Data visualization constitutes
● Infographics: Fusion of
an essential component of data
various visual elements like charts,
analysis and communication. It
graphs, images, and text to succinctly
empowers individuals to discern
present intricate information.
patterns, trends, and insights from
extensive datasets that might otherwise be
Data visualization tools and software, such
challenging to grasp in their raw format. Through
as Tableau, Power BI, Python’s Matplotlib and
diverse visual aids, data visualization empowers
Seaborn, and R’s ggplot2, have progressively gained
decision-makers, analysts, and the general audience
popularity in recent years for creating interactive
to swiftly interpret information and make informed
and dynamic visualizations that enrich the data
decisions grounded in the presented data.
exploration process.

Various types of data visualizations exist, each


Skillful data visualizations are purposefully designed
suited for distinct data types and analytical
to be lucid, succinct, and precise, bestowing
intentions. A few typical data visualization types
valuable insights upon the audience at a glance.
include:
They hold a pivotal role across numerous domains,

2 Introduction
encompassing business, science, journalism, and on the information presented.
academia, wherein data-informed decisions and
● Data Reporting: Whether it’s a business
communication are pivotal for attaining success.
report, research paper, or presentation, data
visualizations make your work more professional
Why is it Important to Learn and impactful. Instead of overwhelming your
Visualization ? audience with tables and numbers, you can
present the information visually, making it easier
● Effective Communication: Visualizations are
for them to grasp the key takeaways.
powerful tools for conveying complex information
in a simple and understandable manner. ● Detecting Errors and Anomalies: Visualization

Learning visualization techniques enables you to can help you identify data errors, inconsistencies,

effectively communicate data-driven insights to or outliers that may not be apparent when

both technical and non-technical audiences. This looking at raw data. Spotting and rectifying these

skill is valuable in various professional fields, issues are crucial to ensure data accuracy and

including business, research, and academia. the validity of analysis results.

● Data Exploration and Analysis: ● Collaboration and Teamwork:

Visualization helps you explore Data visualization facilitates

and understand data more collaboration among team

efficiently. By plotting data members. When everyone

in various ways, you can can see and understand the

identify patterns, trends, data through visualizations,

outliers, and relationships it becomes easier to discuss

between variables that might and analyze the information

be overlooked when examining collectively, fostering better

raw data. It facilitates the data teamwork and decision-making.

analysis process and leads to more ● Innovation and Creativity: Data


accurate and insightful conclusions. visualization allows you to experiment with

● Decision Making: In today’s data-driven world, different visual elements, design principles, and

decision-makers heavily rely on data to make creative approaches. It encourages innovation in

informed choices. Data visualizations enable presenting data in unique and engaging ways.

decision-makers to quickly grasp the implications ● Career Advancement: Proficiency in data


of different options and make well-informed visualization is a valuable skill in today’s job
decisions. They can also help identify potential market. Employers value individuals who can
risks and opportunities. effectively analyze and communicate data

● Storytelling with Data: Data visualizations allow using visualizations, making it an asset that can

you to tell compelling stories with data. By enhance your career prospects.

presenting data in a visually engaging manner,


Overall, learning data visualization empowers you
you can captivate your audience, highlight key
to become a more informed and effective data
insights, and persuade them to take action based

3 Introduction
practitioner, enabling you to extract meaningful notebooks. Plotly supports a variety of chart
insights from data and communicate them in a way types, including line charts, scatter plots, bar
that resonates with others. charts, 3D plots, choropleth maps, and more.

● Bokeh: Bokeh is another library that focuses on


interactive visualizations for web browsers. It is
designed to handle large datasets and allows for
interactive exploration of data. Bokeh provides
various visualizations like line plots, scatter plots,
bar charts, and heatmaps.

● Altair: Altair is a declarative statistical


visualization library for Python. It uses a
simple and concise syntax based on Vega-Lite,
Ways to Visualize in Python making it easy to create complex visualizations
with minimal code. Altair is well-suited for
Python offers several powerful libraries for exploratory data analysis and rapid prototyping
data visualization. Some of the most popular of visualizations.
visualization packages in Python are:
● ggplot (ggplot2 for Python): Inspired by the
popular ggplot2 library in R, ggplot is a Python
● Matplotlib: Matplotlib is one of the most widely
package that follows the “Grammar of Graphics”
used data visualization libraries in Python. It
principles. It provides a high-level interface for
provides flexible and comprehensive plotting
creating complex visualizations with concise
functionalities, enabling users to create a wide
code.
range of static, interactive, and publication-quality
plots. It is highly customizable and suitable for ● Pandas Plotting: Pandas, a popular data
various types of visualizations, including line manipulation library in Python, also includes built-
plots, bar charts, scatter plots, histograms, and in plotting functionality. While not as powerful as
more. some dedicated visualization libraries, it offers a
convenient way to quickly visualize data directly
● Seaborn: Built on top of Matplotlib, Seaborn
from pandas DataFrames.
provides a higher-level interface for creating
attractive statistical graphics. It offers several ● Holoviews: Holoviews is a library that simplifies
built-in themes and color palettes, making it easier the process of creating interactive visualizations.
to create aesthetically pleasing visualizations. It allows users to focus on the data and high-level
Seaborn is particularly useful for statistical semantics, automatically handling the details of
data exploration and is often used to visualize visualization and interactivity.
distributions, correlations, and categorical data.
These visualization packages cater to different
● Plotly: Plotly is a powerful library for creating
needs and preferences, ranging from simple
interactive visualizations in Python. It allows
static plots to complex interactive visualizations.
users to create interactive charts and dashboards
Depending on your project requirements, you can
that can be displayed in web browsers or Jupyter
4 Introduction
choose the most suitable library to effectively computing libraries in Python, such as NumPy
convey insights from your data. and pandas. This makes it convenient to plot
data directly from these libraries without any
Why Choose Matplotlib Over extra conversions.
Others? ● Publication-Quality Plots: Matplotlib is well-
suited for creating publication-quality plots for
While there are several excellent data visualization
academic papers, reports, and presentations.
libraries in Python, there are specific reasons why
It allows you to export plots in various formats
one might choose Matplotlib over others:
(e.g., PNG, PDF, SVG) with high resolution and
vector graphics.
● Maturity and Stability: Matplotlib is one of
the oldest and most mature data visualization ● Matplotlib Gallery: The Matplotlib gallery
libraries in Python. It has been extensively used provides an extensive collection of sample plots
and tested over the years, making it a stable and and code snippets, demonstrating the variety
reliable choice for creating static, high-quality of visualizations you can create with the library.
visualizations. It serves as a valuable resource for
learning and inspiration.
● Wide Adoption: Matplotlib is
widely adopted and has a large ● Integration with Jupyter
user community. As a result, Notebooks: Matplotlib works
you can find extensive seamlessly with Jupyter
documentation, tutorials, Notebooks, allowing
and examples online, you to create interactive
making it easier to learn and visualizations and share your
troubleshoot issues. analyses with others.

● Flexibility and Customization:


However, it’s essential to note that the
Matplotlib offers a high degree
choice of a visualization library ultimately
of customization, allowing users to
depends on your specific needs and preferences.
control every aspect of the visualization. You can
While Matplotlib offers a lot of flexibility and control,
fine-tune the appearance of plots, axes, labels,
libraries like Seaborn, Plotly, and Altair provide
and annotations to match your specific needs.
higher-level abstractions and are more suitable
● Low-Level Control: While some other libraries for specific types of visualizations or interactive
like Seaborn and Plotly provide high-level plots. Some users might find other libraries more
abstractions, Matplotlib operates at a lower level, intuitive or better suited for their particular use
allowing more precise control over plot elements cases. Therefore, it’s always a good idea to explore
and layout. This is advantageous when you need different visualization libraries to find the one that
to create complex and specialized visualizations. best aligns with your project requirements and
● Integration with Other Libraries: Matplotlib workflow.
integrates seamlessly with other scientific

5 Introduction
Installation ofS matplotlib in python

Generally if you are using anaconda distribution packages like matplotlib, pandas and numpy are installed.
However if you are not using conda and using pip as alternative then you need to run these commands on
console

pip install numpy

pip install pandas

pip install matplotlib

pip install seaborn

Note all the above command can be executed from notebook by using a % sign before the command like
below

%pip install numpy

%pip install pandas

%pip install matplotlib

%pip install seaborn

If you don’ t know how to install anaconda or python with pip, I have created a separate document for that, you
must go through that document to understand the processing of installing anaconda or python.

Summary

Data visualization is the graphical representation of data to help people understand and make sense of
complex information. It involves using various visual elements such as charts, graphs, maps, and diagrams
to present data in a way that is easy to interpret and analyze. Data visualization can reveal patterns, trends,
and insights within data, making it a powerful tool for decision-making, storytelling, and communication
in fields ranging from business and science to journalism and education. Effective data visualization can
enhance data-driven decision-making, improve data communication, and facilitate a deeper understanding
of data for a wide range of audiences.

6 Introduction
Unit 2

Different Approaches of
Learning Matplotlib
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Before we delve into the reasons behind the various
● Matplotlib approaches of Matplotlib, it’s important to understand
the historical development of the library. This history
● Styles of writing matplotlib
is one of the main factors contributing to the diverse
flavors that Matplotlib offers. Due to its lengthy presence
as a package in Python and the involvement of multiple
maintainers over time, Matplotlib has evolved in different
directions for achieving the same goals. Therefore,
comprehending the historical progression of Matplotlib
is crucial.

Matplotlib

Matplotlib stands as one of the oldest and most influential


data visualization libraries in Python. Its origins trace
back to 2003 when John D. Hunter, a neurobiologist at
the University of Chicago, created it. Hunter’s motivation
stemmed from his need for a tool to generate figures of
publication quality for his work in neurophysiology and
electrophysiology. Faced with the absence of a suitable
solution, he took the initiative to develop his own plotting
library.

7 Different Approaches of Learning Matplotlib


The name “Matplotlib” arises from the fusion of ● 2015: Version 1.4 introduced significant
“Matlab” (a widely-used numerical computing performance improvements, making Matplotlib
environment) and “plotting.” The library drew faster and more efficient for large datasets.
inspiration from Matlab’s potent plotting capabilities
● 2017: Matplotlib 2.0 was released, featuring
and aimed to deliver a comparable experience
various enhancements, bug fixes, and new
within the Python ecosystem.
plotting functionalities.

● 2020: Matplotlib reached version 3.0, introducing


Key Milestones in the History of Matplotlib
further improvements and optimizations. It

● 2003: The initial version of Matplotlib was remains one of the go-to libraries for data

released by John D. Hunter. visualization in Python. As of writing this book,


the current version is Matplotlib 3.7.2.
● 2004: Matplotlib was open-sourced and made
publicly available, allowing other users to
Throughout its history, Matplotlib has been widely
contribute to its development.
adopted and has played a pivotal role in the Python
● 2005: Version 0.87 introduced the data science ecosystem. Its integration
object-oriented API, providing with Jupyter Notebooks, ease of use,
more flexibility and control over and ability to create publication-
plot elements. This marked a quality plots have made it a popular
significant turning point in the choice among researchers, data
library’s design. analysts, and data scientists

● 2007: Version 0.90 added worldwide. Over time, several

support for interactive plots other visualization libraries

and introduced the interface, have been developed in Python,

which simplified plot creation by but Matplotlib’s influence and utility

providing a set of stateful plotting have kept it relevant and essential for

functions. data visualization tasks.

● 2009: Version 1.0 was released, marking a


Now that we understand its historical significance
stable and feature-rich version of Matplotlib. The
and recognize the significant changes after 2013,
library’s core functionality was well-established
we can observe that older solutions may exhibit
by this point.
different coding styles compared to modern
● 2012: Matplotlib became the foundation of the solutions using Matplotlib.
newly created Astropy project, which aimed to
provide a core set of packages for astronomy- Styles of Writing Matplotlib
related Python projects.
Let’s delve into the various styles of writing
● 2013: John D. Hunter passed away, but his legacy
Matplotlib in general. In the context of Matplotlib,
in creating Matplotlib continues to influence the
“stateful” and “stateless” refer to different
data visualization community.
approaches for creating and managing plots and

8 Different Approaches of Learning Matplotlib


figures. Understanding the difference between Example of Stateless Approach:
these two modes is essential for effectively using
Matplotlib. import matplotlib.pyplot as plt

# Stateless approach: Explicitly creating and


Stateful: In the stateful approach, Matplotlib
modifying Figure and Axes fig, ax = plt.subplots()
maintains an implicit global state in the
ax.plot([1, 2, 3, 4], [1, 4, 9, 16])
background, and each plotting function modifies
this state directly. This means that every time ax.set_xlabel(‘X-axis’)
you call a plotting function, such as plt.scatter(), ax.set_ylabel(‘Y-axis’)
it automatically adds elements to the current
ax.set_title(‘Stateless Plot’)
active figure and axis. While this approach can be
convenient for quick and interactive plotting, it can plt.show()
also lead to confusion and potential issues when
dealing with multiple plots or complex figures. Which approach should you use? While both
approaches are valid and have their use cases,
Example of Stateful Approach: the stateless approach is generally
considered more robust and
import matplotlib.pyplot as plt recommended for most scenarios,
especially when creating
# Stateful approach: Implicitly
complex and multiple plots in
modifies global state plt.
the same figure. This approach
plot([1, 2, 3, 4], [1, 4, 9, 16])
avoids potential issues that
plt.xlabel(‘X-axis’)
might arise from global state
plt.ylabel(‘Y-axis’) modifications inherent in the
stateful approach. Furthermore,
plt.title(‘Stateful Plot’)
the stateless approach is more
plt.show()
aligned with the object-oriented design
of Matplotlib, offering greater flexibility and control
Stateless: In the stateless approach, you explicitly
over your visualizations.
manage figure and axes objects, and there is no
implicit global state. Rather than directly modifying
In summary, When working on extensive and
the current active figure, you explicitly create Figure
intricate visualizations, it’s advisable to utilize the
and Axes objects and operate on them directly.
stateless approach involving explicit Figure and
This approach is more explicit and empowers you
Axes creation. On the other hand, for rapid and
with greater control over your plots, simplifying
interactive plotting in simpler scenarios, the stateful
the process of creating complex and customized
approach can be more convenient.
visualizations.

9 Different Approaches of Learning Matplotlib


Another way to approach visualization challenges using Matplotlib is through the use of pylab. However,
starting from Matplotlib version 2.0 and later, the use of this module has been discouraged, and it is not
recommended for import. Instead, the standard practice is to import the necessary modules separately using
pyplot. Thus, it’s recommended to avoid using pylab. If you’re using Matplotlib, ensure that you adhere to the
stateless (object-oriented) approach when tackling visualization tasks.

Summary

Matplotlib is a widely-used Python library for creating static, animated, and interactive visualizations of data. It
provides a flexible and extensive set of tools for producing high-quality charts, plots, and graphs. Matplotlib’s
features include support for various plot types, customization options for colors and styles, and the ability
to create complex visualizations for a wide range of data analysis and presentation needs. It is often used
in scientific research, data analysis, data exploration, and data communication, making it an essential tool
for those working with data in Python. Matplotlib’s intuitive syntax and extensive documentation make it
accessible to both beginners and experienced data scientists and researchers.

10 Different Approaches of Learning Matplotlib


Unit 3

Understanding Data Types

Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Understanding the datatype before commencing the
● Continuous data plotting process is of utmost importance, as numerous
plots rely heavily on the intrinsic data type of the problem.
● Discrete data
They may not function as intended with differing
● Scales of measurement datatypes.

Continuous Data

Continuous data, also referred to as quantitative or


numerical data, is a kind of data that can assume any value
within a specified range. In simpler terms, it encompasses
data that is measurable and isn’t confined to particular
discrete values. Continuous data is usually depicted on
a continuous scale and can include fractional values. It
frequently finds application in scientific, engineering, and
statistical analyses.

Here are Some Examples of Continuous Data:

● Height: Individuals’ heights can assume any value


within a specific range, categorizing it as continuous
data. For instance, a person’s height might be 167.5
cm, 173.2 cm, 182.1 cm, and so on.

11 Understanding Data Types


● Temperature: Temperature measurements Discrete Data
can take on any value within a given range. For
example, temperatures can be 25.5°C, 30.2°C, Discrete data, on the other hand, is a form of
18.9°C, and so forth. data that can exclusively adopt specific, distinct

● Weight: Weight provides another illustration values within a defined set of categories or levels.

of continuous data. It can encompass various Unlike continuous data that spans any value

values within a specific range, such as 65.7 kg, within a range, discrete data remains confined to

72.3 kg, 58.9 kg, and more. a countable number of values. Discrete data is
frequently expressed as counts or frequencies
● Time: Time is continuously measurable, even
and is commonly employed for categorization and
down to fractions, such as 9:30 AM, 11:45 AM,
classification purposes.
3:15 PM, and so on.

● Speed: Speed, being subject to continuous Here are Some Examples of Discrete Data:
variation within a particular range, is also a form
of continuous data. For instance, a car’s ● Number of Children: The count of children
speed might be 45.6 km/h, 70.2 km/h, in a family is considered discrete data
100.8 km/h, and so forth. because it exclusively takes specific

● Distance: The distance integer values, such as 0, 1, 2, 3,

covered by an object can and so forth.

adopt any value within a ● Grade Levels: The grade


given range. As an example, levels of students in a school
a person could walk 2.4 km, (e.g., 1st grade, 2nd grade, 3rd
3.7 km, 5.1 km, and beyond. grade) are classified as discrete

● Pressure: Measurements of data due to their clear distinction

pressure, whether atmospheric among categories.

or related to blood pressure, are ● Types of Cars: Car type categories


also representable as continuous data. As (e.g., sedan, SUV, truck) represent discrete data
an illustration, atmospheric pressure could be since each car belongs to one of these distinct
1013.2 hPa, 1008.9 hPa, 1005.6 hPa, and more. categories.

● Rolling a Die: The outcome of rolling a six-sided


Continuous data is often susceptible to
die is regarded as discrete data because the
measurement errors and can be represented using
possible results are confined to the integers 1
various statistical techniques, including mean,
through 6.
standard deviation, and regression analysis. It
differs from discrete data, which solely assumes ● Gender: Gender categories (e.g., male, female,

specific and distinct values (e.g., the number of non-binary) are treated as discrete data, as each

children in a family, the number of students in a individual falls within one specific category.

class). ● Colors of Traffic Lights: Traffic light colors (red,


yellow, green) qualify as discrete data since they

12 Understanding Data Types


indicate separate states. Example: Hair color (e.g., blonde, brown, black)
or vehicle types (e.g., sedan, SUV, truck).
● Marital Status: Marital status categories (e.g.,
single, married, divorced) are labeled as discrete ● Ordinal Scale: Ordinal data represents categories
data, as individuals belong to one of these well- with a meaningful order or ranking, but the
defined categories. differences between the categories are not
uniform or quantifiable. While you can determine
Discrete data is often analyzed using methods which category is greater or lesser, you cannot
involving counts and proportions, such as bar ascertain the exact difference between them.
charts, frequency tables, and contingency tables.
Example: Educational levels (e.g., elementary,
It contrasts with continuous data, which can
middle school, high school), Likert scale
encompass any value within a range and requires
responses (e.g., strongly disagree, disagree,
distinct statistical techniques for analysis.
neutral, agree, strongly agree).

Now, these aforementioned categories can be ● Interval Scale: Interval data has a meaningful

further divided into subcategories, often order, and the differences between values are

referred to as scales of measurement. uniform and measurable. However,


it lacks a true zero point, meaning
Scales of Measurement that the value of zero doesn’t
represent the absence of the
Scales of measurement, measured attribute. Ratios
also known as levels of and proportions are not
measurement or data types, meaningful on an interval
categorize data based on the scale.
nature of the values they represent Example: Temperature in Celsius
and the mathematical operations or Fahrenheit. The difference
that can be performed on the data. between 20°C and 30°C is the same as
There are four main types of scales of the difference between 30°C and 40°C,
measurement: nominal, ordinal, interval, and ratio. but a temperature of 0°C does not signify the
Each type of scale has distinct characteristics complete absence of heat.
and determines the types of analyses that can be
● Ratio Scale: Ratio data has a meaningful order,
conducted on the data.
uniform intervals, and a true zero point, where
zero indicates the absence of the measured
● Nominal Scale: Nominal data involves categories
attribute. Ratios and proportions are meaningful
or labels with no inherent order or ranking. It is
on a ratio scale.
used to classify data into distinct groups without
any quantitative value associated with the Example: Height, weight, age, income, distance.
categories. Nominal data allows for qualitative A weight of 0 kg indicates no weight, and a weight
description and grouping but doesn’t support of 10 kg is twice as heavy as a weight of 5 kg.
mathematical operations.

13 Understanding Data Types


These scales of measurement dictate the types ● Meaningful Zero Point: A value of zero represents
of statistical analyses and operations that can be the complete absence of the measured attribute,
performed on the data. Nominal and ordinal data and ratios and proportions are meaningful.
are often analyzed using non-parametric methods,
while interval and ratio data can be analyzed using Discrete data falls within both the nominal scale
parametric statistical techniques that involve and the ordinal scale of measurement (a topic open
mathematical operations like addition, subtraction, to debate). These scales involve distinct categories
multiplication, and division. or levels, and the values within these categories
are limited and countable. Discrete data does not
It’s important to note that the ratio scale often falls encompass fractional or continuous values.
under continuous data. We can generally assume
that any metric falling under the interval scale or When we refer to categorical data, we generally
ratio scale can be considered as continuous data. talk about nominal or ordinal data. Hence, be
However, always consider the business context for aware of the context when discussing categorical
confirmation. data. In programming, categorical data can be
represented as text or numbers. For example,
Continuous data specifically falls under the ratio gender can be represented as ‘male’ and ‘female’ or
scale of measurement. The ratio scale is the as 1 and 0. Note that the numerical representation
only scale that represents true continuous data, (1 and 0) doesn’t hold any inherent meaning; they
allowing values to take any numerical value within are just nominal representations of ‘male’ and
a range, including fractional values. Additionally, ‘female’. The numbers don’t signify any order in
the ratio scale possesses a meaningful zero point, this context. Ordinal data, though also categorical,
facilitating meaningful multiplication and division can be ordered, leading to different interpretations
operations. between nominal and ordinal data, despite both
being considered categorical.

Now, I’d like to diverge slightly to discuss univariate


and bivariate plots. These plots are highly useful
and receive significant attention during exploratory
data analysis (EDA) for tabular structures. There are
numerous ways to generate univariate and bivariate
plots. We’ll define these terms with examples and
demonstrate some univariate and bivariate charts.
In summary, continuous data falls under the ratio
scale and exhibits the following characteristics: We’ll break down the code step by step, and in future
videos, we’ll delve into more complex examples to
● Meaningful Order: Data points can be ranked understand more intricate code. For now and the
and compared based on their magnitudes. next few chapters, we’re keeping things simple and
● Uniform Intervals: The differences between data utilizing the stateful version of Matplotlib. Focus on
points are consistent and measurable. the lines being executed and don’t worry too much

14 Understanding Data Types


about programming principles at this point. We’ll boundary and color represents color of vertical
explore the nuances of plots and other aspects bars fill with any color of choice , here it is blue
later. Please read the comments to comprehend
plt.hist(data, bins=20, color=’blue’,
the code here and in future code blocks as well.
edgecolor=’white’)

# Set plot labels and title


Example:
# updating the label of x axis # updating the label
import matplotlib.pyplot as plt
of y axis
import numpy as np
# updating the title of the plot plt.xlabel(‘Value’) plt.
# The above two are libraries that is required here
ylabel(‘Frequency’) plt.title(‘Univariate Histogram
you have to execute them on your python shell/
Example’) # In this example:
interpreter/jupyter notebook
‘’’
# Generate random data
We generate random data using NumPy’s
# you can use np.random.seed() as well to fix your
np.random.randn() function.
output, but I am leaving this as is for fun
The hist() function is used to create the histogram.
# Below line is creating 1000 samples coming
We pass the data as the first argument.
from a standard normal curve (randn does this)
The bins parameter specifies the number of bins
# the chart is drawn with unknown seed value, you
to use in the histogram.
can use np.random.seed(N),
The color parameter specifies the color of the
where N can be any number data = np.random.
bars, and the edgecolor parameter sets the color
randn(1000)
of the edges.
# Create a univariate histogram using Matplotlib
xlabel(), ylabel(), and title() functions add labels
# Note on continous data we can plot histogram and a title to the plot. show() displays the plot.
(more on this in next chapter) # plt.hist is the
The resulting univariate histogram displays the
command for histogram takes input data as input
frequency distribution of the random data in the
# histogram has some property one of them is to specified number of bins. You can customize
define bins , you can choose to the number of bins, colors, labels, and other plot

ignore it then matplotlib will take the default, same elements as needed.

goes for color and edgecolor, edgecolor represents ‘’’

15 Understanding Data Types


Figure 3.1: Univariate Histogram Example

Now, from next chapter onwards, comments are extensively written only where the logic is complex, you can
use above as template to produce different types of chart , you just have to figure out what to plot and what
is the equivalent command for it, for example, plt.hist will be replaced with something else and you have a
different graph.

Now above chart is called histogram and displays frequencies of value in a certain range, this is called
univariate chart because it represents single variable and in this case it is showing us the distribution of that
variable using histogram.

Another example:

import matplotlib.pyplot as plt

import numpy as np

# Generate data

categories = [‘A’, ‘B’, ‘C’, ‘D’, ‘E’]

values = np.array([10, 25, 15, 30, 20])

# Create a univariate lollipop chart using Matplotlib

plt.stem(categories, values, basefmt=” “, markerfmt=”o”, linefmt=”-”)

16 Understanding Data Types


# Set plot labels and title the values as the y-values.

plt.xlabel(‘Categories’) The basefmt parameter is set to an empty string


to hide the baseline. The markerfmt parameter
plt.ylabel(‘Values’)
specifies the marker style for the lollipops.
plt.title(‘Univariate Lollipop Chart Example’)
The linefmt parameter specifies the line style
# Show the plot connecting the markers.

plt.show() xlabel(), ylabel(), and title() functions add labels

#In this example: and a title to the plot. show() displays the plot.

‘’’ The resulting univariate lollipop chart displays the


values for each category
We define the categories and values for which we
want to create the lollipop chart. using lollipop markers. You can customize the
marker styles, line styles, labels, and other plot
The stem() function is used to create the lollipop
elements as needed.
chart. We pass the categories as the x-values and
‘’’

Figure 3.2: Univariate Lollipop Chart Example

17 Understanding Data Types


You might be pondering that the example above Now, let us look at bivariate charts.
involves two variables: “categories” and “value,” and
you’re correct. However, this plot is still considered import matplotlib.pyplot as plt import numpy as np
univariate because the information provided is # setting seed for 1 for replication np.random.
singular even with the involvement of two features. seed(1)
The reason for this is that one feature loses its
# Generate random bivariate data
meaning if the other is removed. Additionally, in
terms of the data itself, you would only have a x = np.random.randn(100)
single column named “categories,” from which you y = 2 * x + np.random.randn(100)
can derive frequencies (using the .value_counts()
# Create a bivariate chart with histogram and line
method on a column). The resulting frequencies
plot
are recorded under distinct column names. In this
case, we utilized “frequency” and its corresponding plt.hist2d(x, y, bins=10, cmap=’Blues’)
categories. plt.colorbar(label=’Counts’)

# Set plot labels and title


I intended to highlight the importance of
not hastily judging based solely on plt.xlabel(‘X’)
column count. Consider whether plt.ylabel(‘Y’)
these features can exist
plt.title(‘Bivariate Chart with
independently in a dataset or
Histogram and Line Plot’)
if one can be derived from raw
information. If the answer is # Show the plot
“No” for the former and “Yes” for plt.show()
the latter, it signifies a univariate
‘’’
distribution. In this instance,
one of the columns containing raw In this example:

information can generate the frequencies, We generate random bivariate data x and
solidifying its classification as univariate. This y using NumPy’s np.random.randn() function.
type of plot also goes by a humorous moniker:
The hist2d() function is used to create a 2D
the “lollipop” chart. As you can observe, it can
histogram. We pass the x and y
be employed with categorical data to create a
values as the first and second arguments, and
univariate chart.
specify the number of bins and color map.

Hence, we’ve explored two examples—one involving The colorbar() function adds a color bar indicating
continuous data and the other categorical— the count of data points in each bin.
to generate charts that represent univariate
xlabel(), ylabel(), and title() functions add labels
distributions.
and a title to the plot.

show() displays the plot

‘’’

18 Understanding Data Types


We generated all the examples using np.random without a seed. Therefore, if you attempt to recreate the
plots, you won’t be able to replicate them exactly. To achieve reproducibility, you may need to utilize a seed
value, such as 1.

In this scenario, we have two independent columns, x and y. Upon observing the graph, it’s evident that there
are numerous records for x values ranging from 0 to 0.5, and for y values between 0 and 1. This insight is
indicated by the presence of darker blue regions on the graph. Similarly, at the extreme values of x and y, the
frequency of occurrences is notably low. This observation is depicted by the presence of very light blue color
boxes (bars).

Figure 3.3: Bivariate Chart with Histogram

19 Understanding Data Types


A very simple example of bivariate charts are simple scatter plot between two continuous variables,

import matplotlib.pyplot as plt

import numpy as np

np.random.seed(1)

# Generate random bivariate data

x = np.linspace(0, 10, 100)

y = 2 * x + np.random.randn(100)

# Create a bivariate chart with a line plot

plt.scatter(x, y, color=’blue’, label=’Data Points’)

# Set plot labels and title plt.xlabel(‘X’)

plt.ylabel(‘Y’)

plt.title(‘Bivariate Chart with scatter Plot’)

# Show the plot

plt.show()

Figure 3.4: Bivariate Chart with Scatter Plot

20 Understanding Data Types


In this graph, it’s evident that both x and y are increasing, indicating a positive association between them.
(However, it’s important to note that correlation does not imply causation.) This means that changes in x may
not necessarily cause changes in y, and vice versa. The observation we’re making is that when x increases,
we also typically observe an increase in y.

Now that we’ve comprehended univariate and bivariate charts, as well as created some basic plots using
dummy data, it’s crucial to understand the underlying grammar of charts. This understanding will help make
the code more meaningful and coherent to us.

Summary

Data types, in the context of computer programming and data analysis, refer to the classification or
categorization of data values to specify what kind of data a particular variable or object can hold. Data types
are essential for accurately representing and manipulating data in a computer program.

Choosing the appropriate data type for your variables or data structures is crucial for efficient memory usage
and accurate processing of data. Different programming languages may have variations in data types and
their behavior, so it’s essential to understand the data types available in the language you are working with
and how to use them effectively.

21 Understanding Data Types


Unit 4

Matplotlib Grammar

Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Matplotlib is a widely used Python library for creating
● Graph vocabulary static, interactive, and animated visualizations in a
variety of formats. It provides a high-level interface for
creating a wide range of plots and charts, allowing users
to present data in a visually appealing and informative
manner. Matplotlib is highly customizable and provides
control over every aspect of the visualization, making it
a versatile tool for both beginners and experienced data
scientists.

At its core, Matplotlib follows a grammar of graphics


approach, inspired by the concepts introduced by Leland
Wilkinson in his book “The Grammar of Graphics.”
This approach breaks down the process of creating
visualizations into fundamental components, each
contributing to the final visual representation.

Graph Vocabulary

Understanding the Set of Terms Used in Graphs

The vocabulary of a graph encompasses a collection of


terms and components that describe the various elements
and features of a visual representation. Grasping these

22 Matplotlib Grammar
terms is crucial for effectively interpreting and conveying information through charts and graphs. Here are
key terms frequently employed to elucidate the vocabulary of a graph:

● Title: A concise descriptive statement summarizing the main intent or content of the graph.

● Axis Labels: Labels indicating what is being measured along the horizontal x-axis and the vertical y-axis.

● Data Points: Individual values or observations plotted on the graph.

● Legend: An area or box within the graph explaining the meaning of diverse colors or symbols used to
represent data series.

● X-Axis: The horizontal axis signifying the independent variable or categories.

● Y-Axis: The vertical axis denoting the dependent variable or values.

● Tick Marks: Small marks or lines on the axes signifying specific points or intervals on the scale.

● Tick Labels: Numeric or categorical labels accompanying tick marks, displaying values at those positions.

● Gridlines: Horizontal and vertical lines extending from tick marks to assist in reading the values.

● Bar: A rectangular representation utilized in bar charts to depict the magnitude of a variable.

● Line: A continuous line linking data points in a line chart, illustrating trends or relationships.

● Marker: A symbol or point employed to emphasize individual data points on a line or scatter plot.

● Axis Scale: The scope of values exhibited on an axis, encompassing linear, logarithmic, or other scales.

● Data Label: A numeric value or label connected with a data point, offering specific information.

● Axis Range: The range of values covered by an axis, spanning from the minimum to the maximum value.

● Annotations: Supplementary text, shapes, or lines incorporated into the graph to provide context or
explanations.

These terms constitute the fundamental building blocks of graph vocabulary, enabling individuals to
effectively comprehend and construct graphs in various domains, ranging from scientific research and
business analytics to education and communication.

Now, let’s examine a graph from the Matplotlib documentation and endeavor to correlate the above generic
anatomy of a graph with the specific components of a Matplotlib graph.

23 Matplotlib Grammar
Figure 4.1: Anatomy of a Figure

Analyzing the above image, we can observe that while there are slight differences, most of the elements are
similar to what we discussed earlier. This similarity highlights the advantage of understanding Matplotlib, as
it closely aligns with the general principles of plotting grammar. Let’s delve into the components of the above
graph and briefly explain them.

The structure of a Matplotlib plot encompasses various constituents that collaboratively form a comprehensive
visual representation of data. Acquiring familiarity with these components is pivotal for crafting and tailoring
plots proficiently. Here’s an outline of some principal components within a Matplotlib plot:

● Figure: The highest-level container embracing all plot elements. It can accommodate one or multiple
subplots (axes). Think of the figure as the entire canvas on which the plot takes shape.

● Axes: Individual plotting areas within a figure, each with its own x-axis, y-axis, data, and graphical elements.

24 Matplotlib Grammar
Unlike Cartesian axes, an “axes” object in Matplotlib represents a plotting region.

● Axis Labels: Labels on the x-axis and y-axis denoting the measured attributes. These labels provide
contextual information for the displayed data.

● Title: A descriptive heading above subplots or axes, encapsulating the main purpose or content of the plot.

● Data: The actual data points being plotted, visualized through various plot types like lines, bars, scatter
points, etc.

● Ticks: Marks or lines along axes indicating specific data values or intervals. Accompanied by tick labels,
which display corresponding values.

● Tick Labels: Numeric or categorical labels aligned with tick marks, aiding comprehension of data scale.

● Gridlines: Horizontal and vertical lines stemming from tick marks, facilitating data value interpretation and
relationship understanding.

● Legend: A section within the plot elucidating the significance of distinct colors, markers, or lines representing
data series.

● Color and Style: Different colors, markers, and line styles to differentiate data series or emphasize specific
data points.

● Annotations: Text or arrows incorporated into the plot to provide extra context or explanations about
specific data or trends.

● Spines: Lines connecting tick marks that frame the plot, outlining axis borders, and adjustable in visibility
and position.

● Background Color: The color of the plot area and entire figure canvas, customizable to match visualization
design.

● Subplots: When a figure includes multiple plots arranged in a grid, each individual plot in the grid is termed
a subplot. Subplots share the figure but possess distinct axes.

Summary

The anatomy of a Matplotlib plot constitutes a hierarchy of components working harmoniously to present
data visually. Understanding these components is pivotal for effective data communication and manipulation
through visualizations.

25 Matplotlib Grammar
Unit 5

Different Types of Graphs

Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Matplotlib offers an extensive array of chart types that
● Chart types cater to visualizing various data forms.

Chart Types

Here are some of the primary chart types achievable


through Matplotlib:

● Line Chart: Depicts data points linked by straight lines,


suitable for displaying trends over time or continuous
data.

● Bar Chart: Utilizes rectangular bars to represent data,


commonly used for comparing categorical categories
(with or without order).

● Horizontal Bar Chart: Similar to a bar chart, but with


bars extending horizontally.

● Stacked Bar Chart: Variant of the bar chart with


multiple bars stacked to showcase the contribution of
categories to a whole.

● Scatter Plot: Represents data points as dots on a two-


dimensional plane, useful for depicting relationships
between two continuous variables.

26 Different Types of Graphs


● Histogram: Illustrates the distribution of a While we won’t cover all of these chart types in our
continuous variable by segmenting data into curriculum, we’ll delve into most of them. Selecting
bins and plotting frequency/count in each bin. the right chart for specific scenarios involves
understanding data types, columns, and business
● Box Plot (Box-and-Whisker Plot): Visualizes
context. We simplify the process by considering the
summary statistics’ distribution—median,
data types we discussed earlier. Keep in mind that
quartiles, and potential outliers—of a dataset.
the context can alter data types, leading to entirely
● Violin Plot: Combines box plot and kernel density
different plots. Now, let’s explore a generic rule
estimation for an enhanced data distribution
to determine chart choices. While this rule isn’t a
representation.
universal solution, it provides a solid starting point
● Area Chart: Displays the area under a line plot, for plotting.
often showing cumulative data.
Note that all the plots presented here are basic
● Stacked Area Chart: Similar to an area chart,
examples, and we’re not utilizing the stateless
but with multiple data series stacked atop each
approach. However, in subsequent chapters, we’ll
other.
explore these graphs in more intricate
● Hexbin Plot: A two-dimensional
ways, employing the stateless
histogram grouping data points
approach. Also, remember to
into hexagonal bins, useful for
indicate the displayed chart
visualizing data density.
within your notebook (Jupyter
● 3D Plot: Represents three- Notebook) every time.
dimensional data using
surface plots, scatter plots, Let’s Consider the First Chart
and wireframes. Type:

● Error Bar Plot: Illustrates data


Line Plot: When creating a line chart,
point uncertainties by plotting error
you’ll typically require two columns—
bars around each data point.
one representing time and the other representing
● Contour Plot: Visualizes three-dimensional data
continuous data. An example could be stock market
by showing contour lines on a two-dimensional
charts, where stock values are plotted on the y-axis
plane.
against time on the x-axis.

These examples showcase only a fraction of the


## code for line plot
many chart types Matplotlib accommodates.
import matplotlib.pyplot as plt
Each chart type serves unique purposes and is
suitable for different data and visualization types. # Data
To effectively convey information to your audience, # Note here I am using two numeric quantities but
you can choose an appropriate chart type based on on x axis you can assume that you can have time
your data and story. on x axis

x_values = [1, 2, 3, 4, 5]

27 Different Types of Graphs


y_values = [10, 15, 7, 12, 8]

# Create a line plot

plt.plot(x_values, y_values, marker=’o’, linestyle=’-’, color=’b’, label=’Data Line’)

# Add labels and title

plt.xlabel(‘X-axis Label’)

plt.ylabel(‘Y-axis Label’)

plt.title(‘Line Plot Example’)

# Add a legend

plt.legend()

# Show the plot

plt.show()

Figure 5.1: Line Plot Example


Scatter Plot: When generating a scatter plot or point plot, the primary consideration is whether the selected
columns are continuous. If both columns are continuous, you can effectively use a scatter plot. There exist
numerous variations of scatter plots, offering multiple options. Sometimes, with larger datasets, scatter plots
might appear congested. In such cases, techniques like jittering, utilizing hexbin plots, or employing sampling
can be employed. Pair plots, on the other hand, refer to scatter plots involving multiple columns of continuous
data type, helping visualize relationships between various pairs of columns.

28 Different Types of Graphs


import matplotlib.pyplot as plt

# Data

x_values = [1, 2, 3, 4, 5]

y_values = [10, 15, 7, 12, 8]

# Create a scatter plot

plt.scatter(x_values, y_values, color=’b’, marker=’o’, label=’Data Points’)

# Add labels and title

plt.xlabel(‘X-axis Label’)

plt.ylabel(‘Y-axis Label’)

plt.title(‘Scatter Plot Example’)

# Add a legend

plt.legend()

# Show the plot

plt.show()

Figure 5.2: Scatter Plot Example

Bar Chart: It’s important to note that a bar chart and a histogram are distinct visualizations, even though
these terms are sometimes used interchangeably in conversation. When you’re creating a plot, a bar chart is
commonly referred to as a frequency plot or, in some cases, a count plot. Essentially, a bar chart illustrates
the frequency of values within a dataset. Typically, a single column of categorical data serves as the input for
creating a bar chart. Depending on your preference, you can adjust the orientation of the bars, making them
either vertical or horizontal. In contemporary discussions, there’s a growing interest in a type of plot known
29 Different Types of Graphs
as the ‘lollipop chart,’ which can also be used instead of a traditional bar chart. Additionally, bar charts offer
variations in appearance, involving both horizontal and vertical bars.

import matplotlib.pyplot as plt

# Data

categories = [‘Category A’, ‘Category B’, ‘Category C’, ‘Category D’]

values = [15, 8, 12, 6]

# Create a bar chart

plt.bar(categories, values, color=’b’, align=’center’)

# Add labels and title

plt.xlabel(‘Categories’)

plt.ylabel(‘Values’)

plt.title(‘Bar Chart Example’)

# Show the plot

plt.show()

Figure 5.3: Bar Chart Example

30 Different Types of Graphs


Histograms and Density Plots: These visualizations are highly valuable when dealing with columns of
continuous data type. It’s essential to understand that a histogram exclusively operates with continuous
data. If your dataset consists of categorical data, your objective is likely to create a bar chart or a count plot,
not a histogram. Histograms involve the concept of bins and widths. In essence, these bins encompass a
range of continuous data points, and when visualized, they communicate the frequency of those elements
within the specified bin range.

import matplotlib.pyplot as plt

# Data

data = [5, 7, 9, 10, 12, 15, 18, 20, 22, 25, 28, 30, 32, 33, 35]

# Create a histogram

plt.hist(data, bins=5, color=’b’, edgecolor=’black’)

# Create a histogram

plt.hist(data, bins=5, color=’b’, edgecolor=’black’)

# Add labels and title

plt.xlabel(‘Value Range’)

plt.ylabel(‘Frequency’)

plt.title(‘Histogram Example’)

# Show the plot

plt.show()

Figure 5.4: Histogram Example

31 Different Types of Graphs


Boxplots: Boxplots are valuable when dealing exclusively with continuous data. It’s important to note that you
cannot generate a boxplot using categorical data. However, if you possess two columns, with at least one
of them being continuous, you can utilize a boxplot. When you have a column of continuous data, boxplots
provide more extensive insights compared to histograms. They offer information about outliers in addition to
quartile details. Consequently, boxplots are occasionally superior to histograms.

import matplotlib.pyplot as plt

# Data

data = [15, 20, 22, 28, 30, 32, 33, 35, 40, 45, 50]

# Create a box plot

plt.boxplot(data, vert=False, notch=True, patch_artist=True)

# Add labels and title

plt.xlabel(‘Value’)

plt.title(‘Box Plot Example’)

# Show the plot

plt.show()

Figure 5.5: Box Plot Example

32 Different Types of Graphs


Density Plots: A density plot serves as a graphical depiction of the distribution of continuous or numerical
values within a dataset. It offers an estimation of the probability density function of the underlying data,
facilitating the visualization of the distribution’s shape, peaks, and gaps. Density plots prove especially
beneficial for comprehending data distribution when a histogram might appear excessively detailed or noisy.

The fundamental element of a density plot is the kernel density estimate (KDE), which represents a smoothed
rendition of the data’s distribution. The KDE involves placing a kernel (typically a smoothing function like
Gaussian) at each data point and then aggregating these kernels to form a smooth curve that represents the
overall distribution.

Density Plots Find Particular Utility When:

● You aim to visualize the underlying distribution of continuous data without being limited by bin sizes (as
seen in histograms).

● You intend to pinpoint modes (peaks) and identify any patterns with multiple modes in the data.

● You require a polished representation that offers a clearer view of the data’s distribution.

● You seek to compare distributions among various groups or datasets.

Notable variations of density plots encompass the violin plot, which blends the KDE with a box plot, and the
ridge plot, which stacks multiple KDEs to illustrate distributions across distinct categories.

Density plots can be generated using diverse visualization libraries, including Matplotlib and Seaborn in
Python. They provide insights into the general shape and characteristics of your data’s distribution.

## To plot a density plot which represents the shape of a distribution , one can do this

import matplotlib.pyplot as plt

import numpy as np

from scipy.stats import gaussian_kde

# Generate random data

data = np.random.randn(1000)

# Create a density estimator

density = gaussian_kde(data)

# Create a density plot

x_vals = np.linspace(min(data), max(data), 100)

plt.plot(x_vals, density(x_vals), color=’b’)

# Add labels and title


33 Different Types of Graphs
plt.xlabel(‘Value’)

plt.ylabel(‘Density’)

plt.title(‘Density Plot Example’)

# Show the plot

plt.show()

Figure 5.6: Density Plot Example

Violin Plot: Violin plots are exclusively applicable to continuous data and offer insights akin to both box plots
and density plots. Therefore, they can prove exceptionally valuable in certain scenarios. A violin plot not
only imparts information about summary statistics such as medians and quartiles but also showcases the
distribution’s shape, spread, and potential multimodal patterns. This amalgamation of features makes violin
plots a comprehensive tool for visualizing data distributions.

import matplotlib.pyplot as plt

import numpy as np

# Sample data for two categories

data1 = np.random.normal(0, 1, 100)

data2 = np.random.normal(3, 1, 100)

# Combine the data

combined_data = [data1, data2]

34 Different Types of Graphs


# Create a violin plot

plt.violinplot(combined_data)

# Set x-axis labels

plt.xticks([1, 2], [‘Category 1’, ‘Category 2’])

# Set plot labels and title

plt.xlabel(“Categories”)

plt.ylabel(“Values”)

plt.title(“Violin Plot Example”)

# Show the plot

plt.show()

Figure 5.7: Volin Plot Example

All the aforementioned charts are fundamental representations, and numerous other charts are derived from
these basics. As the book progresses, you’ll encounter variations of these charts with additional options for
gaining a more comprehensive understanding of data. Later on, we’ll also explore how to generate multiple
plots efficiently without writing extensive code. For now, it’s important to recognize that understanding data
types can guide our initial choice of plots. Keep in mind that Matplotlib can also be used to draw geometric
shapes and trigonometric/algebraic functions. These concepts will become clearer in subsequent chapters.
In this chapter, the focus is on comprehending how basic plots can be inferred from data types, serving as a
foundation for further interpretation.

35 Different Types of Graphs


You might be curious about the absence of a pie chart in the current context. Can we not create pie charts
with Matplotlib? The answer is affirmative; however, pie charts might not always be the most suitable choice.
Let’s delve into a scenario where a pie chart might not be optimal. But before that, let’s grasp the essence of
a pie chart.

Pie Chart: A pie chart is a circular data visualization tool employed to exhibit the distribution of a categorical
dataset. It represents each category as a “slice” of the pie, with the size of each slice corresponding to the
proportion of the entire dataset that the category occupies. Key attributes of a pie chart include:

● Circular Shape: The entire pie denotes the entire dataset or 100% of the data. Each category’s slice is
proportional to the percentage it contributes to the whole.

● Categories: Categories encircle the circumference of the circle, each with its corresponding slice.

● Angles: The size of each slice is dictated by the angle it forms at the center of the circle. A larger proportion
corresponds to a larger angle.

● Labels: Typically, labels or percentages are positioned inside or outside each slice to provide supplementary
information.

● Legend: A legend is commonly used to associate category names with the colors of the slices.

Pie charts are most effective when you intend to communicate the relative sizes of different categories in
relation to a whole. However, they might be less effective when comparing precise values or when numerous
categories are involved. The circular shape can make it challenging to accurately assess differences in size.

import matplotlib.pyplot as plt

# Data

categories = [‘Category A’, ‘Category B’, ‘Category C’, ‘Category D’]

sizes = [25, 40, 15, 20]

colors = [‘blue’, ‘green’, ‘orange’, ‘red’]

# Create a pie chart

plt.pie(sizes, labels=categories, colors=colors, autopct=’%1.1f%%’, startangle=140)

plt.axis(‘equal’) # Equal aspect ratio ensures the pie chart is circular

# Set plot title

plt.title(‘Pie Chart Example’)

# Show the plot

plt.show()

36 Different Types of Graphs


Figure 5.8: Pie Chart Example

While pie charts may seem appealing, they are not commonly used in industries such as banking, especially in
my experience. This is due to concerns about financial sensitivity; people prefer more accurate representations
and sometimes pie charts fall short in providing clear financial insights. Additionally, banking data tends to be
intricate, and for analyses involving multiple variables, other visualization techniques are often preferred. Pie
charts are more suitable for univariate representations, while for bivariate and multivariate graphs, alternative
methods are favored. Here are some disadvantages of pie charts:

Despite their visual appeal and suitability for specific situations, pie charts have several limitations in the
realm of data science and data visualization:

● Difficulty in Comparing Slices: Accurately comparing the sizes of different slices in a pie chart can be
challenging, particularly when slices are similar in size or the chart contains numerous slices. Human
perception of angles and areas is not as precise as linear measurements, making it hard to determine
exact proportions.

● Limited Applicability to Few Categories: Pie charts work best when depicting a small number of categories.
As the number of categories increases, the chart can become cluttered and confusing, leading to difficulties
in distinguishing between slices.

● Inaccurate Data Representation: A slice’s size represents its proportion relative to the whole. However,
when precise data comparisons are required, people often make inaccurate estimations based on the
angle or area of slices.

37 Different Types of Graphs


● Challenges in Adding Detail: Pie charts do not handle labels or additional information well, especially with
small slices. Adding labels can clutter the chart, causing overlaps and decreased legibility.

● Unsuitability for Time Series Data: Pie charts are unsuitable for illustrating trends over time or any
sequential data. Time-related data is better presented using line charts or other time series visualizations.

● Unsuitability for Complex Data: Pie charts struggle to convey complex relationships or data with multiple
dimensions. They are confined to illustrating one-dimensional data distributions.

● Misleading 3D Effects: Adding a 3D effect to a pie chart can distort the perception of slice sizes, making
them appear larger or smaller than they truly are.

● Often Better Alternatives: In many instances, other visualization types such as bar charts, stacked bar
charts, and grouped bar charts provide clearer and more accurate data representations, rendering them
superior choices for data analysis.

In the realm of data science and effective data visualization, it’s crucial to thoughtfully select appropriate
chart types based on your data’s characteristics and the insights you aim to convey. While pie charts have
their place in specific contexts, they often fall short in conveying precise information or intricate relationships.

Summary

In a general sense, a “graph” is a mathematical and data structure concept used to represent relationships
between objects. It consists of nodes (vertices) connected by edges (links or arcs). Graphs provide a powerful
way to visualize and analyze complex relationships and networks, making them a fundamental concept in
mathematics, computer science, and various real-world applications.

38 Different Types of Graphs


Unit 6
Different Types of Graphs:
Using Seaborn

Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: We have already explored some fundamental types of
● Contour plot versus scatter pair graphs in Matplotlib. However, there are instances when
plots generating graphs using Matplotlib can be challenging
or visually unappealing. Sometimes, Seaborn’s plotting
capabilities are more straightforward than those of
Matplotlib. Here are some intriguing graphs that I believe
you should be aware of, as they can be effortlessly
plotted using Seaborn. Seaborn is a library built on
top of Matplotlib, so the fundamental graph and chart
terminology remains largely consistent. However,
Seaborn’s plotting capabilities can vary in terms of
simplicity.

Let’s begin with some simple exercises before delving


into more complex examples.

Pairplot: This is specifically applicable to continuous


columns, as a pairplot showcases correlations between
variables or columns. Here’s a code snippet illustrating
the creation of a pairplot:

39 Different Types of Graphs - Using Seaborn


import seaborn as sns

import matplotlib.pyplot as plt

# Load example dataset

tips = sns.load_dataset(“tips”)

# Create a pair plot

sns.pairplot(tips, hue=”sex”, markers=[“o”, “s”], diag_kind=”kde”)

# Show the plot

plt.show()

Figure 6.1: PairPlot

If you were to attempt this using Matplotlib, it would be quite challenging. Generating a pairplot in Matplotlib
is a complex task. Additionally, Seaborn’s pairplot command comes in multiple variations. For instance,
using the same code, you can alter the parameter to obtain a pairplot with contours. We will delve into the
parameter later. For now, focus on comparing the differences between the plots in order to understand how
contour plots differ from scatter pair plots.

40 Different Types of Graphs - Using Seaborn


Contour Plots Versus Scatter Pair Plots

In essence, a contour plot, also referred to as a level plot or isoline plot, is a technique used to visualize two-
dimensional representations of three-dimensional function surfaces. Contour plots are commonly employed
to illustrate functions involving two continuous variables.

In a contour plot, the x and y axes represent the two input variables, while the contours represent the function’s
values at various levels. Each contour line connects points with the same function value. These contour lines
unveil patterns, trends, and regions of significance within the data.

The spacing between contour lines signifies the rate of change of the function. Closer contour lines denote
rapid changes, while wider spacing indicates more gradual changes.

Contour plots prove particularly valuable when visualizing functions with intricate shapes or when identifying
crucial points such as saddle points, minima, maxima, and regions with consistent function values.

# Create a pair plot

sns.pairplot(tips, hue=”sex”, markers=[“o”, “s”], kind=”kde”)

# Show the plot

plt.show()

Figure 6.2: HeatMap

41 Different Types of Graphs - Using Seaborn


Heatmap: A heatmap is a form of data visualization that employs colors to depict values within a matrix-
like structure. Seaborn offers a user-friendly function to generate heatmaps, which are useful for illustrating
relationships between variables or showcasing correlation matrices. This type of plot works well with
continuous data. Typically, heatmaps are utilized in scenarios involving variable correlation. However, in the
provided example, we are utilizing counts to create a plot with categorical columns. Consequently, this type
of plot can be employed for both continuous and categorical variables. Here’s an illustrative example of
producing a heatmap using Seaborn:

import seaborn as sns

import matplotlib.pyplot as plt

# Load example dataset

flights = sns.load_dataset(“flights”)

pivot_flights = flights.pivot_table(index=’month’, columns=’year’, values=’passengers’)

# Create a heatmap

sns.heatmap(pivot_flights, cmap=”YlGnBu”, annot=True, fmt=”d”)

# Set plot title

plt.title(“Passenger Counts Heatmap”)

# Show the plot

plt.show()

In

Figure 6.3: Passenger Counts HeatMap

42 Different Types of Graphs - Using Seaborn


Beeswarm Plot: A beeswarm plot, also recognized as a strip plot or scatter plot with jitter, is a data visualization
that arranges individual data points along a singular axis, preventing them from overlapping. In this context,
you can also employ a boxplot where one of the columns is categorical and the other is continuous. Seaborn
offers a function to generate such plots. Here’s an illustrative example:

import seaborn as sns

import matplotlib.pyplot as plt

# Load example dataset

tips = sns.load_dataset(“tips”)

# Create a beeswarm plot sns.set(style=”whitegrid”)

sns.swarmplot(x=”day”, y=”total_bill”, data=tips)

# Set plot title

plt.title(“Beeswarm Plot Example”)

# Show the plot

plt.show()

Figure 6.4: Beeswarm Plot Example

43 Different Types of Graphs - Using Seaborn


Another variation of this plot involves plotting three variables. For example, consider a scenario like this:
plotting ‘total_bill’ with respect to ‘days’ for each ‘smoker’ flag. Here, three columns are involved: ‘total_bill’,
‘day’, and ‘smoker’. Note that we’ve incorporated the third categorical variable using the ‘hue’ parameter.
Observing previous plots, you’ll find that the ‘hue’ parameter in seaborn can be utilized to introduce an additional
variable. This feature is feasible with matplotlib as well, although until now, we’ve abstained from discussing
the third dimension in plots. Going forward, whenever we wish to showcase the segregation of a categorical
variable in relation to a plot, we can leverage the ‘hue’ parameter. Is it possible to use a continuous column
in the ‘hue’? Technically, software permits it, but the output might not be easily understood or represented
by the tool. Therefore, we generally avoid placing continuous values in the ‘hue’ parameter of seaborn. It’s
worth noting that achieving this in matplotlib can be tricky. You can utilize the ‘color’ parameter to achieve
something similar to the ‘hue’ in seaborn, but creating a beeswarm plot in matplotlib can be challenging. I’ll
provide code here to demonstrate the same structure, but please bear in mind that using seaborn is generally
easier than matplotlib in this case.

import seaborn as sns

import matplotlib.pyplot as plt

# Load example dataset

tips = sns.load_dataset(“tips”)

# Create a beeswarm plot

sns.set(style=”whitegrid”)

sns.swarmplot(x=”day”, y=”total_bill”, data=tips, hue=’smoker’)

# Set plot title

plt.title(“Beeswarm Plot Example”)

# Show the plot

plt.show()

Figure 6.5: Beeswarm Plot Example

44 Different Types of Graphs - Using Seaborn


import matplotlib.pyplot as plt

import numpy as np

# Load example dataset

tips = sns.load_dataset(“tips”)

# Create a beeswarm plot using Matplotlib

categories = np.unique(tips[“day”]) jitter = 0.15

for i, category in enumerate(categories):

x = np.random.normal(i, jitter, size=len(tips[tips[“day”] == category]))

plt.scatter(x, tips[tips[“day”] == category][“total_bill”], label=category,

alpha=0.7, c = np.array([‘blue’ if i == ‘Yes’ else ‘red’ for i in

tips[tips[“day”] == category].smoker]) )

plt.xticks(np.arange(len(categories)), categories) plt.xlabel(“Day”)

plt.ylabel(“Total Bill”)

plt.title(“Beeswarm Plot Example”)

plt.legend()

plt.show()

Figure 6.6: Beeswarm Plot Example

45 Different Types of Graphs - Using Seaborn


As you can observe, learning seaborn for these examples is simpler and demands less code. Therefore, we
shouldn’t disregard the prowess of seaborn in generating intricate plots with reduced code complexity. The
code written using matplotlib is more challenging to comprehend, lacks visual appeal, and the resulting plot
also differs slightly from seaborn’s output. In such situations, opting for seaborn is advisable. 1 Countplot:
These are essentially bar plots, and it’s important to recognize that creating a countplot is easier compared
to using the bar option in matplotlib. We can observe this difference by examining examples with both the
countplot and bar options.

import seaborn as sns

import matplotlib.pyplot as plt

# Load example dataset

tips = sns.load_dataset(“tips”)

# Create a countplot

sns.set(style=”darkgrid”)

sns.countplot(x=”day”, data=tips)

# Set plot title

plt.title(“Countplot Example”)

# Show the plot

plt.show()

## matplotlib bar example

import matplotlib.pyplot as plt

import numpy as np

# Load example dataset

tips = sns.load_dataset(“tips”)

# Compute category counts

day_counts = tips[“day”].value_counts()

# Create a countplot using Matplotlib

plt.bar(day_counts.index, day_counts.values, color=’b’)

# Set plot labels and title

plt.xlabel(‘Day’) plt.ylabel(‘Count’) plt.title(‘Countplot Equivalent (Matplotlib)’)

# Show the plot

plt.show()

46 Different Types of Graphs - Using Seaborn


Figure 6.7: Countplot Plot Example

Figure 6.8: Countplot Equivalent Example

47 Different Types of Graphs - Using Seaborn


We can observe that both matplotlib and seaborn are capable of plotting bar charts. However, the outputs are
slightly different. Seaborn’s defaults are more refined in terms of the distinction and order of days, compared
to matplotlib’s output (which requires sorting before plotting to match seaborn’s order). Essentially, there are
cases where seaborn is a significantly superior choice. It’s worth noting that this doesn’t imply matplotlib is
inadequate; it simply underscores the importance of understanding when to opt for seaborn or matplotlib.
The examples provided here serve to elucidate the mechanisms behind these visualizations.

In this chapter, we’ve explored several new types of plots and recognized that seaborn is often a more
advantageous option compared to matplotlib, especially given its improved default visualizations. It’s
important to acknowledge that seaborn is built upon matplotlib, so we can also customize seaborn charts, a
topic we’ll delve into later in the course.

Summary

Seaborn is a Python data visualization library built on top of Matplotlib that specializes in creating informative
and aesthetically pleasing statistical graphics. It simplifies the process of creating various types of statistical
plots and enhances the visual appeal of your data visualizations. In summary, Seaborn is a valuable tool for
data visualization in Python, especially when you want to create informative, aesthetically pleasing statistical
plots with ease. It complements Matplotlib and integrates seamlessly with Pandas, making it a popular
choice for data analysts and scientists.

48 Different Types of Graphs - Using Seaborn


Unit 7
Using Pandas Plotting
Mechanism

Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Up until now, we’ve examined how matplotlib operates
● Internal plotting mechanism of with single-column structures, enabling us to plot based
pandas on individual vectors or arrays of values.

Internal Plotting Mechanism of Pandas

We will now explore how matplotlib supports the internal


plotting mechanism of pandas. I must acknowledge
that this mechanism is often convenient for quickly
comprehending distributions and other insights. However,
it may not be effective in illustrating complex relationships
between features. Nonetheless, understanding this
approach can be immensely helpful when time is limited
and we’re facing tight deadlines. So, let’s proceed.

We will use the ‘palmer_penguins.csv’ dataset to


demonstrate this example. As you can observe, the plot
method is invoked on the series returned by penguins[‘bill_
length_mm’]. Thus, to access the actual histogram, we
need to utilize a column as a series and then proceed to
use .hist to generate the histogram plot.

49 Using Pandas Plotting Mechanism


import pandas as pd

import matplotlib.pyplot as plt

# Load the Palmer Penguins dataset

penguins = pd.read_csv(“https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/
extdata/penguins.csv”)

# Create a histogram using Pandas plot

penguins[“bill_length_mm”].plot.hist(bins=20, edgecolor=’black’, color=’blue’)

# Set plot labels and title

plt.xlabel(‘Bill Length (mm)’)

plt.ylabel(‘Frequency’)

plt.title(‘Histogram of Bill Length in Palmer Penguins’)

# Show the plot

plt.show()

If you replace the line .plot.hist() with .plot(kind=’kde’) , you can recieve a distribution of a continuous column
as well. Try it you will find that it is really easy to tweak and see what the defaults are and how they work.

Now if you explore little bit more on .plot and if you choose to run this code [item for item in dir(penguins[“bill_
length_mm”].plot) if not item.startswith(‘_’)] you will see that below are plot objects supported by the .plot
[‘area’, ‘bar’, ‘barh’, ‘box’, ‘density’, ‘hexbin’, ‘hist’, ‘kde’, ‘line’, ‘pie’, ‘scatter’]

50 Using Pandas Plotting Mechanism


We have already seen hist , we also did a ‘kde’ plot but you can do that plot by using penguins[“bill_length_
mm”].plot.kde() as well and it works. You must be wondering which one style should I pick. Well use the one
which is more explicit in this case .plot.kde is more explicit and it is saying we are plotting something of type
kde(kernel density estimate) in this case. So, probably I would be using this style more or less to plot. We can
expand this idea to plot a pretty complex plot like this one by one line of plot.

Figure 7.1: Density Graph

You probably thinking this plot is good, but has no description and all, well don’t worry, we can add descriptions
by adding plt.legend() . So here is the code which can change this to a much readable code by adding legend
and title, but as promised the code is essentially a one liner. The idea of this code is to break the code into
groups by using .groupby , make sure you only pick those columns which you want in the plot, here two
columns we need one is categorical column which is island and other is continuous which bill_length_mm.
you can make this more readable by wrapping the code in parenthesis. You can see that adding parenthesis
around plots doesn’t change anything and you can add comments as well.

(penguins. # dataset

filter([‘island’,”bill_length_mm”]). # selecting columns

groupby(‘island’)[‘bill_length_mm’]. # group by on island

plot.kde()); # plotting the kde plot

plt.legend() # adding legend

plt.title(‘Distribution of bill length of penguins across islands’) # add title

plt.show()

# show the plot

51 Using Pandas Plotting Mechanism


Figure 7.2: Distribution of Bill Length

Expanding this thought more into a different direction by adding more different types of plots. For example
by just adding this line and replace plot.kde() with plot.hist(alpha=0.5) you will have histograms for three
groups. Intrestingly you can do this with categorical column but you have to use some more data processing
, for example to do this with categorical column and generate a count plot (bar plot), you need to count then
you need to plot. To achieve this we can use groupby with agg method and count them, then we use pivot to
convert the data from longer to wider form. Finally we do the plotting.

## assuming you have imported matplotlib and numpy here

(penguins. # dataset filter([‘island’,”sex”]). # selecting columns groupby([‘island’, ‘sex’]).

agg(np.size). ## counting reset_index(). ## reseting is important to use this as dataframe

pivot(index=’island’, columns=’sex’, values=0). ## pivoting to make it row column structure

plot. bar()

plt.legend() # adding legend

plt.title(‘Distribution of gender of penguins across islands’) # add title

plt.show() # show the plot

52 Using Pandas Plotting Mechanism


This is how the plot looks like after we run through previous steps

Figure 7.3: Distribution of Gender of Penguins

As you can observe, with pandas as well, we can generate plots without directly calling matplotlib. However,
you still need to use matplotlib for generating legends and handling other aesthetic aspects. In general, you
can create plots using pandas, and the process is quite intuitive and often quicker. This is the concept behind
the accessor – it provides an interface to create straightforward plots without unnecessary complexity. I
hope this provides you with a better understanding of the pandas plotting mechanism.

Summary

Pandas, a popular Python library for data manipulation and analysis, includes a convenient plotting mechanism
that simplifies the creation of basic plots directly from DataFrame and Series objects. In summary, Pandas’
plotting mechanism offers a convenient and user-friendly way to create basic data visualizations directly
from your data stored in DataFrames and Series. It’s a valuable tool for quick exploratory data analysis and
visualization tasks in data science and data analysis workflows.

53 Using Pandas Plotting Mechanism


Unit 8

Customizing Matplotlib Graphs

Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Until now, we have been content with utilizing the
● Plot example stateful approach and have explored a modest level of
customization through it. However, let’s now delve into
more intricate examples where the stateless approach
proves to be more advantageous.

Plot Example

To adopt a completely stateless approach in Matplotlib,


it’s recommended to generate your plots using the plt.
subplots() function. This creates both figure and axes
objects, which you can then utilize to plot your data.

Here is one example

np.random.seed(1)

# Generating a reproducible data

data = np.random.randn(100)

# To create a stateless plot, we need figure and axes so


that we can make changes to axes

fig, ax = plt.subplots()

# using plt.subplots we can do that, we will change some

54 Customizing Matplotlib Graphs


options within subplots later

n, bins, patches = ax.hist(data, bins=10, edgecolor=’white’, color=’brown’, alpha=0.5)

# we can save the information of ax.hist into n, bins and patches, for now it is not important, but you can
observe what they contain, this would give you more exposure about metadata information that carries
within matplotlib charts

ax.set_xlabel(‘bins’)

# adding x labels and y labels, note the difference here we are using set_xabel(in stateless approach) not
xlabel(in stateful approach)

ax.set_ylabel(‘counts’)

# similary we have set_title instead of title

ax.set_title(‘histogram using stateless approach’)

plt.show()

We can see many differences here with stateful approach, one of them is that the commands names are
mostly starting with .set_ which suggests that the changes are happening on the object itself, also we are
tracking the object only by ax using object, whatever relevant method we apply on this, it will get reflected on
the plots, and to me this is very satisfactory compared to what we are doing plt. because plt. alters the graph
information but we don’t know how it is tracking the object, this can be really confusing and source of bugs.

Here is an output

Figure 8.1: Histogram

55 Customizing Matplotlib Graphs


Let’s make this task more intricate by plotting unnecesary column
the same kind of graph for multiple continuous
object_cols = list(data.select_dtypes(‘object’).
columns within the dataset. To achieve this, we’ll
columns) # categorical columns
use the ‘palmer_penguins.csv’ dataset once more.
num_cols = list(data.select_dtypes([‘float’, ‘int’]).
However, before we proceed, let’s perform some
columns) # numeric columns
data preprocessing and then move on to generating
plots.
Now, let’s proceed to process each of the numeric
columns one by one. To achieve this, we can
In the following code, we’ve processed the data and
utilize a loop. If you’re new to Python, it might be
generated two objects: one for numeric values and
helpful to read up on loops. Loops are employed
another for categorical values.
when we need to perform a repetitive task with
slight variations. Essentially, loops help us avoid
# reading from local file of palmer_penguins.csv
extensive duplication in our code.
data = pd.read_csv(‘palmar_penguins.csv’) ##
Raw data
Let’s employ the power of loops to generate
# Note you can download this data from the histograms for all the numeric columns in the
previous pd.read_csv # dataset.

pd.read_csv(“https://raw.githubusercontent.
Names = [‘ ‘.join([i.title() for i in item.split(‘_’)]) for
com/allisonhorst/palmerpenguins/master/inst/
item in num_cols] # We are generating name labels
extdata/penguins.csv”)
here based on the column given to us cols, rows =
# And convert each column to lower case. This is
2, 2 # generating number of columns and number
something I left for the reader to do on their own
rows of a grid where our plots are going to be
end
placed, since we have have 4 columns the grid size
# Replacing missing with mean/ use always is 2x2, in case we have 6 columns we could have
vectorized pandas method don’t try to recreate used 3x2 etc.
something which is loop based

num_cols = list(data.select_dtypes([‘float’, ‘int’]).


columns) # numeric columns data.loc[:, num_cols]
= data[num_cols].fillna(data[num_cols].mean())

# Objects in pandas are tricky (they can contain


multiple different data types)

# Transforming the columns and imputation based


on pandas methods data.loc[:, ‘sex’] = data.sex.
fillna(data.sex.mode()[0])

data.loc[:, ‘year’] = pd.Categorical(data.year,


categories=[2007, 2008, 2009], ordered=True)

data.drop(‘rowid’, inplace=True,axis=1) # dropping

56 Customizing Matplotlib Graphs


fig = plt.figure( figsize =(10, 10)) # This is enumerate(num_cols):
important , this is the setup for stateless approach
ax=fig.add_subplot(rows,cols,i+1)
# figsize can be 10x10 here
## Here is the change now you can see we are
# Note: figsize is a parameter used to specify the using ax.hist instead of plt.hist
dimensions (width and height) of a figure in inches.
# suggesting the stateless approach in practice
When you create a figure using the plt.subplots()
ax.hist(x = data[col], color='brown',
function, you can provide the figsize parameter
edgecolor='white') ax.set_xlabel(names[i])
to control the size of the figure that contains your
plots. fig.tight_layout()

for i, col in enumerate(num_cols): ply.show()

# enumerate generates sequence number of each


Thus, when presented with a dataset and the task
object so i will contain the plot number sequence
of plotting a histogram for every numeric column, a
and col will cotntain the name of the column for
for loop with a stateless approach can effectively
which the plot is to be done
accomplish this task.
ax=fig.add_subplot(rows,cols,i+1)

sns.histplot(x = data[col], ax = Now, let’s delve into a few more


ax) # here we are using seaborn examples: When it comes to
it doesn’t matter you choose plotting boxplots for numeric
matplotlib or seaborn but I will columns, you can employ
take both the example here. the same loop structure, with
for now in seaborn interface a simple adjustment to the
we provide ax parameter with plotting command to achieve
the value of axes the desired outcome.

ax.set_xlabel(names[i])
# this is exactly the same code, however
# puttting the names on lables of x axis for the thing which is different is the line where we
each of the columns called sns.boxplot
fig.tight_layout() # make the distance between
plots nicer so that names don’t munge all together # Similary we can use ax.boxplot() in case we want
matplotlib to work with. # I am putting the code for
plt.show()
matplotlib in the next box

Now we can do the same with matplotlib as well by cols = 2


using below code, the results are identical in terms rows = 2 num_cols = lst
of bar chart however aesthetics are different
fig = plt.figure( figsize=(cols*3, rows*3))

cols, rows = 2, 2 num_cols = lst for i, col in enumerate(num_cols):

fig = plt.figure( figsize=(10, 10)) for i, col in ax=fig.add_subplot(rows,cols,i+1)

57 Customizing Matplotlib Graphs


sns.boxplot(x = data[col], ax = ax)

ax.set_xlabel(names[i])

fig.tight_layout()

plt.show()

# matplotlib code

# this code is more involved because parameters for boxplot is quite not simple, however we can do a lot
with the given input parameter of boxplot in matplotlib, try changing colors at your end and see how that
goes!!!

cols = 2

rows = 2 num_cols = lst

fig = plt.figure( figsize=(cols*3, rows*3))

for i, col in enumerate(num_cols):

ax=fig.add_subplot(rows,cols,i+1)

ax.boxplot(x = data[col], patch_artist=True,

boxprops=dict(facecolor='yellow', color='red'),

whiskerprops=dict(color='blue'))

ax.set_xlabel(names[i])

ax.set_title(f'Box plot {names[i]}\n using matplotlib')

fig.tight_layout()

plt.show()

58 Customizing Matplotlib Graphs


Here are the output from both the charts

Figure 8.2: Output Chart

59 Customizing Matplotlib Graphs


Figure 8.3: Output Chart

Please be informed that we will utilize our sessions to comprehensively interpret these graphs. At this point,
within this book, we will primarily focus on the techniques involved in generating these visualizations.

Let’s progress and employ the same approach to plot bar charts for various columns of character type. It’s
worth noting that this technique can be extended to various types of graphs you create.

lst = data.select_dtypes(include='object').columns # creating all the columns in an object

names = [item.title() for item in lst ] # converting them all into title case

# note this the same structure and we can observe that apart from sns.countplot nothing is different as
such

cols = 3

rows = 3

60 Customizing Matplotlib Graphs


fig = plt.figure( figsize=(cols*3, rows*3))

for i, col in enumerate(lst):

ax=fig.add_subplot(rows,cols,i+1)

sns.countplot(x = data[col], ax = ax, palette=["yellow", "green", "orange"]) ax.set_xlabel(names[i])

ax.set_xlabel(names[i])

fig.tight_layout()

plt.show()

Figure 8.4: Output Chart

Now we can try the same code with matplotlib to understand if the same stateless approach can be done on
matplotlib as well. To make it work with matplotlib we can use the below code

cols = 3

rows = 3

fig = plt.figure( figsize=(cols*3, rows*3))

for i, col in enumerate(lst):

ax=fig.add_subplot(rows,cols,i+1)

temp = data[col].value_counts() # here we created a temporary dataset which calculates the frquency

ax.bar(x = temp.index, height = temp.values, color = ['red', 'blue', 'green'])

# we put the values in height and labels in x of the temporary object and we are done, rest of the code is
very similar

ax.set_xlabel(names[i])

fig.tight_layout()

plt.show()

61 Customizing Matplotlib Graphs


Figure 8.5: Output Chart

So, we can see that a similar approach can be used command is employed to switch the backend of the
with categorical data to create charts. Let us switch Matplotlib library to the Qt backend. This enables
gears and look at some difficult looking graphs and the utilization of interactive plots in a separate
understand how they work. window, detached from the notebook interface.
The Qt backend provides a graphical user interface
Now if you ever want to save graphs or play with (GUI) for engaging with the visualizations.
the graphs (zoom in zoom out etc). you can use this
magic command on jupyter notebook or jupyter lab Here’s what %matplotlib qt accomplishes and why
it finds use:
%matplotlib qt
● Backend Selection: Matplotlib supports a
# Note the above magic command which says `qt`
range of graphical backends that dictate how
# Note you have to use %matplotlib inline to stop
plots are displayed. By default, the “inline”
this behavior cols = 3
backend is commonly used in Jupyter Notebook,
rows = 3 rendering plots directly within the notebook. The

fig = plt.figure( figsize=(cols*3, rows*3)) %matplotlib qt command switches the backend


to Qt, facilitating interactive plot presentation.
for i, col in enumerate(lst):
● Interactive Plots: Upon employing %matplotlib
ax=fig.add_subplot(rows,cols,i+1)
qt, Matplotlib-generated plots appear in an
sns.countplot(x = data[col], ax = ax, external, interactive window, often referred
palette=["yellow", "green", "orange"]) to as a GUI window. This allows dynamic
ax.set_xlabel(names[i]) interactions like panning, zooming, and other
dynamic adjustments, enhancing the experience
fig.tight_layout()
compared to static inline plots.
plt.show()
● Separate Window: The interactive plot window
is dissociated from the notebook interface,
What does %matplotlib qt do?
granting the freedom to move it across your
desktop or monitor while you continue working
In a Jupyter Notebook environment, the magic
within the notebook.
62 Customizing Matplotlib Graphs
● Resource Usage: It’s important to note that using the Qt backend can potentially consume more
system resources than the inline backend, especially if you have multiple interactive plot windows open
simultaneously. Therefore, it’s advisable to use this approach selectively, specifically when interactive
exploration of plots is required.

Once you execute the provided code, a new window will open. Through this window, you can save your plots
as images, perform zooming in and out, and even rotate 3D plots to view data from different perspectives.
This functionality can be particularly advantageous for comprehending complex data that is challenging to
conceptualize.

A screenshot of qt backend

Figure 8.6: qt Backend

Note the symbol of floppy disk suggesting how to save (or saving a figure), similarly the zoom symbol,
similarly there are adjustment button to adjust figure properties. You can fiddle with this and realise, in some
cases this can be very helpful.

To save a graph instead of this qt backend you can use the normal inline as well to save a graph, but in that
case you need to use plt.save to save any graph. Here is an example to demonstrate.

## here are other code for visualisation

plt.savefig('myplot.png') # here is the code to save a plot using plt.savefig

plt.show()

The above command will save the figure in current working directory, to change a directory you can provide
a path like below

## here is the path that can be changed/edited

## The below code saves my current figure object into test folder present in E drive

plt.savefig('E:/test/myplot.png') # here is the code to save a plot using plt.savefig

plt.show()

63 Customizing Matplotlib Graphs


Let us now look at more complex graphs, like 3d ● Time Series with Multiple Variables: In
graphs and meshes. scenarios involving time-dependent data and
multiple variables, a 3D plot can reveal how
So, when is a 3D graph useful? In most cases, these variables evolve over time, shedding light
we don’t require 3D graphs; however, there are on intricate temporal relationships.
scenarios where they prove beneficial. Visualizing
● Physical Phenomena: Fields like physics and
more than three dimensions on a 2D computer
engineering use 3D graphs to illustrate complex
screen is challenging, and often, we need to build
physical phenomena like fluid flow, heat
our intuition on lower dimensions to extrapolate
distribution, and electromagnetic fields.
thoughts to higher dimensions. Therefore, a
● Spatial Data: For spatial or geographical data, 3D
plotting mechanism with 3D images can aid in
graphs can represent elevation, terrain, or other
specific contexts.
attributes on a map.

Here are some general use cases of 3D graphs: 3D ● Model Evaluation: 3D graphs facilitate the
graphs, also referred to as three-dimensional evaluation of models dependent on three
plots, are valuable when you want input variables. In finance, for instance,
to analyze and visualize data you could visualize the impacts
involving three variables or of interest rates, inflation, and
dimensions. They offer a more investment returns on portfolio
comprehensive understanding value.
of relationships and patterns
● Data Clustering: In machine
in data that go beyond what
learning and data mining,
traditional 2D graphs can
3D graphs aid in visualizing
portray. Here are instances
clusters of data points when
where 3D graphs are particularly
considering more than two features
effective:
for clustering.

● Multivariate Data: When dealing with data


It’s important to acknowledge that while 3D graphs
containing three variables, a 3D graph can
offer valuable insights in specific contexts, they
showcase interactions and correlations among
also have limitations. Complexity and interpretation
these variables. For instance, in scientific
challenges arise when dealing with more than three
research, you might need to visualize the
variables, and depth perception can be demanding.
relationships between temperature, pressure,
The choice of viewing angle can impact the graph’s
and volume.
interpretation. When employing 3D graphs, it’s
● Surface Visualizations: 3D graphs are adept advisable to offer multiple perspectives and use
at displaying surfaces such as elevation maps, color mapping or other techniques to ensure clear
terrains, or response surfaces in scientific information conveyance.
experiments. They enable the observation of
changes in a variable over two other variables. Now, let’s draw a 3D plot using Matplotlib. Instead
of using the inline backend, employ the desired
64 Customizing Matplotlib Graphs
backend, and you’ll discover the capability to rotate left to right, top to bottom.
the 3D graph in various directions to gain different
ax.plot_surface(x, y, z, cmap='viridis') # you can
perspectives. Feel free to try it out at your end!
play with different cmap to change color of the
plots present in 3d
%matplotlib qt
# Here plot_surface is used to plot the 3d surface
import matplotlib.pyplot as plt
plot
import numpy as np
ax.set_xlabel('X') ax.set_ylabel('Y') ax.set_
# Generate example data zlabel('Z')

x = np.linspace(-5, 5, 50) # generating x with ax.set_title('3D Surface Plot Example')


np.linspace y = np.linspace(-5, 5, 50) # generating
plt.show()
y with np.linspace

x, y = np.meshgrid(x, y) # how do you use meshgrid In this example:


? I will explain in next paragraph
● The create_3d_surface_plot()
z = np.sin(np.sqrt(x**2 + y**2)) # this
function is defined to create a 3D
the function that you want to plot
surface plot using Matplotlib.
which takes two arrays as inputs
● The function accepts x , y ,
fig = plt.figure(figsize=(8, 6)) #
and z data arrays, along with
this one is similar to earlier
labels and a title for the plot.
graph
●The fig and ax objects are
ax = fig.add_subplot(111,
created similarly to previous
projection='3d') # this one tells
examples, but with projection='3d'
us that we are going to use a `3d`
added to indicate that we want a 3D
graph
plot.
# the argument 111 is used to specify
● The plot_surface() function is used
the layout of subplots in the figure
to create the 3D surface plot. It takes the x , y
created by Matplotlib. This is a shorthand notation
, and z data arrays, and a color map (cmap) to
often used to create a single subplot in a figure.
determine the color mapping of the surface.
# However in general : The three arguments are
● Axes labels and a title are set using the set_
combined into a single integer where the digits
xlabel() , set_ylabel() , set_zlabel() , and set_
represent the row, column, and index.
title() methods.
# The first digit represents the number of rows.
● Finally, the plt.show() function displays the plot.
# The second digit represents the number of
columns. This example demonstrates how to create a 3D
surface plot using Matplotlib in a stateless manner.
# The third digit represents the index of the subplot,
You can modify this code to visualize other types of
counting from the top-left corner and going from
3D plots, such as scatter plots or wireframe plots,

65 Customizing Matplotlib Graphs


using different plotting functions and data. ● Using np.meshgrid: Call with the generated x
and y arrays as arguments. It will return two 2D
Now one thing which we mentioned in the graph arrays where one holds the x-coordinates and
is about np.meshgrid , so let us understand what the other holds the y- coordinates. Each value in
does np.meshgrid do: these arrays corresponds to a specific point on
the grid.
np.meshgrid is a function in the NumPy library that
● Creating a Grid: The output of creates a grid
is used to create a grid of coordinates, often used
of coordinates, where each point in the grid is
for creating 2D or 3D plots. It’s particularly helpful
formed by combining an x-coordinate from one
when you want to evaluate a function over a range
array and a y-coordinate from the other array. This
of x and y (and optionally z) values and visualize
is particularly useful for evaluating functions that
the result.
require coordinates across multiple dimensions.

The basic idea behind np.meshgrid is to take two


Here’s a simple example using np.meshgrid :
(or more) 1D arrays representing the coordinates
along different dimensions and generate coordinate
import numpy as np
matrices that cover all possible combinations of
these coordinates. These coordinate matrices are
useful for evaluating functions or creating plots that # Generate 1D coordinate arrays
require coordinates across multiple dimensions.
x = np.linspace(-5, 5, 10)

y = np.linspace(-5, 5, 8)
Here’s how you can understand np.meshgrid :

● Generating 1D Coordinate Arrays: Start by


# Create a grid using np.meshgrid
generating two 1D arrays representing the
X, Y = np.meshgrid(x, y)
coordinates along the x and y dimensions. These
arrays are often generated using np.linspace or
np.arange.
# X and Y are now 2D arrays representing the grid
of coordinates

print(X) # print on your pc to see the output the


printed output will help you understand what
np.meshgrid work, and you really see the magic
behind np.meshgrid to generate a grid from given
two vectors (arrays) which can be taken as input
create a value on another dimension

print(Y)

In this example, the elements of X and Y are 2D


arrays that form a grid of coordinates by combining

66 Customizing Matplotlib Graphs


x and y values. These coordinate arrays can be employed to evaluate functions or generate plots within a 2D
space. If you’re operating in a 3D context, you can extend the same concept by introducing a third coordinate
array for the z-dimension.

Below is the output of the earlier 3D plot. Feel free to rotate it on your end to observe its behavior. For the sake
of completeness, I’m also providing the graph here.

Figure 8.7: 3D Surface Plot Example

67 Customizing Matplotlib Graphs


Another view of same graph by rotating it in other direction

Figure 8.8: 3D Surface Plot Example

Believe it or not, these two graphs are exactly the same, but the perspective changes drastically depending
on how you perceive them. Hence, I emphasize the importance of generating and observing this on your end
to gain a better understanding. With this in mind, we’ve covered a substantial number of important topics in
this chapter. In our next chapter, we’ll delve into small animations achievable using matplotlib. This chapter
promises to be both fun and intriguing, particularly for those intrigued by generative art. Moreover, it holds
potential for simplifying complex concepts when teaching a broader audience. So, until the next section on
animation, see you there!

68 Customizing Matplotlib Graphs


Summary

Customizing Matplotlib graphs allows you to tailor your data visualizations to meet specific design and
presentation requirements. Customizing Matplotlib graphs allows you to create professional-looking
visualizations that effectively communicate your data and insights to your target audience. It's an essential
skill for data scientists, analysts, and researchers working with data visualization in Python.

69 Customizing Matplotlib Graphs


Unit 9

Animations in Matplotlib
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: This chapter is designed to be a more enjoyable journey,
● Sine Curve yet it offers a valuable perspective: animations can be
remarkably helpful in grasping complex concepts. Let’s
● Central limit theorem (CLT)
kick things off with a code example that simplifies the
understanding of animation components. Subsequently,
we will delve into an animation that brings the central
limit theorem to life.

Sine Curve

Allow me to present a straightforward illustration that


animates a sine curve:

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.animation import FuncAnimation # this


is very important this what we call in the last to animate
a function

# Set up the figure and axis

fig, ax = plt.subplots() # similar to earlier xdata, ydata =


[], [] # empty list as initialiser

ln, = plt.plot([], [], ‘ro’) # ‘ro’ specifies red circles as

70 Animations in Matplotlib
markers, empty x and y values later filled during ● The FuncAnimation class is used to create the
animation animation. It takes the figure, the update function,
the frames to iterate through, the init function,
def init():
and blit=True to improve rendering speed.
ax.set_xlim(0, 2*np.pi) # this the x axis limit from 0
to 2*pi ax.set_ylim(-1, 1) # this is y axis limit from
It’s important to note that this example serves
-1 to +1 (because sine and
as a basic illustration to convey the concept of
cosine values lie between -1 to +1 both inclusive animation using Matplotlib. You have the flexibility
for any angle) return ln, to tailor the animation by adjusting plot data and
visual aspects within the function. Furthermore,
# this function is to be called later in FuncAnimation
you can explore more intricate animations and
for updation def update(frame):
even integrate Matplotlib with libraries like NumPy
xdata.append(frame) # frame is somehting which
for advanced visual effects.
is taken as value while doing animation, you can
call anything this is just a name given
Central Limit Theorem (CLT)
ydata.append(np.sin(frame)) ln.set_
data(xdata, ydata) return ln, Now, if we wish to delve into a more
intricate topic, such as the Central
# here update function will take
Limit Theorem (CLT), we must
value from frames and update
construct a function that
the value of xdata, ydata and
computes averages for
ln , which get reflected in fig
samples and subsequently
object, the initial state of fig is
employs these averages to
as per the init function defined,
generate a bell curve through
# blit = True is for rendering better
plotting.
ani = FuncAnimation(fig, update,
frames=np.linspace(0, 2*np.pi, 128), Let’s begin by comprehending the
init_func=init, blit=True) essence of the Central Limit Theorem before
venturing into the animation aspect:
plt.show()

The Central Limit Theorem (CLT) constitutes a


● We create a figure and axis using plt.subplots().
foundational concept within statistics, elucidating
● X data and Y data are empty lists that will store
the behavior of sample means extracted from a
the data for the animation.
sufficiently large set of independent and identically
● We define an init() function to set up the initial distributed random variables. It posits that,
plot limits. regardless of the distribution characterizing the
original population, the distribution of sample
● The update() function is called for each frame of
means will gravitate towards a normal distribution
the animation. It appends new data points to the
as the sample size expands.
lists and updates the plot.

71 Animations in Matplotlib
In simpler terms, the Central Limit Theorem testing, confidence interval estimation, and various
stipulates that when you draw numerous samples other statistical analyses.
of a specific size from any population, calculate
the mean for each sample, and then depict the Here how the code goes.
distribution of these computed sample means, the
resulting distribution will mirror the characteristic import numpy as np
bell-shaped curve, emblematic of a normal import matplotlib.pyplot as plt
distribution.
from matplotlib.animation import FuncAnimation

Key considerations related to the Central Limit # Set up the figure and axis fig, ax = plt.subplots()
Theorem: n_bins = 20

x_range = (-10, 10) ax.set_xlim(*x_range) ax.set_


● Sample Size: The more substantial the sample ylim(0, 0.5)
size, the closer the distribution of sample means
# Initialize histogram and line
aligns with a normal distribution.
hist, _ = np.histogram([], bins=n_bins,
● Independence: The random
range=x_range) # note this is
variables being sampled must
coming from numpy
be independent—where the
outcome of one random # np.histogram returns a lot of
variable does not influence things, but we only need the
the outcome of another. first component, we

● Identically Distributed: are calling it hist.


The random variables being # Essentially this hist object
sampled should exhibit the will contain an array containing
same distribution. the counts of data points that fall
into each bin of the histogram. Since we
The Central Limit Theorem carries significant provided an empty array as the input data [], all
practical implications. It empowers statisticians the values in the histogram will be zeros.
and data analysts to approximate the distribution
line, = ax.plot([], [], lw=2)
of sample means and conduct diverse statistical
tests and calculations, even in scenarios where # his line of code initializes an empty line plot on
the original population distribution deviates from the given axes with a specified line width. The
normality. variable line is then assigned the Line2D object
representing the empty line plot, which will be
For instance, with a sufficiently large sample size, used later in the animation to update the data
you can apply statistical tests that assume a normal points and appearance of the line.
distribution to data that might not be inherently # Set the number of frames (sample sizes) n_
normal, as long as the sample size is ample. This frames = 100
proves particularly advantageous in hypothesis
def init():
72 Animations in Matplotlib
line.set_data([], []) return line,

def update(frame):

ax.clear() # Clear the previous histogram sample_size = frame + 1

samples = np.mean(np.random.uniform(-15, 15, (sample_size, 10000)), axis=0)

# generating the sample mean above

hist, edges = np.histogram(samples, bins=n_bins, range=x_range, density=True) # generating the hist and
edges values from np.histogram

bin_centers = (edges[:-1] + edges[1:]) / 2 # getting the centers line.set_data(bin_centers, hist)

# line.set_data is giving information of centeral value along with hist

vlaues(array of data points), this is essentially captures the counts # rest of the lines are just asthetics

ax.set_title(f’Sample Size: {sample_size}’)

ax.set_xlabel(‘Sample Mean’) ax.set_ylabel(‘Frequency’) ax.set_xlim(*x_range) ax.set_ylim(0, 0.5)

return line,

# FuncAnimation call is very similar to what we earlier saw, it has a figure which gets updated, from update
function which takes input from frames

ani = FuncAnimation(fig, update, frames=n_frames, init_func=init, blit=True, interval=100)

plt.show()

Essentially these are the steps:

● We are initializing an empty histogram hist and an empty line plot line on the axes.

● n frames specifies the number of frames in the animation.

● init () initializes the line plot with empty data.

● update(frame) is the function called for each frame. It clears the axes, generates random samples,
calculates the histogram and bin centers, updates the line plot data, sets axis labels, and limits.

● FuncAnimation creates the animation. It takes the figure, update function, frames (number of frames), init
function, and other parameters like blit and interval.

● plt.show() displays the animation.

The last two examples animations can’t be shown in pdf so I am not attaching gif or any image file. You need
to run that in your PC to ensure that they do work.

In summary, this code generates an animation showing how the distribution of sample means approaches a
normal distribution as the sample size increases.

73 Animations in Matplotlib
Hope this clarifies the way the central theorem can be seen as animation to understand how getting more
more and more data pushes the data to achieve normal curve. This is one of the very important concepts in
statistics and often come a lot during statistical discussions/readings.

Summary

Animations in Matplotlib refer to the ability to create dynamic, time-based visualizations where data changes
and evolves over a sequence of frames. In summary, Matplotlib’s animation capabilities allow you to bring
your data to life by creating dynamic and interactive visualizations that convey changes and patterns over
time. Animations are a powerful tool for storytelling and communicating insights in various fields, from
scientific research to data-driven presentations.

74 Animations in Matplotlib
Unit 10
Using SymPy Commands With
Matplotlib
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: This is going to be a fairly small chapter and I want to
● Latex say that, this is rather an introduction to use latex based
notation than sympy . However, writing latex requires
● SymPy
practice and time, so instead we use sympy (a third party
python package) to write latex command. But if you are
interested you can learn latex.

Latex

Let us discuss each of the components in this chapter to


first understand that then we will look into how we can
use sympy with matplotlib for annotation and labeling
objects.

To install sympy you can use either of the below command


depending upon which of the package manager you are
using. In case of Anaconda, chances are that sympy is
already there.

Nevertheless I will show you both commands in case if it


not present.

conda install -c anaconda sympy # if you are using


anaconda pip install sympy # if you are using pip as

75
package manager
Using Sympy Commands with Matplotlib
# Please don’t try to use pip within conda LaTeX serves as a typesetting system widely
environment that would not work in most of the employed for crafting documents that require
cases. intricate formatting, such as research papers,
theses, reports, academic articles, and books.
Now we got sympy with us, to understand sympy It enjoys notable popularity within academic
(symbolic python), we need to have some basic and scientific spheres due to its proficiency in
background of Latex. Let us an equation, Ever generating meticulously formatted documents with
wondered how books contain these equations like professional typography.
the one mentioned below:
Diverging from traditional word processors that
ax + bx + c = 0
2
prioritize WYSIWYG (What You See Is What You
Get) editing, LaTeX leverages a markup language
x = -b+
- √b2 - 4ac / 2a enabling you to describe your document’s structure
and formatting using plain text commands.
Upon observation, you’ll encounter a plethora of Subsequently, the LaTeX engine processes
intriguing symbols such as “+,” “x^2,” square root these commands to generate exquisitely typeset
symbols over certain expressions, and more. If documents. It attends to details encompassing
you wish to incorporate similar expressions into font styles, section headings, references, footnotes,
your books or journals, you’d need LaTeX. However, tables, and mathematical equations.
instead of using LaTeX directly, we’ll employ SymPy
to formulate such expressions. Subsequently, we LaTeX offers a host of advantages:
can utilize these expressions to annotate text,
legends, axes, and more within our plots. ● High-Quality Typesetting: LaTeX yields
documents characterized by impeccable
First, let’s establish a formal definition of LaTeX typography and formatting, an attribute
to provide you with a foundational understanding, particularly pivotal in academic and scientific
followed by an exploration of SymPy’s capabilities. writing.

● Consistency: Formatting directives and templates


ascertain consistent styling throughout your
document, even if it spans considerable length
or complexity.

● Mathematics and Equations: LaTeX excels in


rendering intricate mathematical equations and
formulas, rendering it the preferred choice for
mathematical and scientific documents.

● Citations and References: LaTeX seamlessly


integrates with bibliography management
systems like BibTeX and BibLaTeX, simplifying
citation and reference management in academic

76 Using Sympy Commands with Matplotlib


writing. SymPy
● Version Control: LaTeX documents manifest as
plain text files, rendering them compatible with SymPy, a Python library, specializes in symbolic

version control systems such as Git. This feature mathematics. It equips users with tools to

proves invaluable for collaborative writing engage in symbolic computations, encompassing

endeavors. algebraic manipulations, calculus, equation solving,


manipulation of mathematical expressions, and
● Templates: A wealth of LaTeX templates caters
more. Diverging from numerical libraries like NumPy
to diverse document types, streamlining the
or SciPy, which center on numerical computations,
initiation of new projects.
SymPy centers its focus on the symbolic
manipulation of mathematical expressions and
To create a document using LaTeX, you write your
equations.
content using LaTeX markup commands in a plain
text file with a .tex extension. Then, you compile this
Here are some of the features and capabilities of
.tex file using a LaTeX compiler, such as pdflatex
SymPy:
, xelatex , or lualatex . The compiler
processes the markup and generates
● Symbolic Expressions: SymPy
a PDF document as output.
empowers you to manipulate
symbolic variables, constants,
While LaTeX does pose a
and expressions. It facilitates
steeper learning curve in
the creation of mathematical
comparison to conventional
expressions and the execution
word processors, its rewards
of algebraic manipulations on
lie in the extensive flexibility
them.
and command over document
formatting. Many individuals deem ● Equation Solving: SymPy

it a worthy investment of their time to excels in solving an array of equations

master LaTeX, particularly if they frequently produce symbolically, encompassing linear,

documents necessitating precise formatting or quadratic, and differential equations, as well as

containing mathematical content. systems of equations.

● Calculus: It proficiently carries out symbolic


With a grasp of what LaTeX entails, let’s delve differentiation and integration of expressions.
into comprehending the essence of SymPy. This proves especially valuable for symbolically
Subsequently, we’ll explore multiple expressions computing derivatives, integrals, and limits.
to grasp the functionality and capabilities of
● Simplification: SymPy adeptly simplifies and
SymPy that extend beyond its utility in generating
refines mathematical expressions to their most
LaTeX. However, for the scope of this course, we’ll
concise form using diverse algebraic rules and
primarily concentrate on employing SymPy for
techniques.
expressing mathematical equations, not delving
into its broader array of features. ● Functions: You can define and manipulate
symbolic functions via SymPy, including
77 Using Sympy Commands with Matplotlib
operations like composition, differentiation, and Let’s examine some examples. If you execute the
integration. command below (refer to the screenshot), you’ll
notice that I’ve included numerous screenshots to
● Matrices and Linear Algebra: SymPy provides a
illustrate how it appears when utilized in Jupyter
toolset for the symbolic manipulation of matrices
Lab/Notebook.
and linear algebra operations, encompassing
matrix multiplication, inversion, and the
import sympy as sp
computation of eigenvalues.
x; y = sp. symbols(‘x y’)
● Series Expansion: SymPy is equipped to
compute Taylor and Laurent series expansions expr = x**2 + 2*x + y
of functions centered around a specified point. print(expr) #output: x**2 + 2*x + y
● Number Theory: Functionalities relevant to x**2 + 2**x + y
number theory, such as prime factorization,
expr
greatest common divisor (gcd), and least
common multiple (LCM) computations, are x2 + 2x + y
incorporated.
You can see that expr is printed as
● Geometry: SymPy supports
expression something we looked
geometric computations,
earlier, however if you want to
e n c o m p a s s i n g
look at raw expression you
manipulations of points,
can see that as well with print
lines, and planes, as well as
command.
geometric plotting.

● TeX Output: SymPy generates A = sp.MATRIX([[1,2] , [3,4]])


TeX code for mathematical
B = sp.MATRIX([[5,6] , [7,8]])
expressions, rendering it
invaluable for generating equations product = A * B
and formulas in LaTeX documents. inverse = A, inv()

print(product)
SymPy is an open-source library and can be
effortlessly installed using Python’s package print(inverse)
manager, pip. It finds extensive utility among Matrix([[19,22] , [43,59]])
mathematicians, scientists, engineers, and
Matrix([[-2,1] , [3/2,-1/2]])
students for symbolic mathematics, exploration
of mathematical concepts, and generation of product

mathematical content. Its seamless integration [19/43/22/50]


with Python streamlines the amalgamation of
inverse
symbolic computations with other programming
[-2/3/2/1/-1/2]
tasks.

78 Using Sympy Commands with Matplotlib


Now, we saw two examples let us understand the So, when you run the code, it will output the symbolic
examples with what is happening. expression:

import sympy as sp x**2 + 2*x + y

x, y = sp.symbols(‘x y’) expr = x**2 + 2*x + y


This means that expr represents the algebraic
print(expr) # Output: x**2 + 2*x + y
expression “x squared plus 2 times x plus y.” This is
a purely symbolic representation, which is the core
● Import sympy as sp : This line imports the
concept of SymPy - working with mathematical
sympy module and gives it an alias sp . The alias
expressions symbolically rather than evaluating
is used to refer to the module’s functions and
them numerically.
classes in a shorter form.

● x, y = sp.symbols(‘x y’) : This line creates To understand the 2nd example with the process
symbolic variables x and y using the sp.symbol() of sympy (ignoring the mathematical details of
function. These variables are used to represent what is matrices etc). We are just looking at how to
mathematical symbols or placeholders in represent objects in more bookish form.
symbolic computations.
A = sp.Matrix([[1, 2], [3, 4]])
● Expr = x**2 + 2*x + y : Here, an expression
expr is defined using the symbolic variables x B = sp.Matrix([[5, 6], [7, 8]])
and y . The expression is a polynomial involving product = A * B
these variables. It’s x squared plus 2 times x plus
inverse = A.inv()
y.
print(product)
● Print(expr) : This line prints the value of the expr
variable. However, in the context of SymPy, this print(inverse)
doesn’t display the numerical result but rather
the symbolic expression itself. ● A = sp.Matrix([[1, 2], [3, 4]]) : This line creates
a 2x2 matrix named using the sp.Matrix()
constructor from the SymPy library. The matrix is
initialized with the values [1, 2] in the first row and
[3, 4] in the second row.

● B = sp.Matrix([[5, 6], [7, 8]]) : Similarly, this


line creates another 2x2 matrix named B with
the values [5, 6] in the first row and [7, 8] in the
second row.

● Product = A * B : This line calculates the matrix


product of matrices A and B and assigns it to
the variable product Matrix multiplication is
performed using the * operator.

79 Using Sympy Commands with Matplotlib


● Inverse = A.inv() : This line calculates the inverse upon squared power of x (e.g x**2) In code this is
of matrix A using the .inv() method provided by how it looks
SymPy’s Matrix class. The result is assigned to
the variable inverse. import sympy as sp

● Print(product) : This line prints the value of the import matplotlib.pyplot as plt import numpy as np
product matrix, which is the result of multiplying x = sp.symbols(‘x’) # here we defined a symbolic x
matrices A and B .
expr = x**2 + 2*x + 1 # we wrote our expression of
● Print(inverse) : This line prints the value of the parabola
inverse matrix, which is the inverse of matrix A.
# Convert the symbolic expression to a numpy
function

func = sp.lambdify(x, expr, ‘numpy’) # we have to


When you run the code, you’ll get the following
lambdify our expression
output:
# In SymPy, the lambdify function is used to
For the product matrix (result of matrix convert symbolic expressions into
multiplication): numerical functions that can be
evaluated using external libraries
Matrix([[19, 22], [43, 50]]) like NumPy or standard Python’s
math module. This allows you
For the inverse matrix to create efficient numerical
(inverse of matrix A ): functions from symbolic
expressions, which can be
Matrix([[-2, 1], [3/2, -1/2]]) particularly useful for tasks like
plotting, numerical computations,
So, the code demonstrates matrix and integration.
operations using the SymPy library.
# The name “lambdify” comes from
It calculates the product of two matrices and the
the concept of creating a lambda function, which
inverse of one of the matrices, and then prints the
is a concise way of defining small, anonymous
results.
functions in Python.

If you are interested in more examples and more # Generate x values


basic examples, Here is the documentation for you: x_vals = np.linspace(-5, 5, 400)
01-intro-sympy
# Evaluate the function for the x values y_vals =
func(x_vals)
Now we understand how symbolic python works,
let us now attach couple of examples where we can # Plot the function using Matplotlib
annotate our graphs using sympy in matplotlib # Here comes the meat of what we understand
with sympy plt.plot(x_vals, y_vals)
Let us draw a parabola with equation : y is dependent
plt.title(‘Plot of $x^2 + 2x + 1$’) # Note the dollar
80 Using Sympy Commands with Matplotlib
symbol between x^2 + 2x + 1

# if you see the plot, the plot doesn’t show x^2 + 2x + 1 but rather a much cleaner title

plt.xlabel(‘x’)

plt.ylabel(‘y’)

plt.grid(True) plt.show()

This is how the title looks after using sympy

Figure 10.1: Sympy

81 Using Sympy Commands with Matplotlib


Try changing the expression: ‘plot of $X^2 + 2X + 1$’ to ‘Plot of $x^2 + 2 \cdot x + 1$’ , you will see a dot
appearing between 2 and x. There are tons of other small commands like \dot, \cdot etc that you can see that
latex offers, often we remember these to work with these.

You might be wondering about that dollar symbol, so here is what it does:

In LaTeX, the dollar symbol $ is used to enter and exit math mode. Math mode is used for typesetting
mathematical content, such as equations, formulas, variables, and symbols. When you enclose text within
a pair of dollar symbols, LaTeX switches to math mode and treats the enclosed content as mathematical
notation. When you use a single dollar symbol, it enters or exits inline math mode, and when you use a pair of
dollar symbols $$, it enters or exits display math mode.

Back to original expression if you remember from where we started, you can create that expression and print
it in your notebook using.

import IPython

a,b,c = sp.symbols(‘a, b, c’)

quadratic_formula = (-b**2 - 4*a*c)) / (2*a)

#print the sympy expression using pretty printing

IPython.display.display(quadratic_formula)

-b + √=4ac + b2 / 2a

Now that we are familiar with how Sympy interacts with Matplotlib and how we can use “$” to manipulate
mathematical expressions as annotations, let’s delve into Sympy’s plotting capabilities. Yes, Sympy has a
concealed plotting feature as well! And it leverages Matplotlib in the background. To cover all aspects, let’s
examine an example to comprehend how it operates.

Through the combination of Sympy and its plotting capabilities, we can effortlessly graph fundamental
mathematical functions. For instance:

from sympy import symbols # you can also use import sympy as sp as well , however for simplicity I am
using this notation

from sympy.plotting import plot

x = symbols(‘x’)

p1 = plot(x*x, show=False) p2 = plot(x, show=False) p1.append(p2[0])

p1.show() # this is mandatory to write otherwise you won’t be able to see plots

82 Using Sympy Commands with Matplotlib


This is how output looks

Figure 10.2: Sympy

If you want to perform multiple plots given some mathamtical functions to you can simply do this.

plot(x, x**2, x**3, (x, -5, 5))

# a one liner, where multiple function like f(x) = x, f(x) = x**2, f(x) = x**3 can be easily drawn

This is how the plot appears. Note that it is not feasible to obtain legends using Sympy; for that, you must
employ Matplotlib code. Therefore, while Sympy is not designed for visualization, it excels in swiftly solving
uncomplicated mathematical functions. Reserve the use of Sympy when you merely seek to comprehend
data, as opposed to creating publishable content. For more formal and publication-oriented work, Matplotlib
is the tool of choice.

83 Using Sympy Commands with Matplotlib


Figure 10.3: MatPlotlib

To solve this problem of legends you would have to call matplotlib, an example of doing a similar thing with
legends.

import sympy

import matplotlib.pyplot as plt

## creating two plots here with show=False parameter to not to plot it, just get the plot objects in line1 and
line2

line1, line2 = sympy.plot((x**2,(x,-1,0)),(x**3,

(x,0,1)),label=’$f(x)$’,show=False)

## Now get all the points from above objects this will be used in matplotlib x1, y1 = line1.get_points()

x2, y2 = line2.get_points()

## Using the stateless version of the code

## These code lines must be familier by now. so we use them to finalise the plot ## define figure and axes

fig, ax = plt.subplots(1,1) ## define the plots ax.plot(x1, y1, label=’x**2’) ax.plot(x2, y2, label=’x**3’)

84 Using Sympy Commands with Matplotlib


## get the legend labels and handles

handles, labels = ax.get_legend_handles_labels() ax.legend(handles, labels)

plt.show()

This is how the output looks now

Figure 10.4: Output Looks

You will notice that we now have legends in the bottom-left corner. I’ll leave it to you to experiment with
placing the legends in any corner of the plot.

In this chapter, we learned about how Sympy operates and how symbols can be integrated into Matplotlib
charts to incorporate more mathematical notations in titles or legends. While text annotations are yet to
be covered in an upcoming chapter, we trust that this has provided you with a clearer understanding of
incorporating symbols into Matplotlib charts and utilizing Sympy for generating rapid plots.

Summary

Using SymPy commands with Matplotlib involves combining the capabilities of two Python libraries: SymPy
and Matplotlib. SymPy is a symbolic mathematics library that allows you to perform algebraic and symbolic
computations, while Matplotlib is a powerful library for creating visualizations and plots.

This combination allows you to visualize mathematical concepts, equations, and data generated through
symbolic computations in a graphical format, making it easier to understand and communicate mathematical
ideas and results.

85 Using Sympy Commands with Matplotlib


Unit 11
Drawing Different Shapes
and Functions
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: Until now, we have focused on formal charts that don’t
● Annotating matplotlib demand extensive code or parameters. However, there
are instances where custom shapes, such as rectangles
or circles, necessitate distinct commands with varying
parameters. This chapter will delve into creating these
shapes. Additionally, we’ll explore plotting mathematical
functions and delve into customizations and annotations
for various chart types.Let us start with a simple chart of
drawing a rectangle.

import matplotlib.pyplot as plt

import matplotlib.patches as patches ## from this place


we get most of the cusom objects

# Create a figure and axis fig, ax = plt.subplots()

# Create a Rectangle patch

rectangle = patches.Rectangle((0.2, 0.3), 0.6, 0.4,


linewidth=2, edgecolor=’r’, facecolor=’none’)

# the first parameter takes start point bottom left


cordinate (0.2, 0.3), the

second and third parameter adds given input

# to the first parameter , in this case 0.2+0.6, 0.3+0.4 so

86 Drawing Different Shapes and Functions


it becomes (0.8, 0.7) cordinates for top right corner

# Add the Rectangle patch to the axis ax.add_patch(rectangle)

# Set limits for the plot ax.set_xlim(0, 1)

ax.set_ylim(0, 1)

# Show the plot plt.show()

The output looks like this.

Figure 11.1: Rectangle

● Rectangle((x, y), width, height) creates a rectangle at the specified coordinates (x, y) with the given width
and height.

● Linewidth sets the border thickness of the rectangle.

● Edgecolor sets the color of the rectangle’s border.

● Facecolor sets the color of the rectangle’s interior.

Remember to adjust the coordinates, dimensions, colors, and other properties as needed to achieve the
desired appearance for your rectangle.

87 Drawing Different Shapes and Functions


To draw a polygon in Matplotlib, you can use the Now we saw how to draw a rectangle, let us draw
polygon class from the matplotlib.patches module. something with more involved shape (in this case
Here’s an example of how to draw a simple polygon: a triangle) . We use a generic command called
Polygon with closed = true parameter to tell that the
pythonCopy codeimport matplotlib.pyplot as plt shape is closed.
import matplotlib.patches as patches
import matplotlib.pyplot as plt import matplotlib.
# Create a figure and axis fig, ax = plt.subplots()
patches as patches
# Define the vertices of the polygon
# Create a figure and axis fig, ax = plt.subplots()
vertices = [(0.2, 0.3), (0.5, 0.8), (0.8, 0.3)]
# Define the vertices of the polygon
# Create a Polygon patch
vertices = [(0.2, 0.3), (0.5, 0.8), (0.8, 0.3)]
polygon = patches.Polygon(vertices, closed=True,
# Create a Polygon patch
linewidth=2, edgecolor=’r’, facecolor=’none’)
polygon = patches.Polygon(vertices,
# Add the Polygon patch to the axis ax.add_
closed=True, linewidth=2, edgecolor=’r’,
patch(polygon)
facecolor=’none’)
# Set limits for the plot ax.set_
# Add the Polygon patch to the
xlim(0, 1)
axis ax.add_patch(polygon)
ax.set_ylim(0, 1)
# Set limits for the plot ax.set_
# Show the plot plt.show()
xlim(0, 1)

This is what is happening here: ax.set_ylim(0, 1)

# Show the plot


● Vertices is a list of tuples
plt.show()
containing the (x, y) coordinates
of the vertices of the polygon.
Here we are drawing a triangle:
● Closed = true indicates that the polygon is closed,
meaning its last vertex is connected to the first ● Vertices is a list of tuples containing the (x, y)
vertex. coordinates of the vertices of the polygon.

● Linewidth sets the border thickness of the ● Closed = True indicates that the polygon is
polygon. closed, meaning its last vertex is connected to
the first vertex.
● Edgecolor sets the color of the polygon’s border.
● Linewidth sets the border thickness of the
● Facecolor sets the color of the polygon’s interior.
polygon.

You can adjust the vertices, colors, and other ● Edgecolor sets the color of the polygon’s border.
properties to customize the appearance of the
● Facecolor sets the color of the polygon’s interior.
polygon according to your needs.

88 Drawing Different Shapes and Functions


You can adjust the vertices, colors, and other ‘FancyBboxPatch’, ‘JoinStyle’,
properties to customize the appearance of the ‘NonIntersectingPathException’, ‘Number’,
polygon according to your needs ‘Patch’, ‘Path’, ‘PathPatch’, ‘Polygon’, ‘Rectangle’,
‘RegularPolygon’, ‘Shadow’]
Similarly, to draw a circle we again would use
something callled patches.Circle , Here we can see Not every command here can create a new object
that we have center and radius as input to properly however, some of these do and some of these just
draw a circle. help for building the objects. Also there are other
methods as well which I have ommited here, but
import matplotlib.pyplot as plt import matplotlib. you can guess some of these are actually going to
patches as patches create objects others are just to help as settings

# Create a figure and axis fig, ax = plt.subplots() (e.g. JoinStyle is one such command).

# Define the center and radius of the circle center


Annotating Matplotlib
= (0.5, 0.5)

radius = 0.3 We now want to annotate stuff and of


course while annotating you can also
# Create a Circle patch
use sympy power to use symbols
circle = patches.Circle(center,
as well. So there is no limit to
radius, linewidth=2,
annotate things if you are
e d g e c o l o r = ’ b ’ ,
creative enough to annotate
facecolor=’none’)
or not. Now, having said that,
# Add the Circle patch to the not every time annotations can
axis ax.add_patch(circle) be useful and the context may tell

# Set limits for the plot ax.set_ us, whether or not this is useful.

xlim(0, 1)
import matplotlib.pyplot as plt
ax.set_ylim(0, 1)
# Sample data
# Show the plot
x = [1, 2, 3, 4, 5]
plt.show()
y = [5, 9, 3, 6, 8]

You might be wondering how many commands are # Create a figure and axis fig, ax = plt.subplots()
supported by patches , well you can see dir(patches)
# Plot the data
, you will realise that these are the commands that
ax.plot(x, y, marker=’o’, label=’Data’)
can be used for creating new objects.
# Annotate a point on the plot
[‘Annulus’, ‘Arc’,’Arrow’,’ArrowStyle’,
# To use annotation we use ax.annotate, giving
‘BoxStyle’, ‘CapStyle’, ‘Circle’, ‘CirclePolygon’,
a label ‘Annotated Point’, and arrow to point the
‘ConnectionPatch’, ‘ConnectionStyle’,
annotation
‘Ellipse’, ‘FancyArrow’, ‘FancyArrowPatch’,

89 Drawing Different Shapes and Functions


ax.annotate(‘Annotated Point’, xy=(3, 3), xytext=(4, 4), arrowprops=dict(facecolor=’black’, shrink=0.05),
fontsize=10, color=’blue’,horizontalalignment=’right’, verticalalignment=’top’)

# Set labels and title ax.set_xlabel(‘X-axis’) ax.set_ylabel(‘Y-axis’) ax.set_title(‘Annotated Chart’)

# Add a legend ax.legend()

# Show the plot

plt.show()

This is what is happening

● ax.annotate is used to add an annotation. The xy parameter specifies the point to annotate, while xy text
specifies the location of the annotation text.

● Arrowprops specifies properties for the arrow that connects the annotation to the point.

● Fontsize and color determine the style of the annotation text.

● Horizontalalignment and verticalalignment control the alignment of the annotation text.

This is how it looks:

Figure 11.2: Annotated Chart

90 Drawing Different Shapes and Functions


Another example is to show roots of an equation and we can see the roots on a chart using scatter dot.

import matplotlib.pyplot as plt import numpy as np

import sympy as sp

# Define the variable x = sp.Symbol(‘x’)

# Define the equation

equation = x**3 - 6*x**2 + 11*x - 6

# Find the roots of the equation using SymPy roots = sp.solve(equation, x)

# Convert SymPy roots to numerical values numerical_roots = [float(root) for root in roots]

# Create a figure and axis fig, ax = plt.subplots()

# Plot the equation

x_vals = np.linspace(-1, 4, 400)

y_vals = x_vals**3 - 6*x_vals**2 + 11*x_vals - 6 ax.plot(x_vals, y_vals, label=’Equation’)

ax.scatter(roots, [0, 0,0], color=[‘red’, ‘blue’, ‘green’]) # Annotate the roots

for root in numerical_roots:

ax.annotate(f’Root at {root:.2f}’, xy=(root, 0), xytext=(root, 20), textcoords=’offset points’, ha=’center’,


fontsize=10,

arrowprops=dict(facecolor=’brown’, arrowstyle=”->”, connectionstyle=”arc3”),

color=’blue’,

bbox=dict(facecolor=’yellow’, edgecolor=’none’, boxstyle=’round,pad=0.2’))

# Set labels and title ax.set_xlabel(‘x’) ax.set_ylabel(‘y’)

ax.set_title(‘Annotated Roots of Equation’)

# Add a legend ax.legend()

# Show the plot

plt.show()

In this example, we use SymPy to find the roots of the equation x**3 - 6*x**2 + 11*x - 6 . Then, we convert the
SymPy roots to numerical values and use Matplotlib to plot the equation. For each root, we annotate it with
its value on the x-axis.

91 Drawing Different Shapes and Functions


This is how it looks

Figure 11.3: Annotated Roots

We can see that with the help of annotations we can point out the actual roots , there are 3 roots in this case.
with value of 1, 2 and 3 and to represent that we used ax.scatter for plotting and ax.annotate for annotations,
we have also used arrow like earlier example to put arrow to point text to the dots but with different style.

In this chapter we looked into drawing different shapes and annotating different charts, you can of course
use this to annotate any chart drawn in matplotlib. In the last chapter we will look into algorithmic based
visualization which may not be possible via common visualization tool as they require to have an algorithm
implemented within the tool, this is also one of the reasons on why we choose python over other dashboarding
tools.

Summary

Drawing different shapes and functions typically involves using various programming libraries or tools to
create visual representations of geometric shapes or mathematical functions.

In summary, drawing different shapes and functions involves using various programming and graphical tools
to create visual representations of geometric shapes, mathematical functions, custom shapes, or data. The
choice of tools and libraries depends on your specific requirements and programming language preferences.

92 Drawing Different Shapes and Functions


Unit 12
Miscellaneous
Plotting Techniques
Learning Objectives
Introduction
By the end of this unit, you will be
able to understand: At times, data can be complex, containing numerous
● Some drawbacks of Kmeans variables and rows. In such intricate scenarios, grouping
similar rows based on features can aid in comprehending
● DBSCAN
common patterns exhibited by different individuals, thus
categorizing them into specific groups. This grouping
technique is known as clustering. Let’s delve into a formal
definition.

Please note that this book exclusively covers clustering


algorithms’ code and visualization aspects. If you find
this technique intriguing, consider delving into another
resource. I highly recommend “Hands-On Machine
Learning” by Aurélien Géron, published by O’Reilly. The
book comprehensively covers all facets of machine
learning.

Clustering is an unsupervised machine learning


technique that involves aggregating similar data points
into clusters or groups. The objective of clustering is to
identify patterns or structures within data without prior
knowledge of labels or categories. It serves as a crucial
exploratory data analysis method, revealing insights and
patterns within datasets.

93 Miscellaneous Plotting Techniques


The fundamental concept of clustering revolves distributions, accommodating more versatile
around partitioning a dataset into subsets or cluster shapes.
clusters, where data points within the same cluster
● Spectral Clustering: Utilizes the spectrum of a
exhibit higher similarity to each other than to data
similarity graph to partition data into clusters.
points in other clusters. Clustering finds application
● Self-Organizing Maps (SOM): Employs a neural
in customer segmentation, image segmentation,
network approach to map high-dimensional data
anomaly detection, document categorization, and
to a lower-dimensional grid, preserving spatial
more.
relationships.

Various algorithms and methods are employed for


Clustering algorithms handle diverse data types,
clustering, contingent on data characteristics and
including numerical, categorical, and mixed data.
desired outcomes. Common clustering algorithms
Algorithm selection hinges on factors like dataset
include:
size, data dimensionality, cluster nature, and
available computational resources.
● KMeans Clustering: Divides data into a
predefined number of clusters,
Visualizing clustering results is vital
assigning each data point to the
for comprehending data structure
cluster with the nearest mean
and validating clustering
(centroid).
quality. Common visualization
● Hierarchical Clustering:
techniques encompass scatter
Constructs a hierarchical
plots, dendrogram plots, and
tree of clusters through
t-SNE (t-Distributed Stochastic
successive merging or
Neighbor Embedding) plots.
splitting based on similarity.

● DBSCAN (Density-Based Spatial This book delves into two of these


Clustering of Applications with algorithms: KMeans and DBScan.
Noise): Clusters data points based on
density, identifying clusters as high-density areas An example using KMeans:
separated by low-density regions.
Visualizing data via KMeans clustering is a prevalent
● Agglomerative Clustering: A hierarchical
technique for uncovering underlying data structure.
clustering type that initiates with individual data
KMeans, an unsupervised algorithm, segregates
points as clusters and iteratively merges them
a dataset into K distinct non-overlapping clusters.
based on a chosen linkage criterion.
Each data point joins the cluster with the closest
● Mean Shift Clustering: Identifies clusters by mean (centroid). Here’s how you can visualize
locating high-density regions in the feature KMeans clustering results using Python and well-
space. known data visualization libraries like Matplotlib
● Gaussian Mixture Models (GMM): Assumes and Seaborn:
data points arise from a mixture of Gaussian

94 Miscellaneous Plotting Techniques


import numpy as np

import matplotlib.pyplot as plt from sklearn.cluster import KMeans ## loading libraries

# Generating synthetic data np.random.seed(10) n_samples = 300

n_features = 2

n_clusters = 4

# Generating random data using features and 4 groups for example sake

data = np.random.randn(n_samples, n_features) kmeans = KMeans(n_clusters=n_clusters) kmeans.fit(data)

fig, ax = plt.subplots(1,1, figsize=(10, 10))

# Visualize the clusters

ax.scatter(data[:, 0], data[:, 1], c=kmeans.labels_, cmap=’rainbow’)

ax.scatter(kmeans.cluster_centers_[:, 0],

kmeans.cluster_centers_[:, 1], s=300, c=’green’,

marker=’o’, alpha=0.5)

ax.set_xlabel(‘Feature 1’)

ax.set_ylabel(‘Feature 2’) ax.set_title(‘KMeans Clustering’) plt.show()

This is how the chart looks:

Figure 12.1: Clustering

95 Miscellaneous Plotting Techniques


We can observe that there are four clusters in the data (though this might not be surprising since we
deliberately created four groups). Even if you take random data, such as the palmer_penguins dataset, and
apply clustering to it, you’ll realize that the technique can effectively segregate specific variants of penguin
species from the others. Let’s perform this analysis to confirm the efficacy of the approach.

import pandas as pd import seaborn as sns

import matplotlib.pyplot as plt from sklearn.cluster import KMeans

# Load the Palmer Penguins dataset penguins = sns.load_dataset(“penguins”) penguins.dropna(inplace=True)

# Select relevant features

features = penguins[[“bill_length_mm”, “flipper_length_mm”]]

# Drop missing values, if any features = features.dropna()

# Number of clusters n_clusters = 3

# Initialize KMeans

kmeans = KMeans(n_clusters=n_clusters)

# Fit KMeans to the data

# We need to fit the algorithm on data, that is why we are using fit for it. kmeans.fit(features)

# Add cluster labels to the original dataset penguins[“cluster”] = kmeans.labels_

# Plot the data points and cluster centers fig, ax = plt.subplots(1,1, figsize=(8, 8))

sns.scatterplot(data=penguins, x=”bill_length_mm”, y=”flipper_length_mm”, hue=”cluster”, palette=”Set1”,


ax = ax )

ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c=’orange’, marker=’o’,


label=’Cluster Centers’, alpha=0.8) ax.set_title(“KMeans Clustering of Penguins Data”)

ax.set_xlabel(“Bill Length (mm)”) ax.set_ylabel(“Flipper Length (mm)”) ax.legend()

plt.show()

96 Miscellaneous Plotting Techniques


The output of this will come as

Figure 12.2: Clustering of Penguins Data

In this visualization, each data point is colored based on the cluster it belongs to, and the orange “O” markers
indicate the cluster centers.

However, it’s important to note that in real-world scenarios, you would typically conduct a more in-depth
analysis of the data, consider feature scaling, and experiment with various cluster numbers to determine the
best configuration for your specific needs.

Interestingly, if you run the code provided below, which is unrelated to clustering, you’ll notice a similar plot.

sns.scatterplot(data=penguins, x=”bill_length_mm”, y=”flipper_length_mm”, hue=”species”, palette=”Set1”)

97 Miscellaneous Plotting Techniques


Figure 12.3: Bill Length

What does this tell us? It essentially indicates that even without prior knowledge of penguin species, we
could have inferred the existence of distinct groups based on the bill_length_mm and flipper_length_mm
measurements among the penguins. There seem to be three clear clusters, noteworthy is the fact that we
didn’t employ the species column in the K-means clustering process.

A note for the readers: In all the examples above, we utilized data that shares a common scale. If you intend
to compare and cluster features that are significantly different, it’s advisable to standardize the data to the
same scale before applying clustering techniques. This is because distance calculations, often used in
clustering, can be influenced by features with larger values. Therefore, standardizing the dataset can yield
more meaningful results. However, in the cases presented here, standardization wasn’t necessary since we
used features with the same physical unit. (You can refer to one example from DBScan.)

Please keep in mind that the Palmer Penguins dataset contains more features that you could also explore for
clustering purposes. This example, however, focused on two features for the sake of simplicity.

Some Drawbacks of Kmeans

KMeans is a popular and widely used clustering algorithm; however, like any algorithm, it comes with
limitations and drawbacks. Some of the key drawbacks of KMeans include:

98 Miscellaneous Plotting Techniques


● Sensitive to Initial Centroids: KMeans is highly might not be suitable for data featuring elongated
sensitive to the initial placement of cluster or irregularly shaped clusters.
centroids. Different initializations can lead to
● Sensitive to Scaling: KMeans is responsive to
distinct final cluster assignments, occasionally
feature scaling. If features possess significantly
resulting in suboptimal solutions. Various
different scales, KMeans could assign more
initialization techniques, such as KMeans++, aim
weight to features with larger scales, potentially
to alleviate this issue.
introducing bias in cluster assignments.
● Requires Specifying the Number of Clusters:
● May Require Multiple Runs: Due to its sensitivity
You must specify the number of clusters (K)
to initialization, KMeans may need multiple
in advance, which might not always be known
runs with varying initializations to increase
or intuitive. Selecting the optimal number of
the likelihood of achieving a robust solution.
clusters can be challenging and may involve
However, this can lead to higher computational
trial and error or using methods like the elbow
expenses.
method or silhouette score.
● Global Optimization: KMeans aims to minimize
● Assumes Equal Cluster Sizes and
the sum of squared distances between
Shapes: KMeans assumes that
data points and cluster centroids.
clusters are approximately
Nevertheless, this can lead to the
spherical and have uniform
problem of local minima, wherein
sizes. In reality, clusters might
the algorithm might converge
exhibit different shapes,
to suboptimal solutions,
densities, or sizes, leading
particularly in intricate or
to suboptimal clustering
high-dimensional data.
outcomes.

● Sensitive to Outliers: KMeans Despite these drawbacks, KMeans


is sensitive to outliers, as they remains a valuable tool for clustering
can disproportionately impact the tasks, particularly when the dataset
calculation of cluster centroids. Outliers might aligns with its assumptions. However, for datasets
exert a pulling effect on clusters, distorting the featuring irregular clusters, diverse cluster sizes,
representation of the underlying data distribution. or noise, exploring alternative clustering methods
might be more suitable. Always assess the
● Doesn’t Handle Non-Convex Clusters Well:
appropriateness of KMeans based on your data’s
KMeans tends to generate convex clusters,
characteristics and your analysis objectives.
making it less effective at identifying non-
convex clusters accurately. Alternative clustering
DBSCAN
algorithms like DBSCAN or hierarchical clustering
can better handle non-convex shapes.
DBSCAN (Density-Based Spatial Clustering of
● Produces Circular Clusters: KMeans computes Applications with Noise) is a clustering algorithm
cluster centers using the mean of data points, that works well in cases where the data has varying
often yielding circular or spherical clusters. This cluster shapes, densities, and is potentially noisy.

99 Miscellaneous Plotting Techniques


Here are some scenarios where DBSCAN can work dimensional spaces where Euclidean distances
better than KMeans: might not effectively capture similarities between
data points.
● Irregular Cluster Shapes: DBSCAN can identify
clusters of various shapes, including non- convex However, it’s important to note that DBSCAN also
and elongated shapes, as it defines clusters has its own limitations. For example, DBSCAN
based on density-connected regions rather than might struggle with datasets where the densities
assuming circular clusters like KMeans. of different clusters overlap significantly, making it
hard to distinguish between clusters. Additionally,
● Varying Cluster Densities: DBSCAN can handle
DBSCAN’s performance can be sensitive to its
clusters with different densities effectively. It
hyperparameters, such as the epsilon (ε) distance
defines clusters based on regions of higher data
threshold and the minimum number of points
point density, so it can find clusters in areas with
required to form a dense region (MinPts).
varying densities.

● Noise and Outliers: DBSCAN can distinguish


noise points from actual clusters and can ignore
them in its clustering process. This is important
when dealing with datasets containing noisy
data points or outliers.

● Unknown Number of Clusters: DBSCAN does


not require specifying the number of clusters
beforehand, making it suitable for situations
where the optimal number of clusters is not
known or intuitive.

● Robust to Initialization: DBSCAN is less sensitive


In summary, DBSCAN is particularly useful for
to the initial placement of points and does not
datasets with irregular shapes, varying densities,
rely on initial centroids like KMeans. This makes
noisy points, and when the number of clusters
it less likely to converge to suboptimal solutions.
is not known. It’s important to understand the
● Appropriate for Spatial Data: As the name characteristics of your data and the strengths of
suggests, DBSCAN was designed for spatial data DBSCAN to decide if it’s a more suitable clustering
or data with a notion of proximity. It works well algorithm than KMeans for your specific task.
for geographic data, image segmentation, and
other forms of data with spatial relationships. Let us take an example of DBScan

● No Assumption of Equal Cluster Sizes: DBSCAN


import numpy as np
does not assume that clusters have equal sizes,
unlike KMeans. This makes it suitable for data import matplotlib.pyplot as plt
where cluster sizes vary significantly. from sklearn.datasets import make_moons from
● High-Dimensional Data: DBSCAN’s approach sklearn.cluster import DBSCAN
based on density can work better in high-
100 Miscellaneous Plotting Techniques
from sklearn.preprocessing import StandardScaler

# Generate moon-shaped data

X, y = make_moons(n_samples=800, noise=0.05, random_state=42)

## Note here we are doing a standard scalar(standardization) to make every column in one scale

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Initialize DBSCAN

dbscan = DBSCAN(eps=0.3, min_samples=5)

# Fit DBSCAN to the data

labels = dbscan.fit_predict(X_scaled)

# Visualize the clusters

## We are avoiding here stateless approach

## But as a homework you can do this stateless with various plots

## using multiple values of eps and min_samples and see how that changes

## You should try kmenas too, you will realise Kmeans fails with this dataset in identifying the correct
pattern

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap=’viridis’) plt.title(“DBSCAN Clustering of Moon Data”)

The output would look like this:

Figure 12.4: Clustering of Moon Data

101 Miscellaneous Plotting Techniques


In this example, the make_moons function generates the moon-shaped data, and then DBSCAN is applied
to cluster the data points. The eps parameter controls the maximum distance between two samples to be
considered as neighbors, and the min_samples parameter sets the minimum number of samples required to
form a dense region.

The resulting plot will show the data points colored according to the clusters identified by DBSCAN. You’ll
likely see two clusters representing the two moon shapes.

Remember that parameter tuning is crucial in DBSCAN. Adjusting eps and min_samples can significantly
affect the results. Experiment with different parameter values to see how they impact the clustering outcome.

Also, note that DBSCAN can identify noise points as well, which will be assigned the label -1. These are data
points that do not belong to any dense region or cluster.

With this example, we have reached the conclusion of the book. I hope you found it enjoyable and informative.
It’s important to clarify that this book’s purpose is to provide a straightforward way of explaining data
visualization in Matplotlib. While the book does not cover storytelling extensively, that skill can be developed
through hands-on experience in solving real-world business problems and interactive sessions.

The aim of this book is to equip you with the ability to create visualizations, choose appropriate plots for
different situations, and interpret data insights from them. It serves as a gentle introduction to the fascinating
world of data visualization. However, it’s worth noting that this book is not exhaustive or comprehensive. It’s
only the beginning of your journey into data visualization.

There is a wealth of resources available on the internet and in books that can further enhance your knowledge.
I’d recommend exploring books like “Fundamentals of Data Visualization” by Claus Wilke and “Storytelling
with Data” published by Wiley. These resources can take your skills to the next level.

I chose a tool, Matplotlib, that may seem challenging for beginners, but I hope it has given you the perspective
that data visualization is not as daunting as it may appear. With a bit of effort and dedication, you can master
it. As you move forward in your careers, I wish you the very best and success in all your endeavors.

Summary

Miscellaneous plotting techniques encompass a wide range of creative and specialized methods for visualizing
data and information beyond traditional charts and graphs. These miscellaneous plotting techniques offer
unique ways to represent and explore data, making them valuable tools for data analysis, storytelling, and
decision-making across a wide range of domains. The choice of technique depends on the nature of the data
and the insights you want to convey to your audience.

102 Miscellaneous Plotting Techniques

You might also like