Professional Documents
Culture Documents
Logistic Regression
Logistic Regression
Logistic Regression
In Machine Learning, we often need to solve problems that require one of the two possible answers, for
example in the medical domain, we might be looking to find whether a tumor is malignant or benign
and similarly in the education domain, we might want to see whether a student gets admission in a
specific university or not.
Such problems are binary classification problems and logistic regression is a very popular algorithm to
solve such problems. With that said, it would be easier to understand what logistic regression is.
What is Logistic Regression?
Logistic Regression is a Machine Learning algorithm used to make predictions to find the value of a dependent
variable such as the condition of a tumor (malignant or benign), classification of email (spam or not spam), or
admission into a university (admitted or not admitted) by learning from independent variables (various features
relevant to the problem).
For example, for classifying an email, the algorithm will use the words in the email as features and based on that
make a prediction whether the email is spam or not.
Logistic Regression is a supervised Machine Learning algorithm, which means the data provided for training is
labeled i.e., answers are already provided in the training set. The algorithm learns from those examples and their
corresponding answers (labels) and then uses that to classify new examples.
In mathematical terms, suppose the dependent variable is Y and the set of independent variables is X, then logistic
regression will predict the dependent variable P(Y=1) as a function of X, the set of independent variables.
Types of Logistic Regression
When we talk about Logistic Regression in general, we usually mean Binary logistic regression, although there are
other types of Logistic Regression as well.
Logistic Regression can be divided into types based on the type of classification it does. With that in view, there are
3 types of Logistic Regression. Let’s talk about each of them:
Binary Logistic Regression
Multinomial Logistic Regression
Ordinal Logistic Regression
Binary Logistic Regression
Binary Logistic Regression is the most commonly used type. It is the type we already discussed when defining
Logistic Regression. In this type, the dependent/target variable has two distinct values, either 0 or 1, malignant or
benign, passed or failed, admitted or not admitted.
Multinomial Logistic Regression
Multinomial Logistic Regression deals with cases when the target or independent variable has three or more possible
values. For example, the use of Chest X-ray images as features that give indication about one of the three possible
outcomes (No disease, Viral Pneumonia, COVID-19). The multinomial Logistic Regression will use the features to
classify the example into one of the three possible outcomes in this case. There can of course be more than three
possible values of the target variable.
Ordinal Logistic Regression
Ordinal Logistic Regression is used in cases when the target variable is of ordinal nature. In this type, the categories
are ordered in a meaningful manner and each category has quantitative significance. Moreover, the target variable
has more than two categories. For example, the grades obtained on an exam have categories that have quantitative
significance and they are ordered. Keeping it simple, the grades can be A, B, or C.
Difference between Logistic and Linear Regression
The major difference between Logistic and Linear Regression is that Linear Regression is used to solve regression
problems whereas Logistic Regression is used for classification problems. In regression problems, the target variable
can have continuous values such as the price of a product, the age of a participant, etc. While, the classification
problems deal with the prediction of target variable that can only have discrete values, for example, prediction of
gender of a person, prediction of a tumor to be malignant or benign, etc.
In what type of software does logistic regression work best?
Logistic Regression is most commonly used in problems of binary classification in which the algorithm predicts one
of the two possible outcomes based on various features relevant to the problem.
1
Logistic Regression finds its applications in a wide range of domains and fields, the following examples will
highlight its importance:
Education sector: In the Education sector, logistic regression can be used to predict:
Whether a student gets admission into a university program or not is based on test scores and various other factors.
In E-learning platforms to see whether a student will complete a course on time or not based on past activity and other
statistics relevant to the problem.
Business sector: In the business sector, logistic regression has the following applications:
Predicting whether a credit card transaction made by a user is fraudulent or not.
Medical sector: Medical sector also benefits from logistic regression through the following uses:
Predicting whether a person has a disease or not is based on values obtained from test reports or other factors in
general.
A very innovative application of Machine Learning being used by researchers is to predict whether a person has
COVID-19 or not using Chest X-ray images.
Other applications: Logistic regression finds its applications in all major sectors, in addition to that, some of its
interesting applications are:
Email Classification – Spam or not spam
Sentiment Analysis – Person is sad or happy based on a text message
Object Detection and Classification – Classifying an image to be a cat image or a dog image
There are numerous other problems that can be solved using Logistic Regression. The above-mentioned examples
should be enough to give you an idea of how powerful and useful this algorithm is.
Example of Algorithm based on Logistic Regression and its implementation in Python
Now that the basic concepts about Logistic Regression are clear, it is time to study a real-life application of Logistic
Regression and implement it in Python.
Let’s work on classifying credit card transactions as fraudulent, also called credit card fraud detection. It is a very
important application of Logistic Regression being used in the business sector. A real-world dataset will be used for
this problem. It is quite a comprehensive dataset having information of over 280,000 transactions. Step by step
instructions will be provided for implementing the solution using logistic regression in Python.
So let’s get started:
Step 1 – Doing Imports
The first step is to import the libraries that are going to be used later. If you do not have them installed, you would
have to install them using pip or any other package manager for python.
import pandas as pd
import pandas as pd
2
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matr
ix
dataset = pd.read_csv('User_Data.csv')
# input
x = dataset.iloc[:, [2, 3]].values
# output
y = dataset.iloc[:, 4].values
sc_x = StandardScaler()
xtrain = sc_x.fit_transform(xtrain)
xtest = sc_x.transform(xtest)
print (xtrain[0:10, :])
classifier = LogisticRegression(random_state = 0)
classifier.fit(xtrain, ytrain)
y_pred = classifier.predict(xtest)
cm = confusion_matrix(ytest, y_pred)
print ("Confusion Matrix : \n", cm)
Conclusion
There are different types of Logistic Regression, but the most widely used is the binary logistic regression in which
the classification takes place on one of the two possible values of the target variable. Logistic Regression is different
from Linear Regression because it is a classification algorithm and has discrete values as classification output, while
Linear Regression is a Regression algorithm having continuous values as output. We also covered that Logistic
Regression finds its use in a wide range of applications including the classification tasks in Business, Education, and
Medical industries.
A real-life example of Logistic Regression was studied. The analysis involved over 280,000 instances of
transactions which were further divided into training and test sets by a ratio of 80 to 20 respectively. After exploring
and preprocessing the dataset, the model was trained and a classification accuracy of 99.9% was obtained. It showed
that Logistic Regression was very successful in detecting fraudulent transactions, although more improvement can
also be made by tuning the model (advance concepts).
3
Linear Regression (Python Implementation)
This article discusses the basics of linear regression and its implementation in the Python
programming language.
Linear regression is a statistical method for modeling relationships between a dependent
variable with a given set of independent variables.
Note: In this article, we refer to dependent variables as responses and independent
variables as features for simplicity.
In order to provide a basic understanding of linear regression, we start with the most
basic version of linear regression, i.e. Simple linear regression.
Now, the task is to find a line that fits best in the above scatter plot so that we can
predict the response for any new feature values. (i.e a value of x not present in a dataset)
4
This line is called a regression line.
The equation of regression line is represented as:
Here,
h(x_i) represents the predicted response value for ith observation.
b_0 and b_1 are regression coefficients and represent y-intercept and slope of
regression line respectively.
To create our model, we must “learn” or estimate the values of regression coefficients
b_0 and b_1. And once we’ve estimated these coefficients, we can use the model to
predict responses!
In this article, we are going to use the principle of Least Squares.
Now consider:
and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!
Without going into the mathematical details, we present the result here:
import numpy as np
import matplotlib.pyplot as plt
5
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
if __name__ == "__main__":
main()
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like
this:
6
import numpy as np
b = np.array((1 , 3, 2))
print (b)
c = np.zeros((3, 4))
print( c)
f = np.arange(0, 30, 5)
print (f)
g = np.linspace(0, 5, 10)
print (g)
print(arr.ndim)
newarr = arr.reshape(2, 2, 3)
print ( arr)
print ( newarr)
a1 = np.arrange(8)
print( a1)
a2 = np.arrange(8).reshape(2, 4)
print(a2)
a3 = np.arrange(8).reshape(4 ,2)
print( a3)
7
packages. And it helps to understand the data, however, complex it is, the significance of
data by summarizing and presenting a huge amount of data in a simple and easy-to-
understand format and helps communicate information clearly and effectively.
Pandas and Seaborn is one of those packages and makes importing and analyzing data
much easier. In this article, we will use Pandas and Seaborn to analyze data.
Pandas
Pandas offer tools for cleaning and process your data. It is the most popular Python
library that is used for data analysis. In pandas, a data table is called a dataframe.
import pandas as pd
# Create DataFrame
df = pd.DataFrame( data )
Output:
Example 2: load the CSV data from the system and display it through pandas.
# import module
import pandas
# load the csv
data = pandas.read_csv("nba.csv")
Output:
8
Seaborn
Seaborn is an amazing visualization library for statistical graphics plotting in Python. It
is built on the top of matplotlib library and also closely integrated into the data structures
from pandas.
Installation
For python environment :
pip install seaborn
For conda environment :
conda install seaborn
Let’s create Some basic plots using seaborn:
Python3
# Importing libraries
import numpy as np
import seaborn as sns
# Selecting style as white,
# dark, whitegrid, darkgrid
# or ticks
sns.set( style = "white" )
Output:
9
Seaborn: statistical data visualization
Seaborn helps to visualize the statistical relationships, To understand how variables in a
dataset are related to one another and how that relationship is dependent on other
variables, we perform statistical analysis. This Statistical analysis helps to visualize the
trends and identify various patterns in the dataset.
These are the plot will help to visualize:
Line Plot
Scatter Plot
Box plot
Point plot
Count plot
Violin plot
Swarm plot
Bar plot
KDE Plot
Line plot:
Lineplot Is the most popular plot to draw a relationship between x and y with the
possibility of several semantic groupings.
Syntax : sns.lineplot(x=None, y=None)
Parameters:
x, y: Input data variables; must be numeric. Can pass data directly or reference columns
in data.
Let’s visualize the data with a line plot and pandas:
Example 1:
# import module
import seaborn as sns
import pandas
10
# loading csv
data = pandas.read_csv("nba.csv")
# ploting lineplot
sns.lineplot( data['Age'], data['Weight'])
Output:
Example 2: Use the hue parameter for plotting the graph.
# import module
import seaborn as sns
import pandas
# plot
sns.lineplot(data['Age'],data['Weight'], hue =data["Position"])
Output:
11
Scatter Plot:
Scatterplot Can be used with several semantic groupings which can help to understand
well in a graph against continuous/categorical data. It can draw a two-dimensional graph.
Syntax: seaborn.scatterplot(x=None, y=None)
Parameters:
x, y: Input data variables that should be numeric.
Returns: This method returns the Axes object with the plot drawn onto it.
Let’s visualize the data with a scatter plot and pandas:
Example 1:
# import module
import seaborn
import pandas
# load csv
data = pandas.read_csv("nba.csv")
# plotting
seaborn.scatterplot(data['Age'],data['Weight'])
Output:
Example 2: Use the hue parameter for plotting the graph.
import seaborn
import pandas
data = pandas.read_csv("nba.csv")
Output:
12
Box plot:A box plot (or box-and-whisker plot) s is the visual representation of the
depicting groups of numerical data through their quartiles against continuous/categorical
data.
13
Output:
Example 2:
# import module
import seaborn as sns
import pandas
Output:
14
Voilin Plot:
A voilin plot is similar to a boxplot. It shows several quantitative data across one or more
categorical variables such that those distributions can be compared.
Syntax: seaborn.violinplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting.
Draw the violin plot with Pandas:
Example 1:
# import module
import seaborn as sns
import pandas
# read csv and plot
data = pandas.read_csv("nba.csv")
sns.violinplot(data['Age'])
Output:
Example 2:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.violinplot(x ="Age", y ="Weight",data = data)
Output:
15
Swarm plot:
A swarm plot is similar to a strip plot, We can draw a swarm plot with non-overlapping
points against categorical data.
Syntax: seaborn.swarmplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y, hue: Inputs for plotting long-form data.
data: Dataset for plotting.
Draw the swarm plot with Pandas:
Example 1:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv( "nba.csv" )
seaborn.swarmplot(x = data["Age"])
Output:
Example 2:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.swarmplot(x ="Age", y ="Weight",data = data)
16
Output:
17
Example 2:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
# read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.barplot(x ="Age", y ="Weight", data = data)
Output:
Point plot:
Point plot used to show point estimates and confidence intervals using scatter plot
glyphs. A point plot represents an estimate of central tendency for a numeric variable by
the position of scatter plot points and provides some indication of the uncertainty around
that estimate using error bars.
Syntax: seaborn.pointplot(x=None, y=None, hue=None, data=None)
Parameters:
x, y: Inputs for plotting long-form data.
hue: (optional) column name for color encoding.
data: dataframe as a Dataset for plotting.
Return: The Axes object with the plot drawn onto it.
Draw the point plot with Pandas:
Example:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
Output:
18
Count plot:
Count plot used to Show the counts of observations in each categorical bin using bars.
Syntax : seaborn.countplot(x=None, y=None, hue=None, data=None)
Parameters :
x, y: This parameter take names of variables in data or vector data, optional, Inputs
for plotting long-form data.
hue : (optional) This parameter take column name for color encoding.
data : (optional) This parameter take DataFrame, array, or list of arrays, Dataset for
plotting. If x and y are absent, this is interpreted as wide-form. Otherwise, it is
expected to be long-form.
Returns: Returns the Axes object with the plot drawn onto it.
Draw the count plot with Pandas:
Example:
# import module
import seaborn
seaborn.set(style = 'whitegrid')
Output:
19
KDE Plot:
KDE Plot described as Kernel Density Estimate is used for visualizing the Probability
Density of a continuous variable. It depicts the probability density at different values in a
continuous variable. We can also plot a single graph for multiple samples which helps in
more efficient data visualization.
Syntax: seaborn.kdeplot(x=None, *, y=None, vertical=False, palette=None, **kwargs)
Parameters:
x, y : vectors or keys in data
vertical : boolean (True or False)
data : pandas.DataFrame, numpy.ndarray, mapping, or sequence
Draw the KDE plot with Pandas:
Example 1:
# importing the required libraries
from sklearn import datasets
import pandas as pd
import seaborn as sns
iris_df['Target'] = iris.target
20
iris_df['Target'].replace([2], 'Iris_Virginica', inplace=True)
Output:
Example 2:
# import module
import seaborn as sns
import pandas
Output:
Output:
# import module
import seaborn as sns
import pandas
# read top 5 column
data = pandas.read_csv("nba.csv").head()
Output:
22
Let’s see an example of univariate data distribution:
Example: Using the dist plot
# import module
import seaborn as sns
import pandas
sns.distplot( data['Age'])
Output:
23
Data Visualization using Matplotlib
Data Visualization is the process of presenting data in the form of graphs or charts. It
helps to understand large and complex amounts of data very easily. It allows the
decision-makers to make decisions very efficiently and also allows them in identifying
new trends and patterns very easily. It is also used in high-level data analysis for Machine
Learning and Exploratory Data Analysis (EDA). Data visualization can be done with
various tools like Tableau, Power BI, Python.
In this article, we will discuss how to visualize data with the help of the Matplotlib
library of Python.
Matplotlib
Matploptib is a low-level library of Python which is used for data visualization. It is easy
to use and emulates MATLAB like graphs and visualization. This library is built on the
top of NumPy arrays and consist of several plots like line chart, bar chart, histogram, etc.
It provides a lot of flexibility but at the cost of writing more code.
Pyplot
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is
designed to be as usable as MATLAB, with the ability to use Python and the advantage
of being free and open-source. Each pyplot function makes some change to a figure: e.g.,
creates a figure, creates a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc. The various plots we can utilize using Pyplot are Line
Plot, Histogram, Scatter, 3D Plot, Image, Contour, and Polar.
After knowing a brief about Matplotlib and pyplot let’s see how to create a simple plot.
Example:
import matplotlib.pyplot as plt
# initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# plotting the data
plt.plot(x, y)
plt.show()
Output:
24
Adding Title
The title() method in matplotlib module is used to specify the title of the visualization
depicted and displays the title using various attributes.
Syntax:
matplotlib.pyplot.title(label, fontdict=None, loc=’center’, pad=None, **kwargs)
Example:
import matplotlib.pyplot as plt
# initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# plotting the data
plt.plot(x, y)
# Adding title to the plot
plt.title("Linear graph")
plt.show()
Output:
We can also change the appearance of the title by using the parameters of this function.
Example:
import matplotlib.pyplot as plt
plt.show()
Output:
25
Note: For more information about adding the title and its customization,
refer Matplotlib.pyplot.title() in Python
Adding X Label and Y Label
In layman’s terms, the X label and the Y label are the titles given to X-axis and Y-axis
respectively. These can be added to the graph by using the xlabel() and ylabel() methods.
Syntax:
matplotlib.pyplot.xlabel(xlabel, fontdict=None, labelpad=None, **kwargs)
matplotlib.pyplot.ylabel(ylabel, fontdict=None, labelpad=None, **kwargs)
Example:
import matplotlib.pyplot as plt
# initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# plotting the data
plt.plot(x, y)
# Adding title to the plot
plt.title("Linear graph", fontsize=25, color="green")
# Adding label on the y-axis
plt.ylabel('Y-Axis')
# Adding label on the x-axis
plt.xlabel('X-Axis')
plt.show()
Output:
26
Setting Limits and Tick labels
You might have seen that Matplotlib automatically sets the values and the
markers(points) of the X and Y axis, however, it is possible to set the limit and markers
manually. xlim() and ylim() functions are used to set the limits of the X-axis and Y-axis
respectively. Similarly, xticks() and yticks() functions are used to set tick labels.
Example: In this example, we will be changing the limit of Y-axis and will be setting the
labels for X-axis.
import matplotlib.pyplot as plt
# initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# plotting the data
plt.plot(x, y)
# Adding title to the plot
plt.title("Linear graph", fontsize=25, color="green")
# Adding label on the y-axis
plt.ylabel('Y-Axis')
# Adding label on the x-axis
plt.xlabel('X-Axis')
# Setting the limit of y-axis
plt.ylim(0, 80)
# setting the labels of x-axis
plt.xticks(x, labels=["one", "two", "three", "four"])
plt.show()
Output:
Adding Legends
A legend is an area describing the elements of the graph. In simple terms, it reflects the
data displayed in the graph’s Y-axis. It generally appears as the box containing a small
sample of each color on the graph and a small description of what this data means.
The attribute bbox_to_anchor=(x, y) of legend() function is used to specify the
coordinates of the legend, and the attribute ncol represents the number of columns that
the legend has. Its default value is 1.
Syntax:
matplotlib.pyplot.legend([“name1”, “name2”], bbox_to_anchor=(x, y), ncol=1)
27
Example:
import matplotlib.pyplot as plt
# initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# plotting the data
plt.plot(x, y)
# Adding title to the plot
plt.title("Linear graph", fontsize=25, color="green")
# Adding label on the y-axis
plt.ylabel('Y-Axis')
# Adding label on the x-axis
plt.xlabel('X-Axis')
# Setting the limit of y-axis
plt.ylim(0, 80)
# setting the labels of x-axis
plt.xticks(x, labels=["one", "two", "three", "four"])
# Adding legends
plt.legend(["GFG"])
plt.show()
Output:
Before moving any further with Matplotlib let’s discuss some important classes that will
be used further in the tutorial. These classes are –
Figure
Axes
Note: Matplotlib take care of the creation of inbuilt defaults like Figure and Axes.
Figure class
Consider the figure class as the overall window or page on which everything is drawn. It
is a top-level container that contains one or more axes. A figure can be created using
the figure() method.
28
Syntax:
class matplotlib.figure.Figure(figsize=None, dpi=None, facecolor=None,
edgecolor=None, linewidth=0.0, frameon=None, subplotpars=None, tight_layout=None,
constrained_layout=None)
Example:
# Python program to show pyplot module
import matplotlib.pyplot as plt
from matplotlib.figure import Figure
# Adding legends
plt.legend(["GFG"])
plt.show()
Output:
29
Axes Class
Axes class is the most basic and flexible unit for creating sub-plots. A given figure may
contain many axes, but a given axes can only be present in one figure. The axes()
function creates the axes object.
Syntax:
axes([left, bottom, width, height])
Just like pyplot class, axes class also provides methods for adding titles, legends, limits,
labels, etc. Let’s see a few of them –
Adding Title – ax.set_title()
Adding X Label and Y label – ax.set_xlabel(), ax.set_ylabel()
Setting Limits – ax.set_xlim(), ax.set_ylim()
Tick labels – ax.set_xticklabels(), ax.set_yticklabels()
Adding Legends – ax.legend()
Example:
Python3
30
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
# Setting Title
ax.set_title("Linear Graph")
# Setting Label
ax.set_xlabel("X-Axis")
ax.set_ylabel("Y-Axis")
# Adding Legend
ax.legend(labels = ('line 1', 'line 2'))
plt.show()
Output:
Multiple Plots
We have learned about the basic components of a graph that can be added so that it can
convey more information. One method can be by calling the plot function again and
again with a different set of values as shown in the above example. Now let’s see how to
plot multiple graphs using some functions and also how to plot subplots.
31
Method 1: Using the add_axes() method
The add_axes() method is used to add axes to the figure. This is a method of figure class
Syntax:
add_axes(self, *args, **kwargs)
plt.show()
Output:
Method 2: Using subplot() method.
This method adds another plot at the specified grid position in the current figure.
Syntax:
subplot(nrows, ncols, index, **kwargs)
subplot(pos, **kwargs)
32
subplot(ax)
Output:
Method 3: Using subplots() method
This function is used to create figures and multiple subplots at the same time.
Syntax:
matplotlib.pyplot.subplots(nrows=1, ncols=1, sharex=False, sharey=False,
squeeze=True, subplot_kw=None, gridspec_kw=None, **fig_kw)
Example:
Python3
33
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
Output:
Method 4: Using subplot2grid() method
This function creates axes object at a specified location inside a grid and also helps in
spanning the axes object across multiple rows or columns. In simpler words, this function
is used to create multiple charts within the same figure.
Syntax:
Plt.subplot2grid(shape, location, rowspan, colspan)
Example:
import matplotlib.pyplot as plt
34
(7, 1), (0, 0), rowspan = 2, colspan = 1)
axes2 = plt.subplot2grid (
(7, 1), (2, 0), rowspan = 2, colspan = 1)
Output:
Line Chart
Line chart is one of the basic plots and can be created using the plot() function. It is used
to represent a relationship between two data X and Y on a different axis.
Syntax:
matplotlib.pyplot.plot(\*args, scalex=True, scaley=True, data=None, \*\*kwargs)
Example:
Python3
35
# Adding label on the y-axis
plt.ylabel('Y-Axis')
plt.show()
Output:
Let’s see how to customize the above-created line chart. We will be using the following
properties –
color: Changing the color of the line
linewidth: Cutomizing the width of the line
marker: For changing the style of actual plotted point
markersize: For changing the size of the markers
linestyle: For defining the style of the plotted line
Different Linestyle available
Character Definition
Solid line
–
Dashed line
—
dash-dot line
-.
Dotted line
:
36
Character Definition
Point marker
.
Circle marker
o
Pixel marker
,
triangle_down marker
v
triangle_up marker
^
triangle_left marker
<
triangle_right marker
>
tri_down marker
1
tri_up marker
2
tri_left marker
3
tri_right marker
4
square marker
s
pentagon marker
p
star marker
*
hexagon1 marker
h
37
Character Definition
hexagon2 marker
H
Plus marker
+
X marker
x
Diamond marker
D
thin_diamond marker
d
vline marker
|
hline marker
_
Example:
import matplotlib.pyplot as plt
plt.show()
Output:
38
Bar Chart
A bar chart is a graph that represents the category of data with rectangular bars with
lengths and heights that is proportional to the values which they represent. The bar plots
can be plotted horizontally or vertically. A bar chart describes the comparisons between
the discrete categories. It can be created using the bar() method.
In the below example, we will use the tips dataset. Tips database is the record of the tip
given by the customers in a restaurant for two and a half months in the early 1990s. It
contains 6 columns as total_bill, tip, sex, smoker, day, time, size.
Example:
import matplotlib.pyplot as plt
import pandas as pd
plt.show()
Output:
39
Customization that is available for the Bar Chart –
color: For the bar faces
edgecolor: Color of edges of the bar
linewidth: Width of the bar edges
width: Width of the bar
Example:
import matplotlib.pyplot as plt
import pandas as pd
plt.show()
40
Output:
Note: The lines in between the bars refer to the different values in the Y-axis of the
particular value of the X-axis.
Histogram
41
# Adding label on the x-axis
plt.xlabel('Total Bill')
plt.show()
Output:
plt.show()
42
Output:
Scatter Plot
plt.show()
43
Output:
plt.show()
Output:
44
Pie Chart
Pie chart is a circular chart used to display only one series of data. The area of slices of
the pie represents the percentage of the parts of the data. The slices of pie are called
wedges. It can be created using the pie() method.
Syntax:
matplotlib.pyplot.pie(data, explode=None, labels=None, colors=None, autopct=None,
shadow=False)
Example:
import matplotlib.pyplot as plt
import pandas as pd
plt.show()
Output:
45
Customizations that are available for the Pie chart are –
explode: Moving the wedges of the plot
autopct: Label the wedge with their numerical value.
color: Attribute is used to provide color to the wedges.
shadow: Used to create shadow of wedge.
Python Data Types
Data types are the classification or categorization of data items. It represents the kind of
value that tells what operations can be performed on a particular data. Since everything is
an object in Python programming, data types are actually classes and variables are
instance (object) of these classes.
Following are the standard or built-in data type of Python:
Numeric
Sequence Type
Boolean
Set
Dictionary
46
Numeric
In Python, numeric data type represent the data which has numeric value. Numeric
value can be integer, floating number or even complex numbers. These values are
defined as int, float and complex class in Python.
Integers – This value is represented by int class. It contains positive or negative
whole numbers (without fraction or decimal). In Python there is no limit to how long
an integer value can be.
Float – This value is represented by float class. It is a real number with floating
point representation. It is specified by a decimal point. Optionally, the character e or
E followed by a positive or negative integer may be appended to specify scientific
notation.
Complex Numbers – Complex number is represented by complex class. It is
specified as (real part) + (imaginary part)j. For example – 2+3j
# Python program to
# demonstrate numeric value
a=5
print("Type of a: ", type(a))
b = 5.0
print("\nType of b: ", type(b))
c = 2 + 4j
print("\nType of c: ", type(c))
Output:
47
Type of a: <class 'int'>
Sequence Type
In Python, sequence is the ordered collection of similar or different data types.
Sequences allows to store multiple values in an organized and efficient fashion. There
are several sequence types in Python –
String
List
Tuple
Boolean
Data type with one of the two built-in values, True or False. Boolean objects that are equal
to True are truthy (true), and those equal to False are falsy (false). But non-Boolean
objects can be evaluated in Boolean context as well and determined to be true or false.
It is denoted by the class bool.
# Python program to
# demonstrate boolean type
print(type(True))
print(type(False))
<class 'bool'>
<class 'bool'>
Time Series
Time series data is an important form of structured data in many different fields, suchas finance, economics,
ecology, neuroscience, and physics. Anything that is observedor measured at many points in time forms a time
series. Many time series are fixed frequency, which is to say that data points occur at regular intervals
according to some rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can
also be irregular without a fixed unit of time or offset between units. How you mark and refer to time series
data depends on the application, and you may have one of thefollowing:
• Timestamps, specific instants in time
• Fixed periods, such as the month January 2007 or the full year 2010
• Intervals of time, indicated by a start and end timestamp. Periods can be thought
of as special cases of intervals
• Experiment or elapsed time; each timestamp is a measure of time relative to aparticular start time (e.g., the
diameter of a cookie baking each second since being placed in the oven)
In this chapter, I am mainly concerned with time series in the first three categories, though many of the
techniques can be applied to experimental time series where the index may be an integer or floating-point
number indicating elapsed time from the start of the experiment. The simplest and most widely used kind of
time series are those indexed by timestamp.
11.1 Date and Time Data Types and Tools
The Python standard library includes data types for date and time data, as well as calendar-related
functionality. The datetime, time, and calendar modules are the main places to start. The datetime.datetime type,
or simply datetime, is widely used:
In [10]: from datetime import datetime
48
In [11]: now = datetime.now()
In [12]: now
Out[12]: datetime.datetime(2017, 9, 25, 14, 5, 52, 72973)
In [13]: now.year, now.month, now.day
Out[13]: (2017, 9, 25)
datetime stores both the date and time down to the microsecond. timedelta represents
the temporal difference between two datetime objects:
In [14]: delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
In [15]: delta
Out[15]: datetime.timedelta(926, 56700)
In [16]: delta.days
Out[16]: 926
In [17]: delta.seconds
Out[17]: 56700
You can add (or subtract) a timedelta or multiple thereof to a datetime object to
yield a new shifted object:
In [18]: from datetime import timedelta
In [19]: start = datetime(2011, 1, 7)
49
In [26]: datetime.strptime(value, '%Y-%m-%d')
Out[26]: datetime.datetime(2011, 1, 3, 0, 0)
In [27]: datestrs = ['7/6/2011', '8/6/2011']
In [28]: [datetime.strptime(x, '%m/%d/%Y') for x in datestrs]
Out[28]:
[datetime.datetime(2011, 7, 6, 0, 0),
datetime.datetime(2011, 8, 6, 0, 0)]
datetime.strptime is a good way to parse a date with a known format. However, it
can be a bit annoying to have to write a format spec each time, especially for common
date formats. In this case, you can use the parser.parse method in the third-party
dateutil package (this is installed automatically when you install pandas):
In [29]: from dateutil.parser import parse
In [30]: parse('2011-01-03')
Out[30]: datetime.datetime(2011, 1, 3, 0, 0)
dateutil is capable of parsing most human-intelligible date representations:
In [31]: parse('Jan 31, 1997 10:45 PM')
Out[31]: datetime.datetime(1997, 1, 31, 22, 45)
In international locales, day appearing before month is very common, so you can pass
dayfirst=True to indicate this:
In [32]: parse('6/12/2011', dayfirst=True)
Out[32]: datetime.datetime(2011, 12, 6, 0, 0)
pandas is generally oriented toward working with arrays of dates, whether used as an
axis index or a column in a DataFrame. The to_datetime method parses many different
kinds of date representations. Standard date formats like ISO 8601 can be
parsed very quickly:
In [33]: datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
In [34]: pd.to_datetime(datestrs)
Out[34]: DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='dat
etime64[ns]', freq=None)
It also handles values that should be considered missing (None, empty string, etc.):
In [35]: idx = pd.to_datetime(datestrs + [None])
In [36]: idx
Out[36]: DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dty
pe='datetime64[ns]', freq=None)
In [37]: idx[2]
Out[37]: NaT
In [38]: pd.isnull(idx)
Out[38]: array([False, False, True], dtype=bool)
NaT (Not a Time) is pandas’s null value for timestamp data.
dateutil.parser is a useful but imperfect tool. Notably, it will recognize
some strings as dates that you might prefer that it didn’t—
for example, '42' will be parsed as the year 2042 with today’s calendar
date.
datetime objects also have a number of locale-specific formatting options for systems
in other countries or languages. For example, the abbreviated month names will be
different on German or French systems compared with English systems. See
Table 11-3 for a listing.
Table 11-3. Locale-specific date formatting
Type Description
%a Abbreviated weekday name
%A Full weekday name
%b Abbreviated month name
%B Full month name
%c Full date and time (e.g., ‘Tue 01 May 2012 04:20:57 PM’)
%p Locale equivalent of AM or PM
%x Locale-appropriate formatted date (e.g., in the United States, May 1, 2012 yields ’05/01/2012’)
%X Locale-appropriate time (e.g., ’04:24:12 PM’)
50