Logistic Regression

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 50

Logistic Regression in Python – Theory and Code Example with Explanation

In Machine Learning, we often need to solve problems that require one of the two possible answers, for
example in the medical domain, we might be looking to find whether a tumor is malignant or benign
and similarly in the education domain, we might want to see whether a student gets admission in a
specific university or not.
Such problems are binary classification problems and logistic regression is a very popular algorithm to
solve such problems. With that said, it would be easier to understand what logistic regression is.
What is Logistic Regression?
Logistic Regression is a Machine Learning algorithm used to make predictions to find the value of a dependent
variable such as the condition of a tumor (malignant or benign), classification of email (spam or not spam), or
admission into a university (admitted or not admitted) by learning from independent variables (various features
relevant to the problem).
For example, for classifying an email, the algorithm will use the words in the email as features and based on that
make a prediction whether the email is spam or not.
Logistic Regression is a supervised Machine Learning algorithm, which means the data provided for training is
labeled i.e., answers are already provided in the training set. The algorithm learns from those examples and their
corresponding answers (labels) and then uses that to classify new examples.
In mathematical terms, suppose the dependent variable is Y and the set of independent variables is X, then logistic
regression will predict the dependent variable P(Y=1) as a function of X, the set of independent variables.
 
Types of Logistic Regression
When we talk about Logistic Regression in general, we usually mean Binary logistic regression, although there are
other types of Logistic Regression as well.
Logistic Regression can be divided into types based on the type of classification it does. With that in view, there are
3 types of Logistic Regression. Let’s talk about each of them:
 Binary Logistic Regression
 Multinomial Logistic Regression
 Ordinal Logistic Regression
 
Binary Logistic Regression
Binary Logistic Regression is the most commonly used type. It is the type we already discussed when defining
Logistic Regression. In this type, the dependent/target variable has two distinct values, either 0 or 1, malignant or
benign, passed or failed, admitted or not admitted.
 
Multinomial Logistic Regression
Multinomial Logistic Regression deals with cases when the target or independent variable has three or more possible
values. For example, the use of Chest X-ray images as features that give indication about one of the three possible
outcomes (No disease, Viral Pneumonia, COVID-19). The multinomial Logistic Regression will use the features to
classify the example into one of the three possible outcomes in this case. There can of course be more than three
possible values of the target variable.
 
Ordinal Logistic Regression
Ordinal Logistic Regression is used in cases when the target variable is of ordinal nature. In this type, the categories
are ordered in a meaningful manner and each category has quantitative significance. Moreover, the target variable
has more than two categories. For example, the grades obtained on an exam have categories that have quantitative
significance and they are ordered. Keeping it simple, the grades can be A, B, or C.
 
Difference between Logistic and Linear Regression
The major difference between Logistic and Linear Regression is that Linear Regression is used to solve regression
problems whereas Logistic Regression is used for classification problems. In regression problems, the target variable
can have continuous values such as the price of a product, the age of a participant, etc. While, the classification
problems deal with the prediction of target variable that can only have discrete values, for example, prediction of
gender of a person, prediction of a tumor to be malignant or benign, etc.
 
In what type of software does logistic regression work best?
Logistic Regression is most commonly used in problems of binary classification in which the algorithm predicts one
of the two possible outcomes based on various features relevant to the problem.

1
Logistic Regression finds its applications in a wide range of domains and fields, the following examples will
highlight its importance:
Education sector: In the Education sector, logistic regression can be used to predict:
 Whether a student gets admission into a university program or not is based on test scores and various other factors.
 In E-learning platforms to see whether a student will complete a course on time or not based on past activity and other
statistics relevant to the problem.
Business sector: In the business sector, logistic regression has the following applications:
 Predicting whether a credit card transaction made by a user is fraudulent or not.
Medical sector: Medical sector also benefits from logistic regression through the following uses:
 Predicting whether a person has a disease or not is based on values obtained from test reports or other factors in
general.
 A very innovative application of Machine Learning being used by researchers is to predict whether a person has
COVID-19 or not using Chest X-ray images.
Other applications: Logistic regression finds its applications in all major sectors, in addition to that, some of its
interesting applications are:
 Email Classification – Spam or not spam
 Sentiment Analysis – Person is sad or happy based on a text message
 Object Detection and Classification – Classifying an image to be a cat image or a dog image
There are numerous other problems that can be solved using Logistic Regression. The above-mentioned examples
should be enough to give you an idea of how powerful and useful this algorithm is.
 
Example of Algorithm based on Logistic Regression and its implementation in Python
Now that the basic concepts about Logistic Regression are clear, it is time to study a real-life application of Logistic
Regression and implement it in Python.
Let’s work on classifying credit card transactions as fraudulent, also called credit card fraud detection. It is a very
important application of Logistic Regression being used in the business sector. A real-world dataset will be used for
this problem. It is quite a comprehensive dataset having information of over 280,000 transactions. Step by step
instructions will be provided for implementing the solution using logistic regression in Python.
So let’s get started:
Step 1 – Doing Imports
The first step is to import the libraries that are going to be used later. If you do not have them installed, you would
have to install them using pip or any other package manager for python.
 
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn import preprocessing

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Step 2 – The Data


The second step is to get data that is going to be used for the analysis and then perform preprocessing steps on the
data.
Step 2.1 – Downloading the Data
The dataset to be used in this example can be downloaded from Kaggle. After downloading, the archive would have
to be extracted and the CSV file would be obtained.
Step 2.2 – Loading the data using Pandas
The CSV file is placed in the same directory as the jupyter notebook (or code file), and then the following code can
be used to load the dataset:

import pandas as pd

2
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score, classification_report, confusion_matr
ix

dataset = pd.read_csv('User_Data.csv')
# input
x = dataset.iloc[:, [2, 3]].values
  
# output
y = dataset.iloc[:, 4].values

xtrain, xtest, ytrain, ytest = train_test_split(


        x, y, test_size = 0.25, random_state = 0)

sc_x = StandardScaler()
xtrain = sc_x.fit_transform(xtrain) 
xtest = sc_x.transform(xtest)
  
print (xtrain[0:10, :])

classifier = LogisticRegression(random_state = 0)
classifier.fit(xtrain, ytrain)

y_pred = classifier.predict(xtest)

cm = confusion_matrix(ytest, y_pred)
  
print ("Confusion Matrix : \n", cm)

print ("Accuracy : ", accuracy_score(ytest, y_pred))

 
Conclusion
There are different types of Logistic Regression, but the most widely used is the binary logistic regression in which
the classification takes place on one of the two possible values of the target variable. Logistic Regression is different
from Linear Regression because it is a classification algorithm and has discrete values as classification output, while
Linear Regression is a Regression algorithm having continuous values as output. We also covered that Logistic
Regression finds its use in a wide range of applications including the classification tasks in Business, Education, and
Medical industries.
A real-life example of Logistic Regression was studied. The analysis involved over 280,000 instances of
transactions which were further divided into training and test sets by a ratio of 80 to 20 respectively. After exploring
and preprocessing the dataset, the model was trained and a classification accuracy of 99.9% was obtained. It showed
that Logistic Regression was very successful in detecting fraudulent transactions, although more improvement can
also be made by tuning the model (advance concepts).

3
Linear Regression (Python Implementation)
This article discusses the basics of linear regression and its implementation in the Python
programming language.
Linear regression is a statistical method for modeling relationships between a dependent
variable with a given set of independent variables.
Note: In this article, we refer to dependent variables as responses and independent
variables as features for simplicity.
In order to provide a basic understanding of linear regression, we start with the most
basic version of linear regression, i.e. Simple linear regression. 

Simple Linear Regression


Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear
function that predicts the response value(y) as accurately as possible as a function of the
feature or independent variable(x).
Let us consider a dataset where we have a value of response y for every feature x: 

For generality, we define:


x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
A scatter plot of the above dataset looks like:-

Now, the task is to find a line that fits best in the above scatter plot so that we can
predict the response for any new feature values. (i.e a value of x not present in a dataset)
4
This line is called a regression line.
The equation of regression line is represented as:

Here,  
 h(x_i) represents the predicted response value for ith observation.
 b_0 and b_1 are regression coefficients and represent y-intercept and slope of
regression line respectively.
To create our model, we must “learn” or estimate the values of regression coefficients
b_0 and b_1. And once we’ve estimated these coefficients, we can use the model to
predict responses!
In this article, we are going to use the principle of  Least Squares.
Now consider:

Here, e_i is a residual error in ith observation. 


So, our aim is to minimize the total residual error.
We define the squared error or cost function, J as: 

and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!
Without going into the mathematical details, we present the result here:

where SS_xy is the sum of cross-deviations of y and x: 


and SS_xx is the sum of squared deviations of x: 
Note: The complete derivation for finding least squares estimates in simple linear
regression can be found here.
Code: Python implementation of above technique on our small dataset 
 Python

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


    # number of observations/points
    n = np.size(x)

    # mean of x and y vector


    m_x = np.mean(x)
    m_y = np.mean(y)

    # calculating cross-deviation and deviation about x


    SS_xy = np.sum(y*x) - n*m_y*m_x
    SS_xx = np.sum(x*x) - n*m_x*m_x

    # calculating regression coefficients


    b_1 = SS_xy / SS_xx
    b_0 = m_y - b_1*m_x

    return (b_0, b_1)

5
def plot_regression_line(x, y, b):
    # plotting the actual points as scatter plot
    plt.scatter(x, y, color = "m",
               marker = "o", s = 30)

    # predicted response vector


    y_pred = b[0] + b[1]*x

    # plotting the regression line


    plt.plot(x, y_pred, color = "g")

    # putting labels


    plt.xlabel('x')
    plt.ylabel('y')

    # function to show plot


    plt.show()

def main():
    # observations / data
    x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

    # estimating coefficients


    b = estimate_coef(x, y)
    print("Estimated coefficients:\nb_0 = {}  \
          \nb_1 = {}".format(b[0], b[1]))

    # plotting regression line


    plot_regression_line(x, y, b)

if __name__ == "__main__":
    main()

Output: 
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like
this:  

6
import numpy as np

a = np.array([[1, 2, 4], [5, 8, 7]], dtype = 'float')


print (a)
print(type(a))

b = np.array((1 , 3, 2))
print (b)
c = np.zeros((3, 4))
print( c)

f = np.arange(0, 30, 5)
print (f)

g = np.linspace(0, 5, 10)
print (g)

arr = np.array([[1, 2, 3, 4],


[5, 2, 4, 2],
[1, 2, 0, 1]])

print(arr.ndim)

newarr = arr.reshape(2, 2, 3)

print ( arr)
print ( newarr)

a1 = np.arrange(8)
print( a1)

a2 = np.arrange(8).reshape(2, 4)
print(a2)

a3 = np.arrange(8).reshape(4 ,2)
print( a3)

Data Visualization with Python Seaborn


Data Visualization is the presentation of data in pictorial format. It is extremely important
for Data Analysis, primarily because of the fantastic ecosystem of data-centric Python

7
packages. And it helps to understand the data, however, complex it is, the significance of
data by summarizing and presenting a huge amount of data in a simple and easy-to-
understand format and helps communicate information clearly and effectively.
Pandas and Seaborn is one of those packages and makes importing and analyzing data
much easier. In this article, we will use Pandas and Seaborn to analyze data.
Pandas
Pandas offer tools for cleaning and process your data. It is the most popular Python
library that is used for data analysis. In pandas, a data table is called a dataframe.

So, let’s start with creating Pandas data frame:


Example 1:

import pandas as pd

# initialise data of lists.


data = {'Name':[ 'Mohe' , 'Karnal' , 'Yrik' , 'jack' ],
        'Age':[ 30 , 21 , 29 , 28 ]}

# Create DataFrame
df = pd.DataFrame( data )

# Print the output.


df

 Output:

 
Example 2: load the CSV data from the system and display it through pandas.
 # import module
import pandas
 # load the csv
data = pandas.read_csv("nba.csv")

# show first 5 column


data.head()

 
 Output:

8
 

Seaborn
 Seaborn is an amazing visualization library for statistical graphics plotting in Python. It
is built on the top of matplotlib library and also closely integrated into the data structures
from pandas.
 Installation
 For python environment : 
 pip install seaborn
 For conda environment : 
 conda install seaborn
 Let’s create Some basic plots using seaborn:
 Python3

# Importing libraries
import numpy as np
import seaborn as sns
# Selecting style as white,
# dark, whitegrid, darkgrid
# or ticks
sns.set( style = "white" )

# Generate a random univariate


# dataset
rs = np.random.RandomState( 10 )
d = rs.normal( size = 50 )

# Plot a simple histogram and kde


# with binsize determined automatically
sns.distplot(d, kde = True, color = "g")

Output:

9
Seaborn: statistical data visualization
Seaborn helps to visualize the statistical relationships, To understand how variables in a
dataset are related to one another and how that relationship is dependent on other
variables, we perform statistical analysis. This Statistical analysis helps to visualize the
trends and identify various patterns in the dataset.
These are the plot will help to visualize:
 Line Plot
 Scatter Plot
 Box plot
 Point plot
 Count plot
 Violin plot
 Swarm plot
 Bar plot
 KDE Plot

Line plot:

Lineplot Is the most popular plot to draw a relationship between x and y with the
possibility of several semantic groupings.
Syntax : sns.lineplot(x=None, y=None)
Parameters:
x, y: Input data variables; must be numeric. Can pass data directly or reference columns
in data.
Let’s visualize the data with a line plot and pandas:
Example 1:
# import module
import seaborn as sns
import pandas

10
# loading csv
data = pandas.read_csv("nba.csv")
 # ploting lineplot
sns.lineplot( data['Age'], data['Weight'])

 Output:

 
 Example 2: Use the hue parameter for plotting the graph.

 # import module
import seaborn as sns
import pandas

# read the csv data


data = pandas.read_csv("nba.csv")

# plot
sns.lineplot(data['Age'],data['Weight'], hue =data["Position"])

Output:

11
Scatter Plot:

Scatterplot Can be used with several semantic groupings which can help to understand
well in a graph against continuous/categorical data. It can draw a two-dimensional graph.
Syntax: seaborn.scatterplot(x=None, y=None)
Parameters:
x, y: Input data variables that should be numeric.
Returns: This method returns the Axes object with the plot drawn onto it.
Let’s visualize the data with a scatter plot and pandas:
Example 1:
# import module
import seaborn
import pandas
 # load csv
data = pandas.read_csv("nba.csv")
 # plotting
seaborn.scatterplot(data['Age'],data['Weight'])

 Output:

 
 Example 2: Use the hue parameter for plotting the graph.

 import seaborn
import pandas
data = pandas.read_csv("nba.csv")

seaborn.scatterplot( data['Age'], data['Weight'], hue =data["Position"])

 
 
Output:
 

12
Box plot:A box plot (or box-and-whisker plot) s is the visual representation of the
depicting groups of numerical data through their quartiles against continuous/categorical
data.

A box plot consists of 5 things.


 Minimum
 First Quartile or 25%
 Median (Second Quartile) or 50%
 Third Quartile or 75%
 Maximum
Syntax: 
seaborn.boxplot(x=None, y=None, hue=None, data=None)
Parameters: 
 x, y, hue: Inputs for plotting long-form data.
 data: Dataset for plotting. If x and y are absent, this is interpreted as wide-form.
Returns: It returns the Axes object with the plot drawn onto it. 
 Draw the box plot with Pandas:
 Example 1:
# import module
import seaborn as sns
import pandas

# read csv and ploting


data = pandas.read_csv( "nba.csv" )
sns.boxplot( data['Age'] )

 
13
 
Output: 

Example 2:
 

# import module
import seaborn as sns
import pandas

# read csv and ploting


data = pandas.read_csv( "nba.csv" )
sns.boxplot( data['Age'], data['Weight'])

Output:

14
Voilin Plot:

A voilin plot is similar to a boxplot. It shows several quantitative data across one or more
categorical variables such that those distributions can be compared. 
Syntax: seaborn.violinplot(x=None, y=None, hue=None, data=None)
Parameters: 
 x, y, hue: Inputs for plotting long-form data. 
 data: Dataset for plotting. 
 Draw the violin plot with Pandas:
Example 1:
# import module
import seaborn as sns
import pandas
 # read csv and plot
data = pandas.read_csv("nba.csv")
sns.violinplot(data['Age'])

 Output: 

 Example 2:

 # import module
import seaborn
 seaborn.set(style = 'whitegrid')
 # read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.violinplot(x ="Age", y ="Weight",data = data)

Output:

15
Swarm plot:

A swarm plot is similar to a strip plot, We can draw a swarm plot with non-overlapping
points against categorical data.
Syntax: seaborn.swarmplot(x=None, y=None, hue=None, data=None)
 
Parameters: 
 x, y, hue: Inputs for plotting long-form data. 
 data: Dataset for plotting. 
 
Draw the swarm plot with Pandas:
Example 1:
# import module
import seaborn
 seaborn.set(style = 'whitegrid')
 # read csv and plot
data = pandas.read_csv( "nba.csv" )
seaborn.swarmplot(x = data["Age"])

Output:

Example 2:
# import module
import seaborn
 seaborn.set(style = 'whitegrid')
 # read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.swarmplot(x ="Age", y ="Weight",data = data)

16
Output:

Bar plot:Barplot represents an estimate of central tendency for a numeric variable with


the height of each rectangle and provides some indication of the uncertainty around that
estimate using error bars. 

Syntax : seaborn.barplot(x=None, y=None, hue=None, data=None)


Parameters :
 x, y : This parameter take names of variables in data or vector data, Inputs for
plotting long-form data.
 hue : (optional) This parameter take column name for colour encoding.
 data : (optional) This parameter take DataFrame, array, or list of arrays, Dataset for
plotting. If x and y are absent, this is interpreted as wide-form. Otherwise it is
expected to be long-form.
Returns : Returns the Axes object with the plot drawn onto it. 
Draw the bar plot with Pandas:
Example 1:
# import module
import seaborn
 seaborn.set(style = 'whitegrid')
 # read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.barplot(x =data["Age"])

17
Example 2:
# import module
import seaborn
 seaborn.set(style = 'whitegrid')
 # read csv and plot
data = pandas.read_csv("nba.csv")
seaborn.barplot(x ="Age", y ="Weight", data = data)

Output:

Point plot:
Point plot used to show point estimates and confidence intervals using scatter plot
glyphs. A point plot represents an estimate of central tendency for a numeric variable by
the position of scatter plot points and provides some indication of the uncertainty around
that estimate using error bars.
Syntax: seaborn.pointplot(x=None, y=None, hue=None, data=None)
Parameters:
 x, y: Inputs for plotting long-form data.
 hue: (optional) column name for color encoding.
 data: dataframe as a Dataset for plotting.
Return: The Axes object with the plot drawn onto it.
Draw the point plot with Pandas:
Example:
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.pointplot(x = "Age", y = "Weight", data = data)

Output:

18
Count plot:

Count plot used to Show the counts of observations in each categorical bin using bars.
Syntax : seaborn.countplot(x=None, y=None, hue=None, data=None)
Parameters :
 x, y: This parameter take names of variables in data or vector data, optional, Inputs
for plotting long-form data.
 hue : (optional) This parameter take column name for color encoding.
 data : (optional) This parameter take DataFrame, array, or list of arrays, Dataset for
plotting. If x and y are absent, this is interpreted as wide-form. Otherwise, it is
expected to be long-form.
Returns: Returns the Axes object with the plot drawn onto it.
 
Draw the count plot with Pandas:
Example:
# import module
import seaborn

seaborn.set(style = 'whitegrid')

# read csv and plot


data = pandas.read_csv("nba.csv")
seaborn.countplot(data["Age"])

Output:

19
KDE Plot:

KDE Plot described as Kernel Density Estimate is used for visualizing the Probability
Density of a continuous variable. It depicts the probability density at different values in a
continuous variable. We can also plot a single graph for multiple samples which helps in
more efficient data visualization.
Syntax: seaborn.kdeplot(x=None, *, y=None, vertical=False, palette=None, **kwargs)
Parameters:
x, y : vectors or keys in data
vertical : boolean (True or False)
data : pandas.DataFrame, numpy.ndarray, mapping, or sequence
Draw the KDE plot with Pandas:
Example 1:
# importing the required libraries
from sklearn import datasets
import pandas as pd
import seaborn as sns

# Setting up the Data Frame


iris = datasets.load_iris()

iris_df = pd.DataFrame(iris.data, columns=['Sepal_Length',


                      'Sepal_Width', 'Patal_Length', 'Petal_Width'])

iris_df['Target'] = iris.target

iris_df['Target'].replace([0], 'Iris_Setosa', inplace=True)


iris_df['Target'].replace([1], 'Iris_Vercicolor', inplace=True)

20
iris_df['Target'].replace([2], 'Iris_Virginica', inplace=True)

# Plotting the KDE Plot


sns.kdeplot(iris_df.loc[(iris_df['Target'] =='Iris_Virginica'),
            'Sepal_Length'], color = 'b', shade = True, Label ='Iris_Virginica')

Output:

Example 2:
# import module
import seaborn as sns
import pandas

# read top 5 column


data = pandas.read_csv("nba.csv").head()

sns.kdeplot( data['Age'], data['Number'])

Output:

Bivariate and Univariate data using seaborn and pandas:


21
Before starting let’s have a small intro of bivariate and univariate data:
Bivariate data: This type of data involves two different variables. The analysis of this
type of data deals with causes and relationships and the analysis is done to find out the
relationship between the two variables.
Univariate data: This type of data consists of only one variable. The analysis of
univariate data is thus the simplest form of analysis since the information deals with only
one quantity that changes. It does not deal with causes or relationships and the main
purpose of the analysis is to describe the data and find patterns that exist within it.
Let’s see an example of Bivariate data disturbation:
Example 1: Using the box plot.
# import module
import seaborn as sns
import pandas

# read csv and ploting


data = pandas.read_csv( "nba.csv" )
sns.boxplot( data['Age'], data['Height'])

Output:

Example 2: using KDE plot.

# import module
import seaborn as sns
import pandas
 # read top 5 column
data = pandas.read_csv("nba.csv").head()

sns.kdeplot( data['Age'], data['Weight'])

Output:

22
Let’s see an example of univariate data distribution:
Example: Using the dist plot
# import module
import seaborn as sns
import pandas

# read top 5 column


data = pandas.read_csv("nba.csv").head()

sns.distplot( data['Age'])

Output:

23
Data Visualization using Matplotlib
Data Visualization is the process of presenting data in the form of graphs or charts. It
helps to understand large and complex amounts of data very easily. It allows the
decision-makers to make decisions very efficiently and also allows them in identifying
new trends and patterns very easily. It is also used in high-level data analysis for Machine
Learning and Exploratory Data Analysis (EDA).  Data visualization can be done with
various tools like Tableau, Power BI, Python.
In this article, we will discuss how to visualize data with the help of the Matplotlib
library of Python.

Matplotlib
Matploptib is a low-level library of Python which is used for data visualization. It is easy
to use and emulates MATLAB like graphs and visualization. This library is built on the
top of NumPy arrays and consist of several plots like line chart, bar chart, histogram, etc.
It provides a lot of flexibility but at the cost of writing more code.

Pyplot
Pyplot is a Matplotlib module that provides a MATLAB-like interface. Matplotlib is
designed to be as usable as MATLAB, with the ability to use Python and the advantage
of being free and open-source. Each pyplot function makes some change to a figure: e.g.,
creates a figure, creates a plotting area in a figure, plots some lines in a plotting area,
decorates the plot with labels, etc. The various plots we can utilize using Pyplot are Line
Plot, Histogram, Scatter, 3D Plot, Image, Contour, and Polar.
After knowing a brief about Matplotlib and pyplot let’s see how to create a simple plot.
Example:
import matplotlib.pyplot as plt
  # initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
  # plotting the data
plt.plot(x, y)
  plt.show()

Output:

24
Adding Title
The title() method in matplotlib module is used to specify the title of the visualization
depicted and displays the title using various attributes.
Syntax:
matplotlib.pyplot.title(label, fontdict=None, loc=’center’, pad=None, **kwargs)
Example:
import matplotlib.pyplot as plt
  # initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
  # plotting the data
plt.plot(x, y)
  # Adding title to the plot
plt.title("Linear graph")
  plt.show()

Output:

We can also change the appearance of the title by using the parameters of this function.
Example:
import matplotlib.pyplot as plt

# initializing the data


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# plotting the data


plt.plot(x, y)

# Adding title to the plot


plt.title("Linear graph", fontsize=25, color="green")

plt.show()

Output:
25
Note: For more information about adding the title and its customization,
refer Matplotlib.pyplot.title() in Python
Adding X Label and Y Label
In layman’s terms, the X label and the Y label are the titles given to X-axis and Y-axis
respectively. These can be added to the graph by using the xlabel() and ylabel() methods.
Syntax:
matplotlib.pyplot.xlabel(xlabel, fontdict=None, labelpad=None, **kwargs)
matplotlib.pyplot.ylabel(ylabel, fontdict=None, labelpad=None, **kwargs)
Example:
import matplotlib.pyplot as plt
 # initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
  # plotting the data
plt.plot(x, y)
# Adding title to the plot
plt.title("Linear graph", fontsize=25, color="green")
  # Adding label on the y-axis
plt.ylabel('Y-Axis')
  # Adding label on the x-axis
plt.xlabel('X-Axis')
  plt.show()

Output:

26
Setting Limits and Tick labels
You might have seen that Matplotlib automatically sets the values and the
markers(points) of the X and Y axis, however, it is possible to set the limit and markers
manually. xlim() and ylim() functions are used to set the limits of the X-axis and Y-axis
respectively. Similarly, xticks() and yticks() functions are used to set tick labels.
Example: In this example, we will be changing the limit of Y-axis and will be setting the
labels for X-axis.
import matplotlib.pyplot as plt
  # initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
  # plotting the data
plt.plot(x, y)
  # Adding title to the plot
plt.title("Linear graph", fontsize=25, color="green")
  # Adding label on the y-axis
plt.ylabel('Y-Axis')
  # Adding label on the x-axis
plt.xlabel('X-Axis')
  # Setting the limit of y-axis
plt.ylim(0, 80)
  # setting the labels of x-axis
plt.xticks(x, labels=["one", "two", "three", "four"])
  plt.show()

Output:

Adding Legends
A legend is an area describing the elements of the graph. In simple terms, it reflects the
data displayed in the graph’s Y-axis. It generally appears as the box containing a small
sample of each color on the graph and a small description of what this data means.
The attribute bbox_to_anchor=(x, y) of legend() function is used to specify the
coordinates of the legend, and the attribute ncol represents the number of columns that
the legend has. Its default value is 1.
Syntax:
matplotlib.pyplot.legend([“name1”, “name2”], bbox_to_anchor=(x, y), ncol=1)

27
Example:
import matplotlib.pyplot as plt
  # initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]
 # plotting the data
plt.plot(x, y)
  # Adding title to the plot
plt.title("Linear graph", fontsize=25, color="green")
  # Adding label on the y-axis
plt.ylabel('Y-Axis')
  # Adding label on the x-axis
plt.xlabel('X-Axis')
  # Setting the limit of y-axis
plt.ylim(0, 80)
  # setting the labels of x-axis
plt.xticks(x, labels=["one", "two", "three", "four"])
  # Adding legends
plt.legend(["GFG"])
  plt.show()

Output:

Before moving any further with Matplotlib let’s discuss some important classes that will
be used further in the tutorial. These classes are – 
 Figure
 Axes
Note: Matplotlib take care of the creation of inbuilt defaults like Figure and Axes.
Figure class
Consider the figure class as the overall window or page on which everything is drawn. It
is a top-level container that contains one or more axes. A figure can be created using
the figure() method.

28
Syntax:
class matplotlib.figure.Figure(figsize=None, dpi=None, facecolor=None,
edgecolor=None, linewidth=0.0, frameon=None, subplotpars=None, tight_layout=None,
constrained_layout=None)
Example:
# Python program to show pyplot module
import matplotlib.pyplot as plt
from matplotlib.figure import Figure

# initializing the data


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# Creating a new figure with width = 7 inches


# and height = 5 inches with face color as
# green, edgecolor as red and the line width
# of the edge as 7
fig = plt.figure(figsize =(7, 5), facecolor='g',
                 edgecolor='b', linewidth=7)

# Creating a new axes for the figure


ax = fig.add_axes([1, 1, 1, 1])

# Adding the data to be plotted


ax.plot(x, y)

# Adding title to the plot


plt.title("Linear graph", fontsize=25, color="yellow")

# Adding label on the y-axis


plt.ylabel('Y-Axis')

# Adding label on the x-axis


plt.xlabel('X-Axis')

# Setting the limit of y-axis


plt.ylim(0, 80)

# setting the labels of x-axis


plt.xticks(x, labels=["one", "two", "three", "four"])

# Adding legends
plt.legend(["GFG"])

plt.show()

Output:

29
Axes Class
Axes class is the most basic and flexible unit for creating sub-plots. A given figure may
contain many axes, but a given axes can only be present in one figure. The axes()
function creates the axes object. 
Syntax:
axes([left, bottom, width, height])
Just like pyplot class, axes class also provides methods for adding titles, legends, limits,
labels, etc. Let’s see a few of them – 
 Adding Title – ax.set_title()
 Adding X Label and Y label – ax.set_xlabel(), ax.set_ylabel()
 Setting Limits – ax.set_xlim(), ax.set_ylim()
 Tick labels – ax.set_xticklabels(), ax.set_yticklabels()
 Adding Legends – ax.legend()
Example:
 Python3

# Python program to show pyplot module


import matplotlib.pyplot as plt
from matplotlib.figure import Figure

# initializing the data

30
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

fig = plt.figure(figsize = (5, 4))

# Adding the axes to the figure


ax = fig.add_axes([1, 1, 1, 1])

# plotting 1st dataset to the figure


ax1 = ax.plot(x, y)

# plotting 2nd dataset to the figure


ax2 = ax.plot(y, x)

# Setting Title
ax.set_title("Linear Graph")

# Setting Label
ax.set_xlabel("X-Axis")
ax.set_ylabel("Y-Axis")

# Adding Legend
ax.legend(labels = ('line 1', 'line 2'))

plt.show()

Output:

Multiple Plots
We have learned about the basic components of a graph that can be added so that it can
convey more information. One method can be by calling the plot function again and
again with a different set of values as shown in the above example. Now let’s see how to
plot multiple graphs using some functions and also how to plot subplots. 

31
Method 1: Using the add_axes() method 
The add_axes() method is used to add axes to the figure. This is a method of figure class
Syntax:
add_axes(self, *args, **kwargs)

# Python program to show pyplot module


import matplotlib.pyplot as plt
from matplotlib.figure import Figure

# initializing the data


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# Creating a new figure with width = 5 inches


# and height = 4 inches
fig = plt.figure(figsize =(5, 4))

# Creating first axes for the figure


ax1 = fig.add_axes([0.1, 0.1, 0.8, 0.8])

# Creating second axes for the figure


ax2 = fig.add_axes([1, 0.1, 0.8, 0.8])

# Adding the data to be plotted


ax1.plot(x, y)
ax2.plot(y, x)

plt.show()

Output:

Method 2: Using subplot() method.
This method adds another plot at the specified grid position in the current figure.
Syntax:
subplot(nrows, ncols, index, **kwargs)
subplot(pos, **kwargs)

32
subplot(ax)

import matplotlib.pyplot as plt


  # initializing the data
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# Creating figure object


plt.figure()

# addind first subplot


plt.subplot(121)
plt.plot(x, y)

# addding second subplot


plt.subplot(122)
plt.plot(y, x)

Output:

Method 3: Using subplots() method
This function is used to create figures and multiple subplots at the same time.
Syntax:
matplotlib.pyplot.subplots(nrows=1, ncols=1, sharex=False, sharey=False,
squeeze=True, subplot_kw=None, gridspec_kw=None, **fig_kw)
Example:
 Python3

import matplotlib.pyplot as plt


  # initializing the data

33
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# Creating the figure and subplots


# according the argument passed
fig, axes = plt.subplots(1, 2)

# plotting the data in the


# 1st subplot
axes[0].plot(x, y)

# plotting the data in the 1st


# subplot only
axes[0].plot(y, x)

# plotting the data in the 2nd


# subplot only
axes[1].plot(x, y)

Output:

Method 4: Using subplot2grid() method
This function creates axes object at a specified location inside a grid and also helps in
spanning the axes object across multiple rows or columns. In simpler words, this function
is used to create multiple charts within the same figure.
Syntax:
Plt.subplot2grid(shape, location, rowspan, colspan)
Example:
import matplotlib.pyplot as plt

# initializing the data


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# adding the subplots


axes1 = plt.subplot2grid (

34
(7, 1), (0, 0), rowspan = 2, colspan = 1)

axes2 = plt.subplot2grid (
(7, 1), (2, 0), rowspan = 2, colspan = 1)

# plotting the data


axes1.plot(x, y)
axes2.plot(y, x)

Output:

Different types of Matplotlib Plots


Matplotlib supports a variety of plots including line charts, bar charts, histograms, scatter
plots, etc. We will discuss the most commonly used charts in this article with the help of
some good examples and will also see how to customize each plot.  
Note: Some elements like axis, color are common to each plot whereas some elements
are pot specific.

Line Chart

Line chart is one of the basic plots and can be created using the plot() function. It is used
to represent a relationship between two data X and Y on a different axis.
Syntax:
matplotlib.pyplot.plot(\*args, scalex=True, scaley=True, data=None, \*\*kwargs)
Example:
 Python3

import matplotlib.pyplot as plt

# initializing the data


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# plotting the data


plt.plot(x, y)

# Adding title to the plot


plt.title("Line Chart")

35
# Adding label on the y-axis
plt.ylabel('Y-Axis')

# Adding label on the x-axis


plt.xlabel('X-Axis')

plt.show()

Output:

Let’s see how to customize the above-created line chart. We will be using the following
properties – 
 color: Changing the color of the line
 linewidth: Cutomizing the width of the line
 marker: For changing the style of actual plotted point
 markersize: For changing the size of the markers
 linestyle: For defining the style of the plotted line
Different Linestyle available
Character Definition

Solid line

Dashed line

dash-dot line
-.

Dotted line
:

36
Character Definition

Point marker
.

Circle marker
o

Pixel marker
,

triangle_down marker
v

triangle_up marker
^

triangle_left marker
<

triangle_right marker
>

tri_down marker
1

tri_up marker
2

tri_left marker
3

tri_right marker
4

square marker
s

pentagon marker
p

star marker
*

hexagon1 marker
h

37
Character Definition

hexagon2 marker
H

Plus marker
+

X marker
x

Diamond marker
D

thin_diamond marker
d

vline marker
|

hline marker
_

Example:
import matplotlib.pyplot as plt

  # initializing the data


x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

# plotting the data


plt.plot(x, y, color='green', linewidth=3, marker='o',
         markersize=15, linestyle='--')

# Adding title to the plot


plt.title("Line Chart")

# Adding label on the y-axis


plt.ylabel('Y-Axis')

# Adding label on the x-axis


plt.xlabel('X-Axis')

plt.show()

Output:

38
Bar Chart

A bar chart is a graph that represents the category of data with rectangular bars with
lengths and heights that is proportional to the values which they represent. The bar plots
can be plotted horizontally or vertically. A bar chart describes the comparisons between
the discrete categories. It can be created using the bar() method.
In the below example, we will use the tips dataset. Tips database is the record of the tip
given by the customers in a restaurant for two and a half months in the early 1990s. It
contains 6 columns as total_bill, tip, sex, smoker, day, time, size.
Example: 
import matplotlib.pyplot as plt
import pandas as pd

# Reading the tips.csv file


data = pd.read_csv('tips.csv')

# initializing the data


x = data['day']
y = data['total_bill']

# plotting the data


plt.bar(x, y)

# Adding title to the plot


plt.title("Tips Dataset")

# Adding label on the y-axis


plt.ylabel('Total Bill')

# Adding label on the x-axis


plt.xlabel('Day')

plt.show()

Output:
39
Customization that is available for the Bar Chart – 
 color: For the bar faces
 edgecolor: Color of edges of the bar
 linewidth: Width of the bar edges
 width: Width of the bar
Example:
import matplotlib.pyplot as plt
import pandas as pd

# Reading the tips.csv file


data = pd.read_csv('tips.csv')

# initializing the data


x = data['day']
y = data['total_bill']

# plotting the data


plt.bar(x, y, color='green', edgecolor='blue',
        linewidth=2)

# Adding title to the plot


plt.title("Tips Dataset")

# Adding label on the y-axis


plt.ylabel('Total Bill')

# Adding label on the x-axis


plt.xlabel('Day')

plt.show()

40
Output:
Note: The lines in between the bars refer to the different values in the Y-axis of the
particular value of the X-axis.

Histogram

A histogram is basically used to represent data provided in a form of some groups. It is a


type of bar plot where the X-axis represents the bin ranges while the Y-axis gives
information about frequency. The hist() function is used to compute and create histogram
of x.
Syntax:
matplotlib.pyplot.hist(x, bins=None, range=None, density=False, weights=None,
cumulative=False, bottom=None, histtype=’bar’, align=’mid’, orientation=’vertical’,
rwidth=None, log=False, color=None, label=None, stacked=False, \*, data=None, \*\
*kwargs)
Example:
import matplotlib.pyplot as plt
import pandas as pd

# Reading the tips.csv file


data = pd.read_csv('tips.csv')

# initializing the data


x = data['total_bill']

# plotting the data


plt.hist(x)

# Adding title to the plot


plt.title("Tips Dataset")

# Adding label on the y-axis


plt.ylabel('Frequency')

41
# Adding label on the x-axis
plt.xlabel('Total Bill')

plt.show()

Output:

Customization that is available for the Histogram – 


 bins: Number of equal-width bins 
 color: For changing the face color
 edgecolor: Color of the edges
 linestyle: For the edgelines
 alpha: blending value, between 0 (transparent) and 1 (opaque)
Example:
import matplotlib.pyplot as plt
import pandas as pd

# Reading the tips.csv file


data = pd.read_csv('tips.csv')

# initializing the data


x = data['total_bill']

# plotting the data


plt.hist(x, bins=25, color='green', edgecolor='blue',
         linestyle='--', alpha=0.5)

# Adding title to the plot


plt.title("Tips Dataset")

# Adding label on the y-axis


plt.ylabel('Frequency')

# Adding label on the x-axis


plt.xlabel('Total Bill')

plt.show()

42
Output:

Scatter Plot

Scatter plots are used to observe relationships between variables. The scatter() method


in the matplotlib library is used to draw a scatter plot.
Syntax:
matplotlib.pyplot.scatter(x_axis_data, y_axis_data, s=None, c=None, marker=None,
cmap=None, vmin=None, vmax=None, alpha=None, linewidths=None,
edgecolors=None
Example:
import matplotlib.pyplot as plt
import pandas as pd

# Reading the tips.csv file


data = pd.read_csv('tips.csv')

# initializing the data


x = data['day']
y = data['total_bill']

# plotting the data


plt.scatter(x, y)

# Adding title to the plot


plt.title("Tips Dataset")

# Adding label on the y-axis


plt.ylabel('Total Bill')

# Adding label on the x-axis


plt.xlabel('Day')

plt.show()

43
Output:

Customizations that are available for the scatter plot are – 


 s: marker size (can be scalar or array of size equal to size of x or y)
 c: color of sequence of colors for markers
 marker: marker style
 linewidths: width of marker border
 edgecolor: marker border color
 alpha: blending value, between 0 (transparent) and 1 (opaque)
import matplotlib.pyplot as plt
import pandas as pd

# Reading the tips.csv file


data = pd.read_csv('tips.csv')

# initializing the data


x = data['day']
y = data['total_bill']

# plotting the data


plt.scatter(x, y, c=data['size'], s=data['total_bill'],
            marker='D', alpha=0.5)

# Adding title to the plot


plt.title("Tips Dataset")

# Adding label on the y-axis


plt.ylabel('Total Bill')

# Adding label on the x-axis


plt.xlabel('Day')

plt.show()

Output:

44
Pie Chart

Pie chart is a circular chart used to display only one series of data. The area of slices of
the pie represents the percentage of the parts of the data. The slices of pie are called
wedges. It can be created using the pie() method.
Syntax:
matplotlib.pyplot.pie(data, explode=None, labels=None, colors=None, autopct=None,
shadow=False)
Example:
import matplotlib.pyplot as plt
import pandas as pd

# Reading the tips.csv file


data = pd.read_csv('tips.csv')

# initializing the data


cars = ['AUDI', 'BMW', 'FORD',
        'TESLA', 'JAGUAR',]
data = [23, 10, 35, 15, 12]

# plotting the data


plt.pie(data, labels=cars)

# Adding title to the plot


plt.title("Car data")

plt.show()

Output:

45
Customizations that are available for the Pie chart are – 
 explode: Moving the wedges of the plot
 autopct: Label the wedge with their numerical value.
 color: Attribute is used to provide color to the wedges.
 shadow: Used to create shadow of wedge.
Python Data Types
Data types are the classification or categorization of data items. It represents the kind of
value that tells what operations can be performed on a particular data. Since everything is
an object in Python programming, data types are actually classes and variables are
instance (object) of these classes.
Following are the standard or built-in data type of Python:
 Numeric
 Sequence Type
 Boolean
 Set
 Dictionary

46
Numeric
In Python, numeric data type represent the data which has numeric value. Numeric
value can be integer, floating number or even complex numbers. These values are
defined as int, float and complex class in Python.
 Integers – This value is represented by int class. It contains positive or negative
whole numbers (without fraction or decimal). In Python there is no limit to how long
an integer value can be.
 Float – This value is represented by float class. It is a real number with floating
point representation. It is specified by a decimal point. Optionally, the character e or
E followed by a positive or negative integer may be appended to specify scientific
notation.
 Complex Numbers – Complex number is represented by complex class. It is
specified as (real part) + (imaginary part)j. For example – 2+3j
# Python program to
# demonstrate numeric value

a=5
print("Type of a: ", type(a))

b = 5.0
print("\nType of b: ", type(b))

c = 2 + 4j
print("\nType of c: ", type(c))
Output:
47
Type of a: <class 'int'>

Type of b: <class 'float'>

Type of c: <class 'complex'>

Sequence Type
In Python, sequence is the ordered collection of similar or different data types.
Sequences allows to store multiple values in an organized and efficient fashion. There
are several sequence types in Python –
 String
 List
 Tuple
Boolean
Data type with one of the two built-in values, True or False. Boolean objects that are equal
to True are truthy (true), and those equal to False are falsy (false). But non-Boolean
objects can be evaluated in Boolean context as well and determined to be true or false.
It is denoted by the class bool.
# Python program to
# demonstrate boolean type
print(type(True))
print(type(False))
<class 'bool'>
<class 'bool'>

Time Series
Time series data is an important form of structured data in many different fields, suchas finance, economics,
ecology, neuroscience, and physics. Anything that is observedor measured at many points in time forms a time
series. Many time series are fixed frequency, which is to say that data points occur at regular intervals
according to some rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can
also be irregular without a fixed unit of time or offset between units. How you mark and refer to time series
data depends on the application, and you may have one of thefollowing:
• Timestamps, specific instants in time
• Fixed periods, such as the month January 2007 or the full year 2010
• Intervals of time, indicated by a start and end timestamp. Periods can be thought
of as special cases of intervals
• Experiment or elapsed time; each timestamp is a measure of time relative to aparticular start time (e.g., the
diameter of a cookie baking each second since being placed in the oven)
In this chapter, I am mainly concerned with time series in the first three categories, though many of the
techniques can be applied to experimental time series where the index may be an integer or floating-point
number indicating elapsed time from the start of the experiment. The simplest and most widely used kind of
time series are those indexed by timestamp.
11.1 Date and Time Data Types and Tools
The Python standard library includes data types for date and time data, as well as calendar-related
functionality. The datetime, time, and calendar modules are the main places to start. The datetime.datetime type,
or simply datetime, is widely used:
In [10]: from datetime import datetime

48
In [11]: now = datetime.now()
In [12]: now
Out[12]: datetime.datetime(2017, 9, 25, 14, 5, 52, 72973)
In [13]: now.year, now.month, now.day
Out[13]: (2017, 9, 25)
datetime stores both the date and time down to the microsecond. timedelta represents
the temporal difference between two datetime objects:
In [14]: delta = datetime(2011, 1, 7) - datetime(2008, 6, 24, 8, 15)
In [15]: delta
Out[15]: datetime.timedelta(926, 56700)
In [16]: delta.days
Out[16]: 926
In [17]: delta.seconds
Out[17]: 56700
You can add (or subtract) a timedelta or multiple thereof to a datetime object to
yield a new shifted object:
In [18]: from datetime import timedelta
In [19]: start = datetime(2011, 1, 7)

In [20]: start + timedelta(12)


Out[20]: datetime.datetime(2011, 1, 19, 0, 0)
In [21]: start - 2 * timedelta(12)
Out[21]: datetime.datetime(2010, 12, 14, 0, 0)
Table 11-1 summarizes the data types in the datetime module. While this chapter is
mainly concerned with the data types in pandas and higher-level time series manipulation,
you may encounter the datetime-based types in many other places in Python
in the wild.
Table 11-1. Types in datetime module
Type Description
date Store calendar date (year, month, day) using the Gregorian calendar
time Store time of day as hours, minutes, seconds, and microseconds
datetime Stores both date and time
timedelta Represents the difference between two datetime values (as days, seconds, and microseconds)
tzinfo Base type for storing time zone information
Converting Between String and Datetime
You can format datetime objects and pandas Timestamp objects, which I’ll introduce
later, as strings using str or the strftime method, passing a format specification:
In [22]: stamp = datetime(2011, 1, 3)
In [23]: str(stamp)
Out[23]: '2011-01-03 00:00:00'
In [24]: stamp.strftime('%Y-%m-%d')
Out[24]: '2011-01-03'
See Table 11-2 for a complete list of the format codes (reproduced from Chapter 2).
Table 11-2. Datetime format specification (ISO C89 compatible)
Type Description
%Y Four-digit year
%y Two-digit year
%m Two-digit month [01, 12]
%d Two-digit day [01, 31]
%H Hour (24-hour clock) [00, 23]
%I Hour (12-hour clock) [01, 12]
%M Two-digit minute [00, 59]
%S Second [00, 61] (seconds 60, 61 account for leap seconds)
%w Weekday as integer [0 (Sunday), 6]
Type Description
%U Week number of the year [00, 53]; Sunday is considered the first day of the week, and days before the first Sunday of
the year are “week 0”
%W Week number of the year [00, 53]; Monday is considered the first day of the week, and days before the first Monday of
the year are “week 0”
%z UTC time zone offset as +HHMM or -HHMM; empty if time zone naive
%F Shortcut for %Y-%m-%d (e.g., 2012-4-18)
%D Shortcut for %m/%d/%y (e.g., 04/18/12)
You can use these same format codes to convert strings to dates using date
time.strptime:
In [25]: value = '2011-01-03'

49
In [26]: datetime.strptime(value, '%Y-%m-%d')
Out[26]: datetime.datetime(2011, 1, 3, 0, 0)
In [27]: datestrs = ['7/6/2011', '8/6/2011']
In [28]: [datetime.strptime(x, '%m/%d/%Y') for x in datestrs]
Out[28]:
[datetime.datetime(2011, 7, 6, 0, 0),
datetime.datetime(2011, 8, 6, 0, 0)]
datetime.strptime is a good way to parse a date with a known format. However, it
can be a bit annoying to have to write a format spec each time, especially for common
date formats. In this case, you can use the parser.parse method in the third-party
dateutil package (this is installed automatically when you install pandas):
In [29]: from dateutil.parser import parse
In [30]: parse('2011-01-03')
Out[30]: datetime.datetime(2011, 1, 3, 0, 0)
dateutil is capable of parsing most human-intelligible date representations:
In [31]: parse('Jan 31, 1997 10:45 PM')
Out[31]: datetime.datetime(1997, 1, 31, 22, 45)
In international locales, day appearing before month is very common, so you can pass
dayfirst=True to indicate this:
In [32]: parse('6/12/2011', dayfirst=True)
Out[32]: datetime.datetime(2011, 12, 6, 0, 0)
pandas is generally oriented toward working with arrays of dates, whether used as an
axis index or a column in a DataFrame. The to_datetime method parses many different
kinds of date representations. Standard date formats like ISO 8601 can be
parsed very quickly:
In [33]: datestrs = ['2011-07-06 12:00:00', '2011-08-06 00:00:00']
In [34]: pd.to_datetime(datestrs)
Out[34]: DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00'], dtype='dat
etime64[ns]', freq=None)
It also handles values that should be considered missing (None, empty string, etc.):
In [35]: idx = pd.to_datetime(datestrs + [None])
In [36]: idx
Out[36]: DatetimeIndex(['2011-07-06 12:00:00', '2011-08-06 00:00:00', 'NaT'], dty
pe='datetime64[ns]', freq=None)
In [37]: idx[2]
Out[37]: NaT
In [38]: pd.isnull(idx)
Out[38]: array([False, False, True], dtype=bool)
NaT (Not a Time) is pandas’s null value for timestamp data.
dateutil.parser is a useful but imperfect tool. Notably, it will recognize
some strings as dates that you might prefer that it didn’t—
for example, '42' will be parsed as the year 2042 with today’s calendar
date.
datetime objects also have a number of locale-specific formatting options for systems
in other countries or languages. For example, the abbreviated month names will be
different on German or French systems compared with English systems. See
Table 11-3 for a listing.
Table 11-3. Locale-specific date formatting
Type Description
%a Abbreviated weekday name
%A Full weekday name
%b Abbreviated month name
%B Full month name
%c Full date and time (e.g., ‘Tue 01 May 2012 04:20:57 PM’)
%p Locale equivalent of AM or PM
%x Locale-appropriate formatted date (e.g., in the United States, May 1, 2012 yields ’05/01/2012’)
%X Locale-appropriate time (e.g., ’04:24:12 PM’)

50

You might also like