Download as pdf or txt
Download as pdf or txt
You are on page 1of 73

ISOM 2600 Business Analytics

TOPIC 2: PANDAS AND DATA


PREPROCESSING
X U H U WA N
HKUST
OCT 25, 2022

Monday, October 24, 2022 XUHU WAN, HKUST 1


Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 2


Pandas:
Series and
DataFrame

Monday, October 24, 2022 XUHU WAN, HKUST 3


Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 4


Pandas
Pandas offers data structures and operations
for manipulating numerical tables and time
series. This library is our main focus.

Pandas is a library for data analysis. It gives Python the ability to work with
spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among
other functions. In the first lecture , we will focus on DataFrame only and learn to read
data file and do column-wise operation.

Import pandas and read data

Monday, October 24, 2022 XUHU WAN, HKUST 5


csv Data File
Information is separated with a comma. The comma is not the only type of
delimiter, however. Some files are delimited by a tab (TSV) or even a
semicolon. The main reason why CSVs are a preferred data format when
collaborating and sharing data is because any program can open this kind of
data structure. It can even be opened in a text editor.

Monday, October 24, 2022 XUHU WAN, HKUST 6


Index: date and hour

Columns: Hourly power consumption in


Different regions of PJM LLC.

PJM Interconnection LLC (PJM) is a


regional transmission organization (RTO)
in the United States. It is part of the
Eastern Interconnection grid operating an
electric transmission system serving all or
parts of Delaware, Illinois, Indiana,
Kentucky, Maryland, Michigan, New
Jersey, North Carolina, Ohio,
Pennsylvania, Tennessee, Virginia, West
Virginia, and the District of Columbia.

Monday, October
24, 2022 7

XUHU WAN, HKUST


Data Model of Pandas

The most often used data structures are the Series (array data) and the
DataFrame (tabular data).
DataFrame consists of multiple columns of Series.

Monday, October 24, 2022 XUHU WAN, HKUST 8


DataFrame
In pandas, DataFrame is two dimensional data structure which has thes index and
columns, similar the excel spreadsheet. Each column is a Series.

Column

Index

Monday, October 24, 2022 XUHU WAN, HKUST 9


Series
1. A Series is used for one-dimensional data which has an index and
a name.

2. A column of DataFrame is a series.

3. The leftmost column contains the index and the rightmost


column in the output contains the values of series. The default
index is the counting number.

Monday, October 24, 2022 XUHU WAN, HKUST 10


Series and DataFrame

Pandas introduces two data types to Python: Series and DataFrame. The
DataFrame represents your entire spreadsheet or rectangular data, whereas
the Series is a single column of the DataFrame. A Pandas DataFrame can also
be thought of as a dictionary or collection of Series objects . The DataFrame
has Columns and Index.

A Series is a column of DataFrame without column name.

Monday, October 24, 2022 XUHU WAN, HKUST 11


Methods and Attributes
The pandas library provides a lot of functionality. Both series and
DataFrame have their own lists of methods and attibutes. There are over 400
method and attributes on a series.

Colab can pop up a list of attributes for


completions(after “.”) when you hit
Ctrl+Space

Monday, October 24, 2022 XUHU WAN, HKUST 12


Methods and Attributes
In python, the attributes are the characteristics of data type and methods
define the operation/actions

Attributes Methods

Monday, October 24, 2022 XUHU WAN, HKUST 13


Attributes

Monday, October 24, 2022 XUHU WAN, HKUST 14


Statistical methods of Series

Monday, October 24, 2022 XUHU WAN, HKUST 15


Methods of DataFrame

All Series methods can work in DataFrame. But the DataFrame is two-
dimensional. How do we know if the method work row-wisely or
column-wisely?

Monday, October 24, 2022 XUHU WAN, HKUST 16


DataFrame Axis
A DataFrame has two axes which are referred to as axis 0 and 1 or the index
(row) axis and the columns axes

Monday, October 24, 2022 XUHU WAN, HKUST 17


Monday, October 24, 2022 XUHU WAN, HKUST 18
Monday, October 24, 2022 XUHU WAN, HKUST 19
Advanced
Methods

Monday, October 24, 2022 XUHU WAN, HKUST 20


Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 21


value_counts()
This method is powerful to find frequencies for discrete or categorical
variables.

Monday, October 24, 2022 XUHU WAN, HKUST 22


sort_value of Series

Monday, October 24, 2022 XUHU WAN, HKUST 23


Sort_value() of DataFrame

Monday, October 24, 2022 XUHU WAN, HKUST 24


Chaining manipulation

Ascending
order

Monday, October 24, 2022 XUHU WAN, HKUST 25


Chaining manipulation

plot

Monday, October 24, 2022 XUHU WAN, HKUST 26


Plot Methods of Series and
DataFrame
Pandas is integrated with matplotlib deeply. With plot attributes, we can easily
chain the plot with other methods.

Monday, October 24, 2022 XUHU WAN, HKUST 27


Histograms

Monday, October 24, 2022 XUHU WAN, HKUST 28


Boxplot

Monday, October 24, 2022 XUHU WAN, HKUST 29


Line plot

Monday, October 24, 2022 XUHU WAN, HKUST 30


Bar plot

Monday, October 24, 2022 XUHU WAN, HKUST 31


Monday, October 24, 2022 XUHU WAN, HKUST 32
Pie plots

Monday, October 24, 2022 XUHU WAN, HKUST 33


Data Slicing

34

Monday, October 24, 2022


XUHU WAN, HKUST
Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 35


Data Slicing (Subsetting )
The data is usually large and it is better to check details by displaying or make
a copy of a part of the data.

We can use the head or tail method to look at the first or last the several
rows . This is useful to see if the data is loaded correctly and has a sense of
the data contents, i.e. index, columns, values etc.
We can select one or multiple columns to compare different variables
We can select one or more rows to check data pattern only in certain
time window.
Mixed selections of rows and columns
Selection of data following certain conditions

Monday, October 24, 2022 XUHU WAN, HKUST 36


Head or Tail

Monday, October 24, 2022 XUHU WAN, HKUST 37


Selection of Columns
SELECT ONE COLUMN SELECT MULTIPLE
COLUMN

Monday, October 24, 2022 XUHU WAN, HKUST 38


Selection of Rows
LOC METHOD: ROW NAMES ILOC METHOD: ROW NUMBERS

ATTN: Loc method is right inclusive and iloc is right exclusive

Monday, October 24, 2022 XUHU WAN, HKUST 39


Mixed Selection

Monday, October 24, 2022 XUHU WAN, HKUST 40


Split data into train and test data (80% train and
20% test)

If the data is time series, the first 80% If the data is cross-sectional (not time series), we
rows as train and the most recent 20% need to shuffle data first before selecting using iloc
as shown in the left.
as the test data.

Slice data using iloc method

Monday, October 24, 2022 XUHU WAN, HKUST 41


Conditional Selection with One Condition
We also may select part of data which satisfy certain conditions. For example,
select all data, such that income is above median

Monday, October 24, 2022 XUHU WAN, HKUST 42


Conditional Selection with Multiple Conditions

And

Or

Monday, October 24, 2022 XUHU WAN, HKUST 43


Making
Changes to
Data

Monday, October 24, 2022 XUHU WAN, HKUST 44


Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 45


Add Columns

Monday, October 24, 2022 XUHU WAN, HKUST 46


Update Columns
We make a new column “Season”

Change the new column directly

Value_counts() is to list the frequencies of all values.

Monday, October 24, 2022 XUHU WAN, HKUST 47


Algebraic Operation on Columns

Monday, October 24, 2022 XUHU WAN, HKUST 48


Algebraic Operation on Columns

Monday, October 24, 2022 XUHU WAN, HKUST 49


Dropping Column/Row
Drop Column

Drop Row

Monday, October 24, 2022 XUHU WAN, HKUST 50


Handling Missing
Data

Monday, October 24, 2022 XUHU WAN, HKUST 51


Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 52


What is NaN Value
Rarely you will get a data set without any missing values. Pandas displays
missing values as NaN. The NaN value in Pandas comes from numpy. Missing
values may be used or displayed in a few ways in Python—NaN, NAN, or nan—
but they are all equivalent.
We can get NaN when we load in a data set with missing values, or from the
data preprocessing
NaN value may be useful in data preprocessing.

Monday, October 24, 2022 XUHU WAN, HKUST 53


Count Missing Values

Monday, October 24, 2022 XUHU WAN, HKUST 54


Replace NaN Values
We can use the fillna() method to recode the missing values to another value.
For example, suppose we wanted the missing values to be recoded as a 0 or
mode/median of the column.

Monday, October 24, 2022 XUHU WAN, HKUST 55


Fill Missing Values with Fill Forward Method

Monday, October 24, 2022 XUHU WAN, HKUST 56


Fill Missing Values with Fill BackWard

Monday, October 24, 2022 XUHU WAN, HKUST 57


Drop Missing Values

Monday, October 24, 2022 XUHU WAN, HKUST 58


Standardization
of Data

Monday, October 24, 2022 XUHU WAN, HKUST 59


Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 60


Standardization/Normalization
Max-Min Standardization
Mean-standard deviation

61
Max-min Standardization
Standardize data into those with the range [0,1]

62
Mean-SD Standardization
The range of standardized data is not bounded but most of data
is in between -3 and 3. It is applicable for the bell-shaped data.

63
Methods for Time
Series Data

Monday, October 24, 2022

XUHU WAN, HKUST 64


Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 65


Convert string to Datetime
You read time series data from csv file and index is in the format of year-
month-days. However it is not in the type of datetime and their type is string.
We need to transform the index using pd.to_datetime

If the index is converted to datetime type, we can get day, month


,year and time easily

Monday, October 24, 2022 XUHU WAN, HKUST 66


Shift()

Monday, October 24, 2022 XUHU WAN, HKUST 67


Pct_change()

Monday, October 24, 2022 XUHU WAN, HKUST 68


Rolling Method
The rolling() function is used to provide rolling window calculations. In
financial times, it can be applied to compute all kinds of dynamic
statistic, e.g. moving average,.

Monday, October 24, 2022 XUHU WAN, HKUST 69


Compute Rolling STD

Monday, October 24, 2022 XUHU WAN, HKUST 70


Compute Cumulative Sum Cumulative sum is
different which compute
the sum starting from
the first row.

Monday, October 24, 2022 XUHU WAN, HKUST 71


Compute cumulative max

Monday, October 24, 2022 XUHU WAN, HKUST 72


Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame

Series and Handling Missing Methods for Time


Data Slicing
DataFrame Values Series

Make Changes to
Advanced methods Standardization
Data

value_counts() Create and update a Max-min


sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Monday, October 24, 2022 XUHU WAN, HKUST 73

You might also like