Topic 2 Pandas Data Processing

ISOM 2600 Business Analytics
TOPIC 2: PANDAS AND DATA

PREPROCESSING
X U H U WA N
HKUST
OCT 25, 2022
Monday, October 24, 2022 XUHU WAN, HKUST 1

Contents
Methods and
Attributes Select columns pd.to_datetime()
Type, .index, .column, Select rows with loc sort_index()
.shape and iloc shift() and
Methods of Drop missing values pct_change()
Mixed selection
statistics:sum(),size() Fill with a value
Conditional selection Rolling mean, std, sum
mean(),std(),min(),max
with one or multiple Fill forward Cumulative sum,
(),quantile(), median()
condition. Fill backward max,min
Axes of DataFrame
Series and Handling Missing Methods for Time

Data Slicing
DataFrame Values Series
Make Changes to
Advanced methods Standardization
Data
value_counts() Create and update a Max-min

sort_values() column
Mean-STD
Chaining styles of Algebraic operation on
methods columns
Plotting() Drop columns

Pandas:
Series and
DataFrame

Contents
Methods and
Mixed selection
Axes of DataFrame

Data Slicing
Make Changes to
Data

Mean-STD
methods columns

Pandas
Pandas offers data structures and operations
for manipulating numerical tables and time
series. This library is our main focus.
Pandas is a library for data analysis. It gives Python the ability to work with
spreadsheet-like data for fast data loading, manipulating, aligning, and merging, among
other functions. In the first lecture , we will focus on DataFrame only and learn to read
data file and do column-wise operation.
Import pandas and read data

csv Data File
Information is separated with a comma. The comma is not the only type of
delimiter, however. Some files are delimited by a tab (TSV) or even a
semicolon. The main reason why CSVs are a preferred data format when
collaborating and sharing data is because any program can open this kind of
data structure. It can even be opened in a text editor.

Index: date and hour
Columns: Hourly power consumption in

Different regions of PJM LLC.
PJM Interconnection LLC (PJM) is a

regional transmission organization (RTO)
in the United States. It is part of the
Eastern Interconnection grid operating an
electric transmission system serving all or
parts of Delaware, Illinois, Indiana,
Kentucky, Maryland, Michigan, New
Jersey, North Carolina, Ohio,
Pennsylvania, Tennessee, Virginia, West
Virginia, and the District of Columbia.
Monday, October
24, 2022 7
XUHU WAN, HKUST

Data Model of Pandas
The most often used data structures are the Series (array data) and the
DataFrame (tabular data).
DataFrame consists of multiple columns of Series.

DataFrame
In pandas, DataFrame is two dimensional data structure which has thes index and
columns, similar the excel spreadsheet. Each column is a Series.
Column
Index

Series
1. A Series is used for one-dimensional data which has an index and
a name.
2. A column of DataFrame is a series.
3. The leftmost column contains the index and the rightmost

column in the output contains the values of series. The default
index is the counting number.

Series and DataFrame
Pandas introduces two data types to Python: Series and DataFrame. The
DataFrame represents your entire spreadsheet or rectangular data, whereas
the Series is a single column of the DataFrame. A Pandas DataFrame can also
be thought of as a dictionary or collection of Series objects . The DataFrame
has Columns and Index.
A Series is a column of DataFrame without column name.

Methods and Attributes
The pandas library provides a lot of functionality. Both series and
DataFrame have their own lists of methods and attibutes. There are over 400
method and attributes on a series.
Colab can pop up a list of attributes for

completions(after “.”) when you hit
Ctrl+Space

Methods and Attributes
In python, the attributes are the characteristics of data type and methods
define the operation/actions
Attributes Methods

Attributes

Statistical methods of Series

Methods of DataFrame
All Series methods can work in DataFrame. But the DataFrame is two-
dimensional. How do we know if the method work row-wisely or
column-wisely?

DataFrame Axis
A DataFrame has two axes which are referred to as axis 0 and 1 or the index
(row) axis and the columns axes

Advanced
Methods

Contents
Methods and
Mixed selection
Axes of DataFrame

Data Slicing
Make Changes to
Data

Mean-STD
methods columns

value_counts()
This method is powerful to find frequencies for discrete or categorical
variables.

sort_value of Series

Sort_value() of DataFrame

Chaining manipulation
Ascending
order

Chaining manipulation
plot

Plot Methods of Series and
DataFrame
Pandas is integrated with matplotlib deeply. With plot attributes, we can easily
chain the plot with other methods.

Histograms

Boxplot

Line plot

Bar plot

Pie plots

Data Slicing
34
Monday, October 24, 2022

XUHU WAN, HKUST
Contents
Methods and
Mixed selection
Axes of DataFrame

Data Slicing
Make Changes to
Data

Mean-STD
methods columns

Data Slicing (Subsetting )
The data is usually large and it is better to check details by displaying or make
a copy of a part of the data.
We can use the head or tail method to look at the first or last the several
rows . This is useful to see if the data is loaded correctly and has a sense of
the data contents, i.e. index, columns, values etc.
We can select one or multiple columns to compare different variables
We can select one or more rows to check data pattern only in certain
time window.
Mixed selections of rows and columns
Selection of data following certain conditions

Head or Tail

Selection of Columns
SELECT ONE COLUMN SELECT MULTIPLE
COLUMN

Selection of Rows
LOC METHOD: ROW NAMES ILOC METHOD: ROW NUMBERS
ATTN: Loc method is right inclusive and iloc is right exclusive

Mixed Selection

Split data into train and test data (80% train and
20% test)
If the data is time series, the first 80% If the data is cross-sectional (not time series), we
rows as train and the most recent 20% need to shuffle data first before selecting using iloc
as shown in the left.
as the test data.
Slice data using iloc method

Conditional Selection with One Condition
We also may select part of data which satisfy certain conditions. For example,
select all data, such that income is above median

Conditional Selection with Multiple Conditions
And
Or

Making
Changes to
Data

Contents
Methods and
Mixed selection
Axes of DataFrame

Data Slicing
Make Changes to
Data

Mean-STD
methods columns

Add Columns

Update Columns
We make a new column “Season”
Change the new column directly
Value_counts() is to list the frequencies of all values.

Algebraic Operation on Columns

Algebraic Operation on Columns

Dropping Column/Row
Drop Column
Drop Row

Handling Missing
Data

Contents
Methods and
Mixed selection
Axes of DataFrame

Data Slicing
Make Changes to
Data

Mean-STD
methods columns

What is NaN Value
Rarely you will get a data set without any missing values. Pandas displays
missing values as NaN. The NaN value in Pandas comes from numpy. Missing
values may be used or displayed in a few ways in Python—NaN, NAN, or nan—
but they are all equivalent.
We can get NaN when we load in a data set with missing values, or from the
data preprocessing
NaN value may be useful in data preprocessing.

Count Missing Values

Replace NaN Values
We can use the fillna() method to recode the missing values to another value.
For example, suppose we wanted the missing values to be recoded as a 0 or
mode/median of the column.

Fill Missing Values with Fill Forward Method

Fill Missing Values with Fill BackWard

Drop Missing Values

Standardization
of Data

Contents
Methods and
Mixed selection
Axes of DataFrame

Data Slicing
Make Changes to
Data

Mean-STD
methods columns

Standardization/Normalization
Max-Min Standardization
Mean-standard deviation
61
Max-min Standardization
Standardize data into those with the range [0,1]
62
Mean-SD Standardization
The range of standardized data is not bounded but most of data
is in between -3 and 3. It is applicable for the bell-shaped data.
63
Methods for Time
Series Data
Monday, October 24, 2022
XUHU WAN, HKUST 64

Contents
Methods and
Mixed selection
Axes of DataFrame

Data Slicing
Make Changes to
Data

Mean-STD
methods columns

Convert string to Datetime
You read time series data from csv file and index is in the format of year-
month-days. However it is not in the type of datetime and their type is string.
We need to transform the index using pd.to_datetime
If the index is converted to datetime type, we can get day, month

,year and time easily

Shift()

Pct_change()

Rolling Method
The rolling() function is used to provide rolling window calculations. In
financial times, it can be applied to compute all kinds of dynamic
statistic, e.g. moving average,.

Compute Rolling STD

Compute Cumulative Sum Cumulative sum is
different which compute
the sum starting from
the first row.

Compute cumulative max

Methods and
Mixed selection
Axes of DataFrame

Data Slicing
Make Changes to
Data

Mean-STD
methods columns

Topic 2 Pandas Data Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic 2 Pandas Data Processing

Uploaded by

Copyright:

Available Formats

ISOM 2600 Business Analytics

TOPIC 2: PANDAS AND DATA

Monday, October 24, 2022 XUHU WAN, HKUST 1

Series and Handling Missing Methods for Time

value_counts() Create and update a Max-min

Monday, October 24, 2022 XUHU WAN, HKUST 2

Monday, October 24, 2022 XUHU WAN, HKUST 3

Series and Handling Missing Methods for Time

value_counts() Create and update a Max-min

Monday, October 24, 2022 XUHU WAN, HKUST 4

Import pandas and read data

Monday, October 24, 2022 XUHU WAN, HKUST 5

Monday, October 24, 2022 XUHU WAN, HKUST 6

Columns: Hourly power consumption in

PJM Interconnection LLC (PJM) is a

XUHU WAN, HKUST

Monday, October 24, 2022 XUHU WAN, HKUST 8

Monday, October 24, 2022 XUHU WAN, HKUST 9

2. A column of DataFrame is a series.

3. The leftmost column contains the index and the rightmost

Monday, October 24, 2022 XUHU WAN, HKUST 10

A Series is a column of DataFrame without column name.

Monday, October 24, 2022 XUHU WAN, HKUST 11

Colab can pop up a list of attributes for

Monday, October 24, 2022 XUHU WAN, HKUST 12

Monday, October 24, 2022 XUHU WAN, HKUST 13

Monday, October 24, 2022 XUHU WAN, HKUST 14

Monday, October 24, 2022 XUHU WAN, HKUST 15

Monday, October 24, 2022 XUHU WAN, HKUST 16

Monday, October 24, 2022 XUHU WAN, HKUST 17

Monday, October 24, 2022 XUHU WAN, HKUST 20

Series and Handling Missing Methods for Time

value_counts() Create and update a Max-min

Monday, October 24, 2022 XUHU WAN, HKUST 21

Monday, October 24, 2022 XUHU WAN, HKUST 22

Monday, October 24, 2022 XUHU WAN, HKUST 23

Monday, October 24, 2022 XUHU WAN, HKUST 24

Monday, October 24, 2022 XUHU WAN, HKUST 25

Monday, October 24, 2022 XUHU WAN, HKUST 26

Monday, October 24, 2022 XUHU WAN, HKUST 27

Monday, October 24, 2022 XUHU WAN, HKUST 28

Monday, October 24, 2022 XUHU WAN, HKUST 29

Monday, October 24, 2022 XUHU WAN, HKUST 30

Monday, October 24, 2022 XUHU WAN, HKUST 31

Monday, October 24, 2022 XUHU WAN, HKUST 33

Monday, October 24, 2022

Series and Handling Missing Methods for Time

value_counts() Create and update a Max-min

Monday, October 24, 2022 XUHU WAN, HKUST 35

Monday, October 24, 2022 XUHU WAN, HKUST 36

Monday, October 24, 2022 XUHU WAN, HKUST 37

Monday, October 24, 2022 XUHU WAN, HKUST 38

ATTN: Loc method is right inclusive and iloc is right exclusive

Monday, October 24, 2022 XUHU WAN, HKUST 39

Monday, October 24, 2022 XUHU WAN, HKUST 40

Slice data using iloc method

Monday, October 24, 2022 XUHU WAN, HKUST 41

Monday, October 24, 2022 XUHU WAN, HKUST 42

Monday, October 24, 2022 XUHU WAN, HKUST 43

Monday, October 24, 2022 XUHU WAN, HKUST 44

Series and Handling Missing Methods for Time

value_counts() Create and update a Max-min