Pandas Manipulations

5/11/22, 9:50 PM Pandas Full Tutorial - Jupyter Notebook
Prepared by Shankar Wagh
Linkedin Page (https://www.linkedin.com/in/shankar-wagh)

¶
Pandas Full Tutorial
1. Pandas is built on top of numpy

2. It is written in C
3. It is used to represents the 2D data unlike array which can be used to represent the similar
type of data
4. We can give user defined names to rows and columns
5. In Data frame we can represent the data in tabular format
In [2]: import pandas as pd

import numpy as np
Series Creation
In [3]: # Creating series from List
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturd
days_ser = pd.Series(days)
print(days_ser, type(days_ser))
0 Sunday
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday
dtype: object <class 'pandas.core.series.Series'>
localhost:8888/notebooks/OneDrive/Desktop/Data Science/python/Pandas/Pandas Full Tutorial.ipynb 1/70

In [515]: print(days_ser[0])
Sunday
In [516]: print(days_ser[len(days_ser)-1])
print(days_ser.shape)
print(days_ser.size)
Saturday
(7,)
In [517]: # Negative indexing Not possible

# print(days_ser[-1]) # ValueError: -1 is not in range
In [518]: # Slicing
print(days_ser[1:4])
1 Monday
2 Tuesday
3 Wednesday
dtype: object
In [519]: # Explicit indexing

days_ser.index = ['day1', 'day2', 'day3', 'day4', 'day5', 'day6', 'day7']
In [520]: days_ser
Out[520]: day1 Sunday
day2 Monday
day3 Tuesday
day4 Wednesday
day5 Thursday
day6 Friday
day7 Saturday
dtype: object
In [521]: print(days_ser['day1'])
print(days_ser['day6'])
Sunday
Friday
In [522]: days_ser['day3':'day5': ]
Out[522]: day3 Tuesday
day4 Wednesday
day5 Thursday
dtype: object

In [523]: # Reversing series

days_ser[::-1]
Out[523]: day7 Saturday
day6 Friday
day5 Thursday
day4 Wednesday
day3 Tuesday
day2 Monday
day1 Sunday
dtype: object
In [524]: # Passing index parameter

states = ['MH', 'UP', 'MP', 'AP', 'KA', 'TN', 'WB', 'RJ', "DL"]
state_ser = pd.Series(states, index = ['st' + str(i) for i in range(1, len(states
state_ser
Out[524]: st1 MH
st2 UP
st3 MP
st4 AP
st5 KA
st6 TN
st7 WB
st8 RJ
st9 DL
dtype: object
In [525]: state_ser[0]
Out[525]: 'MH'
In [527]: # Slicing
state_ser[3:6:2]
Out[527]: st4 AP
st6 TN
dtype: object

In [528]: # Passing index parameter with duplicate index

capitals = [ 'MUM','LKO', 'BHP', 'AMT', 'BLR', 'CHN', 'KOL','JP', 'DL']
capitals_ser = pd.Series(capitals, index = list('abcdaedfg'))
capitals_ser
Out[528]: a MUM
b LKO
c BHP
d AMT
a BLR
e CHN
d KOL
f JP
g DL
dtype: object
In [529]: capitals_ser['a']
Out[529]: a MUM
a BLR
dtype: object
In [530]: # capitals_ser['a':'a'] # KeyError: "Cannot get left slice bound for non
In [531]: # Creating a series from a dictionary

d = dict(zip(states, capitals))
pd.Series(d)
Out[531]: MH MUM
UP LKO
MP BHP
AP AMT
KA BLR
TN CHN
WB KOL
RJ JP
DL DL
dtype: object

In [532]: d = dict(zip(states, capitals))

pd.Series(d, index = list('abcdefghij'))
Out[532]: a NaN
b NaN
c NaN
d NaN
e NaN
f NaN
g NaN
h NaN
i NaN
j NaN
dtype: object
In [533]: d = dict(zip(states, capitals))

state_cap = pd.Series(d, index = ['MH', 'UP', 'MP', 'AP', 'KA', 'TN', 'WB', 'RJ',
state_cap
Out[533]: MH MUM
UP LKO
MP BHP
AP AMT
KA BLR
TN CHN
WB KOL
RJ JP
DL DL
PN NaN
dtype: object
In [534]: # Adding data to Series

state_cap['PN'] = 'Chandigarh'
state_cap['JK'] = 'Kashmir'
state_cap['ZA'] = 'Zarkhand'
In [535]: state_cap
Out[535]: MH MUM
UP LKO
MP BHP
AP AMT
KA BLR
TN CHN
WB KOL
RJ JP
DL DL
PN Chandigarh
JK Kashmir
ZA Zarkhand
dtype: object

DataFrame
In [536]: # Creating dataframe from Dictionary
df_dict = {'Year' : [1990, 1994, 1998, 2002],
'Country' : ['Italy', 'USA', 'France', 'Japan'],
'Winner' : ['Germany', 'Brazil', 'France', 'Brazil'],
'GoalScored' : [115, 141, 171, 161]
}
df_dict = pd.DataFrame(df_dict)
df_dict
Out[536]:
Year Country Winner GoalScored
0 1990 Italy Germany 115
1 1994 USA Brazil 141
2 1998 France France 171
3 2002 Japan Brazil 161
In [537]: print(type(df_dict))
<class 'pandas.core.frame.DataFrame'>
In [538]: # Creating dataframe from List of tuples

df_lotuples = [(2002, 'Japan', 'Brazil', 161),
(2006, 'Germany', 'Italy', 147),
(2010, 'South Africa', 'Spain', 145),
(2014, 'Brazil', 'Germany', 171)
]
pd.DataFrame(df_lotuples, columns = ['Year', 'Country','Winner','GoalScored'])
Out[538]:
1 2006 Germany Italy 147
2 2010 South Africa Spain 145
3 2014 Brazil Germany 171

In [539]: # creating dataframe from list of list

df_listoflist = [[2002, 'Japan', 'Brazil', 161],
[2006, 'Germany', 'Italy', 147],
[2010, 'South Africa', 'Spain', 145],
[2014, 'Brazil', 'Germany', 171]
]
pd.DataFrame(df_listoflist, columns = ['Year', 'Country','Winner','GoalScored'])
Out[539]:
1 2006 Germany Italy 147
2 2010 South Africa Spain 145
3 2014 Brazil Germany 171
In [540]: # Creating dataframe using list of dictionary

df_lodict = [
{'year' : 2002, 'HostCountry' : 'Japan', 'Winner' : 'Brazil'},
{'year' : 2006, 'HostCountry' : 'Germany', 'Winner' : 'Italy'},
{'year' : 2010, 'HostCountry' : 'South Africa', 'Winner' : 'Spain'},
{'year' : 2014, 'HostCountry' : 'Brazil', 'Winner' : 'Germany'},
]
pd.DataFrame(df_lodict)
Out[540]:
year HostCountry Winner
0 2002 Japan Brazil
1 2006 Germany Italy
2 2010 South Africa Spain
3 2014 Brazil Germany
Pandas Level Function
pd.read_csv
Read a comma-separated values (csv) file into DataFrame.
pd.read_csv(
filepath_or_buffer: 'FilePathOrBuffer',
sep=,
delimiter=None,
header='infer',
names=,
index_col=None,
usecols=None,
squeeze=False,
prefix=,
mangle_dupe_cols=True,
dtype:
'DtypeArg | None' = None,
engine=None,
converters=None,
true_values=None,
false_values=None,
skipinitialspace=False,
skiprows=None,
skipfooter=0,
nrows=None,
na_values=None,
keep_default_na=True,
na_filter=True,
verbose=False,
skip_blank_lines=True,
parse_dates=False,
infer_datetime_format=False,
keep_date_col=False,
date_parser=None,
dayfirst=False,
cache_dates=True,
iterator=False,
chunksize=None,
compression='infer',
thousands=None,
decimal: 'str' = '.',
lineterminator=None,
quotechar='"',
quoting=0,

doublequote=True,
escapechar=None,
comment=None,
encoding=None,
encoding_errors: 'str |
None' = 'strict',
dialect=None,
error_bad_lines=None,
warn_bad_lines=None,
on_bad_lines=None,
delim_whitespace=False,
low_memory=True,
memory_map=False,
float_precision=None,
storage_options: 'StorageOptions' = None,
)
In [541]: avocado_data = pd.read_csv('avocado.csv')

avocado_data.head()
Out[541]:
Small Large Total Total
Region Type AveragePrice Date
Bags Bags Bags Volume
2018-03-
0 Atlanta organic 89424.11 207.08 89631.19 190257.38 1.70
25
2018-03-
1 Atlanta conventional 102717.50 153.00 102870.50 202790.74 1.75
18
2018-03-
2 Boston organic 120465.39 18.83 120484.22 236822.98 1.58
11
2018-03-
3 Boston conventional 136877.43 60.60 136938.03 239135.67 1.57
04
2018-02-
4 California organic 66273.89 46.58 66320.47 179041.72 1.82
25
In [542]: avocado_data = pd.read_csv('avocado.csv', usecols=['Region','Type','AveragePrice

avocado_data.head()
Out[542]:
Region Type AveragePrice
0 Atlanta organic 1.70
1 Atlanta conventional 1.75
2 Boston organic 1.58
3 Boston conventional 1.57
4 California organic 1.82
In [544]: avocado_data = pd.read_csv('avocado.csv', usecols=[0,1,6])

avocado_data.head()
Out[544]:
pd.read_excel
5/11/22, 9:50 PM
p _ Pandas Full Tutorial - Jupyter Notebook
Read an Excel file into a pandas DataFrame.
In [3]: # read excel data

pd.read_excel('football_worldcup.xlsx')
Out[3]:
Year Country Winner Runners-Up GoalsScored MatchesPlayed
0 1990 Italy Germany Argentina 115 52
1 1994 USA Brazil Italy 141 52
2 1998 France France Brazil 171 64
3 2002 Japan Brazil Germany 161 64
4 2006 Germany Italy France 147 64
5 2010 South Africa Spain Netherlands 145 64
6 2014 Brazil Germany Argentina 171 64
pd.read_clipboard
Read text from clipboard and pass to read_csv.
In [546]: # copy above data using mouse cursor

pd.read_clipboard(header=None)
Out[546]:
0 1 2 3 4 5 6
0 0 1990 Italy Germany Argentina 115 52
1 1 1994 USA Brazil Italy 141 52
2 2 1998 France France Brazil 171 64
3 3 2002 Japan Brazil Germany 161 64
4 4 2006 Germany Italy France 147 64
5 5 2010 South Africa Spain Netherlands 145 64
6 6 2014 Brazil Germany Argentina 171 64
pd.get_dummies
Convert categorical variable into dummy/indicator variables.

In [547]: avocado_data.head()
Out[547]:
In [548]: # One Hot Encoding using get_dummies

pd.get_dummies(avocado_data)
Out[548]:
AveragePrice Region_Atlanta Region_Boston Region_California Region_NewYork Region_SanFr
0 1.70 1 0 0 0
1 1.75 1 0 0 0
2 1.58 0 1 0 0
3 1.57 0 1 0 0
4 1.82 0 0 1 0
5 1.01 0 0 1 0
6 1.38 0 0 0 1
7 1.29 0 0 0 1
8 1.16 0 0 0 0
9 1.17 0 0 0 0

In [550]: # One Hot Encoding using get_dummies

# for removing dummy variable trap use drop_first=True
pd.get_dummies(avocado_data, drop_first=True)
Out[550]:
AveragePrice Region_Boston Region_California Region_NewYork Region_SanFrancisco Type_o
0 1.70 0 0 0 0
1 1.75 0 0 0 0
2 1.58 1 0 0 0
3 1.57 1 0 0 0
4 1.82 0 1 0 0
5 1.01 0 1 0 0
6 1.38 0 0 1 0
7 1.29 0 0 1 0
8 1.16 0 0 0 1
9 1.17 0 0 0 1
pd.to_datetime
Convert argument to datetime.
In [552]: daywise= pd.read_csv('daywise.csv', usecols=[0,1,2,3])

daywise.head()
Out[552]:
Date Confirmed Deaths Recovered
0 1/22/20 555 17 28
1 1/23/20 654 18 30
2 1/24/20 941 26 36
3 1/25/20 1434 42 39
4 1/26/20 2118 56 52

In [554]: # for checking datatype of Date column

daywise.info()
RangeIndex: 264 entries, 0 to 263
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 264 non-null object
1 Confirmed 264 non-null int64
2 Deaths 264 non-null int64
3 Recovered 264 non-null int64
dtypes: int64(3), object(1)
memory usage: 8.4+ KB
In [555]: # Converting Object to datetime using to_datetime

pd.to_datetime(daywise['Date'])
Out[555]: 0 2020-01-22
1 2020-01-23
2 2020-01-24
3 2020-01-25
4 2020-01-26
...
259 2020-09-05
260 2020-09-06
261 2020-09-07
262 2020-09-08
263 2020-09-09
Name: Date, Length: 264, dtype: datetime64[ns]
pd.to_numeric
Convert argument to a numeric type.
In [556]: pd.to_numeric(daywise['Deaths'])
Out[556]: 0 17
1 18
2 26
3 42
4 56
...
259 879645
260 883414
261 892726
262 897463
263 903759
Name: Deaths, Length: 264, dtype: int64

In [ ]:
pd.unique
Uniques are returned in order
of appearance. This does NOT sort.
Out[567]:
In [565]: pd.unique(avocado_data['Region'])
Out[565]: array(['Atlanta', 'Boston', 'California', 'NewYork', 'SanFrancisco'],
dtype=object)
pd.value_counts
Value counts of unique data
In [566]: pd.value_counts(avocado_data['Region'])
Out[566]: Atlanta 2
Boston 2
California 2
NewYork 2
SanFrancisco 2
Name: Region, dtype: int64
pd.factorize
Encode the object as an enumerated type or categorical variable.
In [570]: codes, uniques = pd.factorize(avocado_data['Region'])
In [571]: codes
Out[571]: array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4], dtype=int64)

In [572]: uniques
Out[572]: Index(['Atlanta', 'Boston', 'California', 'NewYork', 'SanFrancisco'], dtype='ob

ject')
DataFrame Level Function
df.abs
Return a Series/DataFrame with absolute numeric value of each element.

avocado_data.head()
Out[573]:
2018-03-
0 Atlanta organic 89424.11 207.08 89631.19 190257.38 1.70
25
2018-03-
18
2018-03-
2 Boston organic 120465.39 18.83 120484.22 236822.98 1.58
11
2018-03-
04
2018-02-
25
In [574]: avocado_data.at[0, 'Large Bags'] = -340.8
Out[575]:
2018-03-
0 Atlanta organic 89424.11 -340.80 89631.19 190257.38 1.70
25
2018-03-
18
2018-03-
2 Boston organic 120465.39 18.83 120484.22 236822.98 1.58
11
2018-03-
04
2018-02-
25

In [576]: avocado_data['Large Bags'] = avocado_data['Large Bags'].abs()
Out[578]:
2018-03-
0 Atlanta organic 89424.11 340.80 89631.19 190257.38 1.70
25
2018-03-
18
2018-03-
2 Boston organic 120465.39 18.83 120484.22 236822.98 1.58
11
2018-03-
04
2018-02-
25
df.add
Get Addition of dataframe and other, element-wise (binary operator add ).
In [579]: avocado_data[['Large Bags','Total Bags','AveragePrice']].add(1)
Out[579]:
Large Bags Total Bags AveragePrice
0 341.80 89632.19 2.70
1 154.00 102871.50 2.75
2 19.83 120485.22 2.58
3 61.60 136939.03 2.57
4 47.58 66321.47 2.82
5 187.20 106985.89 2.01
6 93.29 124215.59 2.38
7 197.57 197282.89 2.29
8 1287.43 236418.93 2.16
9 610.20 166837.16 2.17
Among flexible wrappers ( add , sub , mul , div , mod , pow ) to

arithmetic operators: + , - , * , / , // , % , ** .
df.add_prefix
For DataFrame, the column labels are prefixed.
In [581]: avocado_data.add_prefix('New_')
Out[581]:
New_Small New_Large New_Total New_Total
New_Region New_Type New_AveragePrice N
0 Atlanta organic 89424.11 340.80 89631.19 190257.38 1.70
2 Boston organic 120465.39 18.83 120484.22 236822.98 1.58
5 California conventional 103033.73 186.20 106984.89 1203274.11 1.01
6 NewYork organic 119694.95 92.29 124214.59 777300.99 1.38
7 NewYork conventional 193813.92 196.57 197281.89 904333.98 1.29
8 SanFrancisco organic 231913.11 1286.43 236417.93 1051308.50 1.16
9 SanFrancisco conventional 162913.33 609.20 166836.16 984000.13 1.17
pd.add_suffix
For DataFrame, the column labels are suffixed.

In [582]: avocado_data.add_suffix('_New')
Out[582]:
Region_New Type_New AveragePrice_New D
Bags_New Bags_New Bags_New Volume_New
0 Atlanta organic 89424.11 340.80 89631.19 190257.38 1.70
2 Boston organic 120465.39 18.83 120484.22 236822.98 1.58
6 NewYork organic 119694.95 92.29 124214.59 777300.99 1.38
df.agg
Aggregate using one or more operations over the specified axis.
In [583]: avocado_data_num = avocado_data[['Small Bags','Large Bags','Total Bags','Total Vo

avocado_data_num.agg(['sum','max', 'min'])
Out[583]:
Small Bags Large Bags Total Bags Total Volume AveragePrice
sum 1327127.36 2990.50 1347979.87 5968266.20 14.43
max 231913.11 1286.43 236417.93 1203274.11 1.82
min 66273.89 18.83 66320.47 179041.72 1.01
df.aggregate
Aggregate using one or more operations over the specified axis.
In [587]: avocado_data_num.aggregate(['sum','max', 'min'])
Out[587]:
sum 1327127.36 2990.50 1347979.87 5968266.20 14.43
max 231913.11 1286.43 236417.93 1203274.11 1.82
min 66273.89 18.83 66320.47 179041.72 1.01
df.all
Return whether all elements are True, potentially over an axis.
In [588]: avocado_data_num.all()
Out[588]: Small Bags True
Large Bags True
Total Bags True
Total Volume True
AveragePrice True
dtype: bool
df.any
Return whether any element is True, potentially over an axis.
In [589]: avocado_data_num.any()
Out[589]: Small Bags True
Large Bags True
Total Bags True
Total Volume True
AveragePrice True
dtype: bool
df.append

In [590]: avocado_data_num
Out[590]:
0 89424.11 340.80 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
5 103033.73 186.20 106984.89 1203274.11 1.01
6 119694.95 92.29 124214.59 777300.99 1.38
7 193813.92 196.57 197281.89 904333.98 1.29
8 231913.11 1286.43 236417.93 1051308.50 1.16
9 162913.33 609.20 166836.16 984000.13 1.17
In [591]: new_data = pd.DataFrame([[10,11,12,13,14]],

columns=['Small Bags','Large Bags','Total Bags','Total Volume
In [592]: new_data
Out[592]:
0 10 11 12 13 14

In [593]: avocado_data_num.append(new_data)
Out[593]:
0 89424.11 340.80 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
5 103033.73 186.20 106984.89 1203274.11 1.01
6 119694.95 92.29 124214.59 777300.99 1.38
7 193813.92 196.57 197281.89 904333.98 1.29
8 231913.11 1286.43 236417.93 1051308.50 1.16
9 162913.33 609.20 166836.16 984000.13 1.17
0 10.00 11.00 12.00 13.00 14.00
In [594]: avocado_data_num.append(new_data, ignore_index=True)
Out[594]:
0 89424.11 340.80 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
5 103033.73 186.20 106984.89 1203274.11 1.01
6 119694.95 92.29 124214.59 777300.99 1.38
7 193813.92 196.57 197281.89 904333.98 1.29
8 231913.11 1286.43 236417.93 1051308.50 1.16
9 162913.33 609.20 166836.16 984000.13 1.17
10 10.00 11.00 12.00 13.00 14.00
df.apply
Apply a function along an axis of the DataFrame.
In [595]: import numpy as np

In [596]: avocado_data_num.apply(func = np.sqrt)
Out[596]:
0 299.038643 18.460769 299.384686 436.185030 1.303840
1 320.495710 12.369317 320.734314 450.322929 1.322876
2 347.081244 4.339355 347.108369 486.644614 1.256981
3 369.969499 7.784600 370.051388 489.015000 1.252996
4 257.437157 6.824954 257.527610 423.133218 1.349074
5 320.988676 13.645512 327.085448 1096.938517 1.004988
6 345.969580 9.606768 352.440903 881.646749 1.174734
7 440.243024 14.020342 444.164260 950.964763 1.135782
8 481.573577 35.866837 486.228269 1025.333360 1.077033
9 403.625235 24.681977 408.455824 991.967807 1.081665
In [597]: avocado_data_num.apply(func = lambda x:x*2)
Out[597]:
0 178848.22 681.60 179262.38 380514.76 3.40
1 205435.00 306.00 205741.00 405581.48 3.50
2 240930.78 37.66 240968.44 473645.96 3.16
3 273754.86 121.20 273876.06 478271.34 3.14
4 132547.78 93.16 132640.94 358083.44 3.64
5 206067.46 372.40 213969.78 2406548.22 2.02
6 239389.90 184.58 248429.18 1554601.98 2.76
7 387627.84 393.14 394563.78 1808667.96 2.58
8 463826.22 2572.86 472835.86 2102617.00 2.32
9 325826.66 1218.40 333672.32 1968000.26 2.34
df.applymap
Apply a function to a Dataframe elementwise.

In [598]: avocado_data_num.applymap(func = np.sqrt)
Out[598]:
0 299.038643 18.460769 299.384686 436.185030 1.303840
1 320.495710 12.369317 320.734314 450.322929 1.322876
2 347.081244 4.339355 347.108369 486.644614 1.256981
3 369.969499 7.784600 370.051388 489.015000 1.252996
4 257.437157 6.824954 257.527610 423.133218 1.349074
5 320.988676 13.645512 327.085448 1096.938517 1.004988
6 345.969580 9.606768 352.440903 881.646749 1.174734
7 440.243024 14.020342 444.164260 950.964763 1.135782
8 481.573577 35.866837 486.228269 1025.333360 1.077033
9 403.625235 24.681977 408.455824 991.967807 1.081665
df.astype
Cast a pandas object to a specified dtype dtype .
In [599]: avocado_data_num.astype('int64')
Out[599]:
0 89424 340 89631 190257 1
1 102717 153 102870 202790 1
2 120465 18 120484 236822 1
3 136877 60 136938 239135 1
4 66273 46 66320 179041 1
5 103033 186 106984 1203274 1
6 119694 92 124214 777300 1
7 193813 196 197281 904333 1
8 231913 1286 236417 1051308 1
9 162913 609 166836 984000 1
df.at
Access a single value for a row/column label pair.

In [600]: avocado_data_num = avocado_data[['Small Bags','Large Bags','Total Bags','Total Vo
Out[601]:
0 89424.11 340.80 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
5 103033.73 186.20 106984.89 1203274.11 1.01
6 119694.95 92.29 124214.59 777300.99 1.38
7 193813.92 196.57 197281.89 904333.98 1.29
8 231913.11 1286.43 236417.93 1051308.50 1.16
9 162913.33 609.20 166836.16 984000.13 1.17
In [602]: avocado_data_num.at[0, 'Large Bags']
Out[602]: 340.8
In [603]: avocado_data_num.at[0, 'Large Bags'] = 100
In [604]: avocado_data_num.head(3)
Out[604]:
0 89424.11 100.00 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
df.iat
Access a single value for a row/column pair by integer position.
In [605]: avocado_data_num.iat[0, 1]
Out[605]: 100.0
df.boxplot
In [606]: avocado_data_num.boxplot()
Out[606]: <matplotlib.axes._subplots.AxesSubplot at 0x1db6dcc7390>
In [607]: # AttributeError: 'Series' object has no attribute 'boxplot'

# avocado_data_num['Small Bags'].boxplot()
df.columns
Gives columns of Dataframe
In [610]: avocado_data_num.columns
Out[610]: Index(['Small Bags', 'Large Bags', 'Total Bags', 'Total Volume',
'AveragePrice'],
dtype='object')
df.corr
Compute pairwise correlation of columns, excluding NA/null values.
In [611]: avocado_data_num.corr()
Out[611]:
Small Bags 1.000000 0.780559 0.999473 0.601784 -0.617340
Large Bags 0.780559 1.000000 0.784400 0.576798 -0.550959
Total Bags 0.999473 0.784400 1.000000 0.625409 -0.637999
Total Volume 0.601784 0.576798 0.625409 1.000000 -0.970196
AveragePrice -0.617340 -0.550959 -0.637999 -0.970196 1.000000

In [612]: import seaborn as sns

sns.heatmap(avocado_data_num.corr(), annot=True)
Out[612]: <matplotlib.axes._subplots.AxesSubplot at 0x1db6de18d30>
df.count
Count non-NA cells for each column or row.
If 0 or 'index' counts are generated for each column.
If
1 or 'columns' counts are generated for each row.
Out[613]:
0 89424.11 100.00 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
5 103033.73 186.20 106984.89 1203274.11 1.01
6 119694.95 92.29 124214.59 777300.99 1.38
7 193813.92 196.57 197281.89 904333.98 1.29
8 231913.11 1286.43 236417.93 1051308.50 1.16
9 162913.33 609.20 166836.16 984000.13 1.17

In [614]: # Count non-NA cells for each column --> axis = 0

avocado_data_num.count(axis = 0)
Out[614]: Small Bags 10
Large Bags 10
Total Bags 10
Total Volume 10
AveragePrice 10
dtype: int64
In [615]: # Count non-NA cells for each row --> axis = 1

avocado_data_num.count(axis = 1)
Out[615]: 0 5
1 5
2 5
3 5
4 5
5 5
6 5
7 5
8 5
9 5
dtype: int64
df.cov
Compute pairwise covariance of columns, excluding NA/null values.
In [616]: avocado_data_num.cov()
Out[616]:
Small Bags 2.543504e+09 1.547827e+07 2.608450e+09 1.281268e+10 -8725.105398
Large Bags 1.547827e+07 1.545972e+05 1.596005e+07 9.574319e+07 -60.708644
Total Bags 2.608450e+09 1.596005e+07 2.677879e+09 1.366291e+10 -9252.211901
Total Volume 1.281268e+10 9.574319e+07 1.366291e+10 1.782244e+11 -114781.831256
AveragePrice -8.725105e+03 -6.070864e+01 -9.252212e+03 -1.147818e+05 0.078534
df.cummax
Return cumulative sum over a DataFrame or Series axis.
To iterate over columns and find the sum in each row,

use axis=1
To iterate over rows and find the sum in each column,

use axis=0

Out[618]:
0 89424.11 100.00 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
5 103033.73 186.20 106984.89 1203274.11 1.01
6 119694.95 92.29 124214.59 777300.99 1.38
7 193813.92 196.57 197281.89 904333.98 1.29
8 231913.11 1286.43 236417.93 1051308.50 1.16
9 162913.33 609.20 166836.16 984000.13 1.17
In [619]: # Calculate cumulative sum wrt column

avocado_data_num.cumsum(axis=0)
Out[619]:
0 89424.11 100.00 89631.19 190257.38 1.70
1 192141.61 253.00 192501.69 393048.12 3.45
2 312607.00 271.83 312985.91 629871.10 5.03
3 449484.43 332.43 449923.94 869006.77 6.60
4 515758.32 379.01 516244.41 1048048.49 8.42
5 618792.05 565.21 623229.30 2251322.60 9.43
6 738487.00 657.50 747443.89 3028623.59 10.81
7 932300.92 854.07 944725.78 3932957.57 12.10
8 1164214.03 2140.50 1181143.71 4984266.07 13.26
9 1327127.36 2749.70 1347979.87 5968266.20 14.43

In [620]: # Calculate cumulative sum wrt row

avocado_data_num.cumsum(axis=1)
Out[620]:
0 89424.11 89524.11 179155.30 369412.68 369414.38
1 102717.50 102870.50 205741.00 408531.74 408533.49
2 120465.39 120484.22 240968.44 477791.42 477793.00
3 136877.43 136938.03 273876.06 513011.73 513013.30
4 66273.89 66320.47 132640.94 311682.66 311684.48
5 103033.73 103219.93 210204.82 1413478.93 1413479.94
6 119694.95 119787.24 244001.83 1021302.82 1021304.20
7 193813.92 194010.49 391292.38 1295626.36 1295627.65
8 231913.11 233199.54 469617.47 1520925.97 1520927.13
9 162913.33 163522.53 330358.69 1314358.82 1314359.99
df.cummin
Return cumulative minimum over a DataFrame or Series axis.
In [621]: avocado_data_num.cummin()
Out[621]:
0 89424.11 100.00 89631.19 190257.38 1.70
1 89424.11 100.00 89631.19 190257.38 1.70
2 89424.11 18.83 89631.19 190257.38 1.58
3 89424.11 18.83 89631.19 190257.38 1.57
4 66273.89 18.83 66320.47 179041.72 1.57
5 66273.89 18.83 66320.47 179041.72 1.01
6 66273.89 18.83 66320.47 179041.72 1.01
7 66273.89 18.83 66320.47 179041.72 1.01
8 66273.89 18.83 66320.47 179041.72 1.01
9 66273.89 18.83 66320.47 179041.72 1.01
df.describe
Generate descriptive statistics.

Descriptive statistics include those that summarize the central

tendency, dispersion and shape of a
dataset's distribution, excluding NaN values.
In [622]: avocado_data_num.describe()
Out[622]:
count 10.00000 10.00000 10.000000 1.000000e+01 10.00000
mean 132712.73600 274.97000 134797.987000 5.968266e+05 1.44300
std 50433.16001 393.18854 51748.226118 4.221664e+05 0.28024
min 66273.89000 18.83000 66320.470000 1.790417e+05 1.01000
25% 102796.55750 68.52250 103899.097500 2.112988e+05 1.20000
50% 120080.17000 126.50000 122349.405000 5.082183e+05 1.47500
75% 156404.35500 193.97750 159361.627500 9.640836e+05 1.67000
max 231913.11000 1286.43000 236417.930000 1.203274e+06 1.82000
In [626]: avocado_data_num.describe(percentiles=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])
Out[626]:
count 10.00000 10.00000 10.000000 1.000000e+01 10.00000
mean 132712.73600 274.97000 134797.987000 5.968266e+05 1.44300
std 50433.16001 393.18854 51748.226118 4.221664e+05 0.28024
min 66273.89000 18.83000 66320.470000 1.790417e+05 1.01000
10% 87109.08800 43.80500 87300.118000 1.891358e+05 1.14500
20% 100058.82200 57.79600 100222.638000 2.002841e+05 1.16800
30% 102938.86100 82.78300 105750.573000 2.266133e+05 1.25400
40% 113030.46200 96.91600 115084.488000 2.382106e+05 1.34400
50% 120080.17000 126.50000 122349.405000 5.082183e+05 1.47500
60% 127030.20600 166.28000 129303.966000 8.281142e+05 1.57400
70% 144688.20000 189.31100 145907.469000 9.282338e+05 1.61600
80% 169093.44800 279.09600 172925.306000 9.974618e+05 1.71000
90% 197623.83900 676.92300 201195.494000 1.066505e+06 1.75700
max 231913.11000 1286.43000 236417.930000 1.203274e+06 1.82000
df.drop
Drop specified labels from rows or columns.

In [627]: # For simplicity take only 5 rows using nrow parameter

citibike_tripdata = pd.read_csv('citibike_tripdata.csv', nrows=5)
In [628]: citibike_tripdata
Out[628]:
start start end end
tripduration starttime stoptime station station station station bikeid name_localizedValu
id name id name
2018-05- 2018-05-
Newport
0 338 01 01 3639 Harborside 3199 33558 Annual Membersh
Pkwy
00:04:47 00:10:25
2018-05- 2018-05-
1 1482 01 01 3681 Grand St 3185 City Hall 33593 24 Ho
01:31:10 01:55:53
2018-05- 2018-05- FREE Bonus Mon

McGinley Lincoln
2 232 01 01 3194 3193 29217 with Annu
Square Park
01:31:29 01:35:22 Membersh
2018-05- 2018-05-
Grove
3 190 01 01 3185 City Hall 3186 29662 24 Ho
St PATH
02:03:29 02:06:40
2018-05- 2018-05-
Oakland
4 303 01 01 3207 3195 Sip Ave 15271 Annual Membersh
Ave
04:27:12 04:32:16
In [629]: citibike_tripdata.drop('usertype', axis=1)
Out[629]:
start start end end
id name id name
2018-05- 2018-05-
Newport
Pkwy
00:04:47 00:10:25
2018-05- 2018-05-
01:31:10 01:55:53
2018-05- 2018-05- FREE Bonus Mon

McGinley Lincoln
2 232 01 01 3194 3193 29217 with Annu
Square Park
01:31:29 01:35:22 Membersh
2018-05- 2018-05-
Grove
3 190 01 01 3185 City Hall 3186 29662 24 Ho
St PATH
02:03:29 02:06:40
2018-05- 2018-05-
Oakland
Ave
04:27:12 04:32:16

In [630]: # TypeError: drop() got multiple values for argument 'axis'

# citibike_tripdata.drop('usertype','name_localizedValue','bikeid', axis=1)
In [631]: # If we have to drop multiple columns then pass inside list

citibike_tripdata.drop(['usertype','name_localizedValue','bikeid'], axis=1)
Out[631]:
start start station end end station
tripduration starttime stoptime
station id name station id name
2018-05-01 2018-05-01 Newport

0 338 3639 Harborside 3199
00:04:47 00:10:25 Pkwy
2018-05-01 2018-05-01
1 1482 3681 Grand St 3185 City Hall
01:31:10 01:55:53
2018-05-01 2018-05-01 McGinley

2 232 3194 3193 Lincoln Park
01:31:29 01:35:22 Square
2018-05-01 2018-05-01 Grove St

3 190 3185 City Hall 3186
02:03:29 02:06:40 PATH
2018-05-01 2018-05-01
4 303 3207 Oakland Ave 3195 Sip Ave
04:27:12 04:32:16
In [632]: # drop rows using axis=0

citibike_tripdata.drop([1,3], axis=0)
Out[632]:
start start end end
id name id name
2018-05- 2018-05-
Newport
Pkwy
00:04:47 00:10:25
2018-05- 2018-05- FREE Bonus Mon

McGinley Lincoln
2 232 01 01 3194 3193 29217 with Annu
Square Park
01:31:29 01:35:22 Membersh
2018-05- 2018-05-
Oakland
Ave
04:27:12 04:32:16
drop_duplicates
Return DataFrame with duplicate rows removed.
In [633]: citibike_tripdata = citibike_tripdata.drop_duplicates()

In [634]: citibike_tripdata
Out[634]:
start start end end
id name id name
2018-05- 2018-05-
Newport
Pkwy
00:04:47 00:10:25
2018-05- 2018-05-
01:31:10 01:55:53
2018-05- 2018-05- FREE Bonus Mon

McGinley Lincoln
2 232 01 01 3194 3193 29217 with Annu
Square Park
01:31:29 01:35:22 Membersh
2018-05- 2018-05-
Grove
3 190 01 01 3185 City Hall 3186 29662 24 Ho
St PATH
02:03:29 02:06:40
2018-05- 2018-05-
Oakland
Ave
04:27:12 04:32:16
df.dropna
Remove missing values.
In [635]: weatherHistory = pd.read_csv('weatherHistory.csv')
In [636]: # Precip Type having 517 null(missing) values

weatherHistory.isnull().sum()
Out[636]: Formatted Date 0
Summary 0
Precip Type 517
Temperature (C) 0
Apparent Temperature (C) 0
Humidity 0
Wind Speed (km/h) 0
Wind Bearing (degrees) 0
Visibility (km) 0
Loud Cover 0
Pressure (millibars) 0
Daily Summary 0
dtype: int64
In [637]: weatherHistory = weatherHistory.dropna()

In [638]: weatherHistory.isnull().sum()
Out[638]: Formatted Date 0
Summary 0
Precip Type 0
Temperature (C) 0
Apparent Temperature (C) 0
Humidity 0
Wind Speed (km/h) 0
Wind Bearing (degrees) 0
Visibility (km) 0
Loud Cover 0
Pressure (millibars) 0
Daily Summary 0
dtype: int64
df.dtypes
display datatypes of each column
In [639]: weatherHistory.dtypes
Out[639]: Formatted Date object
Summary object
Precip Type object
Temperature (C) float64
Apparent Temperature (C) float64
Humidity float64
Wind Speed (km/h) float64
Wind Bearing (degrees) float64
Visibility (km) float64
Loud Cover float64
Pressure (millibars) float64
Daily Summary object
dtype: object
df.duplicated
Return boolean Series denoting duplicate rows.
In [643]: df = pd.DataFrame({
'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie','Indo
'style': ['cup', 'cup', 'cup', 'pack', 'pack','pack'],
'rating': [4, 4, 3.5, 15, 5,5]})

In [644]: df
Out[644]:
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
5 Indomie pack 5.0
In [645]: # True means it is duplicate in dataframe

df.duplicated()
Out[645]: 0 False
1 True
2 False
3 False
4 False
5 True
dtype: bool
In [646]: # Find duplicate using below code

df[df.duplicated()==True]
Out[646]:
brand style rating
1 Yum Yum cup 4.0
5 Indomie pack 5.0
df.explode
Transform each element of a list-like to a row, replicating index values.

In [647]: data = pd.DataFrame({"city": ['P', 'Q', 'R'],

"day1": [22, 25, 21],
'day2':[31, 12, 67],
'day3': [27, 20, [41, 45, 67, 90, 21]],
'day4': [64, 47, 24],
'day5': [23, 54, 16]})
data
Out[647]:
city day1 day2 day3 day4 day5
0 P 22 31 27 64 23
1 Q 25 12 20 47 54
2 R 21 67 [41, 45, 67, 90, 21] 24 16
In [648]: data.explode(column = 'day3', ignore_index=True)
Out[648]:
city day1 day2 day3 day4 day5
0 P 22 31 27 64 23
1 Q 25 12 20 47 54
2 R 21 67 41 24 16
3 R 21 67 45 24 16
4 R 21 67 67 24 16
5 R 21 67 90 24 16
6 R 21 67 21 24 16
df.fillna
Fill NA/NaN values using the specified method.

In [652]: # creating dataframe using Dictionary

data = {'Year' : [1990, 1994, 1998, 2002],
'Country' : ['Italy', np.nan, 'France', 'Japan'],
'Winner' : ['Germany', 'Brazil', 'France', np.nan],
'GoalScored' : [115, np.nan, 171, 161]
}
data = pd.DataFrame(data)
data
Out[652]:
0 1990 Italy Germany 115.0
1 1994 NaN Brazil NaN
2 1998 France France 171.0
3 2002 Japan NaN 161.0
In [653]: data.isnull().sum()
Out[653]: Year 0
Country 1
Winner 1
GoalScored 1
dtype: int64
In [654]: data.fillna(0)
Out[654]:
1 1994 0 Brazil 0.0
3 2002 Japan 0 161.0
In [655]: data.fillna("Missing")
Out[655]:
1 1994 Missing Brazil Missing
3 2002 Japan Missing 161.0
df.groupby
Group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the

object, applying a function, and
combining the results. This can be
used to group large amounts of data and compute operations
on these
groups.

avocado_data.head()
Out[656]:
2018-03-
0 Atlanta organic 89424.11 207.08 89631.19 190257.38 1.70
25
2018-03-
18
2018-03-
2 Boston organic 120465.39 18.83 120484.22 236822.98 1.58
11
2018-03-
04
2018-02-
25
In [657]: g = avocado_data.groupby(by='Type')
'agg', 'aggregate', 'all', 'any', 'apply', 'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'cov',
'cumcount', 'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'ewm', 'expanding',
'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head', 'hist', 'idxmax', 'idxmin', 'indices', 'last', 'mad',
'max', 'mean', 'median', 'min', 'ndim', 'ngroup', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad', 'pct_change',
'pipe', 'plot', 'prod', 'quantile', 'rank', 'resample', 'rolling', 'sample', 'sem', 'shift', 'size', 'skew', 'std',
'sum', 'tail', 'take', 'transform', 'tshift', 'var'
In [658]: g.mean()
Out[658]:
Type
conventional 139871.182 241.114 142182.294 706706.926 1.358
organic 125554.290 330.242 127413.680 486946.314 1.528

In [659]: g.min()
Out[659]:
Region AveragePrice Date
Type
conventional Atlanta 102717.50 60.60 102870.50 202790.74 1.01 2018-02-25
organic Atlanta 66273.89 18.83 66320.47 179041.72 1.16 2018-02-25
In [660]: g.max()
Out[660]:
Region AveragePrice Date
Type
2018-03-
conventional SanFrancisco 193813.92 609.20 197281.89 1203274.11 1.75
25
2018-03-
organic SanFrancisco 231913.11 1286.43 236417.93 1051308.50 1.82
25
In [661]: g.corr()
Out[661]:
Type
conventional Small Bags 1.000000 0.356233 0.998932 0.262657 -0.220026
Large Bags 0.356233 1.000000 0.379422 0.501285 -0.470016
Total Bags 0.998932 0.379422 1.000000 0.306574 -0.262626
Total Volume 0.262657 0.501285 0.306574 1.000000 -0.976248
AveragePrice -0.220026 -0.470016 -0.262626 -0.976248 1.000000
organic Small Bags 1.000000 0.917566 0.999667 0.864519 -0.930256
Large Bags 0.917566 1.000000 0.915467 0.774162 -0.772484
Total Bags 0.999667 0.915467 1.000000 0.877099 -0.938423
Total Volume 0.864519 0.774162 0.877099 1.000000 -0.958522
AveragePrice -0.930256 -0.772484 -0.938423 -0.958522 1.000000

In [662]: g.describe()
Out[662]:
Small Bags
count mean std min 25% 50% 75% max
Type
conventional 5.0 139871.182 39329.111592 102717.50 103033.73 136877.43 162913.33 193813
organic 5.0 125554.290 63623.861662 66273.89 89424.11 119694.95 120465.39 231913
2 rows × 40 columns
df.head
Return the first n rows. default 5

avocado_data.head(2)
Out[663]:
2018-03-
0 Atlanta organic 89424.11 207.08 89631.19 190257.38 1.70
25
2018-03-
18
df.tail
Return the last n rows. default 5

In [664]: avocado_data.tail()
Out[664]:
2018-
03-25
2018-
6 NewYork organic 119694.95 92.29 124214.59 777300.99 1.38
03-18
2018-
03-11
2018-
03-04
2018-
02-25
df.hist
Make a histogram of the DataFrame's columns.
In [665]: avocado_data.hist()
Out[665]: array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DB6DEEA748>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DB6D882400>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DB6BB25F60>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DB6D6D4A58>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DB689A8C18>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DB6BC06C50>]],
dtype=object)
df.idxmax
Return index of first occurrence of maximum over requested axis.
The axis to use. 0 or 'index' for row-wise, 1 or 'columns' for column-wise.
Out[666]:
0 89424.11 100.00 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
5 103033.73 186.20 106984.89 1203274.11 1.01
6 119694.95 92.29 124214.59 777300.99 1.38
7 193813.92 196.57 197281.89 904333.98 1.29
8 231913.11 1286.43 236417.93 1051308.50 1.16
9 162913.33 609.20 166836.16 984000.13 1.17
In [667]: # axis=0 means row wise(in each column)

avocado_data_num.idxmax(axis=0)
Large Bags 8
Total Bags 8
Total Volume 5
AveragePrice 4
dtype: int64
In [668]: # axis=1 means column wise(in each row)

avocado_data_num.idxmax(axis=1)
Out[668]: 0 Total Volume
1 Total Volume
2 Total Volume
3 Total Volume
4 Total Volume
5 Total Volume
6 Total Volume
7 Total Volume
8 Total Volume
9 Total Volume
dtype: object
df.idxmin
Return index of first occurrence of minimum over requested axis.
In [669]: # axis=0 means row wise(in each column)

avocado_data_num.idxmin(axis=0)
Large Bags 2
Total Bags 4
Total Volume 4
AveragePrice 5
dtype: int64
In [670]: # axis=1 means column wise(in each row)

avocado_data_num.idxmin(axis=1)
Out[670]: 0 AveragePrice
1 AveragePrice
2 AveragePrice
3 AveragePrice
4 AveragePrice
5 AveragePrice
6 AveragePrice
7 AveragePrice
8 AveragePrice
9 AveragePrice
dtype: object
df.iloc
Purely integer-location based indexing for selection by position.
In [671]: avocado_data_num.iloc[2:4,1:4]
Out[671]:
Large Bags Total Bags Total Volume
2 18.83 120484.22 236822.98
3 60.60 136938.03 239135.67
df.loc
Access a group of rows and columns by label(s).

Out[672]:
0 89424.11 100.00 89631.19 190257.38 1.70
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
5 103033.73 186.20 106984.89 1203274.11 1.01
6 119694.95 92.29 124214.59 777300.99 1.38
7 193813.92 196.57 197281.89 904333.98 1.29
8 231913.11 1286.43 236417.93 1051308.50 1.16
9 162913.33 609.20 166836.16 984000.13 1.17
In [673]: avocado_data_num.loc[1:4, ['Small Bags','Large Bags','Total Bags','Total Volume',
Out[673]:
1 102717.50 153.00 102870.50 202790.74 1.75
2 120465.39 18.83 120484.22 236822.98 1.58
3 136877.43 60.60 136938.03 239135.67 1.57
4 66273.89 46.58 66320.47 179041.72 1.82
In [674]: avocado_data_num.loc[1:6, 'Small Bags':'Total Bags']
Out[674]:
Small Bags Large Bags Total Bags
1 102717.50 153.00 102870.50
2 120465.39 18.83 120484.22
3 136877.43 60.60 136938.03
4 66273.89 46.58 66320.47
5 103033.73 186.20 106984.89
6 119694.95 92.29 124214.59
df.index

In [675]: avocado_data_num.index
Out[675]: RangeIndex(start=0, stop=10, step=1)
In [676]: list(avocado_data_num.index)
Out[676]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
df.info
This method prints information about a DataFrame including
the index dtype and columns, non-null
values and memory usage.
In [677]: avocado_data_num.info()
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Small Bags 10 non-null float64
1 Large Bags 10 non-null float64
2 Total Bags 10 non-null float64
3 Total Volume 10 non-null float64
4 AveragePrice 10 non-null float64
dtypes: float64(5)
memory usage: 528.0 bytes
df.insert
Insert column into DataFrame at specified location.
In [678]: avocado_data_num.insert(loc = 0, column = 'new_column', value= [1,2,3,4,5,6,7,8,9

Out[679]:
new_column Small Bags Large Bags Total Bags Total Volume AveragePrice
0 1 89424.11 100.00 89631.19 190257.38 1.70
1 2 102717.50 153.00 102870.50 202790.74 1.75
2 3 120465.39 18.83 120484.22 236822.98 1.58
3 4 136877.43 60.60 136938.03 239135.67 1.57
4 5 66273.89 46.58 66320.47 179041.72 1.82
5 6 103033.73 186.20 106984.89 1203274.11 1.01
6 7 119694.95 92.29 124214.59 777300.99 1.38
7 8 193813.92 196.57 197281.89 904333.98 1.29
8 9 231913.11 1286.43 236417.93 1051308.50 1.16
9 10 162913.33 609.20 166836.16 984000.13 1.17
df.interpolate
Fill NaN values using an interpolation method.
method : str, default 'linear'

Interpolation technique to use. One of:
* 'linear': Ignore the index and treat the values as equally
spaced. This is the only method supported on MultiIndexes.
* 'time': Works on daily and higher resolution data to interpolate
given length of interval.
* 'index', 'values': use the actual numerical values of the index.
* 'pad': Fill in NaNs using existing values.
* 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'spline',
'barycentric', 'polynomial': Passed to
`scipy.interpolate.interp1d`. These methods use the numerical
values of the index. Both 'polynomial' and 'spline' require that
you also specify an `order` (int), e.g.
``df.interpolate(method='polynomial', order=5)``.
* 'krogh', 'piecewise_polynomial', 'spline', 'pchip', 'akima',
'cubicspline': Wrappers around the SciPy interpolation methods of
similar names. See `Notes`.
* 'from_derivatives': Refers to
`scipy.interpolate.BPoly.from_derivatives` which
replaces 'piecewise_polynomial' interpolation method in
scipy 0.18.

In [680]: df_dict = {'Year' : [1990, 1994, 1998, 2002],

'Country' : ['Italy', 'USA', 'France', 'Japan'],
'Winner' : ['Germany', 'Brazil', 'France', 'Brazil'],
'GoalScored' : [115, 141, np.nan, 161]
}
df_dict = pd.DataFrame(df_dict)
df_dict
Out[680]:
1 1994 USA Brazil 141.0
2 1998 France France NaN
3 2002 Japan Brazil 161.0
In [681]: # filling missing value using interpolation

df_dict.interpolate(method='linear')
Out[681]:
1 1994 USA Brazil 141.0
In [683]: # filling specified column missing value using interpolation

df_dict['GoalScored'].interpolate(method='linear')
Out[683]: 0 115.0
1 141.0
2 151.0
3 161.0
Name: GoalScored, dtype: float64
df.isin
Whether each element in the DataFrame is contained in values.

In [684]: df_dict
Out[684]:
1 1994 USA Brazil 141.0
In [685]: # showing True where list of value present

df_dict.isin(['1994','Japan','Germany',141.0])
Out[685]:
0 False False True False
1 False False False True
2 False False False False
3 False True False False
df.isna
Detect missing values. and gives boolean value True
In [686]: df_dict.isna()
Out[686]:
df.isnull
Detect missing values.
If missing value present showing True

In [688]: df_dict.isnull()
Out[688]:
df.items
Iterate over (column name, Series) pairs.
In [689]: df_dict
Out[689]:
1 1994 USA Brazil 141.0
In [690]: items = df_dict.items()

In [691]: for label, content in items:

print(label)
print(content)
print()
Year
0 1990
1 1994
2 1998
3 2002
Name: Year, dtype: int64
Country
0 Italy
1 USA
2 France
3 Japan
Name: Country, dtype: object
Winner
0 Germany
1 Brazil
2 France
3 Brazil
Name: Winner, dtype: object
GoalScored
0 115.0
1 141.0
2 NaN
3 161.0
df.iteritems
Iterate over (column name, Series) pairs.
In [692]: iteritems = df_dict.iteritems()

In [693]: for label, content in iteritems:

print(label)
print(content)
print('---------')
Year
0 1990
1 1994
2 1998
3 2002
---------
Country
0 Italy
1 USA
2 France
3 Japan
Name: Country, dtype: object
---------
Winner
0 Germany
1 Brazil
2 France
3 Brazil
Name: Winner, dtype: object
---------
GoalScored
0 115.0
1 141.0
2 NaN
3 161.0
---------
df.iterrows
Iterate over DataFrame rows as (index, Series) pairs.
In [694]: df_dict
Out[694]:
1 1994 USA Brazil 141.0
In [695]: iterrows = df_dict.iterrows()

In [696]: for label, content in iterrows:

print(label)
print(content)
print('---------')
Year 1990
Country Italy
Winner Germany
GoalScored 115.0
Name: 0, dtype: object
---------
Year 1994
Country USA
Winner Brazil
GoalScored 141.0
---------
Year 1998
Country France
Winner France
GoalScored NaN
---------
Year 2002
Country Japan
Winner Brazil
GoalScored 161.0
---------
df.itertuples
Iterate over DataFrame rows as namedtuples.
In [697]: itertuples = df_dict.itertuples()
In [698]: list(itertuples)
Out[698]: [Pandas(Index=0, Year=1990, Country='Italy', Winner='Germany', GoalScored=115.

0),
Pandas(Index=1, Year=1994, Country='USA', Winner='Brazil', GoalScored=141.0),
Pandas(Index=2, Year=1998, Country='France', Winner='France', GoalScored=nan),
Pandas(Index=3, Year=2002, Country='Japan', Winner='Brazil', GoalScored=161.

0)]
df.keys()
columns for DataFrame.
In [700]: df_dict
Out[700]:
1 1994 USA Brazil 141.0
In [701]: df_dict.keys()
Out[701]: Index(['Year', 'Country', 'Winner', 'GoalScored'], dtype='object')
df.values
Return a Numpy representation of the DataFrame.
Only the values in the DataFrame will be
returned, the axes labels
will be removed.
In [702]: df_dict.values
Out[702]: array([[1990, 'Italy', 'Germany', 115.0],
[1994, 'USA', 'Brazil', 141.0],
[1998, 'France', 'France', nan],
[2002, 'Japan', 'Brazil', 161.0]], dtype=object)
df.kurt
Return unbiased kurtosis over requested axis.
In [703]: avocado_data_num.kurt()
Out[703]: new_column -1.200000
Small Bags 0.257027
Large Bags 5.437476
Total Bags 0.249708
Total Volume -2.102999
AveragePrice -1.446656
dtype: float64

In [704]: avocado_data_num.kurtosis()
Out[704]: new_column -1.200000
Small Bags 0.257027
Large Bags 5.437476
Total Bags 0.249708
Total Volume -2.102999
dtype: float64
df.skew
Return unbiased skew over requested axis.
axis : {index (0), columns (1)}
In [705]: avocado_data_num.skew()
Out[705]: new_column 0.000000
Small Bags 0.866584
Large Bags 2.328516
Total Bags 0.856746
Total Volume 0.214461
dtype: float64
df.max
Return the maximum of the values over the requested axis.
In [706]: avocado_data_num.max()
Small Bags 231913.11
Large Bags 1286.43
Total Bags 236417.93
AveragePrice 1.82
dtype: float64
df.min
Return the minimum of the values over the requested axis.

In [707]: avocado_data_num.min()
Small Bags 66273.89
Large Bags 18.83
Total Bags 66320.47
AveragePrice 1.01
dtype: float64
df.median
Return the median of the values over the requested axis.
In [708]: avocado_data_num.median()
Small Bags 120080.170
Large Bags 126.500
Total Bags 122349.405
AveragePrice 1.475
dtype: float64
df.std
Return sample standard deviation over requested axis.
{index (0), columns (1)}
In [709]: avocado_data_num.std()
Small Bags 50433.160010
Large Bags 393.188540
Total Bags 51748.226118
AveragePrice 0.280240
dtype: float64
df.var
Return unbiased variance over requested axis.
{index (0), columns (1)}

In [710]: avocado_data_num.var()
Out[710]: new_column 9.166667e+00
Small Bags 2.543504e+09
Large Bags 1.545972e+05
Total Bags 2.677879e+09
Total Volume 1.782244e+11
AveragePrice 7.853444e-02
dtype: float64
df.melt
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
Out[711]:
0 1 89424.11 100.00 89631.19 190257.38 1.70
1 2 102717.50 153.00 102870.50 202790.74 1.75
2 3 120465.39 18.83 120484.22 236822.98 1.58
3 4 136877.43 60.60 136938.03 239135.67 1.57
4 5 66273.89 46.58 66320.47 179041.72 1.82
5 6 103033.73 186.20 106984.89 1203274.11 1.01
6 7 119694.95 92.29 124214.59 777300.99 1.38
7 8 193813.92 196.57 197281.89 904333.98 1.29
8 9 231913.11 1286.43 236417.93 1051308.50 1.16
9 10 162913.33 609.20 166836.16 984000.13 1.17

In [712]: avocado_data_num.melt()
Out[712]:
variable value
0 new_column 1.00
1 new_column 2.00
2 new_column 3.00
3 new_column 4.00
4 new_column 5.00
5 new_column 6.00
6 new_column 7.00
7 new_column 8.00
8 new_column 9.00
9 new_column 10.00
10 Small Bags 89424.11
11 Small Bags 102717.50
12 Small Bags 120465.39
13 Small Bags 136877.43
14 Small Bags 66273.89
15 Small Bags 103033.73
16 Small Bags 119694.95
17 Small Bags 193813.92
18 Small Bags 231913.11
19 Small Bags 162913.33
20 Large Bags 100.00
22 Large Bags 18.83
23 Large Bags 60.60
24 Large Bags 46.58
26 Large Bags 92.29
30 Total Bags 89631.19
31 Total Bags 102870.50
32 Total Bags 120484.22
33 Total Bags 136938.03

variable value
34 Total Bags 66320.47
35 Total Bags 106984.89
36 Total Bags 124214.59
37 Total Bags 197281.89
38 Total Bags 236417.93
39 Total Bags 166836.16
40 Total Volume 190257.38
50 AveragePrice 1.70
df.memory_usage
Return the memory usage of each column in bytes.

In [713]: df_dict.memory_usage()
Out[713]: Index 128
Year 32
Country 32
Winner 32
GoalScored 32
dtype: int64
In [714]: avocado_data_num.memory_usage()
Out[714]: Index 128
new_column 80
Small Bags 80
Large Bags 80
Total Bags 80
Total Volume 80
AveragePrice 80
dtype: int64
df.multiply
Get Multiplication of dataframe and other, element-wise (binary operator mul ).
In [715]: df_dict.multiply(2)
Out[715]:
0 3980 ItalyItaly GermanyGermany 230.0
1 3988 USAUSA BrazilBrazil 282.0
2 3996 FranceFrance FranceFrance NaN
3 4004 JapanJapan BrazilBrazil 322.0
df.nunique
Count number of distinct elements in specified axis.
In [717]: df_dict.nunique()
Out[717]: Year 4
Country 4
Winner 3
GoalScored 3
dtype: int64

In [718]: df_dict
Out[718]:
1 1994 USA Brazil 141.0
df.pivot_table
Create a spreadsheet-style pivot table as a DataFrame.
In [721]: df_dict.pivot_table(values=['GoalScored','Year'],index='Winner',aggfunc='mean')
Out[721]:
GoalScored Year
Winner
Brazil 151.0 1998
France NaN 1998
Germany 115.0 1990
In [722]: df_dict.pivot_table(values=['GoalScored','Year'],index='Country',aggfunc='mean')
Out[722]:
GoalScored Year
Country
France NaN 1998
Italy 115.0 1990
Japan 161.0 2002
USA 141.0 1994
df.pop
Return item and drop from frame. Raise KeyError if not found.

In [723]: df_dict.pop('Year')
Out[723]: 0 1990
1 1994
2 1998
3 2002
In [724]: df_dict
Out[724]:
Country Winner GoalScored
0 Italy Germany 115.0
1 USA Brazil 141.0
2 France France NaN
3 Japan Brazil 161.0
df.rename
Rename column name
In [725]: df_dict.rename(columns = {'Country':'country','Winner':'winner_country','GoalScor
Out[725]:
country winner_country goal_scored
1 USA Brazil 141.0
2 France France NaN
df.replace
Replace values given in to_replace with value .
In [726]: df_dict.replace(to_replace = 'USA', value= 'usa')
Out[726]:
1 usa Brazil 141.0
2 France France NaN

In [727]: df_dict.replace(to_replace = {'USA': 'usa', 'Brazil': 'brazil'})
Out[727]:
1 usa brazil 141.0
2 France France NaN
3 Japan brazil 161.0
df.reset_index
Reset the index of the DataFrame, and use the default one instead.
In [728]: df_dict.reset_index()
Out[728]:
index Country Winner GoalScored
1 1 USA Brazil 141.0
df.sample
Return a random sample of items from an axis of object.
In [729]: df_dict.sample(n=4)
Out[729]:
2 France France NaN
1 USA Brazil 141.0
df.shape
Return a tuple representing the dimensionality of the DataFrame.

In [730]: df_dict
Out[730]:
1 USA Brazil 141.0
2 France France NaN
In [731]: df_dict.shape
Out[731]: (4, 3)
df.shape
Return an int representing the number of elements in this object.
In [732]: df_dict.size
Out[732]: 12
df.sort_index
Sort object by labels (along an axis).
In [733]: df_dict.sort_index(axis=0)
Out[733]:
1 USA Brazil 141.0
2 France France NaN

In [734]: df_dict.sort_index(axis=1)
Out[734]:
Country GoalScored Winner
0 Italy 115.0 Germany
1 USA 141.0 Brazil
2 France NaN France
3 Japan 161.0 Brazil
df.sort_values
Sort by the values along either axis.
In [735]: df_dict.sort_values(by='Country')
Out[735]:
2 France France NaN
1 USA Brazil 141.0
In [736]: df_dict.sort_values(by='GoalScored')
Out[736]:
1 USA Brazil 141.0
2 France France NaN
In [737]: df_dict.sort_values(by=['Winner','GoalScored'])
Out[737]:
1 USA Brazil 141.0
2 France France NaN
df.to_clipboard
Copy object to the system clipboard.
In [738]: df_dict.to_clipboard()
In [739]: avocado_data_num.to_clipboard()
df.to_csv
Write object to a comma-separated values (csv) file.
In [458]: df_dict.to_csv('new_dict.csv')
df.to_dict
Convert the DataFrame to a dictionary.
In [457]: df_dict.to_dict()
Out[457]: {'Country': {0: 'Italy', 1: 'USA', 2: 'France', 3: 'Japan'},
'Winner': {0: 'Germany', 1: 'Brazil', 2: 'France', 3: 'Brazil'},
'GoalScored': {0: 115.0, 1: 141.0, 2: nan, 3: 161.0}}
df.to_excel
Write object to an Excel sheet.
In [459]: df_dict.to_excel('new_excel.xlsx')
df.to_html
Render a DataFrame as an HTML table.
In [462]: df_dict.to_html('dict_html.html')
df.to_json
Convert the object to a JSON string.
In [463]: df_dict.to_json('dict_json.json')

df.to_numpy
Convert the DataFrame to a NumPy array.
In [464]: df_dict.to_numpy()
Out[464]: array([['Italy', 'Germany', 115.0],
['USA', 'Brazil', 141.0],
['France', 'France', nan],
['Japan', 'Brazil', 161.0]], dtype=object)
df.to_parquet
Write a DataFrame to the binary parquet format.
In [466]: df_dict.to_parquet('parquet_file')
df.to_pickle
Pickle (serialize) object to file.
In [467]: df_dict.to_pickle('file.pkl')
df.transform
Call func on self producing a DataFrame with transformed values.
In [740]: df_dict.transform(func = lambda x:x*2)
Out[740]:
0 ItalyItaly GermanyGermany 230.0
1 USAUSA BrazilBrazil 282.0
2 FranceFrance FranceFrance NaN
3 JapanJapan BrazilBrazil 322.0
df.transpose()
Transpose index and columns.

In [741]: df_dict.transpose()
Out[741]:
0 1 2 3
Country Italy USA France Japan
Winner Germany Brazil France Brazil
GoalScored 115.0 141.0 NaN 161.0
In [742]: df_dict.T
Out[742]:
0 1 2 3
Country Italy USA France Japan
Winner Germany Brazil France Brazil
GoalScored 115.0 141.0 NaN 161.0
In [743]: df_dict
Out[743]:
1 USA Brazil 141.0
2 France France NaN
In [744]: df_dict.truncate(before=2, after=3,axis=0)
Out[744]:
2 France France NaN

Out[745]:
0 1 89424.11 100.00 89631.19 190257.38 1.70
1 2 102717.50 153.00 102870.50 202790.74 1.75
2 3 120465.39 18.83 120484.22 236822.98 1.58
3 4 136877.43 60.60 136938.03 239135.67 1.57
4 5 66273.89 46.58 66320.47 179041.72 1.82
5 6 103033.73 186.20 106984.89 1203274.11 1.01
6 7 119694.95 92.29 124214.59 777300.99 1.38
7 8 193813.92 196.57 197281.89 904333.98 1.29
8 9 231913.11 1286.43 236417.93 1051308.50 1.16
9 10 162913.33 609.20 166836.16 984000.13 1.17
In [746]: avocado_data_num.sort_index(axis=1).truncate(before='Small Bags',after='Total Vol
Out[746]:
Small Bags Total Bags Total Volume
0 89424.11 89631.19 190257.38
1 102717.50 102870.50 202790.74
2 120465.39 120484.22 236822.98
3 136877.43 136938.03 239135.67
4 66273.89 66320.47 179041.72
5 103033.73 106984.89 1203274.11
6 119694.95 124214.59 777300.99
7 193813.92 197281.89 904333.98
8 231913.11 236417.93 1051308.50
9 162913.33 166836.16 984000.13
df.update
Modify in place using non-NA values from another DataFrame.

In [747]: df = pd.DataFrame({'A': [1, 2, 3],

'B': [400, 500, 600]})

df
Out[747]:
A B
0 1 400
1 2 500
2 3 600
In [748]: new_df = pd.DataFrame({'B': [4, 5, 6],

'C': [7, 8, 9]})
new_df
Out[748]:
B C
0 4 7
1 5 8
2 6 9
In [749]: df.update(new_df)
In [750]: df
Out[750]:
A B
0 1 4
1 2 5
2 3 6
df.value_counts
Return a Series containing counts of unique rows in the DataFrame.
In [753]: df_dict.value_counts()
Out[753]: Country Winner GoalScored
Italy Germany 115.0 1
Japan Brazil 161.0 1
USA Brazil 141.0 1
dtype: int64

In [754]: df_dict['Winner'].value_counts()
Out[754]: Brazil 2
Germany 1
France 1
Name: Winner, dtype: int64
df.where
In [755]: df_dict
Out[755]:
1 USA Brazil 141.0
2 France France NaN
In [756]: avocado_data_num.where(avocado_data_num > 100)
Out[756]:
0 NaN 89424.11 NaN 89631.19 190257.38 NaN
1 NaN 102717.50 153.00 102870.50 202790.74 NaN
2 NaN 120465.39 NaN 120484.22 236822.98 NaN
3 NaN 136877.43 NaN 136938.03 239135.67 NaN
4 NaN 66273.89 NaN 66320.47 179041.72 NaN
5 NaN 103033.73 186.20 106984.89 1203274.11 NaN
6 NaN 119694.95 NaN 124214.59 777300.99 NaN
7 NaN 193813.92 196.57 197281.89 904333.98 NaN
8 NaN 231913.11 1286.43 236417.93 1051308.50 NaN
9 NaN 162913.33 609.20 166836.16 984000.13 NaN

In [757]: df_dict.where(df_dict == 'USA')
Out[757]:
0 NaN NaN NaN
1 USA NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Happy Learning

Pandas Manipulations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pandas Manipulations

Uploaded by

Copyright:

Available Formats

5/11/22, 9:50 PM Pandas Full Tutorial - Jupyter Notebook

Prepared by Shankar Wagh

Linkedin Page (https://www.linkedin.com/in/shankar-wagh)

Pandas Full Tutorial

1. Pandas is built on top of numpy

In [2]: import pandas as pd

dtype: object <class 'pandas.core.series.Series'>

localhost:8888/notebooks/OneDrive/Desktop/Data Science/python/Pandas/Pandas Full Tutorial.ipynb 1/70

In [517]: # Negative indexing Not possible

In [519]: # Explicit indexing

Out[520]: day1 Sunday

Out[522]: day3 Tuesday

localhost:8888/notebooks/OneDrive/Desktop/Data Science/python/Pandas/Pandas Full Tutorial.ipynb 2/70

In [523]: # Reversing series

Out[523]: day7 Saturday

In [524]: # Passing index parameter

localhost:8888/notebooks/OneDrive/Desktop/Data Science/python/Pandas/Pandas Full Tutorial.ipynb 3/70

In [528]: # Passing index parameter with duplicate index

In [531]: # Creating a series from a dictionary

localhost:8888/notebooks/OneDrive/Desktop/Data Science/python/Pandas/Pandas Full Tutorial.ipynb 4/70

In [532]: d = dict(zip(states, capitals))

In [533]: d = dict(zip(states, capitals))

In [534]: # Adding data to Series

localhost:8888/notebooks/OneDrive/Desktop/Data Science/python/Pandas/Pandas Full Tutorial.ipynb 5/70

0 1990 Italy Germany 115

1 1994 USA Brazil 141

2 1998 France France 171

3 2002 Japan Brazil 161

In [538]: # Creating dataframe from List of tuples

0 2002 Japan Brazil 161

1 2006 Germany Italy 147

2 2010 South Africa Spain 145

3 2014 Brazil Germany 171

localhost:8888/notebooks/OneDrive/Desktop/Data Science/python/Pandas/Pandas Full Tutorial.ipynb 6/70

In [539]: # creating dataframe from list of list

0 2002 Japan Brazil 161

1 2006 Germany Italy 147

2 2010 South Africa Spain 145

3 2014 Brazil Germany 171

In [540]: # Creating dataframe using list of dictionary

0 2002 Japan Brazil

1 2006 Germany Italy

2 2010 South Africa Spain

3 2014 Brazil Germany

Pandas Level Function

localhost:8888/notebooks/OneDrive/Desktop/Data Science/python/Pandas/Pandas Full Tutorial.ipynb 7/70

In [541]: avocado_data = pd.read_csv('avocado.csv')

In [542]: avocado_data = pd.read_csv('avocado.csv', usecols=['Region','Type','AveragePrice

0 Atlanta organic 1.70

1 Atlanta conventional 1.75

2 Boston organic 1.58

3 Boston conventional 1.57

4 California organic 1.82

In [544]: avocado_data = pd.read_csv('avocado.csv', usecols=[0,1,6])

0 Atlanta organic 1.70

1 Atlanta conventional 1.75

2 Boston organic 1.58

3 Boston conventional 1.57

4 California organic 1.82

Read an Excel file into a pandas DataFrame.

In [3]: # read excel data

0 1990 Italy Germany Argentina 115 52

1 1994 USA Brazil Italy 141 52