Professional Documents
Culture Documents
03 DataFrames
03 DataFrames
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We
can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use
pandas to explore this topic!
In [75]:
import pandas as pd
import numpy as np
In [76]:
from numpy.random import randn
np.random.seed(101)
In [78]:
df
Out [78]:
W X Y Z
In [79]: df['X']
Out [80]:
W Y
A 2.706850 0.907969
B 0.651118 -0.848077
W Y
C -2.018168 0.528813
D 0.188695 -0.933237
E 0.190794 2.605967
In [82]:
type(df['X'])
In [83]:
df['new'] = df['W'] + df['Y']
In [84]:
df
Out [84]:
W X Y Z new
Removing Columns
In [85]:
df.drop('new',axis=1)
Out [85]:
W X Y Z
Out [86]:
W X Y Z new
In [87]:
df.drop('new',axis=1,inplace=True)
In [88]:
df
Out [88]:
W X Y Z
Selecting Rows
In [89]: df.loc['A']
In [90]: df.iloc[1]
In [91]: df.loc['B','Y']
In [92]:
df.loc[['A','B'],['W','Y']]
Out [92]:
W Y
A 2.706850 0.907969
B 0.651118 -0.848077
Conditional Selection
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
In [93]:
df
Out [93]:
W X Y Z
In [94]:
df>0
Out [94]:
W X Y Z
In [95]:
df[df>0]
Out [95]:
W X Y Z
In [96]: df[df['W']>0]
Out [96]:
W X Y Z
In [97]:
df[df['W']>0]['Y']
In [98]:
df[df['W']>0][['Y','X']]
Out [98]:
Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757
For two conditions you can use | and & with parenthesis:
In [105]:
df[(df['W']>0) & (df['X'] > 1)]
Out [105]:
W X Y Z States
In [106]: df
Out [106]:
W X Y Z States
Out [107]:
index W X Y Z States
In [108]:
newind = 'CA NY WY OR CO'.split()
In [109]:
df['States'] = newind
In [110]: df
Out [110]:
W X Y Z States
In [55]:
df.set_index('States')
Out [55]:
W X Y Z
States
In [111]:
df
Out [111]:
W X Y Z States
In [57]:
df
Out [57]:
W X Y Z States
In [58]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
In [59]: hier_index
In [60]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
Out [60]:
A B
G1 1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
G2 1 0.166905 0.184502
A B
2 0.807706 0.072960
3 0.638787 0.329646
Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you
would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
In [61]:
df.loc['G1']
Out [61]:
A B
1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
In [62]:
df.loc['G1'].loc[1]
In [63]:
df.index.names
In [64]:
df.index.names = ['Group','Num']
In [65]:
df
Out [65]:
A B
Group Num
G1 1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
G2 1 0.166905 0.184502
2 0.807706 0.072960
3 0.638787 0.329646
In [66]: df.xs('G1')
Out [66]:
A B
Num
1 0.302665 1.693723
2 -1.706086 -1.159119
A B
Num
3 -0.134841 0.390528
In [67]:
df.xs(['G1',1])
C:\Users\RETECH\anaconda3\envs\DataAnalytics\lib\site-packages\ipykernel_launcher.py:1:
FutureWarning: Passing lists as key for xs is deprecated and will be removed in a future
version. Pass key as a tuple instead.
"""Entry point for launching an IPython kernel.
In [68]:
df.xs(1,level='Num')
Out [68]:
A B
Group
G1 0.302665 1.693723
G2 0.166905 0.184502
Great Job!
In [ ]: