Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

pandas_group_by (5)

January 12, 2020

##
Pandas Group By
In this tutorial we are going to look at weather data from various cities and see how
group by can be used to run some analytics.
[1]: import pandas as pd
df = pd.read_csv("weather_by_cities.csv")
df

[1]: day city temperature windspeed event


0 1/1/2017 new york 32 6 Rain
1 1/2/2017 new york 36 7 Sunny
2 1/3/2017 new york 28 12 Snow
3 1/4/2017 new york 33 7 Sunny
4 1/1/2017 mumbai 90 5 Sunny
5 1/2/2017 mumbai 85 12 Fog
6 1/3/2017 mumbai 87 15 Fog
7 1/4/2017 mumbai 92 5 Rain
8 1/1/2017 paris 45 20 Sunny
9 1/2/2017 paris 50 13 Cloudy
10 1/3/2017 paris 54 8 Cloudy
11 1/4/2017 paris 42 10 Cloudy

0.0.1 For this dataset, get following answers,


1. What was the maximum temperature in each of these 3 cities?

2. What was the average windspeed in each of these 3 cities?


[2]: g = df.groupby("city")
print(g)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018BD6E2C808>


DataFrameGroupBy object looks something like below,
[3]: for city, data in g:
print("city:",city)

1
print( )
print("data:",data)

city: mumbai

data: day city temperature windspeed event


4 1/1/2017 mumbai 90 5 Sunny
5 1/2/2017 mumbai 85 12 Fog
6 1/3/2017 mumbai 87 15 Fog
7 1/4/2017 mumbai 92 5 Rain
city: new york

data: day city temperature windspeed event


0 1/1/2017 new york 32 6 Rain
1 1/2/2017 new york 36 7 Sunny
2 1/3/2017 new york 28 12 Snow
3 1/4/2017 new york 33 7 Sunny
city: paris

data: day city temperature windspeed event


8 1/1/2017 paris 45 20 Sunny
9 1/2/2017 paris 50 13 Cloudy
10 1/3/2017 paris 54 8 Cloudy
11 1/4/2017 paris 42 10 Cloudy
This is similar to SQL,
SELECT * from weather_data GROUP BY city
[4]: g.get_group('mumbai')

[4]: day city temperature windspeed event


4 1/1/2017 mumbai 90 5 Sunny
5 1/2/2017 mumbai 85 12 Fog
6 1/3/2017 mumbai 87 15 Fog
7 1/4/2017 mumbai 92 5 Rain

[5]: g.max()

[5]: day temperature windspeed event


city
mumbai 1/4/2017 92 15 Sunny
new york 1/4/2017 36 12 Sunny
paris 1/4/2017 54 20 Sunny

[6]: g.mean()

2
[6]: temperature windspeed
city
mumbai 88.50 9.25
new york 32.25 8.00
paris 47.75 12.75

This method of splitting your dataset in smaller groups and then applying an operation
(such as min or max) to get aggregate result is called Split-Apply-Combine. It is
illustrated in a diagram below
[7]: g.min()

[7]: day temperature windspeed event


city
mumbai 1/1/2017 85 5 Fog
new york 1/1/2017 28 6 Rain
paris 1/1/2017 42 8 Cloudy

[8]: b= g.describe()
b

[8]: temperature \
count mean std min 25% 50% 75% max
city
mumbai 4.0 88.50 3.109126 85.0 86.50 88.5 90.50 92.0
new york 4.0 32.25 3.304038 28.0 31.00 32.5 33.75 36.0
paris 4.0 47.75 5.315073 42.0 44.25 47.5 51.00 54.0

windspeed
count mean std min 25% 50% 75% max
city
mumbai 4.0 9.25 5.057997 5.0 5.00 8.5 12.75 15.0
new york 4.0 8.00 2.708013 6.0 6.75 7.0 8.25 12.0
paris 4.0 12.75 5.251984 8.0 9.50 11.5 14.75 20.0

[9]: b.T

[9]: city mumbai new york paris


temperature count 4.000000 4.000000 4.000000
mean 88.500000 32.250000 47.750000
std 3.109126 3.304038 5.315073
min 85.000000 28.000000 42.000000
25% 86.500000 31.000000 44.250000
50% 88.500000 32.500000 47.500000
75% 90.500000 33.750000 51.000000
max 92.000000 36.000000 54.000000
windspeed count 4.000000 4.000000 4.000000

3
mean 9.250000 8.000000 12.750000
std 5.057997 2.708013 5.251984
min 5.000000 6.000000 8.000000
25% 5.000000 6.750000 9.500000
50% 8.500000 7.000000 11.500000
75% 12.750000 8.250000 14.750000
max 15.000000 12.000000 20.000000

[10]: g.size()

[10]: city
mumbai 4
new york 4
paris 4
dtype: int64

[11]: g.count()

[11]: day temperature windspeed event


city
mumbai 4 4 4 4
new york 4 4 4 4
paris 4 4 4 4

[12]: %matplotlib inline


g.plot()

[12]: city
mumbai AxesSubplot(0.125,0.125;0.775x0.755)
new york AxesSubplot(0.125,0.125;0.775x0.755)
paris AxesSubplot(0.125,0.125;0.775x0.755)
dtype: object

4
5
Group data using custom function: Let’s say you want to group your data using custom function.
Here the requirement is to create three groups
Days when temperature was between 80 and 90
Days when it was between 50 and 60
Days when it was anything else
For this you need to write custom grouping function and pass that to groupby
[13]: def grouper(df, idx, col):
if 80 <= df[col].loc[idx] <= 90:
return '80-90'
elif 50 <= df[col].loc[idx] <= 60:
return '50-60'
else:
return 'others'

[14]: g = df.groupby(lambda x: grouper(df, x, 'temperature'))


g

[14]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018BD7760F48>

[15]: for key, d in g:


print("Group by Key: {}\n".format(key))
print(d)

6
Group by Key: 50-60

day city temperature windspeed event


9 1/2/2017 paris 50 13 Cloudy
10 1/3/2017 paris 54 8 Cloudy
Group by Key: 80-90

day city temperature windspeed event


4 1/1/2017 mumbai 90 5 Sunny
5 1/2/2017 mumbai 85 12 Fog
6 1/3/2017 mumbai 87 15 Fog
Group by Key: others

day city temperature windspeed event


0 1/1/2017 new york 32 6 Rain
1 1/2/2017 new york 36 7 Sunny
2 1/3/2017 new york 28 12 Snow
3 1/4/2017 new york 33 7 Sunny
7 1/4/2017 mumbai 92 5 Rain
8 1/1/2017 paris 45 20 Sunny
11 1/4/2017 paris 42 10 Cloudy

[ ]:

[ ]:

You might also like