Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Pandas plots

Pandas also provides us options to visualize the data. Here are some of the examples:

Syntax:

df.plot(X, y, marker, kind)

X = value on X axis

y = value on y axis

marker = shape in case of specific plots like a scatter plot

kind = type of plot

A scatter plot to visualize the trend of acceleration in different years.

1. df.plot(x = 'model_year', y = 'acceleration', marker = 'o',


kind = 'scatter');

2.
A bar plot to visualize mean acceleration in different years.

1. df.groupby('model_year').mean()[['acceleration']].plot(kind
= 'bar');

2.

A histogram to visualize the frequency distribution of cylinders

1. df['cylinders'].plot(kind = 'hist')

2.
A scatter plot to visualize the relationship between weight and mpg.

1. df.plot(x = 'weight', y = 'mpg', kind = 'scatter')

2.
A bar plot to visualize the sorted mean values of acceleration with respect to number of
cylinders.

1. df.groupby('cylinders').mean().sort_values('acceleration')
[['acceleration']].plot(kind = 'bar')

2.

Pandas Exercise

Problem Statement:
to find the rainfall dataset that is to be considered for the exercise.. This data contains
region(district) wise rainfall across India.

1. Import the data into Python environment as a Pandas DataFrame.


2. Check for missing values, if any and drop the corresponding rows.
3. Find the district that gets the highest annual rainfall.
4. Display the top 5 states that get the highest annual rainfall.
5. Drop the columns 'Jan-Feb', 'Mar-May', 'Jun-Sep', 'Oct-Dec'.
6. Display the state-wise mean rainfall for all the months using a pivot table.
7. Display the count of districts in each state.
8. For each state, display the district that gets the highest rainfall in May. Also display the
recorded rainfall.
Exercise
Problem Statement:
Given below is a DataFrame df:

1. import pandas as pd

2.

3. df = pd.DataFrame([[0.23,'f1'],[5.36,'f2']],

4. index = list('pq'),

5. columns = list('ab'))

6.

Do the following:

1. Change the column name from ' a' to ' A'.

2. Add a new column ' c' filled with random values.

3. Change the datatype of column ' A' values to complex.

4. Display rows whose any of the element matches with any element of the given list:

1. lst = ['f30','f50','f2','f0']

Merging
There are few more commonly used methods involved in Pandas as listed:

 Merging
 Reshaping
 Pivot Tables
 Grouping

Let us understand their importance by some illustrations starting with merging of datasets.
Given two datasets from a conducted experiment with each dataset having its own feature. Our
task is to form a single dataset combining all the features particular to each observation. To do so
we can take help of concat() function.

1. import pandas as pd

2.

3. data1 = pd.DataFrame([[15, 12, -3],

4. [33, 54, 21],

5. [10, 32, 22]],

6. columns = list('ABC'))

7. data2 = pd.DataFrame([[10, 1, 3],

8. [33, -54, 2],

9. [10, 0.32, 2]],

10. columns = list('DEF'))

11. print(data1)

12. # A B C

13. # 0 15 12 -3

14. # 1 33 54 21

15. # 2 10 32 22

16.

17. print(data2)

18. # D E F

19. # 0 10 1.00 3

20. # 1 33 -54.00 2

21. # 2 10 0.32 2
22.

23. print(pd.concat( [data1,data2], axis = 1 ))

24. # A B C D E F

25. # 0 15 12 -3 10 1.00 3

26. # 1 33 54 21 33 -54.00 2

27. # 2 10 32 22 10 0.32 2

28.

29.

Suppose an individual took observation of 3 different features in two instances. Next, she wants
to combine all these samples to form a single dataset. We can achieve this task using the
same concat() function, but this time with the change of axis.

1. import numpy as np

2. import pandas as pd

3.

4. data1 = pd.DataFrame(np.random.randn(9).reshape(3,3),

5. columns = list('ABC'))

6. data2 = pd.DataFrame(np.arange(9).reshape(3,3),

7. columns = list('ABC'))

8. print(data1) # Random values

9. # A B C

10. # 0 1.957218 0.433266 1.214950

11. # 1 -0.143500 -0.092030 -0.823898

12. # 2 0.481486 -0.024111 -0.769195


13.

14. print(data2)

15. # A B C

16. # 0 0 1 2

17. # 1 3 4 5

18. # 2 6 7 8

19.

20. print(pd.concat( [data1,data2], axis = 0 ))

21. # A B C

22. # 0 1.957218 0.433266 1.214950

23. # 1 -0.143500 -0.092030 -0.823898

24. # 2 0.481486 -0.024111 -0.769195

25. # 0 0.000000 1.000000 2.000000

26. # 1 3.000000 4.000000 5.000000

27. # 2 6.000000 7.000000 8.000000

Pivot Tables
Reshaping a dataset is a perfect seed for quick text visualization. However, similar to reshaping we have
another term named as pivot tables which are more efficient in delivering a better visualization.

To understand pivot tables we take the same last dataframe except adding a new feature 'score'.

1. import pandas as pd

2.

3. df = pd.DataFrame([

4. ['IND', 'Gold', 'Game1', '9.9'],


5. ['IND', 'Bronze', 'Game2', '8'],

6. ['USA', 'Silver', 'Game1', '9.5'],

7. ['USA', 'Gold', 'Game2', '8.6'],

8. ], columns = ['Country', 'Medal',

9. 'Game', 'Score'],

10. index = ['Year1', 'Year2','Year1', 'Year2'])

11.

12. print(df)

13. # Country Medal Game Score

14. # Year1 IND Gold Game1 9.9

15. # Year2 IND Bronze Game2 8

16. # Year1 USA Silver Game1 9.5

17. # Year2 USA Gold Game2 8.6

18.

Pivot tables come handy when we have to break down a large dataset (in terms of features) into fewer features
for quick visualization. For example, finding which medal is common to both IND and USA, listing game(s) in
which India won Silver, etc.

We implement pivot table on the given dataset:

1. # Listing all the features

2. print(df.pivot(index = 'Country', columns = 'Medal'))

3. # Game Score

4. # Medal Bronze Gold Silver Bronze Gold Silver

5. # Country

6. # IND Game2 Game1 None 8 9.9 None


7. # USA None Game2 Game1 None 8.6 9.5

8.

9. # Listing only Score feature

10. print(df.pivot(index = 'Country', columns = 'Medal',

11. values = 'Score'))

12. # Medal Bronze Gold Silver

13. # Country

14. # IND 8 9.9 None

15. # USA None 8.6 9.5

16.

Pivot table limitations


The previous dataset doesn't include any redundant sample, for instance, let's say 2 people win gold as well as
silver medals in Year1 of Game1 from India. In that case, the above function named pivot() breaks due to
ambiguity in Game column. Therefore, to overcome this limitation, we have another function named
as pivot_table() which requires an argument named as aggfun through which we can clear the ambiguity. Let's
understand it through an example.

1. import numpy as np

2. import pandas as pd

3.

4. df = pd.DataFrame([

5. ['IND', 'Gold', 'Game1', '9.9'],

6. ['IND', 'Silver', 'Game1', '9.5'],

7. ['IND', 'Bronze', 'Game2', '8'],

8. ['USA', 'Bronze', 'Game1', '9.0'],


9. ['USA', 'Silver', 'Game2', '8.6'],

10. ], columns = ['Country', 'Medal',

11. 'Game', 'Score'],

12. index = ['Year1', 'Year1', 'Year2','Year1',


'Year2'])

13.

14. print(df)

15. # Country Medal Game Score

16. # Year1 IND Gold Game1 9.9

17. # Year1 IND Silver Game1 9.5

18. # Year2 IND Bronze Game2 8

19. # Year1 USA Bronze Game1 9.0

20. # Year2 USA Silver Game2 8.6

21.

22. print(df.pivot_table(index = 'Country',

23. columns = 'Game',

24. values = 'Score',

25. aggfunc = np.max))

26. # Game Game1 Game2

27. # Country

28. # IND 9.9 8

29. # USA 9.0 8.6

Grouping
To understand grouping, a concept similar to databases, let us consider that we have been given a dataset for
the sales of laptop and desktop systems. The observations for a particular type can be repeated for the different
sales price. In this case, if we need to calculate the total sales of each category then we can group similar data
and apply a certain function.

To do so we create a sample dummy dataset and sum up the total sales particular to each category.

1. import pandas as pd

2.

3. df = pd.DataFrame([["Laptop", 1000],

4. ["Laptop", 2520],

5. ["Desktop", 3000],

6. ["Desktop", 400]], columns =


['Category','Sales'])

7.

8. print(df)

9. # Category Sales

10. # 0 Laptop 1000

11. # 1 Laptop 2520

12. # 2 Desktop 3000

13. # 3 Desktop 400

14.

15. print(df.groupby(['Category'], sort = False).sum())

16. # Sales

17. # Category

18. # Laptop 3520

19. # Desktop 3400


Exercise

Problem Statement:
Given a dataframe df which has three attributes defining: set_name: system names, spd_per_day:
Speed per day, speed: Network speed in MBps

1. sys = ['s1','s1','s1','s1',

2. 's2','s2','s2','s2']

3. net_day = ['d1','d1','d2','d2',

4. 'd1','d1','d2','d2']

5. spd = [1.3, 11.4, 5.6, 12.3,

6. 6.2, 1.1, 20.0, 8.8]

7. df = pd.DataFrame({'set_name':sys,

8. 'spd_per_day':net_day,

9. 'speed':spd})

10.

Do the following:

1. Construct a dataframe new_df where the given dataset is grouped based on each system
(s1 and s2) and speed per day (d1 and d2) with the median speed each day per system.
Also, provide a secondary name ' Median' for the speed attribute.
2. Sort the dataframe new_df in the ascending order of the median speed.

Hints/Answers:

1. Answer:

1. set_name spd_per_day speed

2. Median
3. 0 s1 d1 6.35

4. 1 s1 d2 8.95

5. 2 s2 d1 3.65

6. 3 s2 d2 14.40

7.

2. Answer:

1. set_name spd_per_day speed

2. Median

3. 2 s2 d1 3.65

4. 0 s1 d1 6.35

5. 1 s1 d2 8.95

6. 3 s2 d2 14.40

You might also like