Python Pandas II Notes XII

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

CLASS – XII Sub.

– Informatics Practices (065) Study Material – 02


CHAPTER – Data Handling Using Python Pandas – II

Descriptive Statistics:
• Descriptive statistics are used to describe / summarize large data in ways that are meaningful
and useful. Means “must know” with any set of data. It gives us a general idea of trends in our
data including:
• The mean, mode, median and range.
• Variance and standard deviation ,quartile
• Sum, Count, maximum and minimum.
• Descriptive statistics is useful because it allows us take decision. For example, let’s say we are
having data on the incomes of one million people. No one is going to want to read a million
pieces of data; if they did, they wouldn’t be able to get any useful information from it. On the
other hand, if we summarize it, it becomes useful: an average wage, or a median income, is
much easier to understand than reams of data.
• Some Essential functions:
➢ min() and max():The min() and max() functions find out the maximum and minimum
value respectively from a given set of data.
Syntax:
<df>.min(axis=None, skipna=None, numeric_only=None)
<df>.max(axis=None, skipna=None, numeric_only=None)
Parameters:
axis : {index (0), columns (1)}
skipna : Exclude NA/null values when computing the result
numeric_only : Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data. Not implemented for
Series.
Example of max():
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, 44, 1], "B":[5, np.NaN, 54, 3, 2],
"C":[20, 16, 7, 3, 8], "D":['ABC', 3, 17, 2, 6]})
print(df)
print(df.max(axis = 0))
output:

1|Page By Sumit Sir…


➢ mean(), mode(), median():
• Mean - Mean is an average of all the numbers. The steps required to calculate a
mean are:
• sum up all the values of a target variable in the dataset
• divide the sum by the number of values
• Median - Median is the middle value of a sorted list of numbers. The steps
required to get a median from a list of numbers are:
• sort the numbers from smallest to highest
• if the list has an odd number of values, the value in the middle position is
the median.
• if the list has an even number of values, the average of the two values in
the middle will be the median.
• Mode -To find the mode, or modal value, it is best to put the numbers in order.
Then count how many of each number. A number that appears most often is the
mode.
e.g.
import pandas as pd
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3',
'Basket4‘,'Basket5', 'Basket6'])
print(df)
output:

2|Page By Sumit Sir…


print("\n------ Calculate Mean ---\n")
print(df.mean())
print("\n----- Calculate Median ---\n")
print(df.median())
print("\n------ Calculate Mode ----\n")
print(df.mode())
output:

➢ count():This function count the non-NA entries for each row or column. The values
None, NaN, NaT etc. are considered as NA in pandas.
Syntax:
<df>.count(axis=0, numeric_only=False)
e.g.
import pandas as pd
df = pd.DataFrame({"A":[-5, 8, 12, None, 5, 3],
"B":[-1, None, 6, 4, None, 3],
"C":["sam", "haris", "alex", np.nan, "peter", "nathan"]})
print(df)
output:

#count of non-NA value across the row axis


print(df.count(axis = 0))
output:

3|Page By Sumit Sir…


#Use count() function to find the number of non-NA/null value across the column
print(df.count(axis = 1))
output:

➢ Sum(): This function returns the sum of the values for the requested axis.
Syntax:
<df>.sum(axis=None, skipna=None, numeric_only=None, min_count=0)
Parameters:
axis : {index (0), columns (1)}
skipna : Exclude NA/null values when computing the result.
numeric_only : Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for Series.
min_count : The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
e.g.
i) Example of sum():

ii) Example of sum():

4|Page By Sumit Sir…


➢ var():Variance Function in python pandas is used to calculate variance of a given
set of numbers, Variance of a data frame, Variance of column and Variance of rows,
let’s see an example of each.

2
Formula to calculate variance: 2 x − x
s =
n −1
e.g.
import pandas as pd
import numpy as np
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([30,25,25,30,20]), 'Score':pd.Series([80,60,90,50,50])}
df = pd.DataFrame(d)
print("Dataframe contents")
print (df)
print(df.var())
#df.loc[:,“Age"].var() for variance of specific column #df.var(axis=0) column variance
#df.var(axis=1) row variance
Output:

➢ Quantile():
• The word “quantile” comes from the word quantity. means, a quantile is where
a sample is divided into equal-sized or subgroups (that’s why it’s sometimes
called a “fractile“). So that’s why ,It can also refer to dividing a probability
distribution into areas of equal probability.
• The median is a kind of quantile; the median is placed in a probability
distribution at center so that exactly half of the data is lower than the median
and half of the data is above the median. The median cuts a distribution into two
equal parts and so why sometimes it is called 2-quantile.
• Quartiles are quantiles; when they divide the distribution into four equal parts.
Deciles are quantiles that divide a distribution into 10 equal parts and
Percentiles when that divide a distribution into 100 equal parts .
• Common Quantiles: Certain types of quantiles are used commonly enough to
have specific names. Below is a list of these:
• The 2 quantile is called the median
• The 3 quantiles are called terciles
• The 4 quantiles are called quartiles
• The 5 quantiles are called quintiles
• The 6 quantiles are called sextiles
• The 7 quantiles are called septiles
• The 8 quantiles are called octiles
• The 10 quantiles are called deciles

5|Page By Sumit Sir…


• The 12 quantiles are called duodeciles
• The 20 quantiles are called vigintiles
• The 100 quantiles are called percentiles
• The 1000 quantiles are called permille
• Lower Quartile(Q1) has one-fourth of data values at or below it (middle of
smaller half)
• Upper Quartile (Q3) has three-fourth of data values at or below it (middle of
larger half)
• Interquartile range (IQR): middle half of data values
IQR = upper quartile(Q3) – lower quartile (Q1)
Syntax of quantile():
<df>.quantile(q=0.5, axis=0, numeric_only= True)
Parameters:
q float or array-like default 0.5 (50%quantile) 0<= q<= 1, the quantile(s) to compute like
this: q=[0.25, 0.50, 0.75, 1.0]
axis {(0, 1, ‘index’, ‘columns’)}
numeric_only If False, the quantile of datetime and timedata will be computed a well.
e.g.
#Program in python to find 0.25 quantile of
series[1, 10, 100, 1000]
import pandas as pd
import numpy as np
s = pd.Series([1, 10, 100, 1000])
r=s.quantile(.25)
print(r)
Steps to Find quartiles value
1. q=0.25 (0.25 quantile)
2. n = 4 (no. of elements)
= (n – 1)*q+1
= (4 – 1)*.25 +1
= 3*.25 + 1
= 1.75
3. Now integer part is a=1 and fraction part is 0.75 and T is term. Now formula for
quantile is:
= T1 + b*(T2 – T1) #formula will changed according Term
= 1 + 0.75(10 – 1)
= 1 + 0.75*9
= 1+6.75
= 7.75
Output: Quantile is 7.75
➢ mad() (Mean absolute deviation): This function is used to calculate the mean absolute
deviation of the value for the requested axis.
The mad of a set of data is the avg. distance b/w each data value and the mean.

6|Page By Sumit Sir…


Formula: mad =
 x−x
n
e.g.
import pandas as pd
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([30,25,25,30,20]), 'Score':pd.Series([80,60,90,50,50])}
df = pd.DataFrame(d)
print("Dataframe contents")
print (df)
print (df.mad())
Output:

➢ Standard Deviation(std()):
Standard deviation means measure the amount of variation dispersion of a set of values.
A low standard deviation means the values tend to be close to the mean in a set and a
high standard deviation means the values are spread out over a wider range.
Standard deviation is the most important concepts as far as finance is concerned.
Finance and banking is all about measuring the risk and standard deviation measures
risk.
*After calculating variance in previous function using var() then after calculate square
root of return value from var().
e.g.
import pandas as pd
import numpy as np
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([30,25,25,30,20]), 'Score':pd.Series([80,60,90,50,50])}
df = pd.DataFrame(d)
print("Dataframe contents")
print (df)
print(df.std())
Output:

7|Page By Sumit Sir…


Apply Functions on a Subset of DataFrame
➢ Apply Functions on a Column of a DataFrame
Syntax: <df>[<col name>]
e.g.
a) df[‘Age’].min()
b) df[‘Name’].count()
➢ Apply Functions on Multiple Columns of a DataFrame
Syntax: <df>[[<col name1>,<col name2>,…]]
e.g. df[[‘Age’,’Score’]].count()
Output:

➢ Apply Functions on a Row of a DataFrame


Syntax: <df>.loc[<row index>,:]
e.g.
df.loc[1:3,:].max()
Output:

➢ Apply Functions on a Range of Rows of a DataFrame


Syntax: <df>.loc[<start row>:<end row>,:]
e.g.
df.loc[2:4, :].sum()
Output:

DataFrame Sorting
➢ Sorting means arranging the contents in ascending or descending order. There are two
kinds of sorting available in pandas(Dataframe).
• By value(column) : Sorting over dataframe column/s elements is supported
by sort_values() method. We will cover here three aspects of sorting values of
dataframe.
• Sort a pandas dataframe in python by Ascending and Descending
• Sort a python pandas dataframe by single column
• Sort a pandas dataframe by multiple columns.
Syntax:
<df>.sort_values(by, axis=0, ascending=True, inplace=False, na_position=‘last’)
Parameter:
na_position: Takes two string input ‘last’ or ‘first’ to set position of Null
values. Default is ‘last’.

8|Page By Sumit Sir…


e.g.
i) Sort the python pandas Dataframe by single column – Ascending order:
import pandas as pd
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([30,25,25,30,20]), 'Score':pd.Series([80,60,90,50,50])}
df = pd.DataFrame(d)
print("Dataframe contents without sorting")
print (df)
Output:

df=df.sort_values(by='Score')
print("Dataframe contents after sorting")
print (df)
Output:

ii) Sort the python pandas Dataframe by single column – Descending order:
import pandas as pd
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([30,25,25,30,20]), 'Score':pd.Series([80,60,90,50,50])}
df = pd.DataFrame(d)
print("Dataframe contents without sorting")
print (df)
Output:

df=df.sort_values(by='Score',ascending=0)
print("Dataframe contents after sorting")
print (df)
Output:

9|Page By Sumit Sir…


Note: In above example dictionary object is used to create the dataframe.
Elements of dataframe object df is sorted by sort_value() method. we are
passing 0 or False value for Ascending parameter, which sort the data in
descending order of score.
iii) Sort the pandas Dataframe by Multiple columns:
d=
{'Name':pd.Series(['Sachin','Dhoni','Smith','Russel','Gayle','Rohit','Morgan']),
'Age':pd.Series([30,25,25,30,20,28,34]),
'Score':pd.Series([80,60,90,50,50,98,48]),
'Country': pd.Series(['Ind','Ind','Aus','WI','WI','Ind','Eng'])}
df = pd.DataFrame(d)
print("Dataframe contents without sorting")
print(df)
Output:

df.sort_values(by=['Country','Name'],ascending=[True,False],inplace=True)
print("Dataframe contents after sorting")
print(df)

• By index: Sorting over dataframe index sort_index() is supported by


sort_values() method. We will cover here three aspects of sorting values of
dataframe. We will cover here two aspects of sorting index of dataframe.
• how to sort a pandas dataframe in python by index in Ascending order
• how to sort a pandas dataframe in python by index in Descending order
Syntax:
<df>.sort_index(axis=0, ascending=True, inplace=False, na_position=‘last’)
e.g.
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([30,25,25,30,20]), 'Score':pd.Series([80,60,90,50,50])}

10 | P a g e By Sumit Sir…
df = pd.DataFrame(d)
df= df.reindex([1,4,3,2,0])
print("Dataframe contents without sorting")
print (df)
Output:

df=df.sort_index()
print("Dataframe contents after sorting")
print(df)

Note: In above example dictionary object is used to create the dataframe.


Elements of dataframe object df is first reindexed by reindex() method,index1 is
positioned at 0,4 at 1 and so on, then sorting by sort_index() method. By default
it is sorting in ascending order of index.
DataFrame: Reindex()
➢ Reindex(): This function is used to changes the row labels and column
labels of a DataFrame. To reindex means to conform the data to match a
given set of labels along a particular axis. Reorder the existing data to
match a new set of labels.
Reorder the existing data to match a new set of labels.
Insert missing value (NA) markers in label locations where no data for the
label existed.
Syntax:
<df>.reindex(labels=None, index=None, columns=None,fill_value=nan,
axis=None)
Parameters:
labels: New labels/index to conform the axis specified by ‘axis’ to.
fill_value: Fill existing missing (NaN) values, and any new element needed
for successful DataFrame alignment.

11 | P a g e By Sumit Sir…
Example #1: Use reindex() function to reindex the dataframe. By default values in the new index
that do not have corresponding records in the dataframe are assigned NaN.
import pandas as pd
df = pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4],
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]},
index =["first", "second", "third", "fourth", "fifth"])
print(df)
Output:

# reindexing with new index values


df1=df.reindex(["first", "dues", "trois", "fourth", "fifth"])
print(df1)
Output:

Example #2: Use reindex() function to reindex the column axis and specify fill values.
import pandas as pd
df1=pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4], "C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]})
# reindexing the column axis with
# old and new index values
df1= df.reindex(columns =["A“,"B“,"D“,"E"])
print(df1)
Output:

12 | P a g e By Sumit Sir…
# reindex the columns
# fill the missing values by 25
df2=df.reindex(columns
=[“E“,“B“,“C“,“A“,”D”], fill_value = 30)
print(df2)
Output:

Pivoting DataFrame
➢ Pivot –Pivot reshapes data and uses unique values from index/columns to
form axes of the resulting dataframe. Index is column name to use to
make new frame’s index. Columns is column name to use to make new
frame’s columns. Values is column name to use for populating new
frame’s values.
➢ Exception: ValueError raised if there are any duplicates.
Syntax:
pandas.pivot(index=None, columns=None, values=None)
#Example of pivot()
import pandas as pd
import numpy as np
data={'Item':['TV','TV','AC','AC'],
'Company':['LG','VIDEOCON','LG','SONY'],
'Rupees':[12000,10000,15000,14000], 'USD':[700,650,800,750]}
df1=pd.DataFrame(data)
print(df1)
Output:

13 | P a g e By Sumit Sir…
pvt=df1.pivot(index='Item', columns='Company', values='Rupees')
print(pvt)

print(pvt[pvt.index=='TV'].LG.values)
#pivot() creates a new table/DataFrame whose columns are the unique
values in COMPANY and whose rows are indexed with the unique values
of ITEM. Last statement of above program return value of TV item LG
company i.e. 12000.
Now in previous example, we want to pivot the values of both RUPEES an
USD together, we will have to use pivot function in below manner.
p2=df1.pivot(index=‘Item’, columns=‘Company’)

#Common Mistake in Pivoting


pivot method takes at least 2 column names as parameters - the index and
the columns named parameters. Now the problem is that,What happens
if we have multiple rows with the same values for these columns? What
will be the value of the corresponding cell in the pivoted table using pivot
method? The following diagram depicts the problem:

14 | P a g e By Sumit Sir…
df1.pivot(index='ITEM', columns='COMPANY', values='RUPEES')
It throws an exception with the following message:
ValueError: Index contains duplicate entries, cannot reshape

➢ Pivot table: Pivot table is used to summarize and aggregate data inside
dataframe.
Syntax:
pandas.pivot_table(data, values=None, index=None, columns=None,
aggfunc=‘mean’)
OR
<df>.pivot_table(values=None, index=None, columns=None, aggfunc=‘mean’)
#Pivot Table
The pivot_table() method comes to solve this problem. It works like pivot, but it
aggregates the values from rows with duplicate entries for the specified columns.

df1.pivot_table(index='ITEM', columns='COMPANY',
values='RUPEES‘,aggfunc=np.mean)
In essence pivot_table is a generalization of pivot, which allows you to
aggregate multiple values with the same destination in the pivoted table.
DataFrame: Group by()
➢ Any groupby operation involves one of the following operations on the
original object. They are −
o Splitting the Object
o Applying a function
o Combining the results
➢ In many situations, we split the data into sets and we apply some
functionality on each subset. In the apply functionality, we can perform
the following operations
o Aggregation − computing a summary statistic
15 | P a g e By Sumit Sir…
o Transformation − perform some group-specific operation
Syntax:
<df>.groupby(by=None, axis=0)
Parameters:
by: label or list of labels to be used for grouping.
axis: {0 for index, 1 for column}, default 0; Split along rows (0) or
columns(1) }
e.g.
df1.groupby(‘Tutor’)
Output:
<pandas.core.goupby.DataFrameGroupBy objects at 0a399393>
The result of groupby() is also an object, the DataFrameGroupBy object.
gdf=df1.groupby(‘Tutor’)
gdf is a GroupBy object.
You can store the GroupBy object in a variable name and then use
following attributes

e.g.
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,
2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,
701,804,690]}
df = pd.DataFrame(ipl_data)
print (df)
Output:

16 | P a g e By Sumit Sir…
#store groupobject
gdf=df.groupby('Team')
#view groups
print(gdf.groups)
Output:
{'Devils': Int64Index([2, 3], dtype='int64'),
'Kings': Int64Index([4, 6, 7], dtype='int64'),
'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
'Royals': Int64Index([9, 10], dtype='int64'),
'kings': Int64Index([5], dtype='int64')}
#select a group
print(gdf.get_group('Riders'))
Output:

#get size of group


print(gdf.size())
Output:

#count the value from group


print(gdf.count())
Output:

Group by with multiple columns −


gdf1=df.groupby(['Team','Year'])
#print grouping record
print(gdf1.groups)
Output:

17 | P a g e By Sumit Sir…
#print size of nested groups
print(gdf1.size())
Output:

print(gdf1.get_group(('Kings',2016))) \
Output:

#print count value of nested group


print(gdf1.count())
Output:

➢ Aggregations via groupby()


An aggregated function returns a single aggregated value for each group.
Once the group by object is created, several aggregation operations can be
performed on the grouped data.

18 | P a g e By Sumit Sir…
Syntax:
<GroupbyObject>.agg(func, axis=0)
<df>.groupby(by).agg(func)
func: function, str, list or dict
Example of GroupBy() using agg():
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings‘, 'kings',
'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],'Year‘:[2014,2015,2014,2015,2014,2015,
2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,
694,701,804,690]}
df = pd.DataFrame(ipl_data)
gdf2=df.groupby(‘Year’)
print(gdf2['Points'].agg(np.mean))
Output:

#Applying Multiple Aggregation Functions at Once


print(gdf2['Points'].agg([np.mean,np.sum,np.median]))
Output:

Importing data from a MySQL database into a Pandas data frame:


e.g.
import mysql.connector as sql
import pandas as pd
db_connection = sql.connect(host='localhost', database='bank', user='root', password='')
db_cursor = db_connection.cursor() db_cursor.execute('SELECT * FROM bmaster')
table_rows = db_cursor.fetchall()
df = pd.DataFrame(table_rows)
print(df)
OUTPUT
Will be as data available in table bmaster in mysql.

Note :- for mysql.connector library use pip install mysql_connector command in command prompt.
Pass proper host name, database name, user name and password (if any) in connect method.

19 | P a g e By Sumit Sir…
Exporting data to a MySQL database from a Pandas data frame
e.g.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('mysql+mysqlconnector://root:root@localhost/bank')
lst = ['vishal', 'ram']
lst2 = [11, 22]
# Calling DataFrame constructor after zipping # both lists, with columns specified
df = pd.DataFrame(list(zip(lst, lst2)), columns =['Name', 'val'])
df.to_sql(name='bmaster', con=engine, if_exists = 'replace', index=False)

user name password server databasename


Note :- Create dataframe as per the structure of the table.to_sql() method is used to write data from
dataframe to mysql table. Standard library sqlalchemy is being used for writing data.

20 | P a g e By Sumit Sir…

You might also like