Python Pandas ch-2

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 56

Pandas is one of the most preferred

used data science libraries.


DataFrame is one such data structure of Pandas.
The chapter will cover pivoting, sorting,
aggregation
descriptive statistics, histograms and quantiles.
Series and DataFrames are the basic data
structures
• Import numpy as np
• Import pandas as pd
• A series is a pandas data structure that
represents a one dimensional array like object
containing an array of data (of any numpy data
type) and an associated array of data labels
called the index.
• A Data frame is a two-dimensional data
structure, i.e., data is aligned in a tabular
fashion in rows and columns.
• Features of DataFrame
• Potentially columns are of different types
• Size – Mutable
• Labeled axes (rows and columns)
• Can Perform Arithmetic operations on rows
and columns
Structure
Let us assume that we are creating a data frame with
student’s data.

You can think of it as an SQL table or a spreadsheet


data representation.
Series and DataFrames are the basic data structures
• A Dataframe is a pandas data structure that represents a two
dimensional labelled array, which is an ordered collection of
columns where columns may store different types of data. e.g
numeric ,string ,float or boolean.
Major characteristics of dataFrame data structure.
• It has two indices / axes. Row index(axis=0) and column
index(axis=1).
• Conceptually it is like a spreadsheet where each value is
identifiable with the combination of row index and column
index.
• Indices can be numbers or strings or letters.
• It is value mutable. i.e you can change the value.
• You can add or delete the rows/columns in dataFrame.
Pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
The parameters of the constructor are as follows −

Sr.No Parameter & Description


1 data
data takes various forms like ndarray, series, map, lists, dict, constants and
also another DataFrame.
2 index
For the row labels, the Index to be used for the resulting frame is Optional
Default np.arange(n) if no index is passed.
3 columns
For column labels, the optional default syntax is - np.arange(n). This is only
true if no index is passed.
4 dtype
Data type of each column.
5 copy
This command (or whatever it is) is used for copying of data, if the default is
False.
Create DataFrame
A pandas DataFrame can be created using various inputs like −
– Lists
– dict
– Series
– Numpy ndarrays
– Another DataFrame
In the subsequent sections of this chapter, we will see how to create a DataFrame
using these inputs.
• Create an Empty DataFrame
• A basic DataFrame, which can be created is an Empty Dataframe.
• Example
import pandas as pd
df = pd.DataFrame()
print (df)  
output is as follows −
Empty DataFrame
Columns: []
Index: []
Create a DataFrame from Lists
• The DataFrame can be created using a single list or a list of lists.
• Example 1
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print (df)
Its output is as follows −
00112233445
• Example 2
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Its output is as follows −
Name Age 0 Alex 10 1 Bob 12 2 Clarke 13
• Example 3
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Its output is as follows −
Name Age 0 Alex 10.0 1 Bob 12.0 2 Clarke 13.0
• Example 2
• Let us now create an indexed DataFrame using arrays.
• import pandas as pd
• data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'Age‘ :[28,34,29,42]}
• df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df 
• output is as follows −
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
• Note − Observe, the index parameter assigns an index to each
row.
Create a DataFrame from List of Dicts
• List of Dictionaries can be passed as input data to create a DataFrame. The
dictionary keys are by default taken as column names.
Example 1
• The following example shows how to create a DataFrame by passing a list
of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print dfIts
 output is as follows −
a b c
0 1 2 NaN
1 5 10 20.0

Note − Observe, NaN (Not a Number) is appended in missing areas.


• Example 2
• The following example shows how to create a DataFrame by
passing a list of dictionaries and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df 
output is as follows −
Example 3
• The following example shows how to create a DataFrame with a list of dictionaries, row indices,
and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
#With two column indices with one index with other name df2 = pd.DataFrame(data,
index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2

Its output is as follows −
a b
First 1 2
Second 5 7

print(df2)
a b1
First 1 NaN
Second 5 NaN
Data Series
• Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python
objects, etc.). The axis labels are collectively called index.
• pandas.Series
• A pandas Series can be created using the following
constructor −
pandas.Series( data, index, dtype, copy)
A series can be created using various inputs like −
• Array
• Dict
• Scalar value or constant
The parameters of the constructor are as follows −
Sr.No Parameter & Description
1 data
data takes various forms like ndarray, list, constants
2 index
Index values must be unique and hashable, same length
as data. Default np.arange(n) if no index is passed.
3 dtype
dtype is for data type. If None, data type will be inferred
4 copy
Copy data. Default False

Create an Empty Series


A basic series, which can be created is an Empty Series.
Example
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print (s)
Its output is as follows −
Series([], dtype: float64)
Create a Series from ndarray

If data is an ndarray, then index passed must be of the same length. If no index is passed, then
by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].
Example 1

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print s
Its output is as follows −
0 a
1 b
2 c
3 d
dtype: object

We did not pass any index, so by default, it assigned the indexes ranging
from 0 to len(data)-1, i.e., 0 to 3.
#import
Example 2the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s

Its output is as follows −
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed
values in the output.
Create a Series from dict
A dict canthe
#import
be passed
pandasas input and if no index
library and isaliasing
specified, then
as
the dictionary keys are taken in a sorted order to construct index.
pdIf index is passed, the values in data corresponding to the labels
import pandas
in the index as pd
will be pulled out.
import
Example 1numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64
Observe − Dictionary keys are used to construct index.
Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a
DataFrame. The resultant index is the union of
all the series indexes passed.
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

print(df)
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Selecting/Accessing a Subset from a Dataframe using Row/Column
Names

• You can use following syntax to select/access a subset


from a dataframe object:
<DataFrameObject>.loc[<startrow>:<endrow>,
<startcolumn>:<endcolumn>]
• To access a row , just give the row name/label as this:
<DF object>.loc[<row label>,:]. Make sure not to miss
the COLON AFTER COMMA.
• To access multiple rows, use:
<DF object>.loc [<start row> : <end row> ,:]. Make sure
not to miss the COLON AFTER COMMA.
Selecting subset from DataFrame using Rows/Columns
data = {'Population':[10927986, 12691836, 4631392,4328063],
'Average_income' :[72167810876544, 85007812691836,
422678431392,5261782328063]}

df = pd.DataFrame(data,index=['Delhi','Mumbai','Kolkata','Chennai'])
print(df)
Population Average_income
Delhi 10927986 72167810876544
Mumbai 12691836 85007812691836
Kolkata 4631392 422678431392
Chennai 4328063 5261782328063
To access multiple rows make sure not to
miss the COLON after COMMA
Continue…
• To access selective columns, use :
<DF object>.loc[ : , <start column> :<end row>,:]
Make sure not to miss the COLON BEFORE
COMMA. Like rows, all columns falling between
start and end columns, will also be listed
• To access range of columns from a range of
rows, use:
<DF object>.loc[<startrow> : <endrow>,
<startcolumnn> : <endcolumn>]
To access selective columns make sure not to
miss the COLON before COMMA
To access range of columns from ranges of rows

Df.loc[<startrows>:<endrow>,<startcolumn>:<endcolumn>]
• import pandas as pd
• d = {'one' : pd.Series([1, 2, 3], index=['a', 'b',
'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b',
'c', 'd'])}
• df = pd.DataFrame(d)
• print df.loc['b']
• Its output is as follows −
• one 2.0
• two 2.0
• Name: b,
• dtype: float64
Obtaining subset from DataFrame using Rows/Columns
Numeric index position
Df.iloc[<startrow index>:<endrow index>, <startcolumn index>:<endcolumn index>]
• Selection by integer location
• Rows can be selected by passing integer location to
an iloc function.
• import pandas as pd
• d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' :
pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
• df = pd.DataFrame(d)
• print df.iloc[2]
• Its output is as follows −
• one 3.0
• two 3.0
• Name: c,
• dtype: float64
Selecting/Accessing individual Values

Df.<column>[<row name or row numeric index>]


Deleting columns in DataFrames
Del <df object>[<column name>]
Using the previous DataFrame, we will delete a
column # using del function
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a',
'b', 'c']), 'two' : pd.Series([1, 2, 3, 4],
index=['a', 'b', 'c', 'd']), 'three' :
pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
# using del function print ("Deleting the first
column using DEL function:")
del df['one']
print df
# using pop function print ("Deleting another
column using POP function:")
df.pop('two')
print df
Iteration over a DataFrame
<df>.iteritems() <df>.iterrows()
iteritems()
Iterates over each column as key, value pair with
label as key and column value as a Series
object.

import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(4,3),columns=['col1','col2',
'col3'])
for key,value in df.iteritems():
print key,value
iterrows()
iterrows() returns the iterator yielding each index value along with a
series containing the data in each row.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
print (row_index,row)
Descriptive Statistics with Pandas
dfSales={2016:{'Qtr1':34500,'Qtr2':56000, 'Qtr3':47000,'Qtr4':49000}, 2017:
{'Qtr1':44900,'Qtr2':46100, 'Qtr3':57000,'Qtr4':59000}, 2018:
{'Qtr1':54500,'Qtr2':51000, 'Qtr3':57000,'Qtr4':58900}, 2019:{'Qtr1':34500}}
sal_df=pd.DataFrame(dfSales)
Functions min() and max()
The min() and max() functions find out minimum or maximum value respectively from a give set of data.
Syntax
<dataFrame>.min(axis=None,skipna=None,numeric_only=None)
<dataFrame>.max(axis=None,skipna=None,numeric_only=None)
Axis (0 or 1) by default min and max is calculated along axis 0
Skipna (True or False ) Exclude NA/null values when computing the result.
Numeric_only (True or False) include only float, int, boolean columns.
Functions count() and sum()
The function count() counts the non-NA entries for each row or column. The value None, NaN , NaT etc,
are considered as NA in pandas. The syntax for using count() is:
Syntax
<dataFrame>.count(axis=None,skipna=None,numeric_only=None)
<dataFrame>.max(axis=None,skipna=None,numeric_only=None)
Axis (0 or 1) by default min and max is calculated along axis 0
Skipna (True or False ) Exclude NA/null values when computing the result.
Numeric_only (True or False) include only float, int, boolean columns.
What are Quantile?
• Quantiles are points in a distribution that relates to the ranks order
of values in that distribution. The quantile of the value is the
fraction of observations less than or equal to the value. The
quantile of the median is 0.5 by definition.
• The quantile() function
<dataframe>.quantile(q=0.5,axis=0,numeric_only=True)
q= float or array like, default 0.5(50%quantile), 0<=q<=1, the quantile(s) to
compute.
Axis=[[0,1,’index’, ‘column’] (default 0)] 0 or index is for row wise , 1 or column
for column wise.
Numeric_only If false, the quantile of datetime and timedelta data will be
computed as well.
• The median splits the data set in half. i.e 50% on
the left / below the median and 50% on the right
/ above the median. That means if we divide a
distribution in 2 exact halves, we call the divided
marker as median.
• The divided parts are called quantiles, which
means median divides a distribution in 2
quantiles(50% percentile each).
• If we divide a distribution in 4 exact quarter (25
each), we call the dividing marker a QUARTILE.
• That is, a quartile divides a distribution in 4
quantiles.(25% percentile each)
• Sal_df.quantile(q=[0.25,0.5,0.75,1.0])
• Sal_df.quantile(q=[0.25,0.5,0.75,1.0], axis =1)
The Var() function
• The var() computes variance and returns unbiased
variance over requested axis.
• The syntax of var() function is :
<dataFrame>.var(axis=None,skipna=None,numeric_only=None)
Axis (index (0), column (1) ) default 0
Skipna (True or False ) Exclude NA/null values when
computing the result.
Numeric_only (True or False) include only float, int,
boolean columns.
Sal.std()
sal.var( axis=1)
The std() Function
• The std() function computes the standard deviation
over requested axis.
• Syntax
<dataFrame>.std(axis=None,skipna=None,numeric_only=None,)
Axis (index (0), column (1) ) default 0
Skipna (True or False ) Exclude NA/null values when
computing the result.
Numeric_only (True or False) include only float, int,
boolean columns.
Describe() Function
• The describe() function computes a summary of statistics
pertaining to the DataFrame columns.
import pandas as pd
import numpy as np
• #Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve',
'Smith','Jack', 'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80
,4.10,3.65]) }
• #Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Aggregation/Descriptive statistics - Dataframe

Data aggregation –
Aggregation is the process of turning the values of a dataset (or a
subset of it) into one single value or data aggregation is a
multivalued function ,which require multiple values and return a
single value as a result.There are number of aggregations possible
like count,sum,min,max,median,quartile etc. These(count,sum etc.)
are descriptive statistics and other related operations on
DataFrame Let us make this clear! If we have a DataFrame like…

…then a simple aggregation method is to calculate the summary of the


Score, which is 87+67+89+55+47= 345. Or a different aggregation method
would be to count the number of Name, which is 5.
Aggregation/Descriptive statistics - dataframe
#e.g. program for data aggregation/descriptive
statistics
import pandas as pd
import
#Create numpyofasseries
a Dictionary
np
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
 'Age':pd.Series([26,25,25,24,31]),
'Score':pd.Series([87,67,89,55,47])} #Create a DataFrame
df = pd.DataFrame(d) print("Dataframe contents")
print (df)
print(df.count())
print("count age",df[['Age']].count())
print("sum of score",df[['Score']].sum())
print("minimum age",df[['Age']].min())
print("maximum score",df[['Score']].max())
print("mean age",df[['Age']].mean())
print("mode of age",df[['Age']].mode())
print("median of score",df[['Score']].median())
Syntax of dataFrame
<dataFrameObject>=panda.DataFrame(<a 2D datastructure>,
[columns=column sequence>], [index=<index sequence>])

https://www.tutorialspoint.com/python_pandas/
python_pandas_dataframe.htm

You might also like