Unit 1 Pandas - Series and DataFrame

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Introduction to Python libraries – Pandas and Matplotlib

Python Library – Matplotlib Python Library – Pandas


Matplotlib is a comprehensive library for creating static, animated, It is a most famous Python package for data science, which offers
and interactive visualizations in Python. It is used to create powerful and flexible data structures that make data analysis and
1. Develop publication quality plots with just a few lines of code manipulation easy. A panda makes data importing and data analyzing
2. Use interactive figures that can zoom, pan, update... much easier. Pandas build on packages like NumPy and Matplotlib to
We can customize and Take full control of line styles, font give us a single & convenient place for data analysis and visualization
properties, axes properties... as well as export and embed to a work.
number of file formats and interactive environments

Data Structures in Pandas


Two important data structures of pandas are – Series and DataFrame
Series DataFrame
Series is like a one-dimensional array like structure with DataFrame is like a two-dimensional array like structure with
homogeneous data. For example, the following series is a heterogeneous data. For example, the following DataFrame is a
collection of integers. collection of different data types.

Basic feature of series are homogeneous data, Size Immutable But Basic feature of DataFrame are heterogeneous data, Size Mutable And
Values of Data Mutable. also Data Mutable.
Pandas, Series Creation using various python concept
Pandas Series, is like a one dimensional array like structure with homogeneous data. We cannot change its size after creation but we can change
its values using index. It also support forward/backward index accessing and slicing operation which are already used in list, array or in string
concept. We can store only a single row, set of values of same data types. We cannot store different kind of values in one pandas series variable.
Syntax- pandas.Series(data, index, dtype, copy)
Using List Using Scalar value Using Numpy 1-d array Using Dictionary
[ ] (means single constant value will repeat) np.array( [ ] ) {key : value}
Here a single list converts into a Here, we must use index=[0,1,2,3,4] Here first, list convert into 1-d array and Here dictionary convert into a pandas series.
pandas series. statement inside Series function. then 1-d array convert into a pandas series.
import pandas as pd import pandas as pd import pandas as pd import pandas as pd
import numpy as np

list =[11,12,13,14,15] data=15 arr=np.array( [11,12,13,14,15] ) dict = {0 : 11, 1 : 12, 2 : 13, 3 : 14, 4 : 15}
s =pd.Series(list) s =pd.Series(data, index=[0, 1, 2, 3, 4]) s =pd.Series(arr) s =pd.Series(dict)
print(s) print(s) print(s) print(s)
output- output- output- output-
0 11 0 15 0 11 0 11
1 12 1 15 1 12 1 12
2 13 2 15 2 13 2 13
3 14 3 15 3 14 3 14
4 15 4 15 4 15 4 15
Same Above code we can use without Same Above code we can use without data variable Same Above code we can use without arr Same Above code we can use without dict
list variable directly as follows. but on a string constant directly as follows. variable directly as follows. variable directly as follows.
s =pd.Series([11,12,13,14,15]) s =pd.Series(‘ram’, index=[0, 1, 2, 3, 4]) s =pd.Series( np.array( [11,12,13,14,15] ) ) s =pd.Series({0 : 11, 1 : 12, 2 : 13, 3 : 14, 4 : 15})
print(s) print(s) print(s) print(s)

output- output- output- output-


0 11 0 ram 0 11 0 11
1 12 1 ram 1 12 1 12
2 13 2 ram 2 13 2 13
3 14 3 ram 3 14 3 14
4 15 4 ram 4 15 4 15
In, all following examples we can change default index values according to our choice. We can see default index always start with 0 and so on. But I have changed default index
values 0, 1,2 to 101, 102, 103. We must use index=[101,102,103] statement inside Series function.
list =[11,12,13] arr=np.array( [11,12,13] ) dict = {101 : 11, 102 : 12, 103 : 13}
s =pd.Series(list, index=[101, 102, 103]) s =pd.Series(15, index=[101, 102, 103]) s =pd.Series(arr, index=[101, 102, 103]) s =pd.Series(dict, index=[101, 102, 103])
print(s) print(s) print(s) print(s)
output- output- output- output-
101 11 101 15 101 11 101 11
102 12 102 15 102 12 102 12
103 13 103 15 103 13 103 13
Pandas, Series - indexing
For indexing- use [index] - for single value access
Using default index Using user defined label index
0 1 2 Forward index ‘a’ ‘b’ ‘c’ Forward label index
11 12 13 Values 11 12 13 Values
-3 -2 -1 Backward index -3 -2 -1 Backward index
# to display all values of series using default integer index [0, 1, 2] # to display all values of series using user defined label index [ 'a' , 'b' , 'c' ])
s =pd.Series([11,12,13]) s =pd.Series([11,12,13], index=[ 'a' , 'b' , 'c' ])
print(s) print(s)
output- output-
0 11 a 11
1 12 b 12
2 13 c 13
# to display each value of series using default forward index # to display each value of series using user defined label forward index using loc
s =pd.Series([11,12,13]) s =pd.Series([11,12,13], index=[ 'a' , 'b' , 'c' ])
print(s[0]) print(s.loc[‘a’]) or print(s[‘a’]) or print(s[0])
print(s[1]) print(s.loc[‘b’]) or print(s[‘b’]) or print(s[1])
print(s[2]) print(s.loc[‘c’]) or print(s[‘c’]) or print(s[2])
output- output-
11 11
12 12
13 13
# to display each value of series using default negative integer backward index # to display each value of series using default negative integer backward index
s =pd.Series([11,12,13]) s =pd.Series([11,12,13], index=[ 'a' , 'b' , 'c' ])
print(s.iloc[-3]) print(s.iloc[-3])
print(s.iloc[-2]) print(s.iloc[-2])
print(s.iloc[-1]) print(s.iloc[-1])
output- output-
11 11
12 12
13 13
Pandas, series arithmetic operation examples and basic information
import pandas as pd import pandas as pd print(s.sum())-> 12
import numpy as np print(s.max())-> 6
s = pd.Series([2, 4]) s=pd.Series([2,4,6]) print(s.min()) -> 2
s1 =pd.Series([1, 2]) Output- print(s.size) -> 3 print(s.count()) -> 3
print(s+s1) 3 6 - to add values of s and s1 using arithmetic expression directly no need to use loop. print(s.shape) -> (3, ) print(s.var()) -> 4
or print(s.add(s1)) 3 6 - to add values of s and s1 in another way using pandas add function. print(s.ndim) -> 1 print(s.std()) -> 2
or print(np.add(s,s1)) 3 6 - to add values of s and s1 in another way using numpy add function. print(s.values) -> [2,4,6] print(s.describe())-> all
print(s+3) 5 7 - to add 3 value of every element of s using vector operation. print(s.axes) -> [range,0 to 2] statics will display.
or print(np.add(s,3)) 5 7 - to add 3 value of every element of s using scalar numpy function. print(s.dtypes) -> int64
Pandas, Series – slicing
to access, selected more than one values
For slicing use- loc[start label index : stop label index] – for label index location, to access more than one values
For slicing use- iloc[start integer index : stop integer index] – for integer index location, to access more than one values
Using default index Using user defined label index
0 1 2 Forward index ‘a’ ‘b’ ‘c’ Forward label index
11 12 13 Values 11 12 13 Values
-3 -2 -1 Backward index -3 -2 -1 Backward index
# to display selected values from starting index 0 and here stop index one less # to display selected values from starting label index ‘a’ and here stop index include
import pandas as pd import pandas as pd
s =pd.Series([11,12,13]) s =pd.Series([11,12,13], index=[ 'a' , 'b' , 'c' ])

print(s[0:1])  11 print(s['a' : 'a'])  11


print(s[0:2])  11 12 print(s['a' : 'b'])  11 12
print(s[0:3])  11 12 13 print(s['a' : 'c'])  11 12 13
# to display selected values using iloc from starting index 0 and here stop index one less # to display selected values using iloc from starting index 0 and here stop index one less

print(s.iloc[0:1])  11 print(s.iloc[0:1])  11
print(s.iloc[0:2])  11 12 print(s.iloc[0:2])  11 12
print(s.iloc[0:3])  11 12 13 print(s.iloc[0:3])  11 12 13
# to display selected values using loc from starting index 0 and here stop index include # to display selected values using loc from starting index ‘a’ and here stop index include
print(s.loc[0:1])  11 12 print(s.loc['a' : 'a'])  11
print(s.loc[0:2])  11 12 13 print(s.loc['a' : 'b'])  11 12
print(s.loc[0:3])  11 12 13 print(s.loc['a' : 'c'])  11 12 13
Pandas head() and tail() function
head() -> to access number of rows from top.
tail() -> to access number of rows from bottom.
Note- head and tail function are used in both series and dataframe to access rows from top position and bottom position.
print(s.head(1)) print(s.tail(1)) print(s.head(1)) print(s.tail(1))
0 11 2 13 a 11 c 13
print(s.head(2)) print(s.tail(2)) print(s.head(2)) print(s.tail(2))
0 11 1 12 a 11 b 12
1 12 2 13 b 12 c 13
print(s.head(3)) print(s.tail(3)) print(s.head(3)) print(s.tail(3))
0 11 0 11 a 11 a 11
1 12 1 12 b 12 b 12
2 13 2 13 c 13 c 13
Pandas, Dataframe Creation using various python concept
Pandas dataframe, is like a two dimensional array like structure or a table like structure with heterogeneous data. We can change its size after
creation and we can also change its values also using index. It also supports forward/backward index accessing and slicing operation which are
already used in list, 1d and 2d array or in string concept. We can store multiple rows and columns with, a set of different values of different data
types in one pandas dataframe variable.

Syntax- pandas.DataFrame(data, index, column, dtype, copy)

Using a list of inside sub lists Using a dictionary of lists Using a dictionary of pandas series
[ { {
[ , ], Key1 : [a set of values ], Key1 : pd.Series ( [a set of values ] ),
[ , ] Key2 : [a set of values ] Key2 : pd.Series ( [a set of values ] )
] } }
rollno name rollno name rollno name
0 101 ram 0 101 ram 0 101 ram
1 102 mohan 1 102 mohan 1 102 mohan
2 103 sohan 2 103 sohan 2 103 sohan
import pandas as pd import pandas as pd import pandas as pd

data = [ data={ data={


[101, 'ram'], 'rollno' : [101, 102, 103], 'rollno' : pd.Series( [101, 102, 103] ),
[102, 'mohan'], 'name' : ['ram', 'mohan', 'sohan'] 'name' : pd.Series ( ['ram', 'mohan', 'sohan'] )
[103, 'sohan'] } }
]

df = pd.DataFrame(data, columns=['rollno','name']) df = pd.DataFrame(data, columns=['rollno','name']) df = pd.DataFrame(data, columns=['rollno','name'])


print(df) print(df) print(df)

output- output- output-


rollno name rollno name rollno name
0 101 ram 0 101 ram 0 101 ram
1 102 mohan 1 102 mohan 1 102 mohan
2 103 sohan 2 103 sohan 2 103 sohan
Pandas, Dataframe Creation using various python concept
Pandas dataframe, is like a two dimensional array like structure or a table like structure with heterogeneous data. We can change its size after
creation and we can also change its values also using index. It also support forward/backward index accessing and slicing operation which are
already used in list, 1d and 2d array or in string concept. We can store multiple rows and columns with, a set of different values of different data
types in one pandas dataframe variable.
Syntax- pandas.DataFrame(data, index, column, dtype, copy)
Using a list of dictionaries Using a csv/txt file
Syntax- pd.read_csv( r "path")
CSV- comma separated values
[
Step1 – open notepad and save this file in d drive with data.csv file name.
Step2 – type following data directly using comma separation in this data.csv file as follow-
{ key1 : value, key2 : value},

{ key1 : value, key2 : value}

]
rollno name rollno name
0 101 ram 0 101 ram
1 102 mohan 1 102 mohan
2 103 sohan 2 103 sohan
import pandas as pd import pandas as pd
data = [
{ 'rollno' : 101, 'name' : 'ram' }, df = pd.read_csv( r "d:\data.csv" )
{ 'rollno' : 102, 'name' : 'mohan' }, print(df)
{ 'rollno' : 103, 'name' : 'sohan' }
]
output-
df = pd.DataFrame(data, columns=['rollno' , 'name']) rollno name
print(df)
0 101 ram
output- 1 102 mohan
rollno name 2 103 sohan
0 101 ram
1 102 mohan Note- here pd.read_csv( ) is method, which are used to read csv file
2 103 sohan from other location and we must use r before path to read our data.
DataFrame Basic # ndim:- It show dimension (means total number of axis) #sum():- to sum of all numeric columns and to
of dataframe. concatenate of string columns.
month sales1 sales2
print(x.ndim) print(x.sum())
0 jan 5 3
1 feb 7 5
output:- output:-
2 mar 6 8 2 month janfebmar
import pandas as pd note:- Total number of axis is also called rank. sales1 18
x=pd.DataFrame({ ------------------------------------------ sales2 16
'month':['jan','feb','mar'], #values:-It is used to show all values of dataframe. ------------------------------------------
'sales1':[5,7,6], print(x.values) #max():-to show numeric columns maximum values and
'sales2':[3,5,8]}) also string columns maximum.
output:-
#describe():- show all types of statics of dataframe [['jan' 5 3] print(x.max())
data. ['feb' 7 5] output:-
print(x.describe()) ['mar' 6 8]] month mar
------------------------------------------ sales1 7
output:- #size:- It is used to show total number of elements.
sales1 sales2 sales2 8
print(x.size) ------------------------------------------
count 3.0 3.000000
mean 6.0 5.333333 #min():-to show numeric columns minimum values and
output:-
std 1.0 2.516611 also string columns minimum.
9
min 5.0 3.000000 ------------------------------------------
25% 5.5 4.000000 print(x.min())
#shape:-It is used to show total no. of row & columns.
50% 6.0 5.000000 output:-
print(x.shape)
75% 6.5 6.500000 month feb
output:-
max 7.0 8.000000 sales1 5
(3,3)
------------------------------------------------------- sales2 3
------------------------------------------
#T:- Transpose the dataframe (row convert into ------------------------------------------
#axes:- It is used to show structure details of dataframe.
columns & columns convert into rows. #var():- to show variance values of all numeric columns.
print(x.axes)
print(x.T) print(x.var())
output:-
output:- output:-
[RangeIndex(start=0, stop=3, step=1),
0 1 2 sales1 1.000000
Index(['month', 'sales1', 'sales2'],
month jan feb mar sales2 6.333333
dtype='object')]
sales1 5 7 6 ------------------------------------------
------------------------------------------
sales2 3 5 8 #std():- to show standard deviation values of all numeric
#count():-It is used to count each column, all values.
------------------------------------------------------- columns. (the square root of variance will produce
print(x.count())
#dtypes:- It show data types of dataframe columns. standard deviation)
print(x.dtypes) output:- print(x.std())
output:- month 3 output:-
month object sales1 3 sales1 1.000000
sales1 int64 sales2 3 sales2 2.516611
sales2 int64
Dataframe, Columns, Rows and Each Values Accessing #to display more than one columns
print(x[ [ 'rollno' , 'name' ] ]) print(x[ [ 'marks' , 'city' ] ])
rollno name marks city
rollno name city state marks
0 101 arpit 0 50 kota
0 101 arpit kota rajasthan 50
1 102 himmat 1 70 jamnagar
1 102 himmat Jamnagar gujarat 70
2 103 rohit 2 60 kota
2 103 rohit kota rajasthan 60
3 104 vinod 3 80 jamnagar
3 104 vinod Jamnagar gujarat 80 4 105 nitin 4 90 ajmer
4 105 nitin ajmer rajasthan 90

import pandas as pd #to display each row separately using -> loc[row index number]
import numpy as np
import matplotlib.pyplot as plt print(x.loc[0]) print(x.loc[3])
rollno name city state marks rollno name city state marks
x=pd.DataFrame({'rollno':[101,102,103,104,105], 0 101 arpit kota rajasthan 50 3 104 vinod jamn. gujarat 80
'name':['arpit','himmat','rohit','vinod','nitin'],
'city':['kota','jamnagar','kota','jamnagar','ajmer'], #to display multiple rows with all columns:-
'state':['rajasthan','gujarat','rajasthan','gujarat','rajasthan'], using -> loc[start row index number : stop row index number]
'marks':[50,70,60,80,90]})
print(x.loc[0:1]) print(x.loc[3:4])
# to display entire data frame rollno name city state marks rollno name city state marks
print (x) 0 101 arpit kota rajasthan 50 3 104 vinod jamn. gujarat 80
1 102 himmat jam.. gujarat 70 4 105 nitin ajmer rajasthan 90
rollno name city state marks
0 101 arpit kota rajasthan 50 #to display multiple rows with selected columns:-
1 102 himmat jamnagar gujarat 70 using -> loc[start row : stop row , start column name : stop column name ]
2 103 rohit kota rajasthan 60 print(x.loc[1:2, "city": "marks"]) print(x.loc[2:4, "name": "city"])
3 104 vinod jamnagar gujarat 80 city state marks name city
4 105 nitin ajmer rajasthan 90 1 jamnagar gujarat 70 2 rohit kota
2 kota rajasthan 60 3 vinod jamnagar
#to display each column separately 4 nitin ajmer
print(x['rollno']) print(x['name']) print(x['city'])
or #to display each value from dataframe:- using iloc[row index, column index]
print(x.rollno) print(x.iloc[0,0]) print(x.iloc[1,0]) print(x.iloc[2,0])
rollno name city 101 102 103
0 101 0 arpit 0 kota
1 102 1 himmat 1 jamnagar print(x.iloc[0,1]) print(x.iloc[1,1]) print(x.iloc[2,1])
2 103 2 rohit 2 kota arpit himmat rohit
3 104 3 vinod 3 jamnagar
4 105 4 nitin 4 ajmer
DataFrame- Boolean indexing Importing/Exporting Data
between CSV files and Data Frames
Pandas, DataFrame also support Boolean indexing. So we can For export- DataFrame . to_csv( )
direct search our data based on True or False indexing. We can use
loc[ ] for this purpose. For import- pd.read_csv( r "path" )
import pandas as pd #export this student1 DataFrame to d drive, a new file std1.csv.
data1={
'rollno' : [101,102,103,104], import pandas as pd
'name' : ['ram','mohan','sohan','rohan'] data1={
} 'rollno' : [101,102],
'name' : ['ram','mohan']
student1 = pd.DataFrame(data1, }
index = [True, False, True, False],
columns=['rollno' , 'name'] student1 = pd.DataFrame(data1, columns=['rollno' , 'name'])
)
print(student1) student1.to_csv(
output- r 'd:\std1.csv',
rollno name
True 101 ram index = False,
False 102 mohan header=True
True 103 sohan )
False 104 rohan Note- In std1.csv file index values 0,1 will not show but rollno and name
---------------------- heading will show on this file.
print(student1.loc[True] ) -------------------------------------------------------------
#Now import this std1.csv file from d drive in a DataFrame student1 again.
output-
import pandas as pd
rollno name
True 101 ram
True 103 sohan student1= pd.read_csv( r "d:\std1.csv" )
----------------------- print(student1)
print(student1.loc[False] ) output-
output- rollno name
rollno name 0 101 ram
False 102 mohan 1 102 mohan
False 104 rohan
Adding Deleting Renaming
a new row using - append() method a existing row – drop (index position) method a existing index rename ( ) method
and a new column in existing DataFrame and a existing column – pop (column name) method and a column rename ( ) method
in existing DataFrame in existing DataFrame

import pandas as pd import pandas as pd import pandas as pd


data1={ data1={ data1={
'rollno' : [101,102], 'rollno' : [101,102, 103], 'rollno' : [101,102],
'name' : ['ram','mohan'] 'name' : ['ram','mohan',’sohan’] 'name' : ['ram','mohan']
} } }

student1 = pd.DataFrame(data1, columns=['rollno' , 'name'] ) student1 = pd.DataFrame(data1, columns=['rollno' , 'name'] ) student1 = pd.DataFrame(data1, columns=['rollno' , 'name'] )
print(student1) print(student1) print(student1)
output- output-
rollno name rollno name output-
0 101 ram 0 101 ram rollno name
1 102 mohan 1 102 mohan 0 101 ram
2 103 sohan 1 102 mohan

#to add a new row in existing a DataFrame #to delete a existing row in a DataFrame #to rename a existing index
student1= student1.append({ 'rollno' : 103, 'name': 'sohan' }, student1= student1. drop(0) student1=student1.rename(index= {0 : 'a' , 1 : 'b'} )
ignore_index=True) print(student1)
print(student1)
print(student1) output-
output- rollno name
output- rollno name a 101 ram
rollno name 1 102 mohan b 102 mohan
0 101 ram 2 103 sohan
1 102 mohan
2 103 sohan

#to add a new column in existing a DataFrame #to delete a existing column in a DataFrame #to rename a existing column
student1[‘marks’] = [50,60,70] student1.pop('rollno') student1=student1.rename(columns=
or {'rollno' : 'sid', 'name':'fullname'})
print(student1)
del student1 ['rollno'] print(student1)
output- print(student1)
output- output-
sid fullname
rollno name marks
name a 101 ram
0 101 ram 50
1 mohan b 102 mohan
1 102 mohan 60
2 103 sohan 70 2 sohan
Iteration - Show rows from top and bottom- Arithmetic operation of DataFrames
Value by value- iloc[row, col ], loc[row, col] head(no. of rows from top) add(a,b) ,
Row by row- iterrows(), itertuples() tail (no. of rows from bottom) subtract(a,b)
Column by column- iteritems() by default- head() shows top 5 rows multiply(a,b)
tail() shows bottom 5 rows divide(a,b) and mod(a,b)
import pandas as pd import pandas as pd import pandas as pd
data1={ data1={ a=pd.DataFrame([
'rollno' : [101,102], 'rollno' : [101,102, 103, 104, 105, 106], [4., 6.],
'name' : ['ram','mohan'] 'name' : ['ram','mohan',’sohan’,’arun’,’rohan’,’shyam’] [10.,12.]
} } ])
student1 = pd.DataFrame(data1, columns=['rollno' , 'name'] )
student1 = pd.DataFrame(data1, columns=['rollno' , 'name'] ) print(student1) b=pd.DataFrame([
print(student1) output- [3., 5.],
output- rollno name [6., 7.]
0 101 ram
rollno name ])
1 102 mohan
0 101 ram 2 103 sohan
1 102 mohan 3 104 arun
4 105 rohan
5 106 shyam
#to iterate/access row by row # show top 5 rows # show bottom 5 rows Output-
print(a+b)
for index, row in student1.iterrows(): print(student1.head()) print(student1.tail()) 0 1
output- output-
or
print (row["rollno"], row["name"])
rollno name rollno name print(np.add(a,b)) 0 7.0 11.0
-------------------or---------------------------------- 0 101 ram 1 102 mohan or
for i in range(len(student1)) : 1 102 mohan 2 103 sohan
2 103 sohan 3 104 arun print(a.add(b)) 1 16.0 19.0
print(student1.iloc[i, 0], student1.iloc[i, 1]) Output-
3 104 arun 4 105 rohan print(a+3)
-------------------or---------------------------------- 4 105 rohan 5 106 shyam 0 1
for i in range(len(student1)) : ---------------------------------- ----------------------------------
or
0 7.0 9.0
print(student1.loc[i, "rollno"], student1.loc[i, "name"]) # show top 2 rows # show bottom 2 rows print(np.add(a,3)) 1 13.0 15.0
print(student1.head(2)) print(student1.tail(2))
-------------------or----------------------------------
output- output-
for row in student1.itertuples(): rollno name rollno name
Vertical sum Horizontal sum
print(row) 0 101 ram 4 105 rohan
print( a.sum (axis=0)) print( a.sum (axis=1))
-------------------or---------------------------------- 1 102 mohan 5 106 shyam
---------------------------------- ---------------------------------- Output- Output-
#to iterate/access column by column # show top 3 rows # show bottom 3 rows
for key, value in student1.iteritems(): print(student1.head(3)) print(student1.tail(3)) 0 14.0 0 10.0
output- output- 1 18.0 1 22.0
print(key, value) Note- we can also
output- rollno name rollno name perform sum, max, min,
101 ram 0 101 ram 3 104 arun count, mean, std and
102 mohan 1 102 mohan 4 105 rohan var of a single row or
2 103 sohan 5 106 shyam column using axis.
103 sohan
1-D Array, 2-D Array, Series & DataFrame:- Each Element Value Accessing Using Loop

1-D Numpy Array Series Numeric Array 2-D Numpy Array DataFrame Numeric Array DataFrame table
each element accessing using each element accessing using loop each element accessing using loop each element accessing using loop each element accessing using loop
loop
import numpy as np import pandas as pd 0 1 2 0 1 2 month sales1 sales2
0 5 3 4 0 5 3 4 0 jan 5 3
n1=np.array([10, 20, 13]) s1=pd.Series([10, 20, 13]) 1 7 5 8 1 7 5 8 1 feb 7 5
2 6 8 3 2 6 8 3 2 mar 6 8
for i in range(0, len(n1)): for i in range(0, len(s1)): 3 9 10 6 3 9 10 6 3 apr 9 10
print(n1[i]) print(s1[i]) import numpy as np import pandas as pd import pandas as pd

output:- output:- arr=np.array([[5, 3, 4], df=pd.DataFrame([[5, 3, 4], df=pd.DataFrame({


10 10 [7, 5, 8], [7, 5, 8], 'month':['jan','feb','mar', 'apr'],
20 20 [6, 8, 3], [6, 8, 3], 'sales1':[5,7,6,9],
13 13 [9, 10, 6]]) [9, 10, 6]] 'sales2':[3,5,8,10]})

rc=df.shape rc=df.shape
totalrow=rc[0] totalrow=rc[0]
totalcol=rc[1] totalcol=rc[1]

for row in range(0, len(arr)): for row in range(0, totalrow): for row in range(0, totalrow):
print() print() print()
for col in range(0, len(arr[row])): for col in range(0, totalcol): for col in range(0, totalcol):
print(arr[row , col ], end=' ') print(df.iloc[row , col ], end=' ') print(df.iloc[row , col ], end=' ')

output:- output:- output:-


5 3 4 5 3 4 jan 5 3
7 5 8 7 5 8 feb 7 5
6 8 3 6 8 3 mar 6 8
9 10 6 9 10 6
note:- note:- note: note:- note:-
len(n1)-> return total len(s1)->return total number len(arr)->return total number of df.shape-> return total number of df.shape-> return total number of
number of elements. of elements. rows. rows & total number of columns rows & total number of columns
value. value.
len(arr[row])->return total rc[0]->have total number of rows. rc[0]->have total number of rows.
number of columns on each row. rc[1]->have total number of columns. rc[1]->have total number of columns.
Pandas DataFrame loc and iloc comparison
loc – it is include stop index. iloc - it is not include stop index.
Dataframe.loc[start : stop : step , start : stop : step] Dataframe.iloc[start : stop : step , start : stop : step]
row slice , column slice row slice , column slice
Pandas DataFrame, label index and label column access using loc Pandas DataFrame, default index and column access using iloc
c0 c1 c2 c3 Sum 0 1 2 3 Sum
r0 3 2 3 4 12 0 3 2 3 4 12
r1 7 4 6 8 25 1 7 4 6 8 25
r2 5 6 9 2 22 2 5 6 9 2 22
r3 8 8 4 6 26 3 8 8 4 6 26
Sum 23 20 22 20 Sum 23 20 22 20
Creation of a label pandas DataFrame and insert data Creation of a default pandas DataFrame and insert data
import pandas as pd import pandas as pd

data={'c0' : [3,7,5,8], data={0 : [3,7,5,8],


'c1' : [2,4,6,8], 1 : [2,4,6,8],
'c2' : [3,6,9,4], 2 : [3,6,9,4],
'c3' : [4,8,2,6]} 3 : [4,8,2,6]}

df=pd.DataFrame(data, index= ['r0','r1','r2','r3'], columns=['c0','c1','c2','c3']) df=pd.DataFrame(data, index= [0,1,2,3], columns=[0,1,2,3])


To select a row To select a column To select a row To select a column

print( df.loc['r0'] ) print( df.loc[ : , 'c0'] ) print( df.iloc[0] ) print( df.iloc[ : , 0] )

c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r1'] ) print( df.loc[ : , 'c1'] ) print( df.iloc[1] ) print( df.iloc[ : , 1] )

c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r2'] ) print( df.loc[ : , 'c2'] ) print( df.iloc[2] ) print( df.iloc[ : , 2] )


c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r3'] ) print( df.loc[ : , 'c3'] ) print( df.iloc[3] ) print( df.iloc[ : , 3] )


c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

To sum a selected row To sum a selected column To sum a selected row To sum a selected column
print( df.loc['r0'].sum( )) 12 print( df.loc[ : , 'c0'].sum( )) 23 print( df.iloc[0].sum( )) 12 print( df.iloc[ : , 0].sum( )) 23
print( df.loc['r1'] .sum( )) 25 print( df.loc[ : , 'c1'].sum( ) ) 20 print( df.iloc[1] .sum( )) 25 print( df.iloc[ : , 1].sum( ) ) 20
print( df.loc['r2'] .sum( )) 22 print( df.loc[ : , 'c2'].sum( ) ) 22 print( df.iloc[2] .sum( )) 22 print( df.iloc[ : , 2].sum( ) ) 22
print( df.loc['r3'] .sum( )) 26 print( df.loc[ : , 'c3'].sum ( )) 20 print( df.iloc[3] .sum( )) 26 print( df.iloc[ : , 3].sum ( )) 20
To sum - row wise directly To sum - column wise directly To sum - row wise directly To sum - column wise directly
print( df.sum(axis=1 ) ) print( df.sum(axis=0 ) ) print( df.sum(axis=1 ) ) print( df.sum(axis=0 ) )
12 23 12 23
25 20 25 20
22 22 22 22
26 20 26 20
To show various statics of first row To show various statics of first column To show various statics of first row To show various statics of first column

print( df.loc['r0']) 3 2 3 4 print( df.loc[:,'c0']) 3 7 5 8 print( df.lioc[0]) 3 2 3 4 print( df.iloc[:,0]) 3 7 5 8

print( df.loc['r0'].sum() ) 12 print( df.loc[:,'c0'].sum() ) 23 print( df.iloc[0].sum() ) 12 print( df.iloc[:,0].sum() ) 23

print( df.loc['r0'].max() ) 4 print( df.loc[:,'c0'].max() ) 8 print( df.iloc[0].max() ) 4 print( df.iloc[:,0].max() ) 8

print( df.loc['r0'].min() ) 2 print( df.loc[:,'c0'].min() ) 3 print( df.iloc[0].min() ) 2 print( df.iloc[:,0].min() ) 3

print( df.loc['r0'].count() ) 4 print( df.loc[:,'c0'].count() ) 4 print( df.iloc[0].count() ) 4 print( df.iloc[:,0].count() ) 4

print( df.loc['r0'].mean() ) 3.0 print( df.loc[:,'c0'].mean() ) 5.75 print( df.iloc[0].mean() ) 3.0 print( df.iloc[:,0].mean() ) 5.75

print( df.loc['r0'].median() ) 3.0 print( df.loc[:,'c0'].median() ) 6.0 print( df.iloc[0].median() ) 3.0 print( df.iloc[:,0].median() ) 6.0

print( df.loc['r0'].mode() ) 3 print( df.loc[:,'c0'].mode() ) 3,7,5,8 print( df.iloc[0].mode() ) 3 print( df.iloc[:,0].mode() ) 3,7,5,8

print( df.loc['r0'].var() ) 0.66 print( df.loc[:,'c0'].var() ) 4.91 print( df.iloc[0].var() ) 0.66 print( df.iloc[:,0].var() ) 4.91

print( df.loc['r0'].std() ) 0.81 print( df.loc[:,'c0'].std() ) 2.21 print( df.iloc[0].std() ) 0.81 print( df.iloc[:,0].std() ) 2.21

print( df.loc['r0'].quantile(.25) ) 2.75 print( df.loc[:,'c0'].quantile(.25) ) 4.5 print( df.iloc[0].quantile(.25) ) 2.75 print( df.iloc[:,0].quantile(.25) ) 4.5

print( df.loc['r0'].quantile(.50) ) 3.0 print( df.loc[:,'c0'].quantile(.50) ) 6.0 print( df.iloc[0].quantile(.50) ) 3.0 print( df.iloc[:,0].quantile(.50) ) 6.0
print( df.iloc[0].quantile(.75) ) print( df.iloc[:,0].quantile(.75) )
print( df.loc['r0'].quantile(.75) ) 3.25 print( df.loc[:,'c0'].quantile(.75) ) 7.25 3.25 7.25
print( df.iloc[0].quantile(1) ) print( df.iloc[:,0].quantile(1) )
print( df.loc['r0'].quantile(1) ) 4.0 print( df.loc[:,'c0'].quantile(1) ) 8.0 4.0 8.0

Slice for rows using loc Slice for columns using loc Slice for rows using iloc Slice for columns using iloc
loc include stop index. loc include stop index. iloc is not include stop index. iloc is not include stop index.
print( df.loc['r0' : 'r0' , : ] ) print( df.loc[ : , 'c0' : 'c0' ] ) print( df.iloc[ 0 : 1 , : ] ) print( df.iloc[ : , 0 : 1 ] )

c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r0' : 'r1' , : ] ) print( df.loc[ : , 'c0' : 'c1' ] ) print( df.iloc[ 0 : 2 , : ] ) print( df.iloc[ : , 0 : 2 ] )

c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r0' : 'r2' , : ] ) print( df.loc[ : , 'c0' : 'c2' ] ) print( df.iloc[ 0 : 3 , : ] ) print( df.iloc[ : , 0 : 3 ] )
c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r0' : 'r3' , : ] )) print( df.loc[ : , 'c0' : 'c3' ] ) print( df.iloc[ 0 : 4 , : ] ) print( df.iloc[ : , 0 : 4 ] )
c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6
Slicing- using step for pandas dataframe
Slice for rows using loc Slice for columns using loc Slice for rows using iloc Slice for columns using iloc

loc include stop index. loc include stop index. iloc is not include stop index. iloc is not include stop index.

print( df.loc['r0' : 'r1' : 2 , : ] ) print( df.loc[ : , 'c0' : 'c1' : 2 ] ) print( df.iloc[ 0 : 1 : 2, : ] ) print( df.iloc[ : , 0 : 1: 2 ] )

c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r0' : 'r2' : 2 , : ] ) print( df.loc[ : , 'c0' : 'c2' : 2 ] ) print( df.iloc[ 0 : 2 : 2, : ] ) print( df.iloc[ : , 0 : 2 : 2] )

c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r0' : 'r3' : 2 , : ] ) print( df.loc[ : , 'c0' : 'c3': 2 ] ) print( df.iloc[ 0 : 3 : 2 , : ] ) print( df.iloc[ : , 0 : 3: 2 ] )
c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

print( df.loc['r0' : 'r4' : 2 , : ] ) print( df.loc[ : , 'c0' : 'c4' : 2] ) print( df.iloc[ 0 : 4 : 2, : ] ) print( df.iloc[ : , 0 : 4 : 2] )
c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

Seleced rows with selected columns Slicing- using step for pandas dataframe

loc loc iloc iloc


print( df.loc['r0' : 'r4' : 2 , c1 : c3 ] ) print( df.loc[ r1 : r3 , 'c0' : 'c4' : 2] ) print( df.iloc[ 0 : 4 : 2, 1 : 4 ] ) print( df.iloc[ 1: 4 , 0 : 4 : 2] )

c0 c1 c2 c3 c0 c1 c2 c3 0 1 2 3 0 1 2 3
r0 3 2 3 4 r0 3 2 3 4 0 3 2 3 4 0 3 2 3 4
r1 7 4 6 8 r1 7 4 6 8 1 7 4 6 8 1 7 4 6 8
r2 5 6 9 2 r2 5 6 9 2 2 5 6 9 2 2 5 6 9 2
r3 8 8 4 6 r3 8 8 4 6 3 8 8 4 6 3 8 8 4 6

Numpy 2-d and pandas DataFrame axis concept


axis =0 axis =1 axis =0 axis =1
Hsplit, hstack Vsplit, vstack Means for a selected column and its all Means for a selected row and its all
rows. columns.
Always Vertically performs Always horizontally performs
Or Column by column Or Row by row 0 1 2 3 0 1 2 3

Or Side by side Or Line by line 0 3 2 3 4 0 3 2 3 4

1 7 4 6 8 1 7 4 6 8

2 5 6 9 2 2 5 6 9 2

3 8 8 4 6 3 8 8 4 6
Pythons programs with solutions
1. Create a pandas series from a dictionary of values and an ndarray.
# Create a pandas series from a dictionary of values.
import pandas as pd

dict = {0 : 11, 1 : 12, 2 : 13, 3 : 14, 4 : 15}


s =pd.Series(dict)
print(s)
output-
0 11
1 12
2 13
3 14
4 15

# Create a pandas series from an ndarray.


import pandas as pd
import numpy as np

arr=np.array( [11,12,13,14,15] )
s =pd.Series(arr)
print(s)
output-
0 11
1 12
2 13
3 14
4 15

2. Given a Series, print all the elements that are above the 75th percentile.
import pandas as pd

x=pd.Series([10,20,30,40,50,50,60,70,70,70,80,90,100])
print(x.loc[x>= x.quantile(.75)])
output-
7 70
8 70
9 70
10 80
11 90
12 100

3. To count total even and total odd values in a given dataframe.


import pandas as pd
import numpy as np

df=pd.DataFrame([
[3,2,3,4],
[7,4,6,8],
[5,6,9,2],
[8,8,4,6]
])
evencount=np.sum(np.array(df)%2== 0)
print( evencount)
output-
11
------------
oddcount=np.sum(np.array(df)%2== 1)
print( oddcount )
output-
5
Important - Board Sample question and answers

What is series? Explain with the help of an example. Pandas Series is a one-dimensional labeled
array capable of holding data of any type
(integer, string, float, python objects, etc.).
The axis labels are collectively called index.
Example
import pandas as pd
# simple array
data =pd.series([1,2,3,4,5])
print(data)
Hitesh wants to display the last four rows of the dataframe df and has df.tail(4)
written the following code :
df.tail()
But last 5 rows are being displayed. Identify the error and rewrite the
correct code so that last 4 rows get displayed.
Write the command using Insert() function to add a new column in the EMP.insert(loc=3, column=”Salary”,value=Sal)
last place (3rd place) named “Salary” from the list
Sal=[10000,15000,20000] in an existing dataframe named EMP already
having 2 columns.
Consider the following python code and write the output for statement 0.50 8.0
S1 0.75 11.0
import pandas as pd
K=pd.series([2,4,6,8,10,12,14])
K.quantile([0.50, 0.75])
---------------------- S1
CSV stands for _____________ Comma separated values
Write a python code to create a dataframe with appropriate headings import pandas as pd
from the list given below : # initialize list of lists
data = [['S101', 'Amy', 70],
['S101', 'Amy', 70],
['S102', 'Bandhi', 69],
['S102', 'Bandhi', 69], ['S104', 'Cathy', 75],
['S104', 'Cathy', 75], ['S105', 'Gundaho', 82]]
# Create the pandas DataFrame
['S105', 'Gundaho', 82]
df = pd.DataFrame(data, columns = ['ID',
'Name', 'Marks'])
# printdataframe.
print(df )
Write a small python code to create a dataframe with headings(a and b) import pandas as pd
from the list given below : df = pd.DataFrame([[1, 2], [3, 4]], columns =
[ [1,2], [3,4], [5,6], [7,8] ] ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns
= ['a','b'])
df = df.append(df2)
Find the output of the following code:
import pandas as pd
data = [{'a': 10, 'b': 20}, {'a': 6, 'b': 32, 'c': 22}]
#with two column indices, values same as dictionary keys a b
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b']) first 10 20
#With two column indices with one index with other name second 6 32
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1']) a b1
print(df1) first 10 NaN
print(df2) second 6 NaN
Write the code in pandas to create the following dataframes :

import numpy as np
import pandas as pd
df1 = pd.DataFrame({'mark1':[30,40,15,40],
'mark2':[20,45,30,70]});
df2 = pd.DataFrame({'mark1':[10,20,20,50],
'mark2':[15,25,30,30]});
print(df1)
print(df2)
Write the commands to do the following operations on the dataframes given above :
(i) To rename column mark1 as marks1 in both the dataframes df1 and df2.
note: inplace =True -> directly permanent change into original dataframe.
df1.rename(columns={'mark1':'marks1'}, inplace=True)
print(df1)
df2.rename(columns={'mark1':'marks1'}, inplace=True)
print(df2)
Given a dataframe namely data as shown in adjancant a. Find all rows with the label, ‘apple’. Extract all columns.
figure (fruit names are row labels). data.loc[ ‘apple’, : ]
Write code statement to –
color count price b. List only the columns count and price using loc.
apple red 3 120 data.loc[: , [‘color’, ’price’] ]
apple green 9 110
pear red 25 125 c. List only rows with labels ‘ apple’ and ‘pear’ using loc.
pear green 26 150 data.loc[ [‘apple’, ’pear’] ]
line green 99 70
Syntax and examples of various Pandas data structure operation
Rename columns y=x.rename(columns={'month' : 'monthly', 'sales1':'total sales1'})

Boolean indexing student1.loc[True]


Boolean indexing student1.loc[False]
Exporting For export- DataFrame . to_csv( )
Importing For import- pd.read_csv( r "path" )

To add a new row student1= student1.append({ 'rollno' : 103, 'name': 'sohan' }, ignore_index=True)
To delete a row student1= student1. drop(0)
To delete a column student1.pop('rollno')
To rename columns student1=student1.rename(columns= {'rollno' : 'sid', 'name':'fullname'})

To show top rows print(student1.head(2))


To show bottom rows print(student1.tail(2))
To show, one by one all the values of a totalrow = rc[0]
dataframe
totalcol = rc[1]

for row in range(0, totalrow):


print()
for col in range(0, totalcol):
print(df.iloc[row , col ], end=' ')
To show, one by one all the values of 2-d for row in range(0, len(arr)):
numpy array
print()
for col in range(0, len(arr[row])):
print(arr[row , col ], end=' ')
Arithematic operation style is
using function expression vector
common in numpy 2d array and
np.add(a,b) a+b a+2
pandas dataframe, here a anb are 2- np.subtract(a,b) a-b a-2
np.multiply(a,b) a*b a*2
d array or two dataframe of same
np.divide(a,b) a/b a/2
size. np.remainder(a,b) a%b a%2
np.mode(a,b)
Each value access in numpy #numpy use [ ]
[row index, column index ]
print(c[0,0]) -> 4
print(c[0,1]) -> 6
print(c[1,0]) -> 10
print(c[1,1]) -> 12

Each value access in Dataframe #dataframe use iloc[ ]


print(c.iloc[0,0]) ->4
iloc[row index, column index]
print(c.iloc[0,1]) ->6
print(c.iloc[1,0]) ->10
print(c.iloc[1,1]) ->12

You might also like