Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

Data Manipulation with Pa

ndas
Introduction
• Pandas is a Python library used for working with data sets.
• It has functions for analyzing, cleaning, exploring, and manipulating
data.
• The name "Pandas" has a reference to both "Panel Data", and
"Python Data Analysis" and was created by Wes McKinney in 2008.
• It provides various data structures and operations for manipulating
numerical data and time series.
• This library is built on top of the NumPy library.
• Pandas is fast and it has high performance & productivity for users.
Advantages

• Fast and efficient for manipulating and analyzing data.


• Data from different file objects can be loaded.
• Easy handling of missing data in floating point as well as non-floating
point data
• Size mutability: columns can be inserted and deleted from DataFrame
and higher dimensional objects
• Data set merging and joining.
• Flexible reshaping and pivoting of data sets
• Provides time-series functionality.
• Powerful group by functionality for performing split-apply-combine
operations on data sets
Why Pandas is used for Data Science

• It is built on the top of the NumPy library which means that a lot of
structures of NumPy are used or replicated in Pandas.
• The data produced by Pandas are often used as input for plotting
functions of Matplotlib, statistical analysis in SciPy, and machine
learning algorithms in Scikit-learn.
• Pandas program can be run from any text editor but it is
recommended to use Jupyter Notebook for this as Jupyter given the
ability to execute code in a particular cell rather than executing the
entire file. Jupyter also provides an easy way to visualize pandas data
frames and plots.
Pandas data structures for manipulating data,
• Pandas generally provide two data structures for manipulating data,
They are:
• Series
• DataFrame
Series

• Pandas Series is a one-dimensional labeled array capable of holding


data of any type (integer, string, float, python objects, etc.).
• The axis labels are collectively called indexes.
• Pandas Series is nothing but a column in an excel sheet.
• Labels need not be unique but must be a hashable type.
• The object supports both integer and label-based indexing and
provides a host of methods for performing operations involving the
index.
Creating a Series

• In the real world, a Pandas Series will be created by loading the datasets
from existing storage, storage can be SQL Database, CSV file, an Excel file.
Pandas Series can be created from the lists, dictionary, and from a scalar
value etc.
import pandas as pd
import numpy as np

# Creating empty series


ser = pd.Series()
print(ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print(ser)
Data Frame

• Pandas DataFrame is a two-dimensional size-mutable, potentially


heterogeneous tabular data structure with labeled axes (rows and
columns). A Data frame is a two-dimensional data structure, i.e., data
is aligned in a tabular fashion in rows and columns. Pandas DataFrame
consists of three principal components, the data, rows, and columns.
Creating a DataFrame:

• In the real world, a Pandas DataFrame will be created by loading the


datasets from existing storage, storage can be SQL Database, CSV file, an
Excel file. Pandas DataFrame can be created from the lists, dictionary, and
from a list of dictionaries, etc.
import pandas as pd
# Calling DataFrame constructor
df = pd.DataFrame()
print(df)
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)
pandas.Series

A pandas Series can be created using the following constructor −


pandas.Series( data, index, dtype, copy)
data
data takes various forms like ndarray, list, constants
index
Index values must be unique and hashable, same length as data.
Default np.arrange(n) if no index is passed.
dtype
dtype is for data type. If None, data type will be inferred
copy
• Copy data. Default False
Example
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Create a Series from dict
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print s
Create a Series from Scalar

#import the pandas library and aliasing as pd


import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print s
Accessing Data from Series with Position
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first element
print s[0]
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first three element
print s[:3]
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the last three element
print s[-3:]
pandas.DataFrame

• A pandas DataFrame can be created using the following constructor −


pandas.DataFrame( data, index, columns, dtype, copy)
data
data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame.
index
For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no
index is passed.
columns
For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed.
dtype
Data type of each column.
copy
• This command (or whatever it is) is used for copying of data, if the default is False.
Create an Empty DataFrame

#import the pandas library and aliasing as pd


import pandas as pd
df = pd.DataFrame()
print df
Create a DataFrame from Lists
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Create a DataFrame from Dict of ndarrays / Lists

• All the ndarrays must be of same length. If index is passed, then the
length of the index should equal to the length of the arrays.
• If no index is passed, then by default, index will be range(n),
where n is the array length.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Create a DataFrame from List of Dicts

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
The following example shows how to create a DataFrame by passing a
list of dictionaries and the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df
Create a DataFrame from Dict of Series

import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),


'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df
Pandas Dataframe/Series.head() method

• Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric Python packages. Pandas is one of those packages and makes
importing and analyzing data much easier.

• Pandas head() method is used to return top n (5 by default) rows of a data frame or
series.

• Syntax: Dataframe.head(n=5)

• Parameters:
• n: integer value, number of rows to be returned

• Return type: Dataframe with top n rows


# importing pandas module
import pandas as pd

# making data frame


data = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")

# calling head() method


# storing in new variable
data_top = data.head()

# display
data_top
• In this example, the .head() method is called on series with custom input of n paramet
# importing pandas module
import pandas as pd
# making data frame
data = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
# number of rows to return
n=9
# creating series
series = data["Name"]
# returning top n rows
top = series.head(n = n)
# display
toper to return top 9 rows of the series.
ADDING ROW NEW ROW
• We can add a single row using DataFrame.loc. We can add the row at
the last in our dataframe. We can get the number of rows
using len(DataFrame.index) for determining the position at which we
need to add the new row.
• Example
import pandas as pd
df = pd. DataFrame({'RollNO':[1,2,3,4,5],
'Name':['Nina','Mohan','John','Priya','Aakash'],
'Marks':[89,96,78,60,99]})
print("Orignal DataFrame")
print(df)
print("After adding new row DataFrame")
df.loc[5]=[6,'Sunita',77]
print(df)
OUTPUT

Orignal DataFrame
RollNO Name Marks
0 1 Nina 89
1 2 Mohan 96
2 3 John 78
3 4 Priya 60
4 5 Aakash 99
After adding new row DataFrame
RollNO Name Marks
0 1 Nina 89
1 2 Mohan 96
2 3 John 78
3 4 Priya 60
4 5 Aakash 99
5 6 Sunita 77
Adding Column in Data Frame

• Add columns at the end of the table.


• Add columns at a specific index.
• Add columns with the loc method.
METHOD 1: ADDING COLUMNS ON
THE END
• import pandas as pd
df = pd. DataFrame({'RollNO':[1,2,3,4,5],
'Name':['Nina','Mohan','John','Priya','Aakash'],
'Marks':[89,96,78,60,99]})
print("Orignal DataFrame")
print(df)
df["Division"]='IVA'
print(df)
OUTPUT

Orignal DataFrame
RollNO Name Marks
0 1 Nina 89
1 2 Mohan 96
2 3 John 78
3 4 Priya 60
4 5 Aakash 99
RollNO Name Marks Division
0 1 Nina 89 IVA
1 2 Mohan 96 IVA
2 3 John 78 IVA
3 4 Priya 60 IVA
4 5 Aakash 99 IVA
METHOD 2: ADD COLUMNS AT A SPECIFIC INDEX

import pandas as pd
df = pd. DataFrame({'RollNO':[1,2,3,4,5],
'Name':['Nina','Mohan','John','Priya','Aakash'],
'Marks':[89,96,78,60,99]})
print("Orignal DataFrame")
print(df)
df.insert(3,"Division",'IVA')
print(df)
OUTPUT
Orignal DataFrame
RollNO Name Marks
0 1 Nina 89
1 2 Mohan 96
2 3 John 78
3 4 Priya 60
4 5 Aakash 99
RollNO Name Marks Division
0 1 Nina 89 IVA
1 2 Mohan 96 IVA
2 3 John 78 IVA
3 4 Priya 60 IVA
4 5 Aakash 99 IVA
METHOD 3: ADD COLUMNS WITH LOC

import pandas as pd
df = pd. DataFrame({'RollNO':[1,2,3,4,5],
'Name':['Nina','Mohan','John','Priya','Aakash'],
'Marks':[89,96,78,60,99]})
print("Orignal DataFrame")
print(df)
df.loc[:, "Division"] = 'IVA'
print(df)
OUTPUT
Orignal DataFrame
RollNO Name Marks
0 1 Nina 89
1 2 Mohan 96
2 3 John 78
3 4 Priya 60
4 5 Aakash 99
RollNO Name Marks Division
0 1 Nina 89 IVA
1 2 Mohan 96 IVA
2 3 John 78 IVA
3 4 Priya 60 IVA
4 5 Aakash 99 IVA
Using Drop remove specific row
• Dropping row with index
import pandas as pd
df = pd. DataFrame({'RollNO':[1,2,3,4,5],
'Name':['Nina','Mohan','John','Priya','Aakash'],
'Marks':[89,96,78,60,99]})
print("Orignal DataFrame")
print(df)
df.drop(4,axis=0,inplace=True)
print(df)
OUTPUT
RollNO Name Marks
0 1 Nina 89
1 2 Mohan 96
2 3 John 78
3 4 Priya 60
4 5 Aakash 99
RollNO Name Marks
0 1 Nina 89
1 2 Mohan 96
2 3 John 78
3 4 Priya 60
Deleting Column
• import pandas as pd
df = pd. DataFrame({'RollNO':[1,2,3,4,5],
'Name':['Nina','Mohan','John','Priya','Aakash'],
'Marks':[89,96,78,60,99]})
print("Orignal DataFrame")
print(df)
df.drop('Marks', inplace=True, axis=1)
print(df)
OUTPUT
Orignal DataFrame
RollNO Name Marks
0 1 Nina 89
1 2 Mohan 96
2 3 John 78
3 4 Priya 60
4 5 Aakash 99
RollNO Name
0 1 Nina
1 2 Mohan
2 3 John
3 4 Priya
4 5 Aakash

You might also like