Lab Manual ET Lab III

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

EMERGING TECHNOLOGY LAB III

PRACTICAL NO: 1

TITLE: Introduction to Data Science

AIM: To study the Data Science & use of Pandas for data science

PRIOR CONCEPT:

Data Science is the process that combines statistics, scientific methods, and algorithms to
derive only meaningful and important insights from a ginormous pool of data. It is an
interdisciplinary field whose true foundation lies in Statistics, Mathematics, Computer
Science, and Business. Hence, it becomes a little difficult to understand what exactly Data
Science is and what is it that makes Data scientists one of the coolest professions today.

Pandas is defined as an open-source library that provides high-performance data


manipulation in Python. The name of Pandas is derived from the word Panel Data, which
means an Econometrics from Multidimensional data. It is used for data analysis in Python
and developed by Wes McKinney in 2008.

Data analysis requires lots of processing, such as restructuring, cleaning or merging, etc.
There are different tools are available for fast data processing, such as Numpy, Scipy,
Cython, and Panda. But we prefer Pandas because working with Pandas is fast, simple and
more expressive than other tools. Pandas is built on top of the Numpy package,
means Numpy is required for operating the Pandas.

Before Pandas, Python was capable for data preparation, but it only provided limited support
for data analysis. So, Pandas came into the picture and enhanced the capabilities of data
analysis. It can perform five significant steps required for processing and analysis of data
irrespective of the origin of the data, i.e., load, manipulate, prepare, model, and analyze.

Key Features of Pandas :

• It has a fast and efficient DataFrame object with the default and customized indexing.
• Used for reshaping and pivoting of the data sets.
• Group by data for aggregations and transformations.
• It is used for data alignment and integration of the missing data.
• Provide the functionality of Time Series.
Department of Computer Science & Engineering 1
EMERGING TECHNOLOGY LAB III

• Process a variety of data sets in different formats like matrix data, tabular
heterogeneous, time series.
• Handle multiple operations of the data sets such as subsetting, slicing, filtering,
groupBy, re-ordering, and re-shaping.
• It integrates with the other libraries such as SciPy, and scikit-learn.
• Provides fast performance, and If you want to speed it, even more, you can use
the Cython.

Python Pandas Data Structure :

The Pandas provides two data structures for processing the data i.e., Series and DataFrame,
which are discussed below:

1) Series : It is defined as a one-dimensional array that is capable of storing various data


types.The row labels of series are called the index. We can easily convert the list, tuple, and
dictionary into series using series method. A Series cannot contain multiple columns.It has
one parameter.

Data: It can be any list, dictionary, or scalar value.

2) DataFrame : It is a widely used data structure of pandas and works with a two
dimensional array with labeled axes (rows and columns). DataFrame is defined as a standard
way to store data and has two different indexes, i.e., row index and column index. It consists
of the following properties:

• The columns can be heterogeneous types like int, bool, and so on.
• It can be seen as a dictionary of Series structure where both the rows and columns are
indexed. It is denoted as “columns” in case of columns and “index” in case of rows.

How to install pandas using pip?


Step-1

Department of Computer Science & Engineering 2


EMERGING TECHNOLOGY LAB III

First head over to https://www.python.org and click on Downloads on the Navigation bar as
highlighted on the image below:

Step-2

Be sure to download the latest version of the Python.

Department of Computer Science & Engineering 3


EMERGING TECHNOLOGY LAB III

Step-3

On running the downloaded installer, you will get this window. Click on ‘Install Now’.

Step-4

After finishing the installation, it is recommended to choose the option to disable path
length to avoid any problems with your Python installation.

Step-5

Now that Python is installed, you should head over to our terminal or command prompt from
where you can install Pandas. So go to your search bar on your desktop and search for cmd.
An application called Command prompt should show up. Click to start it.

Department of Computer Science & Engineering 4


EMERGING TECHNOLOGY LAB III

Step-6

Type in the command “pip install manager”. Pip is a package install manager for Python
and it is installed alongside the new Python distributions.

Step-7

Wait for the downloads to be over and once it is done you will be able to run Pandas inside
your Python programs on Windows.

Department of Computer Science & Engineering 5


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-

1. What is Data Science?


2. Explain the key features of Pandas.
3. Write down the major applications of data science.

Department of Computer Science & Engineering 6


EMERGING TECHNOLOGY LAB III

Department of Computer Science & Engineering 7


EMERGING TECHNOLOGY LAB III

PRACTICAL NO 2

TITLE: Python Data Series

AIM: Write a pandas program to add, subtract, multiple & divide two pandas series.

PRIOR CONCEPT:

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series
is nothing but a column in an excel sheet.Labels need not be unique but must be a hashable
type. The object supports both integer and label-based indexing and provides a host of
methods for performing operations involving the index. In the real world, a Pandas Series
will be created by loading the datasets from existing storage, storage can be SQL Database,
CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a
scalar value etc.

Python Code :

import pandas as pd

ds1 = pd.Series([2, 4, 6, 8, 10])

ds2 = pd.Series([1, 3, 5, 7, 9])

ds = ds1 + ds2

print("Add two Series:")

print(ds)

print("Subtract two Series:")

ds = ds1 - ds2

print(ds)

print("Multiply two Series:")

ds = ds1 * ds2

print(ds)

Department of Computer Science & Engineering 8


EMERGING TECHNOLOGY LAB III

print("Divide Series1 by Series2:")

ds = ds1 / ds2

print(ds)

OUTPUT:

Department of Computer Science & Engineering 9


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-

1. Write a code to create a simple Pandas Series from a list?


2. How to create our own labels in Pandas?

Department of Computer Science & Engineering 10


EMERGING TECHNOLOGY LAB III

PRACTICAL NO: 3

TITLE: Python data frames

AIM: Write a Pandas program to create and display a DataFrame from a specified dictionary
data which has the index labels.

PRIOR CONCEPT :

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table


with rows and columns. Pandas use the loc attribute to return one or more specified row(s). If
your data sets are stored in a file, Pandas can load them into a DataFrame.

PYTHON CODE:

import pandas as pd

import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',


'Laura', 'Kevin', 'Jonas'],

'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],

'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)

Department of Computer Science & Engineering 11


EMERGING TECHNOLOGY LAB III

OUTPUT:

CONCLUSION:

QUESTIONS:-

1. What is the difference between Dataset and Dataframe?


2. How to load files in to a DataFrame?

Department of Computer Science & Engineering 12


EMERGING TECHNOLOGY LAB III

Department of Computer Science & Engineering 13


EMERGING TECHNOLOGY LAB III

PRACTICAL NO.: 4

TITLE: Insert data frame

AIM: Write a pandas program to insert a new column in existing data frame.

PRIOR CONCEPT:

There are multiple ways to insert new column in existing data frame.

1. By declaring a new list as a column.


2. By using DataFrame.insert(): It gives the freedom to add a column at any position we
like and not just at the end. It also provides different options for inserting the column
values.
3. Using Dataframe.assign() method : This method will create a new dataframe with a
new column added to the old dataframe.
4. By using a dictionary : We can use a Python dictionary to add a new column in
pandas DataFrame. Use an existing column as the key values and their respective
values will be the values for a new column.

PYTHON CODE:

import pandas as pd

import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew',


'Laura', 'Kevin', 'Jonas'],

'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],

'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(exam_data , index=labels)

print("Original rows:")

print(df)

Department of Computer Science & Engineering 14


EMERGING TECHNOLOGY LAB III

color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']

df['color'] = color

print("\nNew DataFrame after inserting the 'color' column")

print(df)

OUTPUT:

Department of Computer Science & Engineering 15


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-

1. How to add up columns in python?


2. How to add multiple columns to a DataFrame in Pandas

Department of Computer Science & Engineering 16


EMERGING TECHNOLOGY LAB III

PRACTICAL NO: 5

TITLE: Display Pandas Index

AIM: Write a pandas program to display the default index & set a column as an index in a
given data frame.

PRIOR CONCEPT:

To get the index of a Pandas DataFrame, call DataFrame.index property. The


DataFrame.index property returns an Index object representing the index of this DataFrame.
The index property returns an object of type Index. We could access individual index using
any looping technique in Python. We can print the elements of Index object using a for loop

PYTHON CODE:

import pandas as pd

df = pd.DataFrame({

'school_code': ['s001','s002','s003','s001','s002','s004'],

'class': ['V', 'V', 'VI', 'VI', 'V', 'VI'],

'name': ['Alberto Franco','Gino Mcneill','Ryan Parkes', 'Eesha Hinton', 'Gino Mcneill',


'David Parkes'],

'date_Of_Birth':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],

'weight': [35, 32, 33, 30, 31, 32],

'address': ['street1', 'street2', 'street3', 'street1', 'street2', 'street4'],

't_id':['t1', 't2', 't3', 't4', 't5', 't6']})

print("Default Index:")

print(df.head(10))

print("\nt_id as new Index:")

df1 = df.set_index('t_id')
Department of Computer Science & Engineering 17
EMERGING TECHNOLOGY LAB III

print(df1)

print("\nReset the index:")

df2 = df1.reset_index(inplace=False)

print(df2)

OUTPUT:

Department of Computer Science & Engineering 18


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-

1. How could we get first row index in Pandas?


2. Write the syntax to select a specific index in Pandas.

Department of Computer Science & Engineering 19


EMERGING TECHNOLOGY LAB III

PRACTICAL NO.: 6

TITLE: Create Index labels

AIM: - Write a pandas program to create an index labels by using 64-bit integers,using
floating point numbers in a given data frame.

PRIOR CONCEPT :

Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the
rows and all of the columns, or some of each of the rows and columns.Int64Index is a special
case of Index with purely integer labels. Parameters dataarray-like (1-dimensional)
dtypeNumPy dtype (default: int64) copybool. Make a copy of input ndarray. nameobject.
Name to be stored in the index.

PYTHON CODE:

import pandas as pd

print("Create an Int64Index:")

df_i64 = pd.DataFrame({

'school_code': ['s001','s002','s003','s001','s002','s004'],

'class': ['V', 'V', 'VI', 'VI', 'V', 'VI'],

'name': ['Alberto Franco','Gino Mcneill','Ryan Parkes', 'Eesha Hinton', 'Gino Mcneill',


'David Parkes'],

'date_Of_Birth':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],

'weight': [35, 32, 33, 30, 31, 32],

'address': ['street1', 'street2', 'street3', 'street1', 'street2', 'street4']},

index=[1, 2, 3, 4, 5, 6])

print(df_i64)

print("\nView the Index:")

Department of Computer Science & Engineering 20


EMERGING TECHNOLOGY LAB III

print(df_i64.index)

print("\nFloating-point labels using Float64Index:")

df_f64 = pd.DataFrame({

'school_code': ['s001','s002','s003','s001','s002','s004'],

'class': ['V', 'V', 'VI', 'VI', 'V', 'VI'],

'name': ['Alberto Franco','Gino Mcneill','Ryan Parkes', 'Eesha Hinton', 'Gino Mcneill',


'David Parkes'],

'date_Of_Birth ':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],

'weight': [35, 32, 33, 30, 31, 32],

'address': ['street1', 'street2', 'street3', 'street1', 'street2', 'street4']},

index=[.1, .2, .3, .4, .5, .6])

print(df_f64)

print("\nView the Index:")

print(df_f64.index)

OUTPUT:

Department of Computer Science & Engineering 21


EMERGING TECHNOLOGY LAB III

CONCLUSION :

QUESTIONS:-

1. What are the index labels in Pandas?


2. Write the four types of data labels.

Department of Computer Science & Engineering 22


EMERGING TECHNOLOGY LAB III

PRACTICAL NO: 7

TITLE: Pandas string

AIM: Write a pandas program to convert all the string values to upper,lower cases in a
given pandas series.Also find the length of the string.

PRIOR CONCEPT :

Pandas provides a set of string functions which make it easy to operate on string data. Most
importantly, these functions ignore (or exclude) missing/NaN values.

PYTHON CODE:

import pandas as pd

import numpy as np

s = pd.Series(['X', 'Y', 'Z', 'Aaba', 'Baca', np.nan, 'CABA', None, 'bird', 'horse', 'dog'])

print("Original series:")

print(s)

print("\nConvert all string values of the said Series to upper case:")

print(s.str.upper())

print("\nConvert all string values of the said Series to lower case:")

print(s.str.lower())

print("\nLength of the string values of the said Series:")

print(s.str.len())

Department of Computer Science & Engineering 23


EMERGING TECHNOLOGY LAB III

OUTPUT:

Department of Computer Science & Engineering 24


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-

1. What is the Panda datatype of string data?

2. How do you find the string value of a dataframe?

Department of Computer Science & Engineering 25


EMERGING TECHNOLOGY LAB III

PRACTICAL NO: 8

TITLE: Pandas regular expression

AIM: Write a pandas program to remove whitespaces,left sided whitespaces & right sided
whitespaces of the string values.

PRIOR CONCEPT:

Pandas provide 3 methods to handle white spaces(including New line) in any text data. As it
can be seen in the name, str.lstrip() is used to remove spaces from the left side of
string, str.rstrip() to remove spaces from right side of the string and str.strip() removes spaces
from both sides. Since these are pandas function with same name as Python’s default
functions, .str has to be prefixed to tell the compiler that a Pandas function is being called.

PYTHON CODE:

import pandas as pd

color1 = pd.Index([' Green', 'Black ', ' Red ', 'White', ' Pink '])

print("Original series:")

print(color1)

print("\nRemove whitespace")

print(color1.str.strip())

print("\nRemove left sided whitespace")

print(color1.str.lstrip())

print("\nRemove Right sided whitespace")

print(color1.str.rstrip())

Department of Computer Science & Engineering 26


EMERGING TECHNOLOGY LAB III

OUTPUT:

CONCLUSION:

QUESTIONS:-
1. How to get rid of leading and trailing spaces in Pandas?
2. How to check if a string has whitespaces or not?

Department of Computer Science & Engineering 27


EMERGING TECHNOLOGY LAB III

PRACTICAL NO.: 9

TITLE: Joining data frames


AIM: Write a pandas program to join the two given dataframes along rows & assign all data..

PRIOR CONCEPT :

Pandas provides various facilities for easily combining together Series or DataFrame with
various kinds of set logic for the indexes and relational algebra functionality in the case of
join / merge-type operations. In addition, pandas also provides utilities to compare two Series
or DataFrame and summarize their differences.

PYTHON CODE:

import pandas as pd

student_data1 = pd.DataFrame({

'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],

'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],

'marks': [200, 210, 190, 222, 199]})

student_data2 = pd.DataFrame({

'student_id': ['S4', 'S5', 'S6', 'S7', 'S8'],

'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha
Preston'],

'marks': [201, 200, 198, 219, 201]})

print("Original DataFrames:")

print(student_data1)

print("-------------------------------------")

print(student_data2)

Department of Computer Science & Engineering 28


EMERGING TECHNOLOGY LAB III

print("\nJoin the said two dataframes along rows:")

result_data = pd.concat([student_data1, student_data2])

print(result_data)

OUTPUT:

Department of Computer Science & Engineering 29


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-

1. Explain the concat( ) function in Pandas.


2. What is difference between joining and merging in pandas DataFrame?

Department of Computer Science & Engineering 30


EMERGING TECHNOLOGY LAB III

PRACTICAL NO. : 10

TITLE: Merging data frames.

AIM:- Write a pandas program to append a list of dictionaries or series to a existing


dataframe & display the combined data.

PRIOR CONCEPT :

The merge() method updates the content of two DataFrame by merging them together, using
the specified method(s). Pandas provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects.

PYTHON CODE:-

import pandas as pd

student_data1 = pd.DataFrame({

'student_id': ['S1', 'S2', 'S3', 'S4', 'S5'],

'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],

'marks': [200, 210, 190, 222, 199]})

s6 = pd.Series(['S6', 'Scarlette Fisher', 205], index=['student_id', 'name', 'marks'])

dicts = [{'student_id': 'S6', 'name': 'Scarlette Fisher', 'marks': 203},

{'student_id': 'S7', 'name': 'Bryce Jensen', 'marks': 207}]

print("Original DataFrames:")

print(student_data1)

print("\nDictionary:")

print(s6)

combined_data = student_data1.append(dicts, ignore_index=True, sort=False)

print("\nCombined Data:")

print(combined_data)

Department of Computer Science & Engineering 31


EMERGING TECHNOLOGY LAB III

OUTPUT:

Department of Computer Science & Engineering 32


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-
1. How to merge 3 DataFrames in pandas Python?
2. How to merge a list of DataFrames in pandas?
3. Which are the 3 main ways of combining DataFrames together?.

Department of Computer Science & Engineering 33


EMERGING TECHNOLOGY LAB III

PRACTICAL NO. : 11

TITLE: Pandas Time Series.

AIM:- Write a pandas program to create a date from a given year,month ,day & another date
from a given string formats.

PRIOR CONCEPT :

A time series is a sequence of data points that occur in sequential order over a given time
period. Values measured or observed over time are in a time series structure. Pandas’ time
series tools are very useful when data is timestamped. Timestamp is the pandas equivalent of
python’s Datetime. It’s the type used for the entries that make up a DatetimeIndex, and other
timeseries-oriented data structures in pandas. The simplest of the time series is the Series
structure indexed by timestamp.

PYTHON CODE :-

from datetime import datetime

date1 = datetime(year=2020, month=12, day=25)

print("Date from a given year, month, day:")

print(date1)

from dateutil import parser

date2 = parser.parse("1st of January, 2021")

print("\nDate from a given string formats:")

print(date2)

OUTPUT:

Department of Computer Science & Engineering 34


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-
1. How does pandas handle time series data?
2. Which are the three data structures to work with the time series in Pandas?

Department of Computer Science & Engineering 35


EMERGING TECHNOLOGY LAB III

PRACTICAL NO. : 12

TITLE: Pandas grouping aggregate.

AIM:- Write a Pandas program to split the following dataframe by school code and get mean,
min, and max value of age for each school.

PRIOR CONCEPT :

Aggregation in pandas provides various functions that perform a mathematical or logical


operation on our dataset and returns a summary of that function. Aggregation can be used to
get a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset. The function used for aggregation is agg(), the parameter is
the function we want to perform. Pandas’ GroupBy is a powerful and versatile function in
Python. It allows you to split your data into separate groups to perform computations for
better analysis.

PYTHON CODE:-

import pandas as pd

pd.set_option('display.max_rows', None)

#pd.set_option('display.max_columns', None)

student_data = pd.DataFrame({

'school_code': ['s001','s002','s003','s001','s002','s004'],

'class': ['V', 'V', 'VI', 'VI', 'V', 'VI'],

'name': ['Alberto Franco','Gino Mcneill','Ryan Parkes', 'Eesha Hinton', 'Gino Mcneill',


'David Parkes'],

'date_Of_Birth ':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],

'age': [12, 12, 13, 13, 14, 12],

'height': [173, 192, 186, 167, 151, 159],

'weight': [35, 32, 33, 30, 31, 32],

'address': ['street1', 'street2', 'street3', 'street1', 'street2', 'street4']},

Department of Computer Science & Engineering 36


EMERGING TECHNOLOGY LAB III

index=['S1', 'S2', 'S3', 'S4', 'S5', 'S6'])

print("Original DataFrame:")

print(student_data)

print('\nMean, min, and max value of age for each value of the school:')

grouped_single = student_data.groupby('school_code').agg({'age': ['mean', 'min', 'max']})

print(grouped_single)

OUTPUT:

Department of Computer Science & Engineering 37


EMERGING TECHNOLOGY LAB III

CONCLUSION:

QUESTIONS:-
1. Which functions are used in the aggregation?
2. Does pandas Groupby return series?Explain.

Department of Computer Science & Engineering 38

You might also like