Professional Documents
Culture Documents
Lab Manual ET Lab III
Lab Manual ET Lab III
Lab Manual ET Lab III
PRACTICAL NO: 1
AIM: To study the Data Science & use of Pandas for data science
PRIOR CONCEPT:
Data Science is the process that combines statistics, scientific methods, and algorithms to
derive only meaningful and important insights from a ginormous pool of data. It is an
interdisciplinary field whose true foundation lies in Statistics, Mathematics, Computer
Science, and Business. Hence, it becomes a little difficult to understand what exactly Data
Science is and what is it that makes Data scientists one of the coolest professions today.
Data analysis requires lots of processing, such as restructuring, cleaning or merging, etc.
There are different tools are available for fast data processing, such as Numpy, Scipy,
Cython, and Panda. But we prefer Pandas because working with Pandas is fast, simple and
more expressive than other tools. Pandas is built on top of the Numpy package,
means Numpy is required for operating the Pandas.
Before Pandas, Python was capable for data preparation, but it only provided limited support
for data analysis. So, Pandas came into the picture and enhanced the capabilities of data
analysis. It can perform five significant steps required for processing and analysis of data
irrespective of the origin of the data, i.e., load, manipulate, prepare, model, and analyze.
• It has a fast and efficient DataFrame object with the default and customized indexing.
• Used for reshaping and pivoting of the data sets.
• Group by data for aggregations and transformations.
• It is used for data alignment and integration of the missing data.
• Provide the functionality of Time Series.
Department of Computer Science & Engineering 1
EMERGING TECHNOLOGY LAB III
• Process a variety of data sets in different formats like matrix data, tabular
heterogeneous, time series.
• Handle multiple operations of the data sets such as subsetting, slicing, filtering,
groupBy, re-ordering, and re-shaping.
• It integrates with the other libraries such as SciPy, and scikit-learn.
• Provides fast performance, and If you want to speed it, even more, you can use
the Cython.
The Pandas provides two data structures for processing the data i.e., Series and DataFrame,
which are discussed below:
2) DataFrame : It is a widely used data structure of pandas and works with a two
dimensional array with labeled axes (rows and columns). DataFrame is defined as a standard
way to store data and has two different indexes, i.e., row index and column index. It consists
of the following properties:
• The columns can be heterogeneous types like int, bool, and so on.
• It can be seen as a dictionary of Series structure where both the rows and columns are
indexed. It is denoted as “columns” in case of columns and “index” in case of rows.
First head over to https://www.python.org and click on Downloads on the Navigation bar as
highlighted on the image below:
Step-2
Step-3
On running the downloaded installer, you will get this window. Click on ‘Install Now’.
Step-4
After finishing the installation, it is recommended to choose the option to disable path
length to avoid any problems with your Python installation.
Step-5
Now that Python is installed, you should head over to our terminal or command prompt from
where you can install Pandas. So go to your search bar on your desktop and search for cmd.
An application called Command prompt should show up. Click to start it.
Step-6
Type in the command “pip install manager”. Pip is a package install manager for Python
and it is installed alongside the new Python distributions.
Step-7
Wait for the downloads to be over and once it is done you will be able to run Pandas inside
your Python programs on Windows.
CONCLUSION:
QUESTIONS:-
PRACTICAL NO 2
AIM: Write a pandas program to add, subtract, multiple & divide two pandas series.
PRIOR CONCEPT:
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series
is nothing but a column in an excel sheet.Labels need not be unique but must be a hashable
type. The object supports both integer and label-based indexing and provides a host of
methods for performing operations involving the index. In the real world, a Pandas Series
will be created by loading the datasets from existing storage, storage can be SQL Database,
CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a
scalar value etc.
Python Code :
import pandas as pd
ds = ds1 + ds2
print(ds)
ds = ds1 - ds2
print(ds)
ds = ds1 * ds2
print(ds)
ds = ds1 / ds2
print(ds)
OUTPUT:
CONCLUSION:
QUESTIONS:-
PRACTICAL NO: 3
AIM: Write a Pandas program to create and display a DataFrame from a specified dictionary
data which has the index labels.
PRIOR CONCEPT :
PYTHON CODE:
import pandas as pd
import numpy as np
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
OUTPUT:
CONCLUSION:
QUESTIONS:-
PRACTICAL NO.: 4
AIM: Write a pandas program to insert a new column in existing data frame.
PRIOR CONCEPT:
There are multiple ways to insert new column in existing data frame.
PYTHON CODE:
import pandas as pd
import numpy as np
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
print("Original rows:")
print(df)
color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']
df['color'] = color
print(df)
OUTPUT:
CONCLUSION:
QUESTIONS:-
PRACTICAL NO: 5
AIM: Write a pandas program to display the default index & set a column as an index in a
given data frame.
PRIOR CONCEPT:
PYTHON CODE:
import pandas as pd
df = pd.DataFrame({
'school_code': ['s001','s002','s003','s001','s002','s004'],
'date_Of_Birth':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
print("Default Index:")
print(df.head(10))
df1 = df.set_index('t_id')
Department of Computer Science & Engineering 17
EMERGING TECHNOLOGY LAB III
print(df1)
df2 = df1.reset_index(inplace=False)
print(df2)
OUTPUT:
CONCLUSION:
QUESTIONS:-
PRACTICAL NO.: 6
AIM: - Write a pandas program to create an index labels by using 64-bit integers,using
floating point numbers in a given data frame.
PRIOR CONCEPT :
Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the
rows and all of the columns, or some of each of the rows and columns.Int64Index is a special
case of Index with purely integer labels. Parameters dataarray-like (1-dimensional)
dtypeNumPy dtype (default: int64) copybool. Make a copy of input ndarray. nameobject.
Name to be stored in the index.
PYTHON CODE:
import pandas as pd
print("Create an Int64Index:")
df_i64 = pd.DataFrame({
'school_code': ['s001','s002','s003','s001','s002','s004'],
'date_Of_Birth':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
index=[1, 2, 3, 4, 5, 6])
print(df_i64)
print(df_i64.index)
df_f64 = pd.DataFrame({
'school_code': ['s001','s002','s003','s001','s002','s004'],
'date_Of_Birth ':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
print(df_f64)
print(df_f64.index)
OUTPUT:
CONCLUSION :
QUESTIONS:-
PRACTICAL NO: 7
AIM: Write a pandas program to convert all the string values to upper,lower cases in a
given pandas series.Also find the length of the string.
PRIOR CONCEPT :
Pandas provides a set of string functions which make it easy to operate on string data. Most
importantly, these functions ignore (or exclude) missing/NaN values.
PYTHON CODE:
import pandas as pd
import numpy as np
s = pd.Series(['X', 'Y', 'Z', 'Aaba', 'Baca', np.nan, 'CABA', None, 'bird', 'horse', 'dog'])
print("Original series:")
print(s)
print(s.str.upper())
print(s.str.lower())
print(s.str.len())
OUTPUT:
CONCLUSION:
QUESTIONS:-
PRACTICAL NO: 8
AIM: Write a pandas program to remove whitespaces,left sided whitespaces & right sided
whitespaces of the string values.
PRIOR CONCEPT:
Pandas provide 3 methods to handle white spaces(including New line) in any text data. As it
can be seen in the name, str.lstrip() is used to remove spaces from the left side of
string, str.rstrip() to remove spaces from right side of the string and str.strip() removes spaces
from both sides. Since these are pandas function with same name as Python’s default
functions, .str has to be prefixed to tell the compiler that a Pandas function is being called.
PYTHON CODE:
import pandas as pd
color1 = pd.Index([' Green', 'Black ', ' Red ', 'White', ' Pink '])
print("Original series:")
print(color1)
print("\nRemove whitespace")
print(color1.str.strip())
print(color1.str.lstrip())
print(color1.str.rstrip())
OUTPUT:
CONCLUSION:
QUESTIONS:-
1. How to get rid of leading and trailing spaces in Pandas?
2. How to check if a string has whitespaces or not?
PRACTICAL NO.: 9
PRIOR CONCEPT :
Pandas provides various facilities for easily combining together Series or DataFrame with
various kinds of set logic for the indexes and relational algebra functionality in the case of
join / merge-type operations. In addition, pandas also provides utilities to compare two Series
or DataFrame and summarize their differences.
PYTHON CODE:
import pandas as pd
student_data1 = pd.DataFrame({
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
student_data2 = pd.DataFrame({
'name': ['Scarlette Fisher', 'Carla Williamson', 'Dante Morse', 'Kaiser William', 'Madeeha
Preston'],
print("Original DataFrames:")
print(student_data1)
print("-------------------------------------")
print(student_data2)
print(result_data)
OUTPUT:
CONCLUSION:
QUESTIONS:-
PRACTICAL NO. : 10
PRIOR CONCEPT :
The merge() method updates the content of two DataFrame by merging them together, using
the specified method(s). Pandas provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects.
PYTHON CODE:-
import pandas as pd
student_data1 = pd.DataFrame({
'name': ['Danniella Fenton', 'Ryder Storey', 'Bryce Jensen', 'Ed Bernal', 'Kwame Morin'],
print("Original DataFrames:")
print(student_data1)
print("\nDictionary:")
print(s6)
print("\nCombined Data:")
print(combined_data)
OUTPUT:
CONCLUSION:
QUESTIONS:-
1. How to merge 3 DataFrames in pandas Python?
2. How to merge a list of DataFrames in pandas?
3. Which are the 3 main ways of combining DataFrames together?.
PRACTICAL NO. : 11
AIM:- Write a pandas program to create a date from a given year,month ,day & another date
from a given string formats.
PRIOR CONCEPT :
A time series is a sequence of data points that occur in sequential order over a given time
period. Values measured or observed over time are in a time series structure. Pandas’ time
series tools are very useful when data is timestamped. Timestamp is the pandas equivalent of
python’s Datetime. It’s the type used for the entries that make up a DatetimeIndex, and other
timeseries-oriented data structures in pandas. The simplest of the time series is the Series
structure indexed by timestamp.
PYTHON CODE :-
print(date1)
print(date2)
OUTPUT:
CONCLUSION:
QUESTIONS:-
1. How does pandas handle time series data?
2. Which are the three data structures to work with the time series in Pandas?
PRACTICAL NO. : 12
AIM:- Write a Pandas program to split the following dataframe by school code and get mean,
min, and max value of age for each school.
PRIOR CONCEPT :
PYTHON CODE:-
import pandas as pd
pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)
student_data = pd.DataFrame({
'school_code': ['s001','s002','s003','s001','s002','s004'],
'date_Of_Birth ':
['15/05/2002','17/05/2002','16/02/1999','25/09/1998','11/05/2002','15/09/1997'],
print("Original DataFrame:")
print(student_data)
print('\nMean, min, and max value of age for each value of the school:')
print(grouped_single)
OUTPUT:
CONCLUSION:
QUESTIONS:-
1. Which functions are used in the aggregation?
2. Does pandas Groupby return series?Explain.