Download as pdf or txt
Download as pdf or txt
You are on page 1of 94

Arvind Padhiyar PDS 210130107114

Government Engineering College


Sec-28 Gandhinagar

Sem – V
Subject: Python for Data Science

Subject Code: 3150713

Page | 1
Arvind Padhiyar PDS 210130107114

Government Engineering College


Sec-28 Gandhinagar

Certificate

This is to certify that

Mr./Ms. ..................................................................................Of class

Has

Satisfactorily completed his/her term work in

…………………………………. Subject for the term ending in

……………2023.

Date: -

Page | 2
Arvind Padhiyar PDS 210130107114

Preface

Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient weightage
is given to practical work. It shows importance of enhancement of skills amongst the students
and it pays attention to utilize every second of time allotted for practical amongst students,
instructors and faculty members to achieve relevant outcomes by performing the experiments
rather than having merely study type experiments. It is must for effective implementation of
competency focused outcome-based curriculum that every practical is keenly designed to serve
as a tool to develop and enhance relevant competency required by the various industry among
every student. These psychomotor skills are very difficult to develop through traditional chalk
and board content delivery method in the classroom. Accordingly, this lab manual is designed
to focus on the industry defined relevant outcomes, rather than old practice of conducting
practical to prove concept and theory.

By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each
experiment in this manual begins with competency, industry relevant skills, course outcomes as
well as practical outcomes (objectives). The students will also achieve safety and necessary
precautions to be taken while performing practical.

This manual also provides guidelines to faculty members to facilitate student centric lab
activities through each experiment by arranging and managing necessary resources in order that
the students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.

Data Science is about data gathering, analysis and decision-making. Data Science is about finding
patterns in data, through analysis, and make future predictions. By using Data Science, companies
are able to make:

• Better decisions (should we choose A or B)


• Predictive analysis (what will happen next?)
• Pattern discoveries (find pattern, or maybe hidden information in the data)

Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing. Python is an open-source, interpreted, high-level language and
provides a great approach to data science, machine learning, and research purposes. It is one
of the best languages for data science to use for various applications & projects. When it comes
to dealing with mathematical, statistical, and scientific functions, Python has great utility.

Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal
of errors if any.

Page | 3
Arvind Padhiyar PDS 210130107114

Practical – Course Outcome matrix

Course Outcomes (COs):


1. Apply various Python data structures to effectively manage various types of data.
2. Explore various steps of data science pipeline with role of Python.
3. Design applications applying various operations for data cleansing and transformation.
4. Use various data visualization tools for effective interpretations and insights of data.
5. Perform data Wrangling with Scikit-learn applying exploratory data analysis.

Sr.
Objective(s) of Experiment CO1 CO2 CO3 CO4 CO5
No.
Develop a program to understand the control structures of
1. √
python.
Develop a program to learn different types of structures (list,
2. dictionary, tuples) in python. √

Develop a program that reads a .csv dataset file using Pandas


library and display the following content of the dataset.
3. a) First five rows of the dataset √ √
b) Complete data of the dataset
c) Summary or metadata of the dataset.
Develop a program that shows application of slicing and
4. dicing over the rows and columns of the dataset. √ √

Develop a program that shows usage of aggregate function


5. over the input dataset. a) describe b) max c) min d) mean e) √ √
median f) count g) std h) Corr
Develop a program that applies split and merge operations
6. on the datasets. √ √

Develop a program that shows the various data cleaning tasks


over the dataset. a) Identifying the null values. b) Identifying
7. √ √ √
the empty values.
c) Identifying the incorrect timestamp
Develop a program that shows usage of following NumPy
array operations: a) any() b) all() c) isnan() d) isinf() e)
8. isfinite() f) isinf() g) zeros() h) isreal() i) iscomplex() j) √ √
isscalar() k) less() l) greater() m) less_equal() n)
greater_equal()
Develop a program that shows usage of following NumPy
9. library vector functions. a) arrange() b) reshape() c) √ √
linspace() d) randint() e) dot()
Write a program to display below plot using matplotlib
10. libraryFor Values of X:[1,2,3,...,49], Values of Y (thrice √ √
ofX):[3,6,9,12,...,144,147]
Write a program to display below bar plot using matplotlib
library For value
11. √ √
Languages = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
popularity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]
Write a program to display below bar plot using matplotlib
library For below data display pie plot
languages = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
12. √ √
popuratity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728",
"#9467bd", "#8c564b"]

Page | 4
Arvind Padhiyar PDS 210130107114

Write a program to display below bar plot using matplotlib


13. library For 200 random points for both X and Y display √ √
scatter plot

Develop a program that reads .csv file from the url:


14. (https://github.com/chris1610/pbpython/blob/master/data/sa √ √ √
mple salesv3.xlsx?raw=true) and plot the data of the dataset
stored in the .csv file.
Write a text classification pipeline using a custom
preprocessor and CharNGramAnalyzer using data from
15. Wikipedia articles as a training set. √ √ √ √ √

• Evaluate the performance on some held out test sets.


Write a text classification pipeline to classify movie reviews
as either positive or negative.
16. √ √ √ √ √
• Find a good set of parameters using grid search.
• Evaluate the performance on a held out test set.

Page | 5
Arvind Padhiyar PDS 210130107114

Industry Relevant Skills

The following industry relevant competency are expected to be developed in the student by
undertaking the practical work of this laboratory.
1. Programming Languages
2. Mathematics, Statistical Analysis, and Probability
3. Data Mining
4. Machine Learning and AI
5. Data Visualization

Guidelines for Faculty members


1. Teacher should provide the guideline with demonstration of practical to the studentswith
all features.
2. Teacher shall explain basic concepts/theory related to the experiment to the studentsbefore
starting of each practical
3. Involve all the students in performance of each experiment.
4. Teacher is expected to share the skills and competencies to be developed in thestudents and
ensure that the respective skills and competencies are developed in the students after the
completion of the experimentation.
5. Teachers should give opportunity to students for hands-on experience after the demonstration.
6. Teacher may provide additional knowledge and skills to the students even though not covered
in the manual but are expected from the students by concerned industry.
7. Give practical assignment and assess the performance of students based on task assigned to
check whether it is as per the instructions or not.
8. Teacher is expected to refer complete curriculum of the course and follow the guidelines for
implementation.

Instructions for Students


1. Students are expected to carefully listen to all the theory classes delivered by the faculty members
and understand the COs, content of the course, teaching and examination scheme, skill set to be
developed etc.
2. Students shall organize the work in the group and make record of all observations.
3. Students shall develop maintenance skill as expected by industries.
4. Student shall attempt to develop related hand-on skills and build confidence.
5. Students shall make a small project/application in Python.
6. Student shall develop the habits of evolving more ideas, innovations, skills etc. apart fromthose
included in scope of manual.
7. Student shall refer technical magazines and data books.
8. Student should develop a habit of submitting the experimentation work as per the scheduleand
s/he should be well prepared for the same.

Common Safety Instructions


Students are expected to

1. Switch on the PC carefully (not to use wet hands)


2. Shutdown the PC properly at the end of your Lab
3. Carefully Handle the peripherals (Mouse, Keyboard, Network cable etc)
4. Use Laptop in lab after getting permission from Teacher

Page | 6
Arvind Padhiyar PDS 210130107114

Index (Progressive Assessment Sheet)

Sr. Objective(s) of Experiment Page Date of Date of Assessme Sign. of Remar


No. No. perform submiss nt Teacher ks
ance ion Marks with date

Total

Page | 7
Arvind Padhiyar PDS 210130107114

Institute Vision/Mission
Vision:

• To be a premier engineering institution, imparting quality


education for innovative solutions relevant to society and
environment.

Mission:
• To develop human potential to its fullest extent so that
intellectual and innovative engineers can emerge in a wide range of
professions.
• To advance knowledge and educate students in engineering and
other areas of scholarship that will best serve the nation and the
world in future.
• To produce quality engineers, entrepreneurs and leaders to meet the
present and future needs of society as well as environment.

Computer Engineering Department

Vision/Mission:
Vision:

• To achieve excellence for providing value-based education in


Computer Engineeringthrough innovation, team work and ethical
practices.

Mission:

• To produce computer science and engineering graduates according


to the needs ofindustry, government, society and scientific
community.
• To develop partnership with industries, government agencies and R & D
Organizations
• To motivate students/graduates to be entrepreneurs.
• To motivate students to participate in reputed conferences,
workshops, symposiums, seminars and related tech
Page | 8
Arvind Padhiyar PDS 210130107114

Program Educational Outcome (PEO)


• To provide students with a strong foundation in the mathematical,
scientific and engineering fundamentals necessary to formulate,
solve and analyze engineering problems and to prepare them for
graduate studies, R&D, consultancy and higher learning.
• To develop an ability to analyze the requirements of the software,
understand the technical specifications, design and provide novel
engineering solutions and efficient product designs.
• To provide exposure to emerging cutting-edge technologies,
adequate training & opportunities to work as teams on
multidisciplinary projects with effective communication skills and
leadership qualities.
• To prepare the students for a successful career and work with values &
social concern bridging the digital divide and meeting the requirements
of Indian and multinational companies.
• To promote student awareness on the life-long learning and to introduce
them to professional ethics and codes of professional practice

PSO
By the completion of Computer Engineering program, the student will have
following Program specific outcomes.

• Design, develop, test and evaluate computer-based systems by


applying standard software engineering practices and strategies in
the area of algorithms, web design, data structure, and computer
network
• Apply knowledge of ethical principles required to work in a team as well as to lead
a team

Page | 9
Arvind Padhiyar PDS 210130107114

Experiment No: 1
Develop a program to understand the control structures of python.

Competency and Practical Skills:


Competency skills:

• Basic knowledge of computer systems, operating systems, and file systems.


• Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
• Understanding of programming languages, syntax, and logic.
Practical skills:

• Basic understanding of Python programming language


• Understanding of Python control structures
• Ability to use Python's built-in functions and libraries
• Familiarity with Python's syntax
• Problem-solving skills

Relevant CO: CO1

Objectives: (a) To learn and understand the different control structures in Python, such as
loops, conditional statements, and functions..

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
Conditional statements: Conditional statements in Python allow you to execute certain blocks
of code based on whether a certain condition is true or false. The two main types of conditional
statements in Python are "if" statements and "if-else" statements.

Loops: Loops in Python allow you to repeat a block of code multiple times, either for a fixed
number of times or until a certain condition is met. The two main types of loops in Python are
"for" loops and "while" loops.

Functions: Functions in Python allow you to encapsulate blocks of code and reuse them
throughout your program. Functions can accept parameters and return values, making them a
powerful tool for organizing and structuring your code.

Scope: Scope in Python refers to the region of your program where a variable or function is
visible and accessible. Understanding scope is critical for avoiding errors and ensuring that
your code is organized and easy to maintain.

Error handling: Error handling in Python involves detecting and responding to errors that may
occur during program execution. Proper error handling can help you avoid crashes and ensure
that your program continues to run smoothly.

Page | 10
Arvind Padhiyar PDS 210130107114

Safety and necessary Precautions:

1. Data validation.
2. Check the data types.
3. Input sanitization.
4. Error Handling and Secure coding practices.
5. Use comments.
6. Test your code.

Procedure:

1. Plan the program structure and flow: Develop a plan for the program structure,including
the control structures that will be included, and the flow of the program logic.

2. Implement the control structures in Python: Write the code to implement the different
control structures in Python, including conditional statements, loops, and functions.

3. Test and debug the program: Conduct thorough testing of the program to ensure that it
is functioning correctly and identify and troubleshoot any errors or bugs.

4. Refine and optimize the program: Refine the program as needed to improve
performance and optimize its functionality, based on user feedback and testing results.

5. Document the program: Provide clear documentation of the program's purpose,


functionality, and limitations, as well as any potential security risks or necessary
precautions.

6. Deploy and maintain the program: Deploy the program for use by users, and maintain
it by addressing any issues or bugs that arise and providing updates and new features as
needed.

Observations: Put Output of the program

1. To check whether the string is Palindrome or not.

Code :

Page | 11
Arvind Padhiyar PDS 210130107114

2. To check the number is prime or not.

Code :

3. To find nth Fibonacci number.

Code :

Conclusion:

In this experiment I have learned control structures and functions. It allows us to make
decisions and repeat actions based on certain conditions. We can define a function once and
then reuse it multiple times throughout our program, reducing code duplication and making
codebase more maintainable.

Page | 12
Arvind Padhiyar PDS 210130107114

Quiz:

Page | 13
Arvind Padhiyar PDS 210130107114

Page | 14
Arvind Padhiyar PDS 210130107114

Suggested Reference:

1. https://docs.python.org/3/library/
2. https://www.tutorialspoint.com/python/
3. https://www.geeksforgeeks.org/
4. https://realpython.com/
5. https://www.w3schools.com/python/

References used by the students: (Sufficient space to be provided)

1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Page | 15
Arvind Padhiyar PDS 210130107114

Experiment No: 2
Develop a program to learn different types of structures (list, dictionary,
tuples) in python.

Competency and Practical Skills:


Competency skills:

• Basic knowledge of computer systems, operating systems, and file systems.


• Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
• Understanding of programming languages, syntax, and logic.
Practical skills:

• Basic programming concepts: You should have a good grasp of basic programming
concepts such as variables, data types, conditional statements, loops, and functions.
• Python programming language: You should have a good understanding of Python
syntax, data structures, and standard library functions.
• Sequences: Sequences are ordered collections of elements that can be accessed by their
index or key. You should have a good understanding of the different types of sequences
such as string, tuple, list, dictionary, and set, and their respective properties.
• String manipulation: You should know how to manipulate them using methods such as
slicing, concatenation, and formatting.
• Collection manipulation: Collections such as lists, tuples, dictionaries, and sets can be
manipulated using methods such as append, insert, remove, pop, and sort.
• Iteration: You should know how to use for loops and list comprehensions to iterate over
sequences.
• Conditional statements: You should know how to use conditional statements to check
for specific conditions in sequences.
• Functions: You should know how to define functions that operate on sequences and
return values.

Relevant CO: CO1

Objectives: (a) To learn how to manipulate and access their elements, iterate over them,
perform conditional operations on them, and use them in functions.

(b) To learn how to select the appropriate sequence type for a given task based on its properties
and performance characteristics.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
1. In Python programming language, there are four built-in sequence types: strings, lists,
tuples, and ranges. Additionally, Python includes the set and dictionary data structures,
which are implemented as unordered collections of unique and key-value pairs,
respectively.

2. The string data type in Python represents a sequence of characters and is immutable,
meaning its contents cannot be changed once it is created. Strings can be manipulated
using various methods such as slicing, concatenation, and formatting.

16 | P a g e
Arvind Padhiyar PDS 210130107114

3. Lists and tuples are similar in many ways, but tuples are immutable, whereas lists are
mutable. Lists and tuples can hold elements of any data type and can be indexed and
sliced like strings. However, lists offer additional methods such as append, insert,
remove, and pop that allow for manipulation of the list's contents.

4. Dictionaries are another important sequence type in Python and are implemented as
unordered collections of key-value pairs. Each element in a dictionary consists of a key
and a corresponding value. Dictionaries can be used to store and retrieve data quickly
based on the key.

5. Sets are collections of unique elements that are unordered and mutable. Sets are often
used to perform set operations such as union, intersection, and difference.

Safety and necessary Precautions:

1. Use of proper data validation.


2. Secure data storage.
3. Proper error handling.
4. Testing and debugging.
5. Keeping software up to date.
6. Proper code formatting and documentation.

Procedure:
1. Create a string variable using single or double quotes.
Use string methods like upper(), lower(), strip(), split(), join(), and replace() to manipulate
the string as needed.
Use indexing and slicing to access specific characters or substrings within the string.
2. Create a tuple variable using parentheses.
Use indexing and slicing to access specific elements or subsets within the tuple.
Tuples are immutable, so you cannot add, remove or modify elements once created.
3. Create a list variable using square brackets.
Use indexing and slicing to access specific elements or subsets within the list.
Use list methods like append(), insert(), remove(), pop(), extend(), and sort() to modify the
list as needed.
Lists are mutable, so you can add, remove or modify elements once created.
4. Create a dictionary variable using curly braces or the dict() constructor.
Use keys to access values within the dictionary.
Use dictionary methods like keys(), values(), and items() to access different parts of the
dictionary.
Use del or pop() to remove elements from the dictionary.
Use assignment to add or modify elements in the dictionary.
5. Create a set variable using curly braces or the set() constructor.
Use set methods like add(), remove(), pop(), union(), and intersection() to modify or
perform operations on the set.
Sets do not allow duplicate elements, so adding the same element multiple times will only
add it once.

17 | P a g e
Arvind Padhiyar PDS 210130107114

Code :

print('Example of String')
s1 = 'gecg 28sector'
print("\nThe String is : ",s1)
print("\nConvert string in upper case : ", s1.upper())
print("\nConvert string in lower case : ", s1.lower())
print("\nString split example : ", s1.split())
s2 = " gecg "
print("\nString strip example : ",s2.strip(),"is best college")
s3 = ('', 'z', 'b', '')
print("\nString join example : ", 'a'.join(s3))
print("\nString replace example : ", s1.replace("28sector", "28 sector"))

print("\nString Indexing example : ")


print("Indexing using positive number : ",s1[1])
print("Indexing using negative number : ",s1[-4])
print("\nString Slicing example : ")
print(s1[3:9])
print(" \n")

print("\nExample of Tuple")
tup = ('ahmedabad', 'baroda', 'kutch', 'gandhinagar', 'rajkot')
print("\nItems in Tuple is : ",tup)
print("\nTuple Indexing example : ")
print("Indexing using positive number : ",tup[1])
print("Indexing using negative number : ",tup[-1])
print("\nTuple Slicing example : ")
print(tup[1:3])
print(" \n")

print("\nExample of List")
l = ['ahmedabad', 'baroda', 'kutch', 'gandhinagar', 'rajkot']
print("\nItems in List is : ",l)
print("\nList Indexing example : ")
print("Indexing using positive number : ",l[2])
print("Indexing using negative number : ",l[-2])
print("\nList Slicing example : ")
print(l[2:5])
l.append('mehsana')
print("\nList append example : ", l)
l.insert(2, 'mandvi')
print("\nList insert example : ", l)
l.remove('mandvi')
print("\nList remove example : ", l)
l.pop(3)
print("\nList pop example : ", l)
country = ['India', 'Europe', 'Pakistan']
l.extend(country)
print("\nList extend example : ", l)

18 | P a g e
Arvind Padhiyar PDS 210130107114

list_sort = [1, 4, 56, 2, 13]


list_sort.sort()
print("\nList sort example : ", list_sort)
print(" \n")

print("\nExample of Dictionary")
dict = {1: 'A', 2: 'B', 3: 'C', 4: 'D', 5: 'E'}
print("\nKeys and Values in Dictionary is : ",dict)
print("\nAccessing by Keys : ", dict.get(4), " , ", dict.get(2))
print("\nKEYS in Dictionary : ", dict.keys())
print("\nVALUES in Dictionary: ", dict.values())
print("\nITEMS in Dictionary: ", dict.items())
print("\nPOPPING from Dictionary : ", dict.pop(3))
del dict[1]
print("\nDELETE from Dictionary: ", dict)
print(" \n")

print("\nExample of Set")
set = {'India', 'Europe', 'Pakistan'}
print("\nValues in Set is : ",set)
set.add('America')
print("\nAdding item in set : ", set)
set.remove('America')
print("\nRemoving item from set : ", set)
set.pop()
print("\nPopping item from set : ", set)
set1={1,2,3,4}
set2={4,5,6,7}
print("\nUnion of set : ",set1.union(set2))
print("\nIntersection of set : ",set1.intersection(set2))

19 | P a g e
Arvind Padhiyar PDS 210130107114

Output :

20 | P a g e
Arvind Padhiyar PDS 210130107114

Conclusion:

In experiment I have learned different types of data structure used in python like strings,
tuple, list etc. String handles and modifies text and can perform different operations on String
using its method like upper(), lower() etc. Tuples are unchangeable collections we cannot add
or remove elements once created. Lists are mutable sequences with various methods, we can
append() or remove() elements to modify the list as needed. Dictionaries hold key-value pairs
for efficient lookups and can access keys and values from dictionary. Sets store distinct
elements and support set operations like union or intersection. It doesn’t allow duplicate
elements, so adding the same element multiple times will only add it once.

Quiz:

1. What method can you use to convert a string to uppercase in Python? Ans: We can use
the upper() method to convert a string to uppercase in Python.

Example:
s = "Hello, World!"
uppercase_string = s.upper()
print(uppercase_string) # Output: "HELLO, WORLD!"

2. What is the difference between a tuple and a list in Python?


Ans:

Tuple: Ordered, immutable collection of elements. Defined with parentheses ( ). Typically


used for grouping related but unchangeable data.

Example :
tup = ('ahmedabad', 'baroda', 'kutch', 'gandhinagar', 'rajkot')
print(tup) # Output: ('ahmedabad', 'baroda', 'kutch', 'gandhinagar', 'rajkot')
List: Ordered, mutable collection of elements. Defined with square brackets [ ]. Suited for
storing and modifying sequences of data.
Example :
l = ['ahmedabad', 'baroda', 'kutch', 'gandhinagar', 'rajkot']
print(l) # Output: ['ahmedabad', 'baroda', 'kutch', 'gandhinagar', 'rajkot']

3. How do you add an element to a list in Python?


Ans: We can use the append() method to add an element to the end of a list, or the insert()
method to add an element at a specific position.

Example:
my_list = [1, 2, 3]
my_list.append(4) # Adds 4 to the end of the list my_list.insert(1, 5) # Inserts 5 at index 1

4. How do you access a value in a dictionary using its key in Python?


Ans: We can use the key within square brackets to access the value associated with that key
in a dictionary.

Example:
my_dict = {'name': 'John', 'age': 30}
person_name = my_dict['name'] # Accessing value using key 'name' print(person_name) #

21 | P a g e
Arvind Padhiyar PDS 210130107114

Output: 'John'

5. What is a set in Python?


Ans: A set is an unordered collection of unique elements. It is defined using curly braces { }
or the set() constructor. Sets are useful for tasks involving membership testing, removing
duplicates, and set operations like union, intersection, and difference.

Example:
my_set = {1, 2, 3}
my_set.add(4) # Adds element 4 to the set my_set.remove(2) # Removes element 2 from the
set print(my_set) # Output: {1, 3, 4}

Suggested Reference:

1. https://docs.python.org/3/library/
2. https://www.tutorialspoint.com/python/ 3. https://www.geeksforgeeks.org/
4. https://realpython.com/
5. https://www.w3schools.com/python/

References used by the students:

1. https://www.w3schools.com/python/ 2. https://collab.research.google.com/

Rubric wise marks obtained:

22 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 3

Develop a program that reads a .csv dataset file using Pandas library and
display the following content of the dataset.

a) First five rows of the dataset


b) Complete data of the dataset
c) Summary or metadata of the dataset.

Competency and Practical Skills:

Competency skills:

• Knowledge of Python programming language and its libraries, particularly the


Pandas library.
• Understanding of the structure of .csv files and how to read and manipulate them
using Pandas.
• Familiarity with the different methods and functions available in Pandas, such as
"head()", "print()", "display()", "info()", and "describe()".
• Ability to write and debug code, and troubleshoot errors that may arise when
working with datasets.
• Experience in working with datasets, including data cleaning, data wrangling, and
data analysis.
• Ability to understand the content and structure of datasets, and use them to derive
insights and information.

Practical skills:

• Writing code to load a .csv dataset file into a Pandas DataFrame using the
"read_csv()" function.
• Using the "head()" method to display the first five rows of the dataset.
• Using the "print()" function or "display()" method to display the complete data of
the
o dataset.
• Using the "info()" method or "describe()" method to display the summary or
metadata
o of the dataset.
• Handling errors and exceptions that may arise when working with datasets.
• Writing clean and efficient code that is easy to read and maintain.
• Testing the program with different datasets to ensure its accuracy and reliability.

Relevant CO: CO1, CO2

Objectives:

23 | P a g e
Arvind Padhiyar PDS 210130107114

(a) To read and load the .csv dataset file into a Pandas DataFrame.
(b) To display the first five rows of the dataset using the "head()" method.
(c) To display the complete data of the dataset using the "print()" function or
"display()" method.
(d) To display the summary or metadata of the dataset using the "info()" method or
"describe()" method.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:

Pandas is a popular data manipulation library for Python, widely used in data science and
machine learning. It provides a powerful and flexible toolset for working with structured
data, including loading, manipulating, and analyzing datasets in various formats, including
.csv files

Safety and necessary Precautions:

1. Data security, quality and privacy.


2. Memory and performance optimization.
3. Error handling and exception handling.
4. Use comments.
5. Test your code.

Procedure:

1. Import the Pandas library: To use the Pandas library in Python, it is essential to import
it into your program. You can do this by using the "import pandas as pd" statement.
2. Load the dataset: The next step is to load the dataset into a Pandas DataFrame using
the "read_csv()" function. This function takes the path to the .csv file as an argument
and returns a DataFrame object that contains the data from the file.
3. Display the first five rows: To display the first five rows of the dataset, you can use
the "head()" method. This method returns the first five rows of the DataFrame by
default, but you can specify the number of rows you want to display as an argument.
4. Display the complete data: To display the complete data of the dataset, you can use
the "print()" function or "display()" method. This will output the entire DataFrame to
the console or Jupyter Notebook.
5. Display summary or metadata: To display the summary or metadata of the dataset,
you can use the "info()" method or "describe()" method. The "info()" method provides
information about the DataFrame, including the number of rows and columns, data
types, and memory usage. The "describe()" method provides statistical summary of
the dataset, including count, mean, standard deviation, minimum, maximum, and
quartiles for each column.

24 | P a g e
Arvind Padhiyar PDS 210130107114

25 | P a g e
Arvind Padhiyar PDS 210130107114

26 | P a g e
Arvind Padhiyar PDS 210130107114

27 | P a g e
Arvind Padhiyar PDS 210130107114

28 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 4

Develop a program that shows application of slicing and dicing over the
rows and columns of the dataset.

Competency and Practical Skills: Competency skills:

1. Basic knowledge of computer systems, operating systems, and file systems.


2. Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
3. Understanding of programming languages, syntax, and logic.

Practical skills:

• Knowledge of Python programming language.


• Familiarity with Pandas library.
• Ability to read and load dataset files.
• Familiarity with slicing and dicing operations.
• Understanding of data indexing.
• Familiarity with data cleaning and preprocessing.
• Knowledge of data visualization.
• Problem-solving skills.
• Strong analytical and statistical skills.

Relevant CO: CO1, CO2

Objectives: (a) To gain insights into the dataset and extract meaningful information from it.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:

Slicing and dicing are powerful operations that allow data analysts to manipulate data
by selecting specific subsets of data from a larger dataset. These operations are widely
used in data analysis and are a crucial aspect of data manipulation.

In the context of Python, slicing refers to extracting specific portions of data from a
larger data structure, such as a list, tuple, or DataFrame. Slicing is performed by
specifying the start and end indices of the portion of data to be extracted. For
example, in a list of numbers, slicing can be used to extract the first three numbers or
the last five numbers. In a DataFrame, slicing can be used to extract specific rows or
columns based on specific conditions or criteria.

Dicing, on the other hand, refers to grouping and aggregating data based on specific
criteria. This involves dividing the data into smaller subsets based on specific
categories or conditions and performing aggregation functions on each subset. For
example, in a dataset containing sales data, dicing can be used to group the data by
product type, region, or time period and calculate the total sales for each group.

29 | P a g e
Arvind Padhiyar PDS 210130107114

In Python, the Pandas library provides powerful tools for slicing and dicing data in a
DataFrame. The .loc and .iloc methods are used for slicing rows and columns based
on specific conditions or criteria. The .groupby method is used for grouping data
based on specific categories, and aggregation functions such as .sum(), .mean(), and
.count() can be used to perform calculations on each group. The .pivot_table method
is used for creating pivot tables, which provide a summarized view of the data by
grouping and aggregating data based on specific categories.

Safety and necessary Precautions:

1. Backup the original data.


2. Validate the data.
3. Check the output.
4. Secure the data.
5. Use appropriate tools: the Pandas library provides powerful tools for data
manipulation and analysis.

Procedure:

1. Load the dataset: Load the dataset into Python using the Pandas library's read_csv
function.
2. Explore the dataset: Use the head, tail, and info functions to explore the dataset and
get a sense of its structure and contents.
3. Slice and dice the data: Use the Pandas DataFrame's indexing and slicing operations
to select specific rows and columns of the dataset. Examples of slicing operations
include loc, iloc, and [ ].
4. Apply filtering: Use Boolean indexing to filter rows of the dataset based on specific
criteria.
5. Aggregate the data: Use the groupby function to group the data by specific columns
and apply aggregation functions such as sum, mean, and count.
6. Visualize the data: Use visualization libraries such as Matplotlib or Seaborn to create
visualizations of the sliced and diced data.
7. Refine and iterate: Refine the analysis and iterate as needed based on the insights
gained from the analysis.

30 | P a g e
Arvind Padhiyar PDS 210130107114

Observation:

Titanic.csv

a) Load the dataset

Code :

import pandas as pd

df = pd.read_csv('/content/titanic_train.csv')

b) Explorethedataset

Code : First five rows :

31 | P a g e
Arvind Padhiyar PDS 210130107114

Code : Last five rows :

Code : Summary of dataset :

e) Slice and dice the data :

Code :

32 | P a g e
Arvind Padhiyar PDS 210130107114

d) Apply filtering:

Code :

e) Aggregate the data

Code:

f) Visualize the data

Code:

33 | P a g e
Arvind Padhiyar PDS 210130107114

Conclusion:

In this experiment I learned slicing and dicing, filtering data, aggregate function and
visualizing data. The “.loc” and “.iloc” methods are used for slicing rows and columns based
on specific conditions or criteria. The “.groupby” method is used for grouping data based on
specific categories, and aggregation functions such as .sum(), .mean(), and .count() can be
used to perform calculations on each group. Used visualization libraries such as Matplotlib to
create visualizations of the sliced and diced data and Use the groupby function to group the
data by specific columns and apply aggregation functions such as sum, mean, count, min and
max.

Quiz:

1. What is the purpose of slicing and dicing in data analysis?


Ans: Slicing and dicing involves selecting and extracting specific portions of data from
a dataset.
It serves several purposes:

• Focus on relevant information.


• Simplify complex datasets.
• Facilitate exploration and analysis of specific aspects.
• Enable easier visualization and interpretation.

2. Which function of the Pandas library is used to load a .csv dataset file into Python?

Ans: The function used is pandas.read_csv(). It reads a comma-separated values (csv)


file into a DataFrame.

3. What is the difference between loc and iloc in Pandas DataFrame indexing?
Ans:
loc [ ]: Uses labels to access rows and columns. It includes the last element in the range.
Example: df.loc['A':'C', 'Column1'].

34 | P a g e
Arvind Padhiyar PDS 210130107114

4. How can Boolean indexing be used to filter rows of a dataset based on specific
criteria?

Ans: Boolean indexing involves applying a condition to a column, resulting in a Boolean


Series. This Series is then used to filter rows.

Example:

filtered_data = df[df['Column'] > 10]

5. What is the purpose of aggregation functions in data analysis?


Ans: Aggregation functions (e.g., sum, mean, count) are used to summarize or condense
large amounts of data into a single value. They provide insights into the overall trends or
characteristics of the dataset.

6. Which visualization libraries can be used to create visualizations of the sliced and
diced data?

Ans: Common visualization libraries include Matplotlib and Seaborn. They allow you to
create various types of plots and charts to visually represent the data.

7. What is the importance of documenting the slicing and dicing process during data
analysis?

Ans: Documentation ensures transparency, repeatability, and collaboration in data analysis. It


helps others understand the steps taken and the rationale behind them. It also aids in
troubleshooting and reproducing results.

8. What is the advantage of iterating and refining the analysis during the slicing and
dicing process?

Ans: Iterating allows for a more comprehensive exploration of the data. Refining the analysis
helps in uncovering deeper insights and ensuring that the conclusions drawn are robust and
accurate.

9. Can slicing and dicing be applied only to numerical data or can it also be applied to
categorical data?

Ans: Slicing and dicing can be applied to both numerical and categorical data. For
categorical data, it involves selecting specific categories, grouping, and performing analyses
based on categorical attributes.

10. How can the insights gained from slicing and dicing be used to make data-driven
decisions?

Ans: Insights gained from slicing and dicing guide informed decision-making. They can
inform strategies, optimizations, and actions within an organization or project, based on a
thorough understanding of the data.

35 | P a g e
Arvind Padhiyar PDS 210130107114

Suggested Reference:

1. "Python for Data Analysis" by Wes McKinney


2. "Python Data Science Handbook" by Jake VanderPlas
3. "Pandas User Guide" on the Pandas documentation website
4. "Data Wrangling with Pandas" course on DataCamp
5. "Data Manipulation with Pandas" course on Coursera

References used by the students:

1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

36 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 5

Develop a program that shows usage of aggregate function over the input
dataset. a) describe b) max c) min d) mean e) median f) count g) std h) Corr

Competency and Practical Skills:

Competency skills:

• Knowledge of the input dataset format (e.g. CSV, Excel, JSON) and how to load it
into a data structure in Python using libraries like Pandas.
• Understanding of the different aggregate functions available in Pandas, such as
describe, max, min, mean, median, count, std, and corr.
• Familiarity with the syntax of Pandas functions for applying aggregate functions, such
as groupby, apply, and agg.
• Ability to interpret and analyze the results of the aggregate functions to gain insights
about the dataset.

Practical skills:

• Loading the input dataset into a Pandas DataFrame object.


• Applying the desired aggregate functions to the DataFrame using the appropriate

syntax.

• Displaying the results of the aggregate functions in a user-friendly format, such as a

table or chart.

• Handling any errors or exceptions that may arise during the data manipulation
process.

Relevant CO: CO1, CO2

Objectives: (a) To understand the concept of aggregate functions and their usage in
data analysis.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:

In data analysis, aggregate functions are used to calculate summary statistics over a
dataset. These functions are applied to columns or rows of a dataset to calculate
values like the maximum, minimum, mean, median, count, standard deviation, and
correlation.

37 | P a g e
Arvind Padhiyar PDS 210130107114

Here is a brief overview of the aggregate functions:

a) describe: This function generates descriptive statistics that summarize the central
tendency, dispersion, and shape of a dataset's distribution.

b) max: This function is used to find the maximum value of a column or row.

c) min: This function is used to find the minimum value of a column or row.

d) mean: This function is used to find the average value of a column or row.

e) median: This function is used to find the median value of a column or row.

f) count: This function is used to count the number of non-null values in a column or
row.

g) std: This function is used to calculate the standard deviation of a column or row.

h) Corr: This function is used to calculate the correlation between columns or rows of
a dataset.

In Python, these aggregate functions can be applied using the Pandas library. The
groupby() function is used to group data based on a specified column, and the
aggregate functions can then be applied to the grouped data.

Safety and necessary Precautions:

1. Make sure that the input dataset is clean and well-formatted.


2. Check the data types of the columns in the dataset.
3. Be careful when working with large datasets, as some aggregate functions may
require a lot of computational power and memory.
4. Double-check the output of the aggregate functions to ensure that they make sense
and match the expected results.

Procedure:

1. Import necessary libraries: You will need to import Pandas library to load the dataset
and perform various operations on it.
2. Load the dataset: Load the dataset in a Pandas dataframe using the read_csv()
function. Make sure the dataset is in a CSV format and is saved in your working
directory.
3. Check the dataset: Print the first few rows of the dataset using the head() function to
check if the dataset is loaded correctly.
4. Describe the dataset: Use the describe() function to get the summary statistics of the
dataset, such as count, mean, standard deviation, minimum, and maximum values.
5. Apply aggregate functions: Apply the aggregate functions such as max(), min(),

mean(), median(), count(), std(), and corr() on the dataset.

6. Display the results: Display the results of the aggregate functions to the user.

38 | P a g e
Arvind Padhiyar PDS 210130107114

Observation:

Titanic.csv

a. Load the dataset and exploring dataset

b. Describe the dataset

39 | P a g e
Arvind Padhiyar PDS 210130107114

c. Apply aggregate functions

Conclusion:

In this experiment I learned aggregate functions such as max(), min(), mean(), median(),
count(), std(), and corr() on the dataset. Using describe() function to get the summary
statistics of the dataset, such as count, mean, standard deviation, minimum, and maximum
values and finding minimum, maximum, standard deviation etc. using aggregate function.

40 | P a g e
Arvind Padhiyar PDS 210130107114

Quiz:

1. What is the purpose of using aggregate functions in a dataset?


Ans: The purpose of using aggregate functions in a dataset is to perform mathematical
operations on a set of values, typically within a specific column or group of columns. These
functions summarize or aggregate the data, providing insights into the overall characteristics
or trends within the dataset.
2. Which aggregate function calculates the average of a numerical column?

Ans: The aggregate function that calculates the average of a numerical column is typically
referred to as the "mean" or "average" function. In SQL, it is often represented as AVG().

3. Which of the following aggregate functions calculates the correlation between two
numerical columns?

Ans: The aggregate function that calculates the correlation between two numerical columns is
not a standard aggregate function in most database systems. Instead, it's usually calculated
using specialized functions or methods in data analysis libraries like pandas in Python.

4. Which of the following aggregate functions returns the number of non-missing values in
a column?

Ans: The aggregate function that returns the number of non-missing values in a column is
typically referred to as the "count" function. In SQL, it is represented as COUNT().

5. What is the purpose of using the describe() function in Pandas?

Ans: In Pandas, the describe() function is used to generate summary statistics of a DataFrame
or a Series. It provides information such as count, mean, standard deviation, minimum and
maximum for numerical columns. This function is helpful in quickly understanding the
distribution and characteristics of the data in a DataFrame.

Suggested Reference:

1. https://pandas.pydata.org/docs/ 2. https://numpy.org/doc/stable/

References used by the students:

1. https://www.w3schools.com/python/ 2. https://collab.research.google.com/

41 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 6
Develop a program that applies split and merge operations on the datasets.
Competency and Practical Skills:
Competency skills:

• Basic knowledge of computer systems, operating systems, and file systems.


• Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
• Understanding of programming languages, syntax, and logic.

Practical skills:

• Understanding of Data Structures


• Knowledge of Programming Languages
• Familiarity with Data Manipulation Libraries
• Understanding of Splitting and Merging Operations
• Proficiency in Using IDEs and Text Editors
• Problem Solving and Troubleshooting Skills

Relevant CO: CO1, CO2

Objectives: (a) To split large datasets into smaller ones for ease of handling and processing.
(b) To consolidate information and make it easier to analyze.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
Python provides several built-in functions and libraries for performing split and merge
operations on datasets. Here are some examples:

Splitting a Dataset:

Using the pandas split() method: The split() method is a built-in function in Python that can be
used to split a string into a list of substrings based on a specified delimiter. This can be useful
for splitting a dataset into smaller chunks.

Using the numpy.array_split() function: The numpy.array_split() function can be used to split
a numpy array into smaller arrays of equal or nearly equal size.

Merging Datasets:

Using the pandas.concat() function: The pandas.concat() function can be used to concatenate
pandas dataframes along a specified axis.

Using the numpy concatenate() function: The concatenate() function can be used to merge two
or more arrays into a single array.

Safety and necessary Precautions:

1. Check for data consistency.

42 | P a g e
Arvind Padhiyar PDS 210130107114

2. Avoid overwriting original data.


3. Check for duplicates.
4. Handle missing data.
5. Test the code thoroughly.

Procedure:
1. Define the input datasets: Determine the input datasets and their format. It could be
CSV files, Excel files, or other file types. Also, define the delimiter or separator
character for splitting the data.

2. Load the datasets: Load the datasets into the program using the appropriate libraries and
functions. Check that the data is loaded correctly and perform any necessary data
cleaning or formatting.

3. Split the datasets: Use the appropriate function or library to split the datasets into
smaller chunks. Specify the size or number of chunks to create and ensure that the
resulting datasets are consistent and valid.

4. Merge the datasets: Use the appropriate function or library to merge the datasets into a
single dataset. Specify the method of merging and ensure that the resulting dataset is
consistent and valid.

5. Handle missing or duplicate data: Check for any missing or duplicate data in the merged
dataset and handle them appropriately. You can choose to remove the records with
missing data or impute the missing values.

6. Perform calculations or analysis: Once the datasets are merged, you can perform any
necessary calculations or analysis on the resulting dataset. This could include
aggregating data, calculating averages, or performing statistical analysis.

Observation:

Titanic.csv

43 | P a g e
Arvind Padhiyar PDS 210130107114

Code :

44 | P a g e
Arvind Padhiyar PDS 210130107114

Output :

Conclusion:

In this experiment, we loaded, split, merged, cleaned, and analyzed the "titanic.csv" dataset. Through
this process, we've acquired fundamental skills for data manipulation and analysis. We've learned how
to manage data quality, handle missing values, and remove duplicates to create a reliable dataset. These
skills are critical for data analysis, machine learning, and informed decision-making, forming the
foundation for working with real-world datasets in a structured and insightful manner.

Quiz:

1. What are the key steps involved in developing a program that applies split and
merge operations on datasets?
Ans: The key steps involved in developing a program that applies split and merge
operations on datasets include data loading, data splitting, data merging, handling
missing or duplicate data, and performing analysis.

2. What library or function can be used to split the input datasets into smaller
chunks?

45 | P a g e
Arvind Padhiyar PDS 210130107114

Ans: The Pandas library in Python can be used to split the input datasets into smaller
chunks.

3. What should you do if the merged dataset contains missing or duplicate data?
Ans: If the merged dataset contains missing data, you should handle it by removing or
imputing missing values. For duplicate data, remove the duplicates to ensure data
integrity.

4. What should you do after developing the program?


Ans: After developing the program, you can apply it to your specific dataset,
customize it as needed, and use it to split, merge, clean, and analyze your data. The
final step would depend on the purpose of your analysis, which might include making
data-driven decisions or building machine learning models.

Suggested Reference:

1. https://pandas.pydata.org/docs/ 2. https://numpy.org/doc/stable/

References used by the students:

1. https://www.w3schools.com/python/ 2. https://collab.research.google.com/

46 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 7
Develop a program that shows the various data cleaning tasks over the dataset. a)
Identifying the null values. b) Identifying the empty values c) Identifying the
incorrect timestamp
Competency and Practical Skills:
Competency skills:

• Basic knowledge of computer systems, operating systems, and file systems.


• Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
• Understanding of programming languages, syntax, and logic.

Practical skills:

• Basic understanding of Python programming language


• Familiarity with data cleaning techniques, including identifying null and empty values,
handling incorrect timestamps, and removing outliers.
• Knowledge of statistical methods and data visualization techniques to identify anomalies and
outliers in the data.
• Familiarity with data cleaning libraries and tools, such as Pandas and NumPy in Python
• Problem-solving skills

Relevant CO: CO1, CO2, CO3

Objectives: (a) To identify and handle missing or incomplete data in the dataset.
(b) To identify and handle invalid or incorrect data in the dataset.
(c) To remove duplicate data in the dataset.
(d) To standardize data formats and values to ensure consistency across the dataset.
(e) To handle outliers and extreme values that may skew data analysis results.
(f) To ensure data accuracy and completeness for reliable data analysis.
(g) To improve data quality by reducing errors and inconsistencies in the dataset.
(h) To prepare the dataset for further analysis and modeling..

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
Data cleaning is an essential step in the data preparation process that involves identifying and handling
missing, incorrect, or inconsistent data in the dataset. In Python, data cleaning is typically performed
using libraries such as NumPy and Pandas, which provide functions for data manipulation and analysis.

The theory behind data cleaning in Python involves several key steps:

Importing data: The first step in data cleaning is to import the data into Python using the appropriate
library and data format. Common data formats include CSV, Excel, and JSON.

Identifying missing data: Once the data is imported, the next step is to identify missing data in the
dataset. This can be done using the isnull() function in Pandas, which returns a Boolean value
indicating whether a value is missing or not.

Handling missing data: Once missing data is identified, the next step is to handle it appropriately. This

Page | 47
Arvind Padhiyar PDS 210130107114

can be done by either removing the rows or columns with missing values or imputing the missing
values with a suitable value such as the mean or median of the column.

Identifying incorrect data: After handling missing data, the next step is to identify incorrect data in the
dataset, such as values that are outside the expected range or format. This can be done using statistical
techniques such as data visualization and analysis.

Handling incorrect data: Once incorrect data is identified, the next step is to handle it appropriately.
This can be done by removing the outliers or replacing the incorrect values with a suitable value such
as the median or mode of the column.

Standardizing data formats and values: To ensure consistency across the dataset, it is often necessary
to standardize data formats and values. This can be done by converting data types, renaming columns,
or applying formatting rules.

Removing duplicates: Duplicate data can skew analysis results and should be removed from the
dataset. This can be done using the drop_duplicates() function in Pandas.

Quality control: The final step in data cleaning is to perform quality control checks to ensure that the
data is accurate, complete, and consistent. This involves comparing the cleaned dataset to the original
dataset and verifying that the data has been cleaned appropriately.

Safety and necessary Precautions:

1. Backup data.
2. Use secure and updated software.
3. Access control.
4. Data privacy.
5. Data encryption
6. Error handling.
7. Test and validate.

Procedure:
1. Import the required libraries: Import the necessary libraries such as pandas, numpy, and
matplotlib to read, manipulate and visualize the dataset.

2. Load the dataset: Load the dataset into the program using a pandas dataframe.

3. Identify null values: Use the isnull() function to identify null values in the dataset. If any null
values are found, decide on a strategy to handle them. This could involve replacing null values
with a mean or median value, dropping the null values or imputing them with a different value.

4. Identify empty values: Use the empty() function to identify empty values in the dataset. Empty
values are those values that contain nothing (not even null). If any empty values are found,
decide on a strategy to handle them. This could involve replacing empty values with a mean or
median value, dropping the empty values or imputing them with a different value.

5. Identify incorrect timestamp: Use the to_datetime() function to convert the timestamp column
to a datetime object. This will identify any incorrect timestamp values. If any incorrect
timestamp values are found, decide on a strategy to handle them. This could involve dropping
the rows with incorrect timestamp values or imputing them with a different value.

Page | 48
Arvind Padhiyar PDS 210130107114

6. Remove duplicates: Use the drop_duplicates() function to remove any duplicate rows in the
dataset.

7. Data normalization: Use the normalization technique to transform the data into a standard
format to make it more consistent and easier to analyze.

8. Data standardization: Use the standardization technique to transform the data into a standard
scale to make it more consistent and easier to analyze.

9. Save the cleaned dataset: Save the cleaned dataset to a new file for future use.

10. Visualize the cleaned dataset: Use matplotlib or other visualization libraries to create
visualizations of the cleaned dataset to better understand the data and identify any further
cleaning that may be required.

Observations:

Titanic.csv

Page | 49
Arvind Padhiyar PDS 210130107114

Code:

Output:

Page | 50
Arvind Padhiyar PDS 210130107114

Quiz:

1. What is the first step in developing a program for data cleaning in Python?
Ans: The first step in developing a program for data cleaning in Python is to import the
required libraries, such as Pandas, NumPy, and Matplotlib, to read, manipulate, and visualize
the dataset.
2. How can null values be identified in a dataset?
Ans: Null values can be identified in a dataset using the ‘isnull()’ function, which returns a
Boolean mask of the same shape as the input, where True represents a null value.
3. How can empty values be handled in a dataset?
Ans: Empty values in a dataset can be handled by using the isspace() function, which checks
if a string is empty or contains only whitespace. You can apply this function to identify
empty values in the dataset.
4. How can incorrect timestamp values be identified in a dataset?
Ans: Incorrect timestamp values in a dataset can be identified by attempting to convert a
timestamp column to a datetime object using the to_datetime() function. If this operation raises
a ValueError, it indicates incorrect timestamp values.
5. What is the purpose of data normalization in data cleaning?
Ans: The purpose of data normalization in data cleaning is to transform data into a standard format,
making it more consistent and easier to analyse. Normalization ensures that the data falls within a
specific range or scale, reducing the impact of outliers and making comparisons more meaningful in
various analysis and modelling tasks.

Suggested Reference:

1. https://pandas.pydata.org/docs/ 2. https://numpy.org/doc/stable/

References used by the students:

1. https://www.w3schools.com/python/ 2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

Page | 51
Arvind Padhiyar PDS 210130107114

Experiment No: 8
Develop a program that shows usage of following NumPy array operations: a)
any() b) all() c) isnan() d) isinf() e) isfinite() f) isinf() g) zeros() h) isreal() i)
iscomplex() j) isscalar() k) less() l) greater() m) less_equal() n) greater_equal()
Competency and Practical Skills:
Competency skills:

• Basic knowledge of computer systems, operating systems, and file systems.


• Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
• Understanding of programming languages, syntax, and logic.
Practical skills:

• Understanding of Data Structures


• Knowledge of Programming Languages
• Familiarity with Data Manipulation Libraries
• Proficiency in Using IDEs and Text Editors
• Problem Solving and Troubleshooting Skills

Relevant CO: CO2

Objectives: (a) To perform complex mathematical and logical operations on large arrays
andmatrices efficiently.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
NumPy is a popular Python library for scientific computing that provides efficient and powerful
arrayoperations. It enables users to work with multidimensional arrays and perform a variety of
mathematical and logical operations on them.

Here are the explanations of some of the NumPy array operations mentioned in the question:

a) any(): It returns True if any of the elements of an array evaluate to True, and False otherwise.

b) all(): It returns True if all the elements of an array evaluate to True, and False otherwise.

c) isnan(): It returns an array of the same shape as the input array, with True where the
correspondingelement of the input array is NaN (Not a Number), and False elsewhere.

d) isinf(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is +/-inf (positive or negative infinity), and False
elsewhere.

e) isfinite(): It returns an array of the same shape as the input array, with True where the
correspondingelement of the input array is finite (i.e., not NaN, +/-inf), and False elsewhere.

f) isinf(): It returns an array of the same shape as the input array, with True where the

Page | 52
Arvind Padhiyar PDS 210130107114

correspondingelement of the input array is +/-inf (positive or negative infinity), and False
elsewhere.

g) zeros(): It returns a new array of the specified shape and data type, filled with zeros. T

h) isreal(): It returns an array of the same shape as the input array, with True where the
correspondingelement of the input array is real, and False where it is complex.

i) iscomplex(): It returns an array of the same shape as the input array, with True where
thecorresponding element of the input array is complex, and False where it is real.

j) isscalar(): It returns True if the input is a scalar (i.e., a single value, not an array), and
Falseotherwise.

k) less(): It returns an array of the same shape as the input arrays, with True where the
correspondingelement of the first input array is less than the corresponding element of the second
input array, and False otherwise.

l) greater(): It returns an array of the same shape as the input arrays, with True where the
correspondingelement of the first input array is greater than the corresponding element of the second
input array, and False otherwise.

m) less_equal(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is less than or equal to the corresponding element of
thesecond input array, and False otherwise.

n) greater_equal(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is greater than or equal to the corresponding element
ofthe second input array, and False otherwise.

Safety and necessary Precautions:

1. Make sure to import NumPy correctly.


2. Use appropriate data types.
3. Watch out for NaN and Inf.
4. Be careful with memory usage.
5. Test the program thoroughly.

Procedure:
1. Import the NumPy library: To use NumPy array operations, you need to import the
NumPy library into your Python environment. You can do this using the import
statement.

2. Create a NumPy array: You need to create a NumPy array to perform the various
operations.You can create an array using the np.array() function.

3. Use the array operations: Once you have created the array, you can use various NumPy array
operations such as any(), all(), isnan(), isinf(), isfinite(), zeros(), isreal(), iscomplex(),
isscalar(),less(), greater(), less_equal(), and greater_equal().

4. Print the output: After performing the operations, you should print the output to see the results.

Page | 53
Arvind Padhiyar PDS 210130107114

Observations:

Code :

Output :

Conclusion:
The program demonstrates various NumPy array operations. It starts by creating a sample NumPy
arraycontaining integers, a NaN value, infinity, and complex numbers. The program then applies
different operations to this array. It checks if any element is non-zero (any()) and if all elements are
non-zero (all()), both of which return True in this case. It identifies NaN and infinity values using
isnan() and isinf(), respectively, and marks finite values with isfinite(). The program also
distinguishes between real and complex numbers using isreal() and iscomplex(). It checks if 42 is a
scalar (isscalar()) and determines that it is, whereas the array is not. Additionally, comparisons
between two arrays are madeusing functions like less(), greater(), less_equal(), and greater_equal(),
providing results indicating which elements satisfy the respective conditions.

Page | 54
Arvind Padhiyar PDS 210130107114

Quiz:
1. What does the NumPy function 'any()' return?
Ans: The NumPy function any() returns True if at least one element in the input array is
evaluatedas True (non-zero). If all elements are False (zero), it returns False.
2. What is the purpose of the NumPy function 'isnan()'?
Ans: The purpose of the NumPy function isnan() is to identify and return a boolean mask where
each element in the input array is checked for being NaN (Not a Number). It returns a boolean
arrayof the same shape as the input, where True indicates that the corresponding element is NaN,
and False indicates it is a valid number.
3. What does the NumPy function 'zeros()' do?
Ans: The NumPy function zeros() creates a new NumPy array filled with zeros. It takes a shape
as input and generates an array of that shape entirely populated with zeros. For example,
np.zeros((2,3)) would produce a 2x3 matrix filled with zeros.
4. What does the NumPy function 'isreal()' do?
Ans: The NumPy function isreal() checks each element in an array and returns a boolean
maskindicating whether the element is a real number. It returns True for real numbers and
False for complex numbers.

Suggested Reference:

1. NumPy User Guide: https://numpy.org/doc/stable/user/index.html


2. NumPy Tutorial: https://www.tutorialspoint.com/numpy/index.htm
3. NumPy Cheat Sheet:
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_S
heet.pdf

References used by the students:

1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

Page | 55
Arvind Padhiyar PDS 210130107114

Experiment No: 9
Develop a program that shows usage of following NumPy library vector functions.
a) arrange() b) reshape() c) linspace() d) randint() e) dot()

Competency and Practical Skills:


Competency skills:

• Basic knowledge of computer systems, operating systems, and file systems.


• Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
• Understanding of programming languages, syntax, and logic.
Practical skills:

• Understanding of Data Structures


• Knowledge of Programming Languages
• Familiarity with Data Manipulation Numpy Libraries
• Proficiency in Using IDEs and Text Editors
• Problem Solving and Troubleshooting Skills

Relevant CO: CO2

Objectives: (a) To provide efficient and powerful tools for working with large arrays and matrices
inPython, along with a wide range of mathematical and scientific functions for manipulating and
analyzing these arrays.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
Here is a brief theory for each of the NumPy vector functions:

a) arrange(): This function is used to create a one-dimensional array with evenly spaced values
within aspecified range. The function takes in three arguments: start (optional), stop, and step
(optional). The start argument is the starting value of the sequence (inclusive), the stop argument is
the ending value ofthe sequence (exclusive), and the step argument is the step size between values.
For example, np.arange(0, 10, 2) creates an array with values [0, 2, 4, 6, 8].

b) reshape(): This function is used to reshape an array into a new shape without changing its data.
Thefunction takes in one argument: the new shape of the array, specified as a tuple of integers. For
example, np.reshape(my_array, (3, 4)) reshapes the array my_array into a 3x4 matrix.

c) linspace(): This function is used to create a one-dimensional array with evenly spaced values
between a specified range. The function takes in three arguments: start, stop, and num (optional).
The start argument is the starting value of the sequence, the stop argument is the ending value of
the sequence, and the num argument is the number of values to generate. For example,
np.linspace(0, 1, 5)creates an array with values [0., 0.25, 0.5, 0.75, 1.].

d) randint(): This function is used to generate an array of random integers within a specified
range. Thefunction takes in three arguments: low (optional), high, and size (optional). The low
argument is the lower bound of the range (inclusive), the high argument is the upper bound of the
range (exclusive), and the size argument is the shape of the output array. For example,
np.random.randint(0, 10, size=(2, 3)) generates a 2x3 array of random integers between 0 and 10.

Page | 56
Arvind Padhiyar PDS 210130107114

e) dot(): This function is used to perform matrix multiplication between two arrays. The function
takesin two arguments: the two arrays to be multiplied. The arrays must have compatible shapes
for matrix multiplication. For example, if A is a 2x3 array and B is a 3x2 array, np.dot(A, B)
performs matrix multiplication between A and B and returns a 2x2 array.

Overall, these NumPy vector functions are commonly used for manipulating and analyzing
arrays inscientific computing and data analysis. By using these functions in a program, you can
efficiently perform operations on large arrays and matrices in Python.

Safety and necessary Precautions:

1. Install NumPy from a trusted source.


2. Keep NumPy updated.
3. Understand data types.
4. Avoid modifying arrays in place.
5. Use vectorized operations.
6. Handle exceptions and errors.

Procedure:
1. Import the NumPy library: Begin your program by importing the NumPy library
using theimport statement.

2. Create an array: Create an array using one of the NumPy functions such as
arrange() or linspace(). You can also create an array from an existing data source
such as a CSV file.

3. Reshape the array: Use the reshape() function to reshape the array to the desired
shape. Forexample, you can reshape a one-dimensional array into a two-dimensional
array.

4. Generate random numbers: Use the randint() function to generate an array of random
integerswithin a specified range.

5. Perform matrix multiplication: Use the dot() function to perform matrix multiplication
betweentwo arrays.

6. Print the results: Print the resulting arrays to the console using the print() function

Page | 57
Arvind Padhiyar PDS 210130107114

Observations:

Code :

Output:

Conclusion:
In this experiment, arange() is used to create an array containing values from 0 to 8 with a step size
of
2. Secondly, reshape() is employed to transform a 1D array into a 2D array with dimensions 3x4.
Next,linspace() generates an array with five evenly spaced values between 0 and 1. The randint()
function isthen utilized to create a 3x3 array filled with random integers ranging from 0 to 9.
Finally, the dot() function is employed to compute the dot product of two 2x2 arrays, producing
the result of matrix multiplication. Overall, this program effectively showcases the versatility and
power of NumPy's vector functions in performing a variety of array manipulations and
calculations.

Quiz:
1. What is the purpose of the NumPy library?
Ans: The purpose of the NumPy library is to provide a powerful set of tools for numerical
computations in Python. It offers high-performance multidimensional array operations, along
with acollection of mathematical functions to work with these arrays. NumPy is particularly
valuable for tasks involving numerical computations in fields like mathematics, physics,
engineering, data science, and machine learning.

Page | 58
Arvind Padhiyar PDS 210130107114

2. Which NumPy function can be used to create an array with evenly spaced values?
Ans: The NumPy function linspace() can be used to create an array with evenlyspaced values.
Itgenerates an array of specified length with values evenly spaced within a specified range.

3. Which NumPy function can be used to generate an array of random integers within a
specified range?
Ans: The NumPy function randint() is used to generate an array of random integers
within aspecified range. It allows for the creation of arrays with randomly selected
integer values.

4. How can you perform matrix multiplication between two arrays in NumPy?
Ans: Matrix multiplication between two arrays in NumPy can be performed using the
np.dot()function. This function takes two arrays as input and computes their dot product,
which is equivalent to matrix multiplication for 2D arrays.

5. What is the purpose of the reshape() function in NumPy?


Ans: The purpose of the reshape() function in NumPy is to change the shape of an array. It
allowsfor the transformation of a multidimensional array into a different shape, as long as the
total number of elements remains constant. This is useful for tasks like converting a 1D array
into a 2D matrix or vice versa, or for preparing data in a suitable format for various
computations.

Suggested Reference:

1. https://numpy.org/doc/stable/
2. https://numpy.org/doc/stable/user/index.html

References used by the students:


1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

Page | 59
Arvind Padhiyar PDS 210130107114

Experiment No: 10
Write a program to display below plot using matplotlib library. For
Values ofX:[1,2,3,...,49], Values of Y (thrice of X):[3,6,9,12,...,144,147]

Competency and Practical Skills:


Competency skills:

• Understanding the basics of data visualization


• Familiarity with Python programming language
• Knowledge of the different types of plots and when to use them
• Knowledge of the syntax and parameters for different matplotlib functions
• Understanding of data structures like arrays and data frames

Practical skills:

• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.

Relevant CO: CO4

Objectives: (a) To create informative and visually appealing data visualizations that enable
users toexplore, understand, and communicate complex data.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
Matplotlib is a Python library that provides a variety of tools for creating high-quality data
visualizations. It is one of the most popular data visualization libraries due to its ease of use and
versatility. The library is built on NumPy and provides a range of options for creating different
types ofplots and graphs, including line plots, scatter plots, bar charts, histograms, and many more.

The main components of the Matplotlib library are:

pyplot module: This is the main module of Matplotlib, which provides a simple interface for
creatingplots and charts. It is a collection of functions that allow users to create plots with
minimal coding.

Figure and Axes objects: The Figure object is the top-level container for all the plot elements. It
represents the entire plot and contains one or more Axes objects. The Axes object is the individual
plotarea where data is plotted.

Plotting functions: Matplotlib provides a range of plotting functions that can be used to create
differenttypes of plots and charts. These functions include plot(), scatter(), bar(), hist(), and many
more.

Page | 60
Arvind Padhiyar PDS 210130107114

Customization options: Matplotlib allows users to customize the appearance of plots in various ways,
including changing the plot color, adding labels, titles, and legends, adjusting the axis limits, and
more.

To use Matplotlib, you first need to import the library and its pyplot module. Then, you can
create afigure object and one or more axes objects using the subplots() function. After that, you
can use the various plotting functions to create different types of plots and customize them as
needed.

Overall, Matplotlib provides a powerful and flexible tool for creating data visualizations in
Python. With its wide range of options and customization features, it can be used for a variety of
data analysisand communication tasks.

Safety and necessary Precautions:

1. Keep Matplotlib libraries up-to-date.


2. Use Comments
3. Test your code.

Procedure:
1. Import the required libraries - Matplotlib and NumPy.
2. Create two NumPy arrays for X and Y values using np.arange() and multiplication.
3. Create a figure and an axis object using plt.subplots().
4. Use a x.plot() function to plot X and Y values as a line plot.
5. Customize the plot with axis labels and a title.
6. Display the plot using plt.show() function.

Observations:
Code:

Page | 61
Arvind Padhiyar PDS 210130107114

Output:

Quiz:
1. What is Matplotlib?
Ans: Matplotlib is a widely used Python library for creating static, animated, and interactive
visualizations in a variety of formats. It provides a high-level interface for drawing
attractive andinformative statistical graphics, plots, and charts.

2. What are the two basic types of plots in


Matplotlib?Ans: The two basic types of plots in
Matplotlib are:

Line plots: These are used to display data points in a continuous line. They are suitable for
representing trends or relationships between variables over a continuous range.
Scatter plots: These display individual data points as markers without connecting lines

3. How can you change the color of a plot in Matplotlib?


Ans: The color of a plot in Matplotlib can be changed by specifying the color parameter in the
plotfunction. For example, to change the color to red, you can use plt.plot(x, y, color='red').

Page | 62
Arvind Padhiyar PDS 210130107114

4. How can you add a legend to a plot in Matplotlib?


Ans: To add a legend to a plot in Matplotlib, you can use the plt.legend() function. This
function takes an optional labels parameter that allows you to specify the labels for each plot.
The legend provides information about the data represented in the plot, making it easier for
viewers to interpretthe visualization.

5. What is the function used to save a plot to a file in Matplotlib?


Ans: The function used to save a plot to a file in Matplotlib is plt.savefig(). This function
allowsyou to save the current figure to a specified file format, such as PNG, PDF, SVG, or
others. For example, plt.savefig('my_plot.png') will save the current plot as a PNG file with
the filename 'my_plot.png'.

Suggested Reference:

1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-
edition/9781800565547

References used by the students:


1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

Page | 63
Arvind Padhiyar PDS 210130107114

Experiment No: 11
Write a program to display below bar plot using matplotlib library. For
valueLanguages = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
Popularity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]

Competency and Practical Skills:


Competency skills:

• Understanding the basics of data visualization


• Familiarity with Python programming language
• Knowledge of the different types of plots and when to use them
• Knowledge of the syntax and parameters for different matplotlib functions
• Understanding of data structures like arrays and data frames

Practical skills:

• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.

Relevant CO: CO4

Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to
drawinsights and make informed decisions.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
A bar plot is a type of chart that displays data as rectangular bars. The length or height of each bar is
proportional to the value of the data it represents. Bar plots are useful for comparing the values of
different categories or groups.

Matplotlib is a popular data visualization library in Python that provides a wide range of functions
for creating different types of plots, including bar plots.

Use the bar() function to create the bar plot by passing the languages and popularity lists as
arguments. The bar() function automatically generates the rectangular bars for each category and
sets their lengths proportional to the values in the popularity list.

Safety and necessary Precautions:

1. Keep Matplotlib libraries up-to-date.


2. Use Comments
3. Test your code.

Page | 64
Arvind Padhiyar PDS 210130107114

Procedure:
1. Define the data for the plot as lists or arrays.
2. Use the bar() function to create the plot, passing the data as arguments.
3. Customize the plot by changing the colors, labels, and other attributes.
4. Add a title and labels to the plot to provide context and improve its readability.
5. Display the plot using the show() function.

Observations:
Code:

Output:

Page | 65
Arvind Padhiyar PDS 210130107114

Conclusion:
The provided Python program effectively utilizes the Matplotlib library to create a bar plot
illustrating the popularity of various programming languages. With data points for languages, the
plot vividly showcases their respective popularity percentages. Each bar corresponds to a language,
with the height of the bar representing its popularity.

Quiz:
1. What is a bar plot?
Ans: A bar plot, also known as a bar chart or bar graph, is a graphical representation of
categoricaldata. It uses rectangular bars to display the frequency, count, or any other measure
associated with different categories. The length or height of each bar corresponds to the value
of the category it represents, making it easy to compare and visualize the differences between
categories.

2. Which library is used to create a bar plot in Python?


Ans: The library used to create a bar plot in Python is Matplotlib. Matplotlib is a popular data
visualization library that provides a wide range of tools for creating static, animated, and
interactiveplots and charts.

3. What is a bar plot?


Ans: A bar plot, also known as a bar chart or bar graph, is a graphical representation of
categoricaldata. It uses rectangular bars to display the frequency, count, or any other measure
associated with different categories. The length or height of each bar corresponds to the value
of the category it represents, making it easy to compare and visualize the differences between
categories.

4. Which library is used to create a bar plot in Python?


Ans: The library used to create a bar plot in Python is Matplotlib. Matplotlib is a popular data
visualization library that provides a wide range of tools for creating static, animated, and
interactiveplots and charts.

5. What are the steps involved in creating a bar plot using


Matplotlib? Ans: The steps involved in creating a bar plot using
Matplotlib are as follows:Import the Matplotlib library.

Page | 66
Arvind Padhiyar PDS 210130107114

Define the data to be plotted (e.g., categories and corresponding


values).Use the plt.bar() function to create the bar plot.
Customize the plot by adding labels, titles, colors, and other
attributes.Display the plot using plt.show().

6. What is the correct syntax to create a bar plot using Matplotlib?


Ans:
import matplotlib.pyplot
as plt# Define data
categories = [...] # List of categories
values = [...] # List of corresponding
values# Create bar plot
plt.bar(categories, values, color='...') # Additional customization can be applied
here# Add labels and title
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Title of the Bar
Plot')# Display the plot
plt.show()

7. What are the parameters required by the bar() function to create a bar
plot?Ans: The bar() function in Matplotlib requires two main parameters:
x: This parameter represents the categories or labels on the x-axis. It can be a list of
strings,numbers, or any other categorical data.
height: This parameter corresponds to the values or heights of the bars. It should be a
list ofnumerical values, with each value representing the height of a respective bar.

Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer:
https://www.youtube.com/playlist?list=PL-
osiE80TeTvipOqomVEeZ1HRrcEvtZB_

Page | 67
Arvind Padhiyar PDS 210130107114

4. Python Data Science Handbook by Jake


VanderPlas:
https://jakevdp.github.io/PythonDataScienceHa
ndbook/
5. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-
edition/9781800565547

References used by the students:


• https://www.w3schools.com/python/
• https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

Page | 68
Arvind Padhiyar PDS 210130107114

Experiment No: 12
Write a program to display below bar plot using matplotlib library For below data
display pie plot
Languages = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
Popuratity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]
Colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b"]

Competency and Practical Skills:


Competency skills:

• Understanding the basics of data visualization


• Familiarity with Python programming language
• Knowledge of the different types of plots and when to use them
• Knowledge of the syntax and parameters for different matplotlib functions
• Understanding of data structures like arrays and data frames

Practical skills:

• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.

Relevant CO: CO1, CO4

Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
A bar plot is a type of chart that displays data as rectangular bars. The length or height of each bar is
proportional to the value of the data it represents. Bar plots are useful for comparing the values of
different categories or groups.

Matplotlib is a popular data visualization library in Python that provides a wide range of functions for
creating different types of plots, including bar plots.

Use the bar() function to create the bar plot by passing the languages and popularity lists as arguments.
The bar() function automatically generates the rectangular bars for each category and sets their lengths
proportional to the values in the popularity list.

Safety and necessary Precautions:

1. Keep Matplotlib libraries up-to-date.


2. Use Comments
3. Test your code.

69 | P a g e
Arvind Padhiyar PDS 210130107114

Procedure:
1. Import the necessary libraries (matplotlib.pyplot)
2. Define the data to be used (Languages, Popularity, Colors)
3. Create a figure object and set the figure size
4. Define the title of the plot and add the data to be displayed (Popularity) and their corresponding
labels (Languages)
5. Set the colors of the pie chart using the Colors list
6. Add a legend to the chart with the labels and colors used
7. Display the plot.

Observations:
Code:

Output:

Conclusion:
The provided Python program uses the Matplotlib library to create visual representations of
programming language popularity. It first generates a bar plot displaying the popularity percentages of
various languages. Then, it creates a pie plot to further visualize the distribution. The colors chosen
enhance visual appeal, and labels provide context. These plots offer clear insights into the relative
popularity of programming languages.

70 | P a g e
Arvind Padhiyar PDS 210130107114

Quiz:
1. What libraries do you need to import to create the pie chart using matplotlib?
Ans: To create a pie chart using Matplotlib, you need to import the matplotlib.pyplot module. In
the provided program, this is done with the line import matplotlib.pyplot as plt.
2. What is the purpose of defining the Colors list in the program?
Ans: The purpose of defining the Colors list in the program is to specify the colors that will be used
for the different segments of the pie chart. Each color in the list corresponds to a programming
language in the Languages list, providing a visually appealing representation of the data.
3. What is the purpose of setting the figure size in the program?
Ans: Setting the figure size in the program using plt.figure(figsize=(width, height)) determines the
dimensions of the plot window. This allows for control over the size of the generated plot, ensuring
that it is displayed in an appropriately scaled manner.

4. How do you add a legend to the pie chart in matplotlib?


Ans: To add a legend to the pie chart in Matplotlib, you can use the plt.legend() function. However,
in the context of a pie chart, legends are typically not used because the labels of the pie slices
themselves serve as a form of legend.

Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547

References used by the students:


1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

71 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 13
Write a program to display below bar plot using matplotlib library For 200
random points for both X and Y display scatter plot.

Competency and Practical Skills:


Competency skills:

• Understanding the basics of data visualization


• Familiarity with Python programming language
• Knowledge of the different types of plots and when to use them
• Knowledge of the syntax and parameters for different matplotlib functions
• Understanding of data structures like arrays and data frames

Practical skills:

• Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
• Ability to customize the appearance of plots including labels, colors, legends, and titles
• Ability to add text, annotations, and shapes to the plots
• Ability to work with multiple plots and subplots
• Ability to export plots in different file formats like png, pdf, svg, etc.
• Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.

Relevant CO: CO4

Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
In Matplotlib, a scatter plot is a chart type that displays data as a collection of points with the position
determined by the values of two variables. Each point on the scatter plot represents an observation, and
the position of the point on the X-Y axis is determined by the values of the two variables.

A scatter plot is useful for exploring the relationship between two continuous variables. It can be used
to identify patterns or trends in the data and to detect the presence of outliers or unusual observations.
Scatter plots can also be used to assess the correlation between the two variables.

Matplotlib provides the scatter() function for creating scatter plots. The function takes two arrays, one
for the X-axis data and one for the Y-axis data, as its input arguments. Additional parameters can be
used to customize the appearance of the scatter plot, such as the color, size, and transparency of the
points.

Safety and necessary Precautions:

1. Keep Matplotlib libraries up-to-date.


2. Use Comments
3. Test your code.

72 | P a g e
Arvind Padhiyar PDS 210130107114

Procedure:
1. Import necessary libraries: We will need the Matplotlib and NumPy libraries for this task.
2. Generate random data for the X and Y axes: We can use the NumPy library to generate random
data for both the X and Y axes
3. Create a scatter plot: We can use the scatter method of the Matplotlib library to create a scatter
plot. We need to pass the X and Y data as arguments and specify the marker style and color
using the marker and c parameters, respectively
4. Add title and labels: We can add a title and labels for the X and Y axes using the title, xlabel,
and ylabel methods of the Matplotlib library.
5. Set axes limits: We can set the limits for the X and Y axes using the xlim and ylim methods of
the Matplotlib library.
6. Display the plot: We can display the plot using the show method of the Matplotlib library.

Observations:

Code:

Output:

73 | P a g e
Arvind Padhiyar PDS 210130107114

Conclusion:
Firstly, a bar plot illustrates the popularity percentages of various programming languages, offering a
clear visual comparison. The x-axis represents different languages, while the y-axis indicates their
respective popularity percentages. Secondly, a scatter plot displays 200 randomly generated points in a
2D space. These points are distributed across the plot, showcasing a random pattern. The program
demonstrates the versatility of Matplotlib in creating diverse visualizations for different types of data.

Quiz:

1. What is a scatter plot?


Ans: A scatter plot is a type of data visualization that displays individual data points as markers on
a two-dimensional graph. Each point represents the values of two variables, typically on the x and y
axes. Scatter plots are useful for visualizing the relationship between two continuous variables and
identifying patterns, trends, or correlations in the data.

2. What is the function used for creating scatter plots in Matplotlib?


Ans: The function used for creating scatter plots in Matplotlib is plt.scatter().

3. What are the input arguments for the scatter() function?


Ans: The input arguments for the scatter() function in Matplotlib include:
x: The values for the x-axis.
y: The values for the y-axis.
s: The size of the markers.

c: The color of the markers.


marker: The type of marker to use.

4. What can a scatter plot be used for?


Ans: A scatter plot can be used for various purposes, including:
Identifying patterns or trends in the relationship between two variables.
Visualizing the distribution of data points.
Detecting outliers or unusual observations.
Assessing the strength and direction of a correlation between variables.
Comparing multiple groups or categories within the same plot.

5. Can the appearance of the scatter plot be customized?


Ans: Yes, the appearance of a scatter plot can be customized in several ways. You can adjust the
marker size (s argument), marker color (c argument), marker type (marker argument), and

74 | P a g e
Arvind Padhiyar PDS 210130107114

transparency (alpha argument) of the data points. Additionally, you can customize the axes labels,
titles, grid, and other visual attributes to enhance the clarity and aesthetics of the plot.

Suggested Reference:

1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer: https://www.youtube.com/playlist?list=PL-
osiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
5. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547

References used by the students:

1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

75 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 14
Develop a program that reads .csv and plot the data of the dataset stored in the .csv
file file from the url: (https://github.com/chris1610/pbpython/blob/master/data/sample
salesv3.xlsx?raw=true)

Competency and Practical Skills:


Competency skills:
• Data analysis, data visualization, file handling and programming.

Relevant CO: CO3, CO4

Objectives: (a) To analyze and visualize the data in an efficient and effective way.
(b) To identify patterns, trends, and outliers in the data.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
Reading a .csv file from a URL and plotting the data is a common data analysis and visualization task
in many fields. Here are the main steps involved in this process:

Importing the necessary libraries: To read and plot the .csv file, we typically use the pandas and
matplotlib libraries. We need to import them at the beginning of our program.

Loading the data from the URL: We can use the pandas library's read_csv function to read the data
from the URL. We need to provide the URL of the .csv file as an argument to this function.

Data cleaning and preparation: Once we have loaded the data, we may need to clean and prepare it for
visualization. This may include dropping unnecessary columns, filling missing values, and
transforming the data.

Data visualization: Once the data is cleaned and prepared, we can use matplotlib's various plotting
functions to create visualizations such as line plots, scatter plots, bar plots, and more. We can
customize the plot with various parameters such as colors, labels, titles, and more.

Displaying the plot: After creating the plot, we need to display it using the show function provided by
the matplotlib library.

Safety and necessary Precautions:

1. Validate inputs.
2. Handle errors.
3. Secure the program
4. Optimize performance
5. Test and review.

Procedure:
1. Import the necessary libraries: You will need the pandas library to read the .csv file, and
matplotlib library to create the plot.
2. Read the .csv file from the URL: Use the pandas library to read the .csv file from the URL and
store it as a DataFrame object.

76 | P a g e
Arvind Padhiyar PDS 210130107114

3. Preprocess the data: Preprocess the data as required. This may involve cleaning the data,
removing duplicates, handling missing values, and converting data types.
4. Visualize the data: Use the matplotlib library to create a visualization of the data. You can
create scatter plots, line graphs, histograms, and other types of visualizations based on the data.
5. Save or display the visualization: Save the visualization to a file or display it on the screen,
depending on the user requirements.
6. Test and validate the program: Test the program thoroughly to ensure that it works as expected
for various input datasets. Validate the results against the expected output and fix any issues or
errors.
7. Document the program: Document the program by providing clear and concise comments in the
code and a user manual that explains how to use the program.

Observations:

Code:

Output:

77 | P a g e
Arvind Padhiyar PDS 210130107114

Conclusion:
The provided Python program effectively utilizes the Pandas library to load data from an Excel file. It
then groups the data by product name and calculates the total sales for each product. The program
proceeds to create a clear and informative bar chart displaying the total sales figures. This visual
representation allows for easy comparison of product sales performance.

Quiz:
1. What library is required to read a .csv file in Python?
Ans: The library required to read a .csv file in Python is Pandas.

2. What library is required to create plots in Python?


Ans: The library required to create plots in Python is Matplotlib.

3. What is the first step in developing a program that reads a .csv file from a URL and plots
the data?
Ans: The first step in developing a program that reads a .csv file from a URL and plots the data is
to import the necessary libraries, namely Pandas for data handling and Matplotlib for plotting.

4. How do you read a .csv file from a URL in Python using the pandas library?
Ans: To read a .csv file from a URL in Python using the Pandas library, you can use the
pd.read_csv() function. For example, df = pd.read_csv(url) would read the .csv file from the
provided URL and store it in a DataFrame called df.

5. How do you create a scatter plot of two columns from a DataFrame using the matplotlib
library?
Ans: To create a scatter plot of two columns from a DataFrame using the Matplotlib library, you
can use the plt.scatter() function. For example, plt.scatter(df['column1'], df['column2']) would
create a scatter plot of 'column1' on the x-axis and 'column2' on the y-axis.

6. How do you save a plot to a file using the matplotlib library?


Ans: To save a plot to a file using the Matplotlib library, you can use the plt.savefig() function. For
example, plt.savefig('plot.png') would save the current plot as a PNG file with the filename
'plot.png'. You can specify the file format by providing the appropriate file extension (e.g., .png,
.pdf, .svg).

78 | P a g e
Arvind Padhiyar PDS 210130107114

Suggested Reference:

1. Pandas documentation on reading a CSV file from a URL:


https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-csv-files
2. Matplotlib documentation on creating plots:
https://matplotlib.org/stable/tutorials/introductory/pyplot.html
3. Real Python tutorial on reading and writing CSV files in Python:
https://realpython.com/python-csv/
4. DataCamp tutorial on data visualization with Matplotlib:
https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python
5. Towards Data Science tutorial on creating visualizations with Pandas and Matplotlib:
https://towardsdatascience.com/data-visualization-with-pandas-and-matplotlib-8dadc69f2f79

References used by the students:

1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

79 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 15
Write a text classification pipeline using a custom preprocessor and
CharNGramAnalyzer using data from Wikipedia articles as a training set.
Evaluate the performance on some held out test sets.
Date:

Competency and Practical Skills:


Competency skills:

• Basic knowledge of computer systems, operating systems, and file systems.


• Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
• Understanding of programming languages, syntax, and logic.

Practical skills:

• Proficiency in Python programming language.


• Familiarity with the scikit-learn library for machine learning.
• Knowledge of text preprocessing techniques, such as tokenization, stop word removal,
stemming, and lemmatization.
• Understanding of feature extraction techniques, such as bag-of-words and character n-grams.
• Ability to evaluate the performance of a text classification model using metrics such as
accuracy, precision, recall, and F1 score.
• Knowledge of cross-validation techniques for evaluating model performance on held-out test
sets.
• Familiarity with data collection and preprocessing techniques for building a training set from
Wikipedia articles.

Relevant CO: CO3, CO4, CO5

Objectives: (a) To develop a machine learning model that can accurately classify text documents into
predefined categories that can be used for various applications such as sentiment analysis, spam
detection, and topic modeling.

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
Text classification is the task of assigning predefined categories or labels to text documents based on
their content. A text classification pipeline typically consists of several stages, including data
preprocessing, feature extraction, model training, and evaluation.

In the context of Wikipedia articles, the first step in building a text classification pipeline is to collect a
dataset of articles with their corresponding labels. These labels can be either manually assigned or
obtained from existing metadata such as categories or tags.

Once a dataset is obtained, the next step is data preprocessing. This typically involves text
normalization, tokenization, stop word removal, and stemming/lemmatization

After preprocessing, the text is converted into numerical features that can be used as input to a machine
learning model. A popular technique for feature extraction is the bag-of-words model, which represents
each document as a vector of word frequencies.

80 | P a g e
Arvind Padhiyar PDS 210130107114

An alternative approach is to use character n-grams, such as CharNGramAnalyzer, which captures the
sequence of characters in the text. This method is particularly useful for capturing the morphology and
syntax of the text and can improve the performance of the classifier.

The final stage in the text classification pipeline is model training and evaluation. A common approach
is to use supervised learning algorithms such as Naive Bayes, Logistic Regression, or Support Vector
Machines. The performance of the model is evaluated using metrics such as accuracy, precision, recall,
and F1 score on held-out test sets.

In summary, building a text classification pipeline using a custom preprocessor and


CharNGramAnalyzer involves data preprocessing, feature extraction, model training, and evaluation.
This approach can be particularly useful for text classification tasks where the meaning and
relationships of words are important.

Safety and necessary Precautions:

1. Data privacy.
2. Bias and fairness.
3. Model accuracy and reliability
4. Ethical considerations
5. Test and review.

Procedure:
Collect and preprocess the data: Download a set of Wikipedia articles that represent the different
categories you want to classify (e.g., sports, politics, entertainment, etc.). Preprocess the data by
removing any unnecessary characters, converting all text to lowercase, and removing any stop words.

Split the data: Split the preprocessed data into two sets: training and test sets. The training set will be
used to train the model, while the test set will be used to evaluate the model's performance.

Feature extraction: Extract the features from the preprocessed text using CharNGramAnalyzer. This
will convert each text document into a vector of features that can be used as input to the classification
model.

Train the model: Train a text classification model using the extracted features and the training set. You
can use any machine learning algorithm, such as Naive Bayes, SVM, or Neural Networks.

Evaluate the model: Use the trained model to classify the test set and evaluate its performance using
metrics such as accuracy, precision, recall, and F1-score.

Tune the model: If the model's performance is not satisfactory, you can tune the hyperparameters of the
algorithm or try different algorithms to improve its performance.

Deploy the model: Once you are satisfied with the model's performance, you can deploy it in
production to classify new text document

81 | P a g e
Arvind Padhiyar PDS 210130107114

Observations:

Wikipedia.csv

Code:

82 | P a g e
Arvind Padhiyar PDS 210130107114

Output:

83 | P a g e
Arvind Padhiyar PDS 210130107114

Conclusion:

In the conclusion, the code evaluates the model's performance on two separate held-out test sets.
The results include accuracy scores and detailed classification reports, providing insights into
the model'seffectiveness in classifying text from the test sets.
This code can serve as a versatile framework for text classification tasks, demonstrating the
importance of preprocessing and feature extraction techniques like CharNGramAnalyzer. The
evaluation results onmultiple test sets help assess the model's generalization capabilities.

Quiz:

1. What is the purpose of using a custom preprocessor in a text classification pipeline?


Ans: The purpose of using a custom preprocessor in a text classification pipeline is to
perform specific data cleaning and preprocessing tasks on the raw text data before it is fed
into the machinelearning model. This can include tasks like converting text to lowercase,
removing special characters, handling stopwords, and other custom operations that are
specific to the dataset or the classification task. The custom preprocessor helps in improving
the quality of the features fed into the model, which in turn can lead to better classification
performance.

2. Which analyzer is used in the given scenario?


84 | P a g e
Arvind Padhiyar PDS 210130107114

"Writing a text classification pipeline using a custom preprocessor and


CharNGramAnalyzerusing data from Wikipedia articles as a training set."
Ans: In the given scenario, the analyzer used is the CharNGramAnalyzer. This analyzer is
specifically designed to generate character-level n-grams (sequences of characters of length
'n') from the input text. It's especially useful when word-level features might not be as
informative or when there's a need to capture information at the character level. In this
context, CharNGramAnalyzer will generate character-level features from the Wikipedia
articles, which canbe valuable for certain text classification tasks.

3. What is the purpose of evaluating the performance on held-out test sets in


textclassification?
Ans: The purpose of evaluating performance on held-out test sets in text classification is to
assess how well the trained model generalizes to unseen data. This step is crucial to ensure
that the model's performance is not overfitted to the training data. Here's why it's important:
Generalization Testing: It allows us to test the model's ability to make accurate predictions
on data it has never seen before. This simulates real-world scenarios where the model
encounters new,unseen examples.
Assessing Model Quality: It provides an unbiased evaluation of the model's performance,
whichhelps in understanding how well it will perform in practical applications.
Avoiding Overfitting: It helps detect whether the model has overfit to the training data.
Overfittingoccurs when the model learns the training data too well, including noise or
irrelevant patterns, which can lead to poor performance on new data.
Parameter Tuning: It aids in fine-tuning model parameters or hyperparameters to achieve
the bestperformance on unseen data.
Comparing Models: It enables comparison of different models or approaches to see
which oneperforms better on the same test set, helping in model selection.
Building Trust: It builds confidence in the model's predictive ability, as it
demonstrates itsperformance on independent, unseen examples.

Suggested Reference:

1. "Building a Text Classification Pipeline with Python" by Dipanjan Sarkar: This article
provides a step-by-step guide on how to build a text classification pipeline using
Python and scikit-learn library. It covers preprocessing techniques, feature extraction,

model selection, andevaluation.


85 | P a g e
Arvind Padhiyar PDS 210130107114

2. "Text Classification with NLTK and Scikit-Learn" by Ahmed Besbes: This tutorial
provides a detailed guide on how to perform text classification using Python and two
popularlibraries, NLTK and scikit-learn. It covers data preprocessing, feature
extraction, and model training and evaluation.

3. "Using Wikipedia Articles for Text Classification" by Nikolay Krylov: This article
demonstrates how to use Wikipedia articles as a training set for text classification. It
covers datacollection, preprocessing, feature extraction using TF-IDF and
CharNGramAnalyzer, model training, and evaluation.

4. "Text Classification with Python and Scikit-Learn" by Sebastian Raschka: This book
chapter provides a comprehensive guide on how to perform text classification using
Python andscikit-learn. It covers data preprocessing, feature extraction, model training,
and evaluation, as well as advanced topics such as model selection and parameter
tuning.

5. "A Complete Tutorial on Text Classification using Naive Bayes Algorithm" by Divya
Gupta: This tutorial provides a detailed guide on how to perform text classification
using NaiveBayes algorithm in Python. It covers data preprocessing, feature extraction,
model training and evaluation, as well as parameter tuning.

References used by the students:

1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

86 | P a g e
Arvind Padhiyar PDS 210130107114

Experiment No: 16
Write a text classification pipeline to classify movie reviews as either positive or
negative.
Find a good set of parameters using grid search.
Evaluate the performance on a held out test set.

Date:

Competency and Practical Skills:


Competency skills:

• Basic knowledge of computer systems, operating systems, and file systems.


• Familiarity with command-line interfaces (CLI) and graphical user interfaces (GUI).
• Understanding of programming languages, syntax, and logic.
Practical skills:

• Strong understanding of Natural Language Processing (NLP) concepts, such as tokenization,


stemming, lemmatization, and feature extraction.
• Familiarity with popular NLP libraries such as NLTK, SpaCy, and Scikit-learn.
• Knowledge of machine learning algorithms, such as Naive Bayes, Support Vector Machines, and
Neural Networks.
• Ability to preprocess text data, including removing stop words, cleaning text, and performing
feature engineering.
• Experience with data exploration and visualization tools, such as Pandas and Matplotlib.
• Familiarity with Python programming language and its data science ecosystem, including
NumPy, SciPy, and Pandas.
• Ability to evaluate the performance of a classification model using appropriate metrics such as
accuracy, precision, recall, and F1-score.
• Knowledge of different hyperparameter tuning techniques and cross-validation methods to
optimize the model's performance..

Relevant CO: CO3, CO4, CO5

Objectives: (a) To create an accurate and reliable model that can automatically classify movie reviews as
positive or negative, which can be useful for analyzing large volumes of reviews quickly and efficiently,
as well as for providing recommendations to users based on their preferences..

Equipment/Instruments: Personal Computer, Internet, Python

Theory:
The theory behind writing a text classification pipeline to classify movie reviews as either positive or
negative involves several key steps:

Data preprocessing: This step involves cleaning and preparing the raw text data by removing stop words,
converting text to lowercase, and performing stemming or lemmatization.

Feature extraction: This step involves converting the preprocessed text data into a numerical representation
that can be used as input to a machine learning algorithm. Common techniques include Bag-of-Words,
TF-IDF, and Word Embeddings.

Model selection and training: This step involves selecting an appropriate machine learning algorithm and
training it on the preprocessed and transformed data. Popular algorithms include Naive Bayes, Support

87 | P a g e
Arvind Padhiyar PDS 210130107114

Vector Machines, and Neural Networks.

Hyperparameter tuning: This step involves selecting the optimal hyperparameters for the chosen machine
learning algorithm. This can be done using techniques such as grid search or random search.

Evaluation: This step involves evaluating the performance of the trained model on a held-out test set. This
can be done using metrics such as accuracy, precision, recall, and F1-score.

Deployment: This step involves deploying the trained model in a production environment, where it can be
used to classify new movie reviews.

Grid search is a hyperparameter tuning technique that involves searching for the optimal set of
hyperparameters for a given machine learning algorithm by exhaustively trying all possible combinations
of hyperparameter values. This can be done by training and evaluating the model with different
combinations of hyperparameters on a validation set, and selecting the combination that yields the best
performance.

Evaluating the performance of the trained model on a held-out test set is important to ensure that the model
generalizes well to new, unseen data. This helps to avoid overfitting, where the model performs well on
the training data but poorly on new data.

Overall, the theory behind writing a text classification pipeline to classify movie reviews as either positive
or negative involves a combination of data preprocessing, feature extraction, model selection and training,
hyperparameter tuning, evaluation, and deployment.

Safety and necessary Precautions:

1. Data preprocessing
2. Feature extraction
3. Model selection
4. Hyper parameter tuning
5. evaluation

Procedure:
1. Preprocess the data: Preprocess the movie review data by cleaning the text, removing stop words,
and performing stemming or lemmatization to reduce the dimensionality of the feature space.

2. Split the data: Split the preprocessed data into training, validation, and test sets. The training set
will be used to train the model, the validation set will be used to tune the hyperparameters, and the
test set will be used to evaluate the final performance of the model.

3. Extract features: Extract features from the preprocessed text using techniques such as Bag-of-
Words, TF-IDF, or Word Embeddings. This will convert the text data into a numerical
representation that can be used as input to a machine learning algorithm.

4. Select a model: Choose a suitable machine learning algorithm, such as Naive Bayes, Support
Vector Machines, or Neural Networks, and train it on the preprocessed and transformed data.

5. Hyperparameter tuning: Use grid search to find the best set of hyperparameters for the chosen
machine learning algorithm. This involves training and evaluating the model with different
combinations of hyperparameters on the validation set, and selecting the combination that yields
the best performance.

88 | P a g e
Arvind Padhiyar PDS 210130107114

Observations:

Movie_reviews.csv

89 | P a g e
Arvind Padhiyar PDS 210130107114

Code:

Output:

90 | P a g e
Arvind Padhiyar PDS 210130107114

Conclusion:
The provided code implements a text classification pipeline to classify movie reviews as either positive
or negative. It uses a Support Vector Machine (SVM) classifier, along with TF-IDF vectorization for
feature extraction. The grid search technique is employed to find the best combination of
hyperparameters for the SVM classifier, enhancing its performance.Finally, the model's performance is
evaluated on a held-out test set. This includes metrics such as accuracy, precision, recall, and F1-score,
providing a comprehensive assessment of the model's effectiveness in classifying movie reviews.
In conclusion, this pipeline demonstrates an effective approach for sentiment analysis on movie reviews,
showcasing the power of grid search in fine-tuning model parameters for optimal performanceon real-
world data. Keep in mind that this is a versatile framework and can be adapted for other text
classification tasks with appropriate dataset and labels.

91 | P a g e
Arvind Padhiyar PDS 210130107114

Quiz:
1. What is the first step you should take when developing a text classification pipeline? Ans:
The first step in developing a text classification pipeline is to acquire and prepare the data. This
involves collecting a dataset of labeled text samples (where each sample has a corresponding
category or label), and performing preprocessing tasks like cleaning, tokenization, and possibly
stemming or lemmatization.

2. What are some techniques for feature extraction in text classification?


Ans: Techniques for feature extraction in text classification include:
• Bag-of-Words (BoW): Represents text as a collection of unique words, ignoring
grammar and word order.
• Term Frequency-Inverse Document Frequency (TF-IDF): Weights terms based on their
frequency in a document relative to their frequency in the entire corpus.
• Word Embeddings (e.g., Word2Vec, GloVe): Represent words as vectors in a
continuous vector space based on their semantic relationships.
• Character-level N-grams: Analyzes character sequences of length 'n' to capture
morphological and spelling patterns.
• Part-of-Speech Tagging: Identifies the grammatical components of words in a sentence.

3. Which of the following algorithms is not suitable for text classification?


Ans: Nearest Neighbor Algorithm (e.g., k-Nearest Neighbors or k-NN) is generally not suitable for
text classification. While it can work for some simple cases, it tends to struggle with high-
dimensional data like text, as it doesn't handle the "curse of dimensionality" well. Other algorithms
like Support Vector Machines (SVM), Naive Bayes, and Neural Networks are typically more
effective for text classification tasks.

4. What is grid search used for in text classification?


Ans: Grid search is used to systematically search through a hyperparameter space to find the best
combination of hyperparameters for a machine learning model. In text classification, it helps to
fine-tune parameters like the choice of kernel in SVM, the number of layers in a neural network, or
the learning rate in a gradient boosting model.

5. How do you evaluate the performance of a text classification model?


Ans: The performance of a text classification model can be evaluated using metrics such as:
• Accuracy: The proportion of correctly classified instances.
• Precision: The proportion of true positives among all predicted positives.
• Recall: The proportion of true positives among all actual positives.

92 | P a g e
Arvind Padhiyar PDS 210130107114

• F1-score: The harmonic mean of precision and recall, balancing the trade-off
between false positives and false negatives.
• Confusion Matrix: Provides a detailed breakdown of true positives, true
negatives, false positives, and false negatives.

6. What is the purpose of a held-out test set?


Ans: The purpose of a held-out test set is to provide an independent dataset that the model has never
seen during training. It allows for an unbiased evaluation of the model's performance on new,unseen
data. This helps to ensure that the model generalizes well and performs accurately in real- world
scenarios. The held-out test set is crucial for validating the effectiveness of the model before
deploying it in production or using it for decision-making.

Suggested Reference:

1. "Introduction to Machine Learning with Python" by Andreas C. Müller and Sarah Guido - This
book provides a comprehensive introduction to machine learning and includes a section on text
classification. It covers topics such as preprocessing text data, feature extraction, and model
evaluation.

2. "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward Loper -
This book provides an introduction to natural language processing and includes a section on
text classification. It covers topics such as feature selection, training classifiers, and evaluation
metrics.

"Scikit-learn documentation" - Scikit-learn is a popular machine learning library in Python. The


documentation includes a section on text classification, which covers preprocessing text data, feature
extraction, and model selection. It also provides examples of how to use grid search to find the best set of
hyperparameters for a model

3. "Text Classification in Python using spaCy" by Dipanjan Sarkar - This tutorial provides an
introduction to text classification using spaCy, a popular NLP library in Python. It covers topics
such as preprocessing text data, feature extraction, model selection, and hyperparameter tuning.

4. "Sentiment Analysis on Movie Reviews" Kaggle competition - This Kaggle competition


provides a dataset of movie reviews labeled as positive or negative. It includes notebooks from
participants that demonstrate how to preprocess the data, extract features, and train models. It
also provides examples of how to use grid search to find the best set of hyperparameters for a
model.

93 | P a g e
Arvind Padhiyar PDS 210130107114

References used by the students:

1. https://www.w3schools.com/python/
2. https://collab.research.google.com/

Rubric wise marks obtained:

Rubrics 1 2 3 4 5 Total
Marks
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)

Goo Averag Goo Averag Good Satisfactory Good Satisfactory Good Average
d (2) e (1) d (2) e (1) (2) (1) (2) (1) (2) (1)

94 | P a g e

You might also like