Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Practical No.

-01

Aim: Write a Python program to understand data for following operations:

a) Loading data from CSV file

b) Retrieve the basic information of data in dataset

Software required: Python (Jupyter Notebook)

Dataset: Iris Dataset

Link for Dataset: https://www.kaggle.com/datasets/uciml/iris

Theory:

Pandas is an open-source library that is made mainly for working with relational or labeled
data both easily and intuitively. It provides various data structures and operations for manipulating
numerical data and time series. This library is built on top of the NumPy library. Pandas is fast and it
has high performance & productivity for users.

Pandas dataframe.shape: return a tuple representing the dimensionality of the DataFrame.

Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes
really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we
use the dataframe.info() function.

Pandas dataframe.head() returns the first n rows(observe the index values). The default number of elements
to display is five, but you may pass a custom number.

Pandas dataframe.tail() returns the last n rows(observe the index values).

Task to be performed

a) Loading iris data from CSV file

b) Retrieve the basic information of data in iris dataset

Code Snippet:

PRMITR/CSE/DWM/2022-23 Page 1
Output:

Viva Question:

1) What is an pandas?
2) What is dataframe?
3) How to load data in python environment?
4) What is a role of info() function
5) What is a role of head() and tail() function

PRMITR/CSE/DWM/2022-23 Page 2
Conclusion:

PRMITR/CSE/DWM/2022-23 Page 3
Practical No.-02

Aim: Write a Python program to retrieve basic statistical description of data from dataset:

Software required: Python (Jupyter Notebook)

Dataset: Iris Dataset

Link for Dataset: https://www.kaggle.com/datasets/uciml/iris

Theory:

Statistical analysis is the process of collecting and analyzing data in order to discern patterns
and trends. It is a method for removing bias from evaluating data by employing numerical analysis.
This technique is useful for collecting the interpretations of research, developing statistical models,
and planning surveys and studies.

Statistical analysis is a scientific tool that helps collect and analyze large amounts of data to identify
common patterns and trends to convert them into meaningful information. In simple words, statistical
analysis is a data analysis tool that helps draw meaningful conclusions from raw and unstructured
data.

The conclusions are drawn using statistical analysis facilitating decision-making and helping
businesses make future predictions on the basis of past trends. It can be defined as a science of
collecting and analyzing data to identify trends and patterns and presenting them. Statistical analysis
involves working with numbers and is used by businesses and other institutions to make use of data to
derive meaningful information.

Pandas describe() is used to view some basic statistical details like percentile, mean, std etc. of a
data frame or a series of numeric values. When this method is applied to a series of string, it returns
a different output which is shown in the examples below.

Syntax: DataFrame.describe(percentiles=None, include=None, exclude=None)

Parameters:
percentile: list like data type of numbers between 0-1 to return the respective percentile
include: List of data types to be included while describing dataframe. Default is None
exclude: List of data types to be Excluded while describing dataframe. Default is None
Return type: Statistical summary of data frame.

The describe() method returns description of the data in the DataFrame.

If the DataFrame contains numerical data, the description contains these information for each column:

count - The number of not-empty values.


mean - The average (mean) value.
std - The standard deviation.
min - the minimum value.
25% - The 25% percentile*.
50% - The 50% percentile*.
75% - The 75% percentile*.
max - the maximum value.

PRMITR/CSE/DWM/2022-23 Page 4
Task to be Performed :

1. To Retrieve the basic statistic of IRIS dataset

2. Write inference drawn from basic statistic of IRIS dataset

Code Snippet:

PRMITR/CSE/DWM/2022-23 Page 5
Output:

Viva Questions
1. What are the measures of central tendency?
2. What is descriptive statistic?
3. What are the different types of statistic with respect to data analysis
4. Write function required for retrieving statistics
5. Difference between inference and conclusion

PRMITR/CSE/DWM/2022-23 Page 6
Conclusion:

PRMITR/CSE/DWM/2022-23 Page 7
Practical No.-03

Aim: Write a Python program to implement basic data preprocessing:

Software required: Python (Jupyter Notebook)

Dataset: Iris Dataset

Link for Dataset: https://www.kaggle.com/datasets/uciml/iris

Theory:

Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean data
set. In other words, whenever the data is gathered from different sources it is collected in raw
format which is not feasible for the analysis.

Need of Data Preprocessing

 For achieving better results from the applied model in Machine Learning projects the format
of the data has to be in a proper manner. Some specified Machine Learning model needs
information in a specified format, for example, Random Forest algorithm does not support
null values, therefore to execute random forest algorithm null values have to be managed
from the original raw data set.

 Another aspect is that the data set should be formatted in such a way that more than one
Machine Learning and Deep Learning algorithm are executed in one data set, and best out of
them is chosen.

PRMITR/CSE/DWM/2022-23 Page 8
Task to be performed:

Write a python program to impute missing values with various techniques on given dataset.

a) Remove rows/ attributes

b) Replace with mean or mode

Code Snippet:

PRMITR/CSE/DWM/2022-23 Page 9
Output:

Viva Questions
1. What are the measures of central tendency?
2. What is descriptive statistic?
3. What are the different types of statistic with respect to data analysis
4. Write function required for retrieving statistics
5. Difference between inference and conclusion

Conclusion:

PRMITR/CSE/DWM/2022-23 Page 10
Practical No.-04

Aim: Write a Python program to implement data visualization technique

Software required: Python (Jupyter Notebook)

Dataset: Iris Dataset

Link for Dataset: https://www.kaggle.com/datasets/uciml/iris

Theory

Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data. Additionally, it provides an excellent way for
employees or business owners to present data to non-technical audiences without confusion.

General Types of Visualizations:

 Chart: Information presented in a tabular, graphical form with data displayed along two axes.
Can be in the form of a graph, diagram, or map.
 Table: A set of figures displayed in rows and columns.
 Graph: A diagram of points, lines, segments, curves, or areas that represents certain variables
in comparison to each other, usually along two axes at a right angle.
 Geospatial: A visualization that shows data in map form using different shapes and colors to
show the relationship between pieces of data and specific locations.
 Infographic: A combination of visuals and words that represent data. Usually uses charts or
diagrams.
 Dashboards: A collection of visualizations and data displayed in one place to help with
analyzing and presenting data.
More specific examples

 Area Map: A form of geospatial visualization, area maps are used to show specific values set
over a map of a country, state, county, or any other geographic location. Two common types
of area maps are choropleths and isopleths.
 Bar Chart: Bar charts represent numerical values compared to each other. The length of the
bar represents the value of each variable.
 Box-and-whisker Plots: These show a selection of ranges (the box) across a set measure (the
bar).
 Bullet Graph: A bar marked against a background to show progress or performance against a
goal, denoted by a line on the graph.
 Gantt Chart: Typically used in project management, Gantt charts are a bar chart depiction of
timelines and tasks.
 Heat Map: A type of geospatial visualization in map form which displays specific data
values as different colors (this doesn’t need to be temperatures, but that is a common use).

PRMITR/CSE/DWM/2022-23 Page 11
 Highlight Table: A form of table that uses color to categorize similar data, allowing the
viewer to read it more easily and intuitively.
 Histogram: A type of bar chart that split a continuous measure into different bins to help
analyze the distribution.
 Pie Chart: A circular chart with triangular segments that shows data as a percentage of a
whole.
 Treemap: A type of chart that shows different, related values in the form of rectangles nested
together.

Task to be performed:

Perform any four visualization type on IRIS dataset

Code Snippet:

PRMITR/CSE/DWM/2022-23 Page 12
Output:

PRMITR/CSE/DWM/2022-23 Page 13
Viva Questions

1. What are the different data visualization graphs?


2. What are the different data visualization charts?
3. What are the different data visualization maps?
4. Difference between charts and graph

Conclusion:

PRMITR/CSE/DWM/2022-23 Page 14
Practical No.-05

Aim: Write a Python program to implement correlation analysis.

Software required: Python (Jupyter Notebook)

Theory

Correlation coefficients quantify the association between variables or features of a dataset.


These statistics are of high importance for science and technology, and Python has great tools that you
can use to calculate them. SciPy, NumPy, and Pandas correlation methods are fast, comprehensive,
and well-documented.

Pandas is more convenient for calculating statistics. It offers statistical methods


for Series and DataFrame instances. For example, given two Series objects with the same number of
items, you can call .corr() on one of them with the other as the first argument:

>>> import pandas as pd


>>> x = pd.Series(range(10, 20))
>>> y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
>>> x.corr(y) # Pearson's r
>>> y.corr(x)
>>> x.corr(y, method='spearman') # Spearman's rho
>>> x.corr(y, method='kendall') # Kendall's tau

Task to be performed:

Perform above snippet on basic series variable

Code Snippet:

PRMITR/CSE/DWM/2022-23 Page 15
Output:

Viva Questions

1. What is correlation?
2. What is pearson correlation?
3. What is kendall correlation
4. What is sperman correlation
5. Difference between correlation and association

PRMITR/CSE/DWM/2022-23 Page 16
Conclusion:

PRMITR/CSE/DWM/2022-23 Page 17
Practical No. 6

Aim: Create an Employee Table with the help of Data Mining Tool WEKA.

Software Required: Weka

Description:

We need to create an Employee Table with training data set which includes attributes like name, id,
salary, experience, gender, phone number.

Steps:

1) Open Start  Programs  Accessories  Notepad

2) Type the following training data set with the help of Notepad for Employee Table.

@relation employee

@attribute name {x,y,z,a,b}

@attribute id numeric

@attribute salary {low,medium,high}

@attribute exp numeric

@attribute gender {male,female}

@attribute phone numeric

@data

x,101,low,2,male,250311

y,102,high,3,female,251665

z,103,medium,1,male,240238

a,104,low,5,female,200200

b,105,high,2,male,240240

3) After that the file is saved with .arff file format.

4) Minimize the arff file and then Open Start Programs weka-3-4.

5) Click on weka-3-4, then Weka dialog box is displayed on the screen.

6) In that dialog box there are four modes, click on explorer.

7) Explorer shows many options. In that click on ‘open file’ and select the arff file

8) Click on edit button which shows employee table on weka.

PRMITR/CSE/DWM/2022-23 Page 18
Conclusion:

PRMITR/CSE/DWM/2022-23 Page 19
Practical No. 7

Aim: Create a Weather Table with the help of Data Mining Tool WEKA.

Software Required: Weka

Theory:

We need to create a Weather table with training data set which includes attributes like outlook,
temperature, humidity, windy, play.

Steps:

1) Open Start  Programs  Accessories  Notepad

2) Type the following training data set with the help of Notepad for Weather Table.

@relation weather

@attribute outlook {sunny,rainy,overcast}

@attribute temparature numeric

@attribute humidity numeric

@attribute windy {true,false}

@attribute play {yes,no}

@data

sunny,85.0,85.0,false,no

overcast,80.0,90.0,true,no

sunny,83.0,86.0,false,yes

rainy,70.0,86.0,false,yes

rainy,68.0,80.0,false,yes

rainy,65.0,70.0,true,no

overcast,64.0,65.0,false,yes

sunny,72.0,95.0,true,no

sunny,69.0,70.0,false,yes

rainy,75.0,80.0,false,yes

3) After that the file is saved with .arff file format.

4) Minimize the arff file and then Open Start Programs weka-3-4

5) Click on weka-3-4, then Weka dialog box is displayed on the screen.

PRMITR/CSE/DWM/2022-23 Page 20
6) In that dialog box there are four modes, click on explorer.

7) Explorer shows many options. In that click on ‘open file’ and select the arff file

8) Click on edit button which shows weather table on weka.

Training Data Set-> Weather Table

Conclusion:

PRMITR/CSE/DWM/2022-23 Page 21
Practical No. 8

Aim: Apply ADD Pre-Processing techniques to the training data set of Weather Table

Software Required: Weka

Theory:

Real world databases are highly influenced to noise, missing and inconsistency due to their queue size
so the data can be pre-processed to improve the quality of data and missing results and it also
improves the efficiency.

There are 3 pre-processing techniques they are:

1) Add

2) Remove

3) Normalization

Creation of Weather Table:

Procedure:

1) Open Start  Programs  Accessories  Notepad

2) Type the following training data set with the help of Notepad for Weather Table.

@relation weather

@attribute outlook {sunny,rainy,overcast}

@attribute temparature numeric

@attribute humidity numeric

@attribute windy {true,false}

@attribute play {yes,no}

@data

sunny,85.0,85.0,false,no

overcast,80.0,90.0,true,no

sunny,83.0,86.0,false,yes

rainy,70.0,86.0,false,yes

rainy,68.0,80.0,false,yes

rainy,65.0,70.0,true,no

overcast,64.0,65.0,false,yes

PRMITR/CSE/DWM/2022-23 Page 22
sunny,72.0,95.0,true,no

sunny,69.0,70.0,false,yes

rainy,75.0,80.0,false,yes

3) After that the file is saved with .arff file format.

4) Minimize the arff file and then open Start Programs weka-3-4.

5) Click on weka-3-4, then Weka dialog box is displayed on the screen.

6) In that dialog box there are four modes, click on explorer.

7) Explorer shows many options. In that click on ‘open file’ and select the arff file

8) Click on edit button which shows weather table on weka.

Add Pre-Processing Technique:

Procedure:

1) Open Start  Programs  Accessories  Notepad

2) Click on explorer.

3) Click on open file.

4) Select Weather.arff file and click on open.

5) Click on Choose button and select the Filters option.

6) In Filters, we have Supervised and Unsupervised data.

7) Click on Unsupervised data.

PRMITR/CSE/DWM/2022-23 Page 23
8) Select the attribute Add.

9) A new window is opened.

10) In that we enter attribute index, type, data format, nominal label values for Climate.

11) Click on OK.

12) Press the Apply button, then a new attribute is added to the Weather Table.

13) Save the file.

14) Click on the Edit button, it shows a new Weather Table on Weka.

Weather Table after adding new attribute CLIMATE:

Task: Apply Pre-Processing techniques to the training data set of Employee Table

PRMITR/CSE/DWM/2022-23 Page 24
Conclusion:

PRMITR/CSE/DWM/2022-23 Page 25

You might also like