Professional Documents
Culture Documents
Practical No.-01
Practical No.-01
-01
Theory:
Pandas is an open-source library that is made mainly for working with relational or labeled
data both easily and intuitively. It provides various data structures and operations for manipulating
numerical data and time series. This library is built on top of the NumPy library. Pandas is fast and it
has high performance & productivity for users.
Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes
really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we
use the dataframe.info() function.
Pandas dataframe.head() returns the first n rows(observe the index values). The default number of elements
to display is five, but you may pass a custom number.
Task to be performed
Code Snippet:
PRMITR/CSE/DWM/2022-23 Page 1
Output:
Viva Question:
1) What is an pandas?
2) What is dataframe?
3) How to load data in python environment?
4) What is a role of info() function
5) What is a role of head() and tail() function
PRMITR/CSE/DWM/2022-23 Page 2
Conclusion:
PRMITR/CSE/DWM/2022-23 Page 3
Practical No.-02
Aim: Write a Python program to retrieve basic statistical description of data from dataset:
Theory:
Statistical analysis is the process of collecting and analyzing data in order to discern patterns
and trends. It is a method for removing bias from evaluating data by employing numerical analysis.
This technique is useful for collecting the interpretations of research, developing statistical models,
and planning surveys and studies.
Statistical analysis is a scientific tool that helps collect and analyze large amounts of data to identify
common patterns and trends to convert them into meaningful information. In simple words, statistical
analysis is a data analysis tool that helps draw meaningful conclusions from raw and unstructured
data.
The conclusions are drawn using statistical analysis facilitating decision-making and helping
businesses make future predictions on the basis of past trends. It can be defined as a science of
collecting and analyzing data to identify trends and patterns and presenting them. Statistical analysis
involves working with numbers and is used by businesses and other institutions to make use of data to
derive meaningful information.
Pandas describe() is used to view some basic statistical details like percentile, mean, std etc. of a
data frame or a series of numeric values. When this method is applied to a series of string, it returns
a different output which is shown in the examples below.
Parameters:
percentile: list like data type of numbers between 0-1 to return the respective percentile
include: List of data types to be included while describing dataframe. Default is None
exclude: List of data types to be Excluded while describing dataframe. Default is None
Return type: Statistical summary of data frame.
If the DataFrame contains numerical data, the description contains these information for each column:
PRMITR/CSE/DWM/2022-23 Page 4
Task to be Performed :
Code Snippet:
PRMITR/CSE/DWM/2022-23 Page 5
Output:
Viva Questions
1. What are the measures of central tendency?
2. What is descriptive statistic?
3. What are the different types of statistic with respect to data analysis
4. Write function required for retrieving statistics
5. Difference between inference and conclusion
PRMITR/CSE/DWM/2022-23 Page 6
Conclusion:
PRMITR/CSE/DWM/2022-23 Page 7
Practical No.-03
Theory:
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean data
set. In other words, whenever the data is gathered from different sources it is collected in raw
format which is not feasible for the analysis.
For achieving better results from the applied model in Machine Learning projects the format
of the data has to be in a proper manner. Some specified Machine Learning model needs
information in a specified format, for example, Random Forest algorithm does not support
null values, therefore to execute random forest algorithm null values have to be managed
from the original raw data set.
Another aspect is that the data set should be formatted in such a way that more than one
Machine Learning and Deep Learning algorithm are executed in one data set, and best out of
them is chosen.
PRMITR/CSE/DWM/2022-23 Page 8
Task to be performed:
Write a python program to impute missing values with various techniques on given dataset.
Code Snippet:
PRMITR/CSE/DWM/2022-23 Page 9
Output:
Viva Questions
1. What are the measures of central tendency?
2. What is descriptive statistic?
3. What are the different types of statistic with respect to data analysis
4. Write function required for retrieving statistics
5. Difference between inference and conclusion
Conclusion:
PRMITR/CSE/DWM/2022-23 Page 10
Practical No.-04
Theory
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data. Additionally, it provides an excellent way for
employees or business owners to present data to non-technical audiences without confusion.
Chart: Information presented in a tabular, graphical form with data displayed along two axes.
Can be in the form of a graph, diagram, or map.
Table: A set of figures displayed in rows and columns.
Graph: A diagram of points, lines, segments, curves, or areas that represents certain variables
in comparison to each other, usually along two axes at a right angle.
Geospatial: A visualization that shows data in map form using different shapes and colors to
show the relationship between pieces of data and specific locations.
Infographic: A combination of visuals and words that represent data. Usually uses charts or
diagrams.
Dashboards: A collection of visualizations and data displayed in one place to help with
analyzing and presenting data.
More specific examples
Area Map: A form of geospatial visualization, area maps are used to show specific values set
over a map of a country, state, county, or any other geographic location. Two common types
of area maps are choropleths and isopleths.
Bar Chart: Bar charts represent numerical values compared to each other. The length of the
bar represents the value of each variable.
Box-and-whisker Plots: These show a selection of ranges (the box) across a set measure (the
bar).
Bullet Graph: A bar marked against a background to show progress or performance against a
goal, denoted by a line on the graph.
Gantt Chart: Typically used in project management, Gantt charts are a bar chart depiction of
timelines and tasks.
Heat Map: A type of geospatial visualization in map form which displays specific data
values as different colors (this doesn’t need to be temperatures, but that is a common use).
PRMITR/CSE/DWM/2022-23 Page 11
Highlight Table: A form of table that uses color to categorize similar data, allowing the
viewer to read it more easily and intuitively.
Histogram: A type of bar chart that split a continuous measure into different bins to help
analyze the distribution.
Pie Chart: A circular chart with triangular segments that shows data as a percentage of a
whole.
Treemap: A type of chart that shows different, related values in the form of rectangles nested
together.
Task to be performed:
Code Snippet:
PRMITR/CSE/DWM/2022-23 Page 12
Output:
PRMITR/CSE/DWM/2022-23 Page 13
Viva Questions
Conclusion:
PRMITR/CSE/DWM/2022-23 Page 14
Practical No.-05
Theory
Task to be performed:
Code Snippet:
PRMITR/CSE/DWM/2022-23 Page 15
Output:
Viva Questions
1. What is correlation?
2. What is pearson correlation?
3. What is kendall correlation
4. What is sperman correlation
5. Difference between correlation and association
PRMITR/CSE/DWM/2022-23 Page 16
Conclusion:
PRMITR/CSE/DWM/2022-23 Page 17
Practical No. 6
Aim: Create an Employee Table with the help of Data Mining Tool WEKA.
Description:
We need to create an Employee Table with training data set which includes attributes like name, id,
salary, experience, gender, phone number.
Steps:
2) Type the following training data set with the help of Notepad for Employee Table.
@relation employee
@attribute id numeric
@data
x,101,low,2,male,250311
y,102,high,3,female,251665
z,103,medium,1,male,240238
a,104,low,5,female,200200
b,105,high,2,male,240240
4) Minimize the arff file and then Open Start Programs weka-3-4.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
PRMITR/CSE/DWM/2022-23 Page 18
Conclusion:
PRMITR/CSE/DWM/2022-23 Page 19
Practical No. 7
Aim: Create a Weather Table with the help of Data Mining Tool WEKA.
Theory:
We need to create a Weather table with training data set which includes attributes like outlook,
temperature, humidity, windy, play.
Steps:
2) Type the following training data set with the help of Notepad for Weather Table.
@relation weather
@data
sunny,85.0,85.0,false,no
overcast,80.0,90.0,true,no
sunny,83.0,86.0,false,yes
rainy,70.0,86.0,false,yes
rainy,68.0,80.0,false,yes
rainy,65.0,70.0,true,no
overcast,64.0,65.0,false,yes
sunny,72.0,95.0,true,no
sunny,69.0,70.0,false,yes
rainy,75.0,80.0,false,yes
4) Minimize the arff file and then Open Start Programs weka-3-4
PRMITR/CSE/DWM/2022-23 Page 20
6) In that dialog box there are four modes, click on explorer.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
Conclusion:
PRMITR/CSE/DWM/2022-23 Page 21
Practical No. 8
Aim: Apply ADD Pre-Processing techniques to the training data set of Weather Table
Theory:
Real world databases are highly influenced to noise, missing and inconsistency due to their queue size
so the data can be pre-processed to improve the quality of data and missing results and it also
improves the efficiency.
1) Add
2) Remove
3) Normalization
Procedure:
2) Type the following training data set with the help of Notepad for Weather Table.
@relation weather
@data
sunny,85.0,85.0,false,no
overcast,80.0,90.0,true,no
sunny,83.0,86.0,false,yes
rainy,70.0,86.0,false,yes
rainy,68.0,80.0,false,yes
rainy,65.0,70.0,true,no
overcast,64.0,65.0,false,yes
PRMITR/CSE/DWM/2022-23 Page 22
sunny,72.0,95.0,true,no
sunny,69.0,70.0,false,yes
rainy,75.0,80.0,false,yes
4) Minimize the arff file and then open Start Programs weka-3-4.
7) Explorer shows many options. In that click on ‘open file’ and select the arff file
Procedure:
2) Click on explorer.
PRMITR/CSE/DWM/2022-23 Page 23
8) Select the attribute Add.
10) In that we enter attribute index, type, data format, nominal label values for Climate.
12) Press the Apply button, then a new attribute is added to the Weather Table.
14) Click on the Edit button, it shows a new Weather Table on Weka.
Task: Apply Pre-Processing techniques to the training data set of Employee Table
PRMITR/CSE/DWM/2022-23 Page 24
Conclusion:
PRMITR/CSE/DWM/2022-23 Page 25