Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

D Mahalingam College of Engineering & Technology

Department of Artificial Intelligence and


Data Science
19ADPN6401 Mini Project
Zeroth Review
Team Number: 23BADA01
Domain: Data Mining

Title: Data cleaning using python

Guide :N.Suba Rani

Team Members: Viswanthan.R (727621BAD002)


Santhosh.K (727621BAD035)
Vigneshwaran.R (727621BAD044)
Venkateesh.V.K (727622BAD302)

1 19ADPN6401 Mini Project


Contents
⚫ Problem Statement
⚫ Literature Identified and Findings
⚫ Objective
⚫ Module Description
⚫ Software and Hardware Requirements
⚫ Data Set
⚫ References
⚫ Online Certification Courses
⚫ Weekly Plan

2 19ADPN6401 Mini Project


Problem Statement

 Analysts may lose out on actionable insights due to incomplete data.


This is very common in cases where missing observations and outliers
are dropped.
 It may lead to an even bigger problem when automated. Some
automated data cleaning tools are not very smart and may end up
mishandling some observations in the dataset.
 It is time-consuming. Data cleaning may take a lot of time, especially
when dealing with large data.
 The process is very expensive.

3 19ADPN6401 Mini Project


Objective
⮚ To improve the quality of the data set.
⮚ It greatly improves our decision making capabilities.
⮚ It removes the unwanted observations.
⮚ Fixing the structural errors.
⮚ Managing unwanted outliers.
⮚ To handling missing data.

4 19ADPN6401 Mini Project


Block Diagram
Data cleaning

Nan values are Remove redundant data


Data Input replaced with median

Dimensionality reduction Standardization


of data

5 19ADPN6401 Mini Project


Module Description
● Data cleaning - In this stage, data is collected from
the data set and redundant data is eliminated.

● Missing values - Many values are missing or filled


with garbage entry.These values can have multiple
reasons why they might be missing. They could be
missing due to some reason such as sensor error or
other factors, or they can also be missing completely
at random.

● Outliers - Many times certain numeric values are


either too large or too low than the average
6 19vAaDPluN6e401 specific column. These are
Moifni Prtohjeect

considered as outliers. Outliers need special


Noisy data - Noisy data unnecessarily increases the amount of storage space
required and can also adversely affect the results of any data mining
analysis.

Ignore the tuples - This approach is suitable only when the dataset we have
is quite large and multiple values are missing within a tuple.

7
Requirements
Sofiware requirements:
● Windows 10
● JUPYTER notebook

Hardware requirements:
● Processor – i5
● Hard Disk – 4 GB
● Memory – 1GB RAM

8 19ADPN6401 Mini Project


Data Set
The dataset is obtained from a website called
kaggle.com which contains over 900+ unique movies.

9 19ADPN6401 Mini Project


References
Text book :
Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining: Concepts and Techniques”, 3rdEdition,
Elsevier,2012.
Jure Leskovec, Anand Rajaraman, Jeffery David Ullman, “Mining of Massive Datasets”,
2ndEdition,CambridgeUniversity Press,2014.
Ian H.Witten, Eibe Frank, Mark A.Hall, “Data Mining: Practical Machine Learning Tools
andTechniques”, 3rdEdition,Elsevier,2011.
EMC Education Services, “Data Science and Big Data Analytics Discovering, Analyzing, Visualizing
andPresentingData”,Wiley,2015.
Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data
StreamswithAdvancedAnalytics”,JohnWiley &sons2013.

Web references:
.http://www.cs.waikato.ac.nz/ml/weka/documentation.html
.https://cran.r-project.org/manuals.html
.https://archive.ics.uci.edu/ml/index.html

10 19ADPN6401 Mini Project


Online Certification Courses
⚫ Data mining course in Great learning.

11 19ADPN6401 Mini Project

You might also like