Project

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 4

DATA CLEANING

Data cleaning or say data cleansing is the process of


detecting and correcting (or removing) corrupt or inaccurate
records from a record set, table, or database and refers to
identifying incomplete, incorrect, inaccurate or irrelevant parts
of the data and then replacing, modifying, or deleting the
dirty data.
STEPS FOR DATA CLEANING

1. IMPORTING OF DATA.
2. EXPLORING THE RAW DATA
3. REMOVAL OF UNWANTED OBSERVATIONS
4. FIXING STRUCTURAL ERRORS
5. MANAGING UNWANTED DATA
6. HANDLING MISSING DATA
7. EXPORTING THE DATASET
DATA CLEANING WITH R
• FOR UNDERSTANDING OF DATA- WE LOAD DPLYR LIBRARY FOR FOLLOWING FUNCTION

Launch<-abc.csv (dataset) library(dplyr)


• View its class:- class(abc) • Glimpse(abc) #same as structure

• View its dimension:- dim(abc) • Summary(abc)


• Head(abc)
• For rows and column:- name(abc)
• Tail(abc)
• For the structure of data:- str(abc)
• FOR VISUALIZING FOR MISSING VALUES
Checking for NAS
We use
• Is.na(abc)
hist(abc$xy) single variable • which(is.na(x)) particular row/col
• any(is.na(abc))
plot(abc$xy ty) b/w two variable • sum(is.na(abc))
• Summery(abc)

For tidy data Another method to remove rows with nas


Observation as row and column • Na.omit(abc)
One type of obs unit per table
We use To deal with date and times
gather(data, key, value) We use lubridates library
spread(data, key, value) Ex- library(lubridate)
seprate(data, col, into) Weather$day<-ymd(weather2date)
unite(data, col, ….)

Dealing with missing values


Row with no missing value
• Complete.cases(abc)

You might also like