Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Cleaning and Preparing

Data
TE AIML (Hon)
Athang Joshi

Athang Joshi 1
What is data cleaning?

• Process of fixing or removing incorrect, corrupted, incorrectly formatted,


duplicate, or incomplete data within a dataset.
• Combining multiple data sources causes duplication or mislabelling of data.
• If data is incorrect, outcomes and algorithms are unreliable, even though they may
look correct.
• Data cleaning processes vary from dataset to dataset.
• A template should always be e stablished for data cleaning process.

Athang Joshi 2
Steps to clean data

• Remove duplicate or irrelevant observations


• Fix structural errors
• Filter unwanted outliers
• Handle missing data
• Validate and QA

Athang Joshi 3
Pratibha Sharma 4
Pratibha Sharma 5
Filter unwanted outliers

• Observations that do not appear to fit within the data.


• Should be removed, only if there is a legitimate reason.
• Sometimes, it is the appearance of an outlier that will prove a theory.
• Just because an outlier exists, doesn’t mean it is incorrect.
• If an outlier proves to be irrelevant for analysis or is a mistake, then only it should
be removed.

Pratibha Sharma 6
Fix structural errors

• Structural errors are strange naming conventions, typos, or incorrect


capitalization.
• For example, you may find “N/A” and “Not Applicable” both appear, but they
should be analyzed as the same category.

Athang Joshi 7
Handle missing data

• Many algorithms do not accept missing values.


• Observations that have missing values can be dropped. But this should be done
very carefully.
• Input missing values based on other observations. But then, there is an
opportunity to lose integrity of the data.
• The way the data is used can be altered to effectively navigate null values.

Athang Joshi 8
Validate and QA

• At the end of process, answers of following questions should be ready:


1) Does the data make sense?
2) Does the data follow the appropriate rules for its field?
3) Does it prove or disprove the working theory, or bring any insight to light?
4) Is the data giving any trends which are helpful for the next theory?
5) If not, is that because of a data quality issue?

Athang Joshi 9
Advantages and benefits of data cleaning

• Removal of errors when multiple sources of data are at play.


• Fewer errors make for happier clients and less-frustrated employees.
• Ability to map the different functions
• Monitoring errors and better reporting to see where errors are coming from,
making it easier to fix incorrect or corrupt data for future applications.
• Using tools for data cleaning makes for more efficient business practices and
quicker decision-making.

Athang Joshi 10
Characteristics of a quality data

• Validity (The degree to which the data conforms to defined rules or constraints)
• Accuracy (The data is close to the true values)
• Completeness (The degree to which all required data is known)
• Consistency (Data is consistent within the same dataset and/or across multiple
data sets)
• Uniformity (The degree to which the data is specified using the same unit of
measure)

Athang Joshi 11
Thank You!
(athangj@sies.edu.in)

Athang Joshi 12

You might also like