Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 28

DATA

CLEANSING
-Vishal Kumar
07IT910
-Karishma Verma
07IT927
WHAT IS DATA
CLEANSING?
Data cleansing or data scrubbing is the act of
detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or
database. Used mainly in databases, the term
refers to identifying incomplete, incorrect,
inaccurate, irrelevant etc. parts of the data and
then replacing, modifying or deleting this dirty
data
WHY DATA CLEANSING ?
After cleansing, a data set will be consistent with
other similar data sets in the system. The
inconsistencies detected or removed may have been
originally caused by different data dictionary
definitions of similar entities in different stores, may
have been caused by user entry errors, or may have
been corrupted in transmission or storage.
DATA QUALITY
High quality data needs to pass a set of quality
criteria. Those include:

Accuracy: An aggregated value over the criteria


of integrity, consistency and density.

Integrity: An aggregated value over the criteria


of completeness and validity.
Validity: Approximated by the amount of data
satisfying integrity constraints.

Consistency: Concerns contradictions and


syntactical anomalies.

Uniformity: Directly related to irregularities.

Density: The quotient of missing values in the


data and the number of total values ought to be
known.
Uniqueness: Related to the number of
duplicates in the data.

Completeness: Achieved by correcting data


containing anomalies.
CLEANSING STEPS
Parsing
Correcting
Standardizing
Matching
Consolidating
PARSING

Parsing in data cleansing is performed for the


detection of syntax errors. A parser decides
whether a string of data is acceptable within the
allowed data specification. This is similar to the
way a parser works with grammars and languages.
Parsed Data in Target File
First Name: Beth
Middle Name: Christine
Input Data from Source File Last Name: Parker
Beth Christine Parker, SLS MGR Title: SLS MGR
Regional Port Authority Firm: Regional Port Authority
Federal Building Location: Federal Building
12800 Lake Calumet Number: 12800
Hedgewisch, IL Street: Lake Calumet
City: Hedgewisch
State: IL
CORRECTING

Corrects parsed individual data components using


sophisticated data algorithms and secondary data
sources.
Corrected Data
Parsed Data First Name: Beth
First Name: Beth Middle Name: Christine
Middle Name: Christine Last Name: Parker
Last Name: Parker Title: SLS MGR
Title: SLS MGR Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: South Butler Drive
Street: Lake Calumet City: Chicago
City: Hedgewisch State: IL
State: IL Zip: 60633
Zip+Four: 2398
STANDARDIZING

Standardizing applies conversion routines to


transform data into its preferred (and
consistent) format using both standard and
custom business rules.
Corrected Data
Corrected Data Pre-name: Ms.
First Name: Beth First Name: Beth
Middle Name: Christine 1st Name Match
Last Name: Parker Standards: Elizabeth, Bethany, Bethel
Title: SLS MGR Middle Name: Christine
Firm: Regional Port Authority Last Name: Parker
Location: Federal Building Title: Sales Mgr.
Number: 12800 Firm: Regional Port Authority
Street: South Butler Drive Location: Federal Building
City: Chicago Number: 12800
State: IL Street: S. Butler Dr.
Zip: 60633 City: Chicago
Zip+Four: 2398 State: IL
Zip: 60633
Zip+Four: 2398
MATCHING

Searching and matching records within and


across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
Corrected Data (Data Source #2)
Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678
CONSOLIDATING

Analyzing and identifying relationships


between matched records and
consolidating/merging them into ONE
representation.
Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678
CHALLENGES
AND
PROBLEMS
Error Correction and loss of information:

The most challenging problem within data


cleansing remains the correction of values to
remove duplicates and invalid entries. In many
cases, the available information on such anomalies
is limited and insufficient to determine the
necessary transformations or corrections leaving
the deletion of such entries as the only plausible
solution. The deletion of data though, leads to loss
of information which can be particularly costly if
there is a large amount of deleted data.
Maintenance of Cleansed Data:

Data cleansing is an expensive and time consuming


process. So after having performed data cleansing
and achieving a data collection free of errors, one
would want to avoid the re-cleansing of data in its
entirety after some values in data collection
change. The process should only be repeated on
values that have changed which means that a
cleansing lineage would need to be kept which
would require efficient data collection and
management techniques.
Data Cleansing in Virtually Integrated
Environments:

In virtually integrated Sources like IBM’s


DiscoveryLink, the cleansing of data has to be
performed every time the data is accessed which
considerably decreases the response time and
efficiency.
CONCLUSION
Thus data cleansing works not only to make
sure that data is accurate , but also that it is
consistent between different records.
Thus data cleansing helps to achieve high data
quality there by increasing the effectiveness of
the system.
THANK YOU

You might also like