Professional Documents
Culture Documents
Data Cleaning
Data Cleaning
While the techniques used for data cleaning may vary according to the types of data we have, we
can follow these basic steps to map out a framework for the organisation.
find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same category.
performance of the data you are working with. However, sometimes it is the appearance of an
outlier that will prove a theory you are working on
Does the data follow the appropriate rules for its field?
Does it prove or disprove your working theory, or bring any insight to light?
Can you find trends in the data to help you form your next theory?
If not, is that because of a data quality issue?
Cleansing legacy data involves examining field level information for duplicate, corrupt,
incomplete or inaccurate data attributes and bringing them into compliance with the
requirements needed to support their related business process designs. Cleansing
should be accomplished in-flight using the ETL tool. Cleansing should be accomplished
where possible through the use ofusing external database tools for example Dun &
Bradstreet® for Customer Master Data cleansing. Where the data to be cleansed is
more company system specific, it should be accomplished with rules determined through
the legacy business system subject matter experts with additional input from the Global
Process Owners when necessary.
Rules Based Data Profiling is a primary input in the development of Cleansing Rules.
Identify the requirements for Master and Transactional Data
Data Cleansing of Legacy data to:
o Standardize addresses based on Global Standards as well as
Country StandardsStandards.
o Identify missing data – based on data requirements.
o Identify misuse of data fields.
o Validate data based on field definitions, for example:
Phone number must be 10-digitsdigits.
Duns number must be 17 characterscharacters.
ID or PIN number must be x-digitsdigits.
Postal Codes must be x-digits or y-digits.
The process of cleansing data should use the following approach:
o Clearly define the scope.
o The Statistical and Business Rule Data Profiling results should
serve as a guide in the development of data cleansing requirements.
Use the profiling results to develop an understanding of the data at a
detailed level.
o Know what data records are relevant to the current project phase.
o Maximize the use of “In-Flight” cleansing packages developed
within BODS for use with the Data Extract, Transform and Load (ETL)
tool. As data requirements and business rules change, the process of
cleansing the data may also need to change. It is necessary to monitor
what has been done and continue to correct the data.
Avoid or minimize cleansing data in the legacy environment for country due for go-
livelive.
Data Cleansing is an iterative process, not an event.