Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Data & Data Quality

Adapted from Chapman , Galhardas and Johnson Slides 1


Recap

Oceans of Data (1) Rivers of Information (2)

Streams of Knowledge (3) Drops of Wisdom (4)

Adapted from Chapman 2

Chapman 2005
Data Quality
► Data in the real world is dirty
► incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
► noisy: containing errors or outliers
► inconsistent: containing discrepancies in codes or names
► No quality data, no quality information & flawed decision making!
► Quality decisions must be based on quality data
► Databases and data warehouses need consistent integration of quality data

3
Multi-Dimensional Measure of Data Quality
► A well-accepted multidimensional view:
► Accuracy
► Completeness
► Consistency
► Timeliness
► Believability
► Value added
► Interpretability
► Accessibility

4
Threats to Data Quality
Loss of data quality can occur at many stages:
► At the time of collection
► During digitisation
► During documentation
► During storage and archiving
► During analysis and manipulation
► At time of presentation
► At time of use
► Data are of high quality if they are fit for their intended use in
operations, decision-making, and planning (Juran 1964).
5
Principles of Good Data Management
Not reinventing information management wheels
Looking for efficiencies in data collection and quality control procedures
Sharing of data, information and tools
Using existing standards or developing new, robust standards
Fostering the development of networks and partnerships
Presenting a sound business case for data collection and management
Reducing duplication in data collection and data quality control
Looking beyond immediate use and examining requirements of users
Ensuring that good documentation and metadata procedures exist

6
Comparing Information Production to
Manufacturing
Stage Product Manufacturing Information Manufacturing

Input Raw materials Raw Data

Process Assembly line Information system

Output Physical Products Information products

Wang 1998
7
Data Management Cycle
Based on the manufacturing cycle

8
Total Quality Management
Data Management Cycle is driven by Total Quality Management Cycle
TQM – “an integrated management philosophy and set of practices that emphasize:
► continuous improvement,
► meeting customer requirements,
► reducing rework,
► long-range thinking,
► increasing employee involvement and teamwork,
► process redesign,
► competitive benchmarking,
► team-based problem solving,
► continuous measurement of results and
► closer relationships with suppliers”

9
Total Data Quality Management Cycle

Wang 1998
10
Plan Do Check Act Cycle
TQM Cycle is derived from the Plan, Do, Check, and Act cycle by Demming

Plan. This first phase (the longest), aimed at improving operations and processes,
requires establishing the problems (what is going wrong), their extent, their
causes, and developing ideas to eliminate them.
Do. The identified solutions are tested on a small scale, thus minimizing
disruptions to routine activities.
Check. The testing activity requires identifying control indicators for key
activities to establish the quality of the output and identify any new problems
arising from the change.
Act. In this last phase, successful changes in activities and processes are
implemented on a large scale. If the test is unsuccessful, the group involved in
the study must repeat the cycle

11
TDQM Cycle
► In applying the TDQM methodology, an organization must:
(1) clearly articulate the Information Product in business terms;
(2) establish an Information Product team consisting of a senior executive as the
TDQM champion, an Information Product engineer who is familiar with the TDQM
methodology, and members who are information suppliers, manufacturers,
consumers, and Information Product managers;
(3) teach Information Quality assessment and Information Quality management skills
to all the Information Product constituencies; and
(4) institutionalize continuous Information Product improvement

12
Data Validation & Cleaning
► Validation - process used to determine if data are inaccurate, incomplete, or
unreasonable.
► Validation may include :-
► format checks,
► completeness checks,
► reasonableness checks,
► limit checks,
► review of the data to identify outliers (geographic, statistical, temporal or environmental) or
other errors, and
► assessment of data by subject area experts (e.g. taxonomic specialists).
► Data cleaning (data cleansing) - process of “fixing” errors in the data that have been
identified during the validation process.
► It is important in the data cleaning process that data is not inadvertently lost, and
changes to existing information be carried out very carefully
13
Data Cleaning Framework
► Define and determine error types
► Search and identify error instances
► Correct the errors
► Document error instances and error types
► Modify data entry procedures to reduce future errors

(Maletic and Marcus 2000)

14
Data Cleaning Hints
Organising Data improves efficiency
The organizing of data prior to data checking, validation and correction can improve
efficiency and considerably reduce the time and costs of data cleaning.
– For example, by sorting data on location, efficiency gains can be achieved
through checking all records pertaining to the one location at the same time,
rather than going back and forth to key references.
– Similarly, by sorting records by collector and date, it is possible to spot errors
where a record may be at an unlikely location for that collector on that day.
– Put mechanisms to prevent data errors, correct errors early

15
Data Cleaning Hints
► Partnerships improve efficiency
► By developing partnerships, many data validation processes won’t need to be
duplicated, errors will more likely be documented and corrected, and new
errors won’t be incorporated by inadvertent “correction” of suspect records that
are not in error.
► Partnerships with: -
► Data collectors
► Other institutions with duplicate collections
► Like-minded institutions developing tools, standards and software
► Data users (good feedback mechanisms)
► Statisticians and data auditors

16
Data Cleaning Hints
► Prioritisation reduces duplication
► Prioritisation helps reduce costs and improves efficiency. It is often
of value to concentrate on those records where lots of data can be
cleaned at the lowest cost.
– For example, those that can be examined using batch
processing or automated methods, before working on the more
difficult records.
► By concentrating on those data that are of most value to users,
there is also a greater likelihood of errors being detected and
corrected

Adapted from Chapman 17


Data Cleaning Hints
► Focus on most critical data first
► Concentrate on discrete units (taxonomic, geographic, etc.)
► Ignore data that are not used or for which data quality
cannot be guaranteed
► Consider data that are of broadest value, are of greatest
benefit to the majority of users and are of value to the most
diverse of uses
► Work on those areas whereby lots of data can be cleaned
at the lowest cost (e.g. through use of batch processing).

18
Data Cleaning Hints
► Set targets and performance measures
► Performance measures are a valuable addition to quality
control procedures,
► They help an organization manage their data cleaning
processes.
► Performance measures may include statistical checks on the
data (for example, 95% of all records are within 1,000 meters
of their reported position), on the level of quality control (for
example – 65% of all records have been checked by a
qualified taxonomist within the previous 5 years; 90% have
been checked by a qualified taxonomist within the previous
10 years).
19
Data Cleaning Hints

► Minimise duplication and re-working of data


► Duplication is a major factor with data cleaning in most organizations.
– Many organizations add the geocode at the same time as they store the
record. As records are seldom sorted geographically, this means that the
same locations will be chased up a number of times.
– By carrying out the geo-referencing as a special operation, records from
similar locations can then be sorted and then the appropriate map-sheet
only has to be extracted once.
– Some institutions also use the database itself to help reduce duplication by
searching to see if the location may already have been geo-referenced .

20
Data Cleaning Hints
► Feedback is a two-way street
► Perform data validation by exposing it to the data suppliers/
users.
► It is essential that data custodians encourage feedback from
users of their data, and take the feedback that they receive
seriously.
► Data custodians also need to feed back information on errors
to the collectors and data suppliers where relevant.
► In this way there is a much higher likelihood that the incidence
of future errors will be reduced and the overall data quality
improved 21
Data Cleaning Hints
Educate data users on good practices
► Train all Data stakeholders on data handling techniques
► Education and training improves techniques
► Poor training, especially at the data collection and data
entry stages of the Information Quality Chain, is the cause
of a large proportion of the errors in primary species data.
► Good training of data entry operators can reduce the error
associated with data entry considerably, reduce data entry
costs and improve overall data quality

22
Data Cleaning Hints
► Uphold Accountability, Transparency and Audit-ability
► Haphazard and unplanned data cleaning exercises are very inefficient and generally
unproductive.
► Within data quality policies and strategies – clear lines of accountability for data
cleaning need to be established.
► To improve the “fitness for use” of the data and thus their quality, data cleaning
processes need to be transparent and well documented with a good audit trail to
reduce duplication and to ensure that once corrected, errors never re-occur.

23
Data Cleaning Hints
► Documentation is the key to good data quality
► Without good documentation, it is difficult for users to determine the
fitness for use of the data and difficult for custodians to know what
and by whom data quality checks have been carried out.
► Documentation is generally of two types.
– The first is tied to each record and records what data checks
have been done and what changes have been made and by
whom.
– The second is the metadata that records information at the
dataset level.
► Both are important, and without them, good data quality is
compromised.

24
Data Cleaning Hints

Perform Internal & external Database Checks


► Logical inconsistencies within the database
► Checking one field against another
► Text location vs geocode or District/State
► Checking one database against another
► Gazetteers
► Collectors

25
Dealing with Outliers
► The detection of outliers (geographic, statistical and environmental)
can provide one of the most useful tests for finding possible errors in
data.
► It is important, however, that validation tests do not uncritically delete
data because they are found to be statistical outliers.
► The uncritical exclusion of outliers can remove valuable records from
the data set and skew future analyses.
► Users, on the other hand, may decide to delete outliers from their
analysis if they are unsure of their validity as good records.
► Identification of outliers thus not only assists data custodians to
identify possible errors, but can aid users in determining whether
individual data records are fit for use in their analysis or not.
26
Data Custodian Responsibility
Should ensure:-
► the data are transcribed into the database correctly and accurately from the collector’s notes
► quality control procedures are implemented and exercised during data capture,
► data and the data quality are adequately and accurately documented,
► validation checks are routinely carried out on the data,
► validation checks carried out are fully documented,
► the data are stored and archived in a suitable manner
► earlier versions are systematically stored to allow comparisons and return to “uncleaned” data,
► data integrity is maintained,
► the data are made available in a timely and accurate manner with documentation that allows
users to determine “fitness for use”,
27
Data Custodian Responsibility
Should ensure:-
► custodian responsibilities as to privacy, intellectual property rights, copyright,
and sensitivities of traditional/indigenous owners are maintained,
► conditions of use of the data are maintained and made available along with any
restrictions on use and known areas of unsuitability of the data,
► all legal requirements with respect to the data are honoured and complied with,
► feedback from users on the data quality is dealt with in a timely manner,
► data quality is maintained to the highest level at all times,
► all known errors are fully documented and made known to users.

28

You might also like