Professional Documents
Culture Documents
Dayone
Dayone
BrightStar
TRAINING
Today
Lunch
Assessing Quality processes GIS upgrade project examples
Tomorrow
Lunch
Data warehouse and ETL Feature maintenance
Overview
Build a data quality system Avoid the worst traps Be able to describe a project scope
Budget, timeline, priorities
ISBN 978-0-09771400-2
Spatial Accuracy
Statistical Accuracy
= = =
Relevant Relevant + Missing Relevant - Errors Relevant Relevant - Errors Relevant + Missing
Completeness
Data Profiling
Find out what is there Assess the risks Understand data challenges early Have an enterprise view of all data
Profile Metrics
Security
Consistency
Assess
Improve
Prevent
Recognise
Data Cleaning
Monitor
Course examples
LINZ coordinate upgrade 1998-2003 NSCC services upgrade 2008 Valuation roll structure and matching ETL of utilites from SDE to Autocad Address location issues NAR, DRA
Documents and examples on memory stick
Morning Tea
Assessing Quality
1.
2. 3. 4. 5. 6. 7.
Project steps Required roles Defining the objectives Designing rules Scorecard and Metadata Frequency of assessment Common mistakes
System Consolidations
Manual Data Entry Batch Feeds Real-Time Interfaces
Database
Process Automation
Data processing
Data purging
Define data mapping Extract, Transform, Load (ETL) Drown in Data Problems
Find Scapegoat
Head-on two car wreck Square pegs into round holes Winner loser merging (50% wrong)
High error rate Complex and poor entry forms Users find ways around checks Forcing non blanks does not work
Large volumes mean lots of errors Source system subject to changes Errors accumulate Especially dangerous if triggers activated
Data between dbs in synchronisation Data in small packets out of context Too fast to validate Rejection loses record, so accepted Faster or better but not both!
Object changes are unnoticed by computers Retroactive changes may not be propagated
The data is assumed to comply with the new requirements Upgrades are tested against what the data is supposed to be, not what is actually there Once upgrades are implemented everything goes haywire
Fitness to the purpose of use may not apply Acceptable error rates may now be an issue Value granularity, map scale Data retention policy
Meaning of codes may change over time that only experts know Experts know when data looks wrong Retirees rehired to work systems Auckland address points were entered on corners and the rest guessed, later used as exact.
Web 2.0 bots automate form filling Transactions are generated without ever being checked by people Customers given automated access are more sensitive to errors in their own data
Changes in the programs Programs may not keep up with changes in data collection Processing may be done at the wrong time
Coordinate data not usually readable Data models CAD v GIS Fuzzy matching is not Boolean (near) Atomic objects harder to define Features have 2,3,4,5 dimensions Projection systems are not exact Topology requires special operators
Highly risky for data quality Relevant data may be purged Erroneous data may fit criteria It may not work the next year
En masse processes may add errors Cleaning processes may have bugs Incomplete information about data
Data Gazing
Count the records Just open the sources and scroll Sort and look at the ends Run some simple frequency reports See if the field names make sense What is missing that should be there
Lunch
Data Cleaning
There are always lots of errors It is too much to inspect all by hand Data experts are rare and too busy It does not fix process errors You may make it worse
Automated Cleaning
The only practical method Needs sophisticated pattern analysis Allow for backtracking Data quality rules are interdependent
Common Mistakes
Inadequate Staffing of Data Quality Teams 2. Hoping That Data Will Get Better by Itself 3. Lack of Data Quality Assessment 4. Narrow Focus 5. Bad Metadata 6. Ignoring Data Quality During Data Conversions 7. Winner-Loser Approach in Data Consolidation 8. Inadequate Monitoring of Data Interfaces 9. Forgetting About Data Decay 10. Poor Organization of Data Quality Metadata
1.
Metadata
Includes everything known about the data
Data model Business rules, relations, state Subclasses (lookup tables) GIS Metadata (NZGLS or ISO) XML Readme.txt
Data Exchange
Batch or interactive ETL (Extract Transform Load) Replication Time differences in data
Integrates many different sources Spatial patterns are revealed Display thousands of records simultaneously with direct access Location now seen as important
Scorecard
Case Study
Outline a GIS data quality system Measles Chart Prioritise Interview Build up a scorecard
Afternoon Tea
Assessment Exercise
Split into pairs Interview one person about their dataset Collect basic information Devise a strategy for a profile Rotate pair with another Interview other person Verbal reports to class
References