Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

Data Preparation

IT 6713 BI
Overview

• What is data preparation and related


concepts
• What are the major issues and tasks in data
preparation?
• Self-service data preparation

Data is never clean


You will spend most of your time
cleaning and preparing data
2
Concepts
• Data preparation is the process of collecting (gathering), cleaning,
consolidating, structuring and organizing data into refined data set,
primarily used for data analysis. It emphasizes more on the user side.
– https://searchbusinessanalytics.techtarget.com/definition/data-preparation
• Data acquisition is the process of extracting the relevant business
information, transforming data into a required business format and
loading into the target system (a data warehouse or operational data
store)
– https://www.tutorialkart.com/what-is-data-acquisition-data-warehousing/
• Data integration is the process of combining data from many different
sources into a unified view of these data
– https://www.techopedia.com/definition/28290/data-integration
• All commonly utilize ETL tasks to get data into a target dataset model
or storage

3
Major Data Problems

• Data volume
– Big volume of data brings problems of processing performance
• Data quality
– Data quality varies at different sources and creates problems
for extraction and transformation
• Data location
– Data is located at different storage places (internal or external),
without easy access.
• Data differences (heterogeneity) if from multiple data
sources or the source data set is different from target
data set
– See the lecture notes “data differences”.

4
Data Preparation Tasks
• Similar to traditional ETL, data preparation consists of some
major steps to load the data for analysis.
• Extraction
– Accessing and extracting the data from source systems, including
database, flat files, spreadsheets, etc.
– Data is usually cleaned along this process.
• Cleanse
– The process of identifying and correcting data issues for accuracy,
completeness, consistency, uniformity, and validity.
• Transformation
– Change the extracted data to a format and structure that conform to
the target data model.
– Some additional data cleanse can also be part of the transformation
process.

5
Extraction
• The main objective of the extract step is to retrieve all the required data from the
source system/format as “atomic” as possible
• This involves
– Detecting source data structure/format and converting them into an accessible format (model)
– Extracting target data from mixed content
– Identifying the correct subset of source data that has to be submitted to the ETL workflow for
further processing. E.g. only extract the new data.
• Common sources of data
– Flat files: usually a stream of text without heavy internal structure, and can be easily read by a
general text processor http://searchsqlserver.techtarget.com/definition/flat-file
• CSV
– Pure text files with structural information
• XML, JSON
– Complex and binary files: needs specific applications or modules
• Excel, PDF, Word, Image, etc.
– OLTP databases: these sources are fairly structured and provide rich tools and APIs for
exporting data
– APIs (through standard data format): source systems may be unknown but the outputs are
usually in a structured or semi-structured format.

6
Data sourcing challenges in self-service

• Data may not already be in the data warehouse


• Data comes from public non-controlled sources
– Web,
• One needs to ask, is all data available? Very
often, strategic data are not saved in any
corporate database; instead they are saved in
personal documents like spreadsheets and
presentations. This presents a difficultly in the
process to load the data in the dashboard.

7
CSV (Comma-Separated Values)
• CSV is a very popular text file format for data exporting and
exchange.
• The CSV file format is not standardized so there are many
variations which creates issues in data extraction.
• In practice the term "CSV" might refer to any file that:
– is plain text using a character set such as ASCII, various Unicode
character sets (e.g. UTF-8),
– consists of records (typically one record per line),
– with the records divided into fields separated by delimiters (typically a
single reserved character such as comma, semicolon, or tab;
sometimes the delimiter may include optional spaces),
– where every record has the same sequence of fields.
• Basic rules and examples
– https://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules

8
Typical Extraction Examples
• Export students registration information from the Banner student
information system at the beginning of every semester.
– The source system can produce a report and save as CSV file.
• Extract data from quarterly industry survey reports.
– Survey data are downloaded from a website in PDF or Excel report
format.
• Import new transaction records from operational databases
every month.
– The source system provides APIs that can directly export data as high
level data objects.
• Get data from webpages by scraping
– Data are presented in HTML tables
– Data are not presented in HTML but in structure HTML tags (using
jQuery, an example of KSU class schedule)

9
Common Extraction Issues
• Data access
– Data are in a proprietary format/system or a flat document (PDF, or scanned,
reports) and cannot be properly extracted easily
– Located in a remote system and difficulty to connect to (no direct access)
– Outdated systems
– Lack of high level data parsers (e.g. JSON)
– Incompatible system (e.g. different version of Excel)
• Dirty data (low data quality)
– Poor (source) data model
– Data mixed with “noises”, like excessive formatting and spacing, without
consistent patterns, especially from web pages
– Improper encoding
• Poor data management
– No documentation of data definition, conflicting rules
– Autonomous: source data format and structure may be changed without
notifications

10
Data Cleanse
• Data cleanse
– The process of identifying and correcting data issues for accuracy,
completeness, consistency, uniformity, and validity.
• Data cleanse utilizes transformation techniques
• Also through data analysis and data profiling
– Identifying missing values
– Duplicates
– Lack of identifiers
– Validating spelling and encoding issues
– Correcting errors
– Remove noises (unwanted or undesired, or irrelevant, or interfering data or
text, empty space, irregular character, etc.)
• Documentation and trace of ETL
• Data cleanse during extraction or transformation

11
Transform
• The transform step applies a set of rules to transform the data from the source to the target.
• Common problems and issues can be categorized as (Extraction, Transformation, and Loading, by
Panos Vassiliadis and Alkis Simitsis, book chapter from Encyclopedia of Database Systems, pp
1095-1101):
Schema-level • naming conflicts, where the same name is used for different objects (homonyms)
problems or different names are used for the same object (synonyms)
• structural conflicts, different representations of the same object in different
sources, or converting data types between sources and the warehouse.
Record-level • Duplicated or contradicting records.
problems • Consistency problems concerning the granularity or timeliness of data occur (e.g.,
sales per day vs. sales per year) or reference to different points in time (e.g.,
current sales as of yesterday for a certain source vs. as of last month for another
source).
• Missing identifier/primary key
Value-level • Naming inconsistency: SPSU, Southern Poly, Southern Poly Technic
problems • format masks, like for example, different value representations (e.g., for sex:
‘Male’, ‘M’, ‘1’), or different interpretation of the values (e.g., date/time formats:
American ‘mm/dd/yy’ vs. European ‘dd/mm/yy’).
• Missing values (no ZIP code), truncated values
• Other value-level problems include assigning surrogate key management,
substituting constants, setting values to NULL or DEFAULT based on a condition,
or using frequent SQL operators like UPPER, TRUNC, and SUBSTR.
Common • Source data format and structure are changed
problems • No documentation of data definition, conflicting rules
12
Common Cleanse and Transform Operations

• Common ones
– Projection (filtering of columns)
– Data type/format conversion
– Translating values
– Filtering
– Value encoding
– Derived (calculated) value
– Sorting
– Joining
– Aggregation
– Grouping
– Assigning identifier or establish relationships foreign keys
– Transposing and pivoting
– Splitting/combining columns
– Disaggregation or normalization
– Lookup and validate

13
Self Service Data Preparation
• The real technical challenge to enabling “DIY” analytics for business users
resides in data integration and preparation, not visualization.
– http://www.dataversity.net/self-serve-data-preparation-can-support-business-users/
• Traditional ETL is very IT centric. ETL is usually a separate process
happening together with data management and storage. It is time
consuming and involves with complex and technical tasks like
programming, data structure, models, protocols, APIs, etc.
• New self-service data preparation capabilities provide users with greater
agility to respond to new data sources and new business requirements.
ETL usually happens together with data analysis. Self-Serve Data Prep
provides business users with powerful capabilities to explore, manipulate
and merge new data sources – all without the assistance of IT staff.
• Potential problem
– Without appropriate processes and governance, self-service capabilities can
introduce multiple versions of the truth, increase errors in reporting and leave
companies exposed to inconsistent information.

14
https://www.alteryx.c
om/day-in-the-life-of-a
n-analyst-without-self-
service-analytics

15
Data Preparation Tools
• Data preparation can be achieved in two ways generally: hand-coded
(programmed) and designer tools
• Coded; building custom ETL using a combination of
– OS scripts, like Power Shell, shell script
– Procedural languages (for example, Python, Perl, or C, or JavaScript, or Excel
macros)
– Vendor specific database language (PL\SQL, T-SQL, and so on).
• Designer tools
– Enterprise level systems and services like SSDT
https://www.gartner.com/reviews/market/data-integration-tools
– Self service guided tools: like Excel, Power Query
https://www.gartner.com/reviews/market/data-preparation-tools
– These tools usually provide graphical user interfaces for designing and executing
workflows.
• Data preparation tools can be a part of an larger tools/system, or can be
rather independent services

16
Self-Service Tools

• Power BI/Power Query


– Power Query is part of Excel 2016 and later as
“Get & Transform” (add-in available for previous
Excel versions)
– Also part of the desktop version Power BI
– https://www.gartner.com/reviews/market/data-pr
eparation-tools/compare/tableau-vs-microsoft
• More tools
– https://www.gartner.com/reviews/market/data-pr
eparation-tools/vendors
– https://solutionsreview.com/data-integration/the-
17
6-major-players-in-data-preparation-tools/
Tool Capabilities

• General
– Connect to different data sources/destinations
– Parse different data formats
– Graphical/visual designer
– Provide common control flow and data transformation tasks
– Job automation and scheduling
– Operation monitoring and error reporting
• Advanced capabilities
– Automation, repeatable
– Smart: allows business users to perform Advanced Data
Discovery and auto-suggests relationships, reveals the impact
and importance of key factors, recommends data type casts,
data quality improvements and more!

18
Good Resources

• https://www.dataversity.net/self-serve-data-p
reparation-can-support-business-users/
• https://www.softwareadvice.com/resources/s
elf-service-data-preparation-guide/

• https://searchbusinessanalytics.techtarget.c
om/definition/data-preparation

19

You might also like