Basics of Data Integration

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

The arguments put forth by him in favor of a library data warehouse are

• Data from several heterogenous data sources (MS Excel spreadsheets, MS Access CSVfile, etc.) can
be extracted and brought together in a data warehouse.
• Even when DIIT expands into several branches in multiple cities, it still can have one ware-house to
support the information needs of the institution.

• Data anomalies can be corrected through an ETL package.

• Missing or incomplete records can be detected and duly corrected.

• Uniformity can be maintained over each attribute of a table.

• Data can be conveniently retrieved for analysis and generating reports (like the report on spending
requested above).

• Fact-based decision making can be easily supported by a data warehouse.

• Ad hoc queries can be easily supported


DATA MART
DATA MART
GOALS of DATA WAREHOUSE
BI-PROCESS
What are the problems faced in Data
Integration?
Challenges in Data Integration
Data Mapping
• Identification of data relationships
• Consolidation of multiple database in single database
• Data transformation between data source and data destination
DATA MAPPING
• The process of creating data element mapping between two distinct
data models
• It is used as the first step towards a wide variety of data integration
tasks
 Data transformation between data source & data destination
 Identification of data relationships
 Discovery of hidden data
 Consolidation of multiple database into single database
Some Transformation Types
Selecting only certain columns to load
Translating a few coded values
Encoding some free –form values
Deriving a new calculated value
Joining together data derived from multiple sources
Summarizing multiple rows data
Splitting a column into multiple column
Ralph Kimball’s Approach Vs W.H. Inmon’s
Approach
• Kimball : A data warehouse is made up of all data marts in the
enterprise – Bottom –up Approach . Faster , cheaper and less complex
• W.H . Inmon : DW is subject oriented, non-volatile integrated, time-
variant collection of data – Top-down approach . Expensive and
complex . Achieves “ Single Version of truth” for lrge organisations
and worth the investment.
Data Integration Technologies
• Data Interchange :
Structured transmission of data between 2 or more organizations
electronically –or between two trading partners not email.

Object Request Brokering :


Middleware software that allow making calls from one another and
also in-process data transformation .
Modelling Techniques

Entity Relationship Modelling


Logical Design Technique

Focuses on reducing data redundancy

Contributes in initial stages of constructing


DW

Problem of creating huge number of tables


with dozens of joins –Massive spider web of
joins between tables
Steps To Draw ER Model
Identify
Entities

Identify
Relation
between
Entities

Identify Key
attribute

Identify
other
relevant
attributes

Draw ER
diagram

Review with
Business
Users
Problems posed by ER Modeling
• End Users finds it difficult to comprehend and traverse

• Lack of software to query general ER Model

• ER modeling cannot be used for DW where performance access and


ad hoc queries are required to be done
Dimensional Modeling
• Logical design techniques that focuses on presenting data in standard
format for end –user consumption
• Based on Schema –Star or Snowflake
• Consists of one large table and number of small dimension tables
• Fact table has multipart primary key and each table has a single part
primary key
Basics of Data Integration
(Extraction Transformation
Loading)
Difference Between ER Modeling and
Dimensional Modeling
ER Modeling Dimensional Modeling
Optimized for Transactional Data Optimized for Query ability and performance

Eliminates redundant data Does not eliminate redundant data where appropriate

Highly normalized Aggregates most of the attributes of a dimension into a


single entity

It is a complex maze with hundreds of entities It has logical grouped set in schemas

Used for Transactional Systems Used for Analytical Systems

It is split as per the entities It is split as per dimensions and facts


Data Quality
What is data quality?
In simple terms, data quality tells us how reliable a particular set of
data is and whether or not it will be good enough for a user to employ
in decision-making. This quality is often measured by degrees.
Data Quality Dimensions
There are six primary, or core, dimensions to data quality. These are
the metrics analysts use to determine the data’s viability and its
usefulness to the people who need it.

ACCURACY,
COMPLETENESS,
VALIDITY,
CONSISTENCY,
TIMELINESS,
Accuracy

• The data must conform to actual, real-world scenarios and reflect


real-world objects and events. Analysts should use verifiable sources
to confirm the measure of accuracy, determined by how close the
values jibe with the verified correct information sources
E.g. Address of the customer in the database is the real address
Age of the patient is accurate as in hospital database
Completeness

• Completeness measures the data's ability to deliver all the


mandatory values that are available successfully.

E.g. Data of all students at University is available


Data of all the patients of a hospital are available
Dat of all employees
Consistency

• Data consistency describes the data’s uniformity as it moves across


applications and networks and when it comes from multiple sources.
Consistency also means that the same datasets stored in different
locations should be the same and not conflict.
• E.g. Customer cancelled and surrendered his credit card and yet
status reads due
• Employee left but his email id is active
Timeliness

Timely data is information that is readily available whenever it’s


needed. This dimension also covers keeping the data current; data
should undergo real-time updates to ensure that it is always available
and accessible.
• Eg Airlines need to provide timely flight information to passengers
• Enterprise to publish timely quarterly results
Uniqueness

• Uniqueness means that no duplications or redundant information


are overlapping across all the datasets. No record in the dataset
exists multiple times. Analysts use data cleansing and deduplication
to help address a low uniqueness score.
• For example, when reviewing customer data, you should expect that
each customer has a unique customer ID.
Validity

• Data must be collected according to the organization’s defined


business rules and parameters. The information should also conform
to the correct, accepted formats, and all dataset values should fall
within the proper range.
• Formatting usually includes metadata, such as valid data types, ranges,
patterns, and more.
• Data profiling is the process of
examining, analyzing, and
creating useful summaries of
data. The process yields a high-
level overview which aids in the
discovery of data quality issues,
Data Profiling risks, and overall trends. Data
profiling produces critical
insights into data that companies
can then leverage to their
advantage.
Two types of Data Profiling
Data Quality Profiling
Database Profiling
Data Quality Profiling

• Analysing the data from a data source or a


database against the business requirements
• Enables to identify issues in data quality
• Analysis may be represented as
• Summaries : Counts and Percentages on
completeness of the dataset , uniqueness
of columns etc .
Details : Involves lists containing
information on data records or data
problems in individual records
Database Profiling

Procedures of Analysis of a database wrt its schema ,


relationship between tables , columns used , keys of the tables
etc

Database Profiling is done initially followed by data quality


profiling
When is it conducted?
At the discovery /
requirements
gathering phase

• As soon as source data


system are identified
• Business requirements laid
out
• Data quality profiling needs
to be done to avoid
correction and re-work
Just before the Dimensional
Modelling Process

• Intensive data Profiling is done


• More database Profiling
• Analysis of schema designs for DW
• Aims to identify the best method to
convert the source data to the
dimensional model
To identify possible errors that
may creep in during ETL

During ETL
Helps in identifying what data
Package to extract and what filters to
apply
Design

More Data quality Profiling


Domino’s data avalanche

• With almost 14,000 locations, Domino’s was already the largest pizza
company in the world by 2015. But when the company launched
its AnyWare ordering system, it was suddenly faced with an avalanche of
data. Users could now place orders through virtually any type of device or
app, including smart watches, TVs, car entertainment systems, and social
media platforms.
• That meant Domino’s had data coming at it from all sides. By putting
reliable data profiling to work, Domino’s now collects and analyzes
data from all of the company’s point of sales systems in order to streamline
analysis and improve data quality. As a result, Domino’s has gained deeper
insights into its customer base, enhanced its fraud detection processes,
boosted operational efficiency, and increased sales.

You might also like