Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Efficiency of Business Intelligence Replenishment in

using Extract, Transform and Load (ETL) process and

Capture, Transform and Flow (CTF) software in Real Time

Data Warehouse

By:

Christine Mae T. Cion

Cristha Joy A. Simbajon

BSCS-1A

Prof. Hazel Anuncio

June 16, 2020

1
Table of Contents

i. Introduction

ii. Body

a. Extract, Transform and Load (ETL)

b. Capture the data, Transform and

Flow (CTF)

iii. Conclusion

iv. Reference

2
i. Introduction

A Real Time Data Warehouse (RTDW) can be defined as a

system that represents the characteristics and the

actual situation of the organization. For instance, if

we have a request to analyze a particular facet of the

organization embarked on an RTDW, the answer will be

represented in the real state of the organization at

the time of the request sending. Unlike most

traditional data warehouses, an RTDW is seen to contain

current data (real time) of the organization. Thus, the

refreshing frequency plays a predominant role in RTDW.

(S. Bouaziz, A. Nabli and F. Gargouri,2007)

The data keep innovating fast so there are many data

that we need to replenish in data warehouse.

Especially, for those big companies that continuing

pushing a data and need to be stored and update.

A RTDW has low latency data and provides current (or

real-time) data. It can replenish data. They can easily

update the data. And it is flexible it measured how

3
easy it is for the analyst to break out the standard

representation.

The intelligent warehousing solution and framework can

commonly be divided into three fundamental tiers with

data flows between them. The three layers are

Presentation Layer, Architecture Layer, and Middleware

Layer. These tiers or layers must be seamlessly

integrated and function as one to ensure the immediate

success and long-term benefits of a data warehouse.

(J. Vandermay, 2001)

The RTDW using Business Intelligence (BI) to replenish

data. BI is a set of processes, architectures, and

technologies that convert raw data into meaningful

information that drives profitable business actions and

BI systems help businesses to identify market trends

and spot business problems that need to be addressed.

BI technology can be used by Data analyst, IT people,

business users and head of the company. BI system helps

organization to improve visibility, productivity and

fix accountability. The draw-backs of BI is that it is

time-consuming costly and very complex process. The

4
larger the data warehouse, the longer it takes to

replenish.

ii. Body

Currently, the dominant method of replenishing data

warehouses and data marts is to use extraction,

transformation and load (ETL) tools that “pull” data

from source systems periodically – at the end of a day,

week, or month – and provide a “snapshot” of your

business data at a given moment in time. That batch

data is then loaded into a data warehouse table. During

each cycle, the warehouse table is completely refreshed

and the process is repeated no matter whether the data

has changed or not. (J. Vandermay, 2001)

Due to the growth in business and the related increases

in data, companies may find it difficult to fit its

“batch job” into a periodic time window of eight hours,

and may as a result cut into normal usage hours. In

contrast to standard ETL tools, consider an advanced

CTF solution that instead captures, transforms and

flows data in real-time into an efficient, continuously

replenished data warehouse.

5
Extract, Transform and Load (ETL)

ETL is a process that extracts the data from different

source systems, then transforms the data (like applying

calculations, concatenations, etc.) and finally loads

the data into the Data Warehouse system. The ETL

process requires active inputs from various

stakeholders including developers, analysts, testers,

top executives and is technically challenging.

In order to maintain its value as a tool for decision-

makers, Data warehouse system needs to change with

business changes. ETL is a recurring activity (daily,

weekly, monthly) of a Data warehouse system and needs

to be agile, automated, and well documented.

ETL Process in Data Warehousing

ETL is a 3-step process

6
Step 1) Extraction

In this step, data is extracted from the source system

into the staging area. Transformations if any are done

in staging area so that performance of source system in

not degraded. Also, if corrupted data is copied

directly from the source into Data warehouse database,

rollback will be a challenge. Staging area gives an

opportunity to validate extracted data before it moves

into the Data warehouse.

Data warehouse needs to integrate systems that have

different DBMS, Hardware, Operating Systems and

Communication Protocols. Sources could include legacy

7
applications like Mainframes, customized applications,

Point of contact devices like ATM, Call switches, text

files, spreadsheets, ERP, data from vendors, partners

amongst others. Hence one needs a logical data map

before data is extracted and loaded physically. This

data map describes the relationship between sources and

target data.

Three Data Extraction methods:

1. Full Extraction

2. Partial Extraction- without update notification.

3. Partial Extraction- with update notification

Irrespective of the method used, extraction should not

affect performance and response time of the source

systems. These source systems are live production

databases. Any slow down or locking could affect

company's bottom line.

Some validations are done during Extraction:

 Reconcile records with the source data

 Make sure that no spam/unwanted data loaded

 Data type check

 Remove all types of duplicate/fragmented data

 Check whether all the keys are in place or not

8
Step 2) Transformation

Data extracted from source server is raw and not usable

in its original form. Therefore, it needs to be

cleansed, mapped and transformed. In fact, this is the

key step where ETL process adds value and changes data

such that insightful BI reports can be generated.

In this step, you apply a set of functions on extracted

data. Data that does not require any transformation is

called as direct move or pass through data.

In transformation step, you can perform customized

operations on data. For instance, if the user wants

sum-of-sales revenue which is not in the database. Or

if the first name and the last name in a table is in

different columns. It is possible to concatenate them

before loading.

Following are Data Integrity Problems:

1. Different spelling of the same person like Jon, John,

etc.

2. There are multiple ways to denote company name like

Google, Google Inc.

3. Use of different names like Cleaveland, Cleveland.

4. There may be a case that different account numbers are

generated by various applications for the same

customer.

9
5. In some data required files remains blank

6. Invalid product collected at POS as manual entry can

lead to mistakes.

Validations are done during this stage

 Filtering – Select only certain columns to load

 Using rules and lookup tables for Data standardization

 Character Set Conversion and encoding handling

 Conversion of Units of Measurements like Date Time

Conversion, currency conversions, numerical

conversions, etc.

 Data threshold validation check. For example, age cannot

be more than two digits.

 Data flow validation from the staging area to the

intermediate tables.

 Required fields should not be left blank.

 Cleaning (for example, mapping NULL to 0 or Gender Male

to "M" and Female to "F" etc.)

 Split a column into multiples and merging multiple

columns into a single column.

 Transposing rows and columns,

 Use lookups to merge data

10
 Using any complex data validation (e.g., if the first

two columns in a row are empty then it automatically

rejects the row from processing)

Step 3) Loading

Loading data into the target data warehouse database is

the last step of the ETL process. In a typical Data

warehouse, huge volume of data needs to be loaded in a

relatively short period (nights). Hence, load process

should be optimized for performance.

In case of load failure, recover mechanisms should be

configured to restart from the point of failure without

data integrity loss. Data Warehouse admins need to

monitor, resume, cancel loads as per prevailing server

performance.

Types of Loading:

 Initial Load — populating all the Data Warehouse tables

 Incremental Load — applying ongoing changes as when

needed periodically.

 Full Refresh —erasing the contents of one or more tables

and reloading with fresh data.

Load verification

 Ensure that the key field data is neither missing nor

null.

11
 Test modeling views based on the target tables.

 Check that combined values and calculated measures.

 Data checks in dimension table as well as history table.

 Check the BI reports on the loaded fact and dimension

table.

Capture, Transform and Flow (CTF)

The capture, transform and flow (CTF) software exists

that can facilitate the real time delivery of

meaningful information to subscribed systems, movement

among varied platforms and databases, and the selecting

and filtering of the data transmitted. CTF tools enable

you to capture raw data from multiple operational

databases and flow the data in real-time into data

warehouse tables while transforming the data into

meaningful information and flow in different systems.

Users are empowered through the means to translate

values, derive new calculated fields, reformat field

sizes, table names and data types. CTF tools help

accelerate time to market while adding value to

business intelligence information by keeping the data

clean, current and in a format conducive to query and

analysis. Through a combination of best practices and

12
best-of-breed solutions such as capture, transform and

flow tools, companies can reasonably expect to have end

users querying the data warehouse within a short time

frame.

Change Data Capture

Today, more and more businesses using a data warehouse

are beginning to realize they cannot achieve point-in-

time consistency without continuous, real-time change

data capture. There are several techniques used by data

integration / replenishment software to move data.

Essentially, integration tools either push or pull data

on an event driven or polling basis.

Push integration is initiated at the source for each

subscribed target. This means that as changes occur,

they are captured and sent, or “pushed” across to each

target. Pull integration is initiated at the target by

each subscribed target. In other words, the target

system extracts the captured changes and “pulls” them

down to the local database.

But in order to compete with an information-driven

Internet era, organizations must employ solutions that

offer the option of updating databases as incremental

13
changes occur, reflecting those changes to subscribed

systems. With advanced CTF solutions, every time an

add, change or delete occurs in the production

environment, it is automatically captured and

integrated or “pushed” in real-time to the data

warehouse. By significantly reducing batch window

requirements and instead making incremental updates,

users regain computing time once lost.

Beyond real-time integration, change data capture can

also be done periodically. Data can be captured and

then stored until a predetermined integration time. For

example, an organization may schedule its

Considerations for Building a Real-time Data Warehouse

refreshes of full tables or changes to tables to be

integrated hourly or nightly. Only data that has

changed since the previous integration needs to be

transformed and transported to the subscriber. The data

warehouse can therefore be kept current and consistent

with the source databases. (S. Bouaziz, A. Nabli and F.

Gargouri,2007)

Transform

14
Transformational data integration software can conduct

individual tasks such as translating values, deriving

new calculated fields, joining tables at source,

converting date fields, and reformatting field sizes,

table names and data types. All of these functions

allow for code conversion, removal of ambiguity and

confusion associated with data, standardization,

measurement conversions, and consolidating dissimilar

data structures for data consistency.

Flow

Replenishing the feed of transformed data in real-time

from multiple operational systems to one or more

subscriber systems. Whether a data warehouse or several

data marts, the flow process is a smooth, continuous

stream of bits of information as opposed to the batch

loading of data performed by ETL tools.

15
iii. Conclusion

The competitive business economy is rapidly growing and

due to updating and increasing of data, companies may

find it difficult to load data into a data warehouse for

downstream data analytics. This problem can be avoided by

selecting replenishment solution that offers advanced

capture, transform and flow (CTF). In disparity to standard

ETL tools, consider an advanced CTF solution that captures,

transforms and flows data in real-time into an efficient,

continuously replenished data warehouse. (CTF) tool can

contribute to the simplicity and efficiency of a real-

time data warehouse.

16
iv. Reference

Krishna, (2020),” What is business intelligence”.

https://www.guru99.com/business-intelligence-definition-example.html

J. vandermay, (2001),” consideration for building real time warehouse”.

https://www.semanticscholar.org/paper/Considerations-for-Building-a-Real-time-Data-

Vandermay/9c9f28799b8280de9ac574a0e416a18089dd7eca

Krishna, (2020),” What is ETL”.

https://www.guru99.com/etl-extract-load-process.html

17

You might also like