Online Analytical Processing: OLAP (Or Online Analytical Processing) Has Been Growing in Popularity Due To The

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Online Analytical Processing

OLAP (or Online Analytical Processing) has been growing in popularity due to the
increase in data volumes and the recognition of the business value of analytics. Until the mid-
nineties, performing OLAP analysis was an extremely costly process mainly restricted to
larger organizations.
The major OLAP vendor are Hyperion, Congo’s, Business Objects, Micro Strategy. The cost
per seat were in the range of $1500 to $5000 per annum. The setting up of the environment to
perform OLAP analysis would also require substantial investments in time and monetary
resources.
This has changed as the major database vendor have started to incorporate OLAP modules
within their database offering - Microsoft SQL Server 2000 with Analysis Services, Oracle
with Express and Darwin, and IBM with DB2.
Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications, Call Centre.
OLTPs are designed for optimal transaction speed. When a consumer makes a purchase
online, they expect the transactions to occur instantaneously. With a database design, call
data modelling, optimized for transactions the record 'Consumer name, Address, Telephone,
Order Number, Order Name, Price, Payment Method' is created quickly on the database and
the results can be recalled by managers equally quickly if needed.

Online Transaction Processing


OLTP (or Online Transaction Processing) is mainly used in industries that rely
heavily on the efficient processing of a large number of client transactions, e.g., banks,
airlines and retailers. Database systems that support OLTP are usually decentralized to avoid
single points of failure and to spread the volume between multiple servers.
OLTP systems must provide atomicity, which is the ability to fully process or completely
undo an order. Partial processing is never an option. When airline passenger seats are booked,
atomicity combines the two system actions of reserving and paying for the seat. Both actions
must happen together or not at all.
Typically, OLTP systems are used for order entry, financial transactions, customer
relationship management (CRM) and retail sales. Such systems have a large number of users
who conduct short transactions. Database queries are usually simple, require sub-second
response times and return relatively few records.
An important attribute of an OLTP system is its ability to maintain concurrency. To avoid
single points of failure, OLTP systems are often decentralized.
IBM's CICS (Customer Information Control System) is a well-known OLTP product.
Difference Between OLTP And OLAP

OLTP OLAP

Stands for Online Transaction Processing. Stands for Online Analytical Processing.

It is operational data. It is historical / consolidation data.

It is used to control and run fundamental It is used to help with planning, problem
business tasks. solving and decision support.

It is the original source of the data. The OLAP data comes from the various
OLTP databases.

Processing speed is very fast. Processing speed is slow.

The database design is highly normalized The database design is deformalized with
with many tables. fewer tables and mostly uses star or
snowflake schema.

It is reporting engine. It is the business process engine.

It processes simple queries. It processes complex queries.

It focuses on updating data. It focuses on reporting data.

It is characterized by a large number of It is characterized by low volume of


short online transactions. transactions.
Data Warehouse
The purpose of the Data Warehouse in the overall Data Warehousing Architecture is to
integrate corporate data. It contains the "single version of truth" for the organization that has
been carefully constructed from data stored in disparate internal and external operational
databases.
The amount of data in the Data Warehouse is massive. Data is stored at a very granular level
of detail.
For example, every "sale" that has ever occurred in the organization is recorded and related to
dimensions of interest. This allows data to be sliced and diced, summed and grouped in
unimaginable ways.
Typical Data Warehousing Environment
Contrary to popular opinion, the Data Warehouse does not contain all the data in the
organization. Its purpose is to provide key business metrics that are needed by the
organization for strategic and tactical decision making.
Decision makers don't access the Data Warehouse directly. This is done through various
front-end tools that read data from subject specific Data Marts.
The Data Warehouse can be either "relational" or "dimensional". This depends on how the
business intends to use the information.

Components of Business Intelligence Architecture


One mistake that top leaders of many organizations make is think of their BI system as
equivalent to front-end BI tools being used. Then there is another set of technical geeks who
make lot of discussion about a business intelligence architecture around some fancy jargons
without giving due importance to what exactly comprises BI architecture.
The key elements of a business intelligence architecture are:
 Source systems
 ETL process
 Data modelling
 Data warehouse
 Enterprise information management (EIM)
 Appliance systems
 Tools and technologies
ETL (Extract, Transform, and Load) Process
ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool extracts
the data from different RDBMS source systems then transforms the data like applying
calculations, concatenations, etc. and then load the data into the Data Warehouse system.
It's tempting to think a creating a Data warehouse is simply extracting data from multiple
sources and loading into database of a Data warehouse. This is far from the truth and requires
a complex ETL process. The ETL process requires active inputs from various stakeholders
including developers, analysts, testers, top executives and is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse system needs to
change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data
warehouse system and needs to be agile, automated, and well documented.

ETL Process in Data Warehouses


ETL is a 3-step process
Step 1) Extraction
In this step, data is extracted from the source system into the staging area. Transformations if
any are done in staging area so that performance of source system in not degraded. Also, if
corrupted data is copied directly from the source into Data warehouse database, rollback will
be a challenge. Staging area gives an opportunity to validate extracted data before it moves
into the Data warehouse.
Data warehouse needs to integrate systems that have different
DBMS, Hardware, Operating Systems and Communication Protocols. Sources could include
legacy applications like Mainframes, customized applications, Point of contact devices like
ATM, Call switches, text files, spreadsheets, ERP, data from vendors, partners amongst
others.
Hence one needs a logical data map before data is extracted and loaded physically. This data
map describes the relationship between sources and target data.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and response time
of the source systems. These source systems are live production databases. Any slow down or
locking could effect company's bottom line.
Some validations are done during Extraction:
 Reconcile records with the source data
 Make sure that no spam/unwanted data loaded
 Data type check
 Remove all types of duplicate/fragmented data
 Check whether all the keys are in place or not
Step 2) Transformation
Data extracted from source server is raw and not usable in its original form. Therefore it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and changes data such that insightful BI reports can be generated.
In this step, you apply a set of functions on extracted data. Data that does not require any
transformation is called as direct move or pass through data.
In transformation step, you can perform customized operations on data. For instance, if the
user wants sum-of-sales revenue which is not in the database. Or if the first name and the last
name in a table is in different columns. It is possible to concatenate them before loading.
Following are Data Integrity Problems:
1. Different spelling of the same person like Jon, John, etc.
2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleveland, Cleveland.
4. There may be a case that different account numbers are generated by various
applications for the same customer.
5. In some data required files remains blank
6. Invalid product collected at POS as manual entry can lead to mistakes.
Validations are done during this stage
 Filtering – Select only certain columns to load
 Using rules and lookup tables for Data standardization
 Character Set Conversion and encoding handling
 Conversion of Units of Measurements like Date Time Conversion, currency
conversions, numerical conversions, etc.
 Data threshold validation check. For example, age cannot be more than two digits.
 Data flow validation from the staging area to the intermediate tables.
 Required fields should not be left blank.
 Cleaning ( for example, mapping NULL to 0 or Gender Male to "M" and Female to
"F" etc.)
 Split a column into multiples and merging multiple columns into a single column.
 Transposing rows and columns,
 Use lookups to merge data
 Using any complex data validation (e.g., if the first two columns in a row are empty
then it automatically reject the row from processing)
Step 3) Loading
Loading data into the target datawarehouse database is the last step of the ETL process. In a
typical Data warehouse, huge volume of data needs to be loaded in a relatively short period
(nights). Hence, load process should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point of
failure without data integrity loss. Data Warehouse admins need to monitor, resume, cancel
loads as per prevailing server performance.
Types of Loading:
 Initial Load — populating all the Data Warehouse tables
 Incremental Load — applying ongoing changes as when needed periodically.
 Full Refresh —erasing the contents of one or more tables and reloading with fresh
data.
Load verification
 Ensure that the key field data is neither missing nor null.
 Test modelling views based on the target tables.
 Check that combined values and calculated measures.
 Data checks in dimension table as well as history table.
 Check the BI reports on the loaded fact and dimension table.
Data Quality
Data quality refers to the overall utility of a dataset(s) as a function of its ability to be easily
processed and analysed for other uses, usually by a database, data warehouse, or data
analytics system. Quality data is useful data. To be of high quality, data must be consistent
and unambiguous. Data quality issues are often the result of database merges or
systems/cloud integration processes in which data fields that should be compatible are not
due to schema or format inconsistencies. Data that is not high quality can undergo data
cleansing to raise its quality.

What activities are involved in data quality?


Data quality activities involve data rationalization and validation.
Data quality efforts are often needed while integrating disparate applications that occur
during merger and acquisition activities, but also when soloed data systems within a single
organization are brought together for the first time in a data warehouse or big data lake. Data
quality is also critical to the efficiency of horizontal business applications such as enterprise
resource planning (ERP) or customer relationship management (CRM).

What are the benefits of data quality?


When data is of excellent quality, it can be easily processed and analysed, leading to insights
that help the organization make better decisions. High-quality data is essential to business
intelligence efforts and other types of data analytics, as well as better operational efficiency.

Why data quality is important?


Poor-quality data is often pegged as the source of inaccurate reporting and ill-conceived
strategies in a variety of companies, and some have attempted to quantify the damage done.
Economic damage due to data quality problems can range from added miscellaneous
expenses when packages are shipped to wrong addresses, all the way to steep regulatory
compliance fines for improper financial reporting.
An oft-cited estimate originating from IBM suggests the yearly cost of data quality issues in
the U.S. during 2016 alone was about $3.1 trillion. Lack of trust by business managers in data
quality is commonly cited among chief impediments to decision-making.
Data profiling
Data profiling, also called data archaeology, is the statistical analysis and assessment of data
values within a data set for consistency, uniqueness and logic.
The data profiling process cannot identify inaccurate data; it can only identify business rules
violations and anomalies. The insight gained by data profiling can be used to determine how
difficult it will be to use existing data for other purposes. It can also be used to provide
metrics to assess data quality and help determine whether or not metadata accurately
describes the source data.
Profiling tools evaluate the actual content, structure and quality of the data by exploring
relationships that exist between value collections both within and across data sets. For
example, by examining the frequency distribution of different values for each column in a
table, an analyst can gain insight into the type and use of each column. Cross-column analysis
can be used to expose embedded value dependencies and inter-table analysis allows the
analyst to discover overlapping value sets that represent foreign key relationships between
entities.
Data profiling is a crucial part of:
 Data warehouse and business intelligence (DW/BI) projects—data profiling can
uncover data quality issues in data sources, and what needs to be corrected in ETL.
 Data conversion and migration projects—data profiling can identify data quality
issues, which you can handle in scripts and data integration tools copying data from
source to target. It can also uncover new requirements for the target system.
 Source system data quality projects—data profiling can highlight data which
suffers from serious or numerous quality issues, and the source of the issues (e.g. user
inputs, errors in interfaces, data corruption).
Data profiling involves:
 Collecting descriptive statistics like min, max, count and sum.
 Collecting data types, length and recurring patterns.
 Tagging data with keywords, descriptions or categories.
 Performing data quality assessment, risk of performing joins on the data.
 Discovering metadata and assessing its accuracy.
 Identifying distributions, key candidates, foreign-key candidates, functional
dependencies, embedded value dependencies, and performing inter-table analysis.
Conformed Dimension
A dimension occurring in multiple domains of an organization that has been harmonized for
coherent, consistent use. A conformed dimension can be used in different data warehouses in
the same organization. This supports expansion and growth of the warehouse.
Ex: A Time dimension with attributes day, month, year is conformed if it is defined once and
reused across multiple cubes.

Fact
An individual record of business activity that is stored in a data warehouse. Each fact
contains one or more measures (numbers, amounts, or prices) and a series of fields
(dimensions) by which the fact can later be analysed. Facts are the foundation of data
warehouse tables and OLAP data cubes.
Ex: In the area of Faculty Research, one fact is recorded per researcher, per project that
he/she proposes, per year. Each fact contains several measures (dollars requested, dollars
awarded, etc.) and several dimension fields (the researcher, sponsoring agency, type of
research, title of the project, etc.)

Data Cube
A database structure that forms the basis for analysis applications. The name "cube" suggests
that the data inside has multiple "dimensions" to it. That is, if a regular spreadsheet table has
two dimensions through which its data can be viewed or calculated (one set of labels going
across and one going down), a data cube has three or four or sometimes many more. These
dimensions can be stacked, combined, and drilled into in multiple ways, and are well suited
to browsing in pivot tables. Cubes often contain pre-calculated sub-totals, in myriad
combinations and at different levels of aggregation, to enhance speed and usability.

Attribute
A field that makes up part of a data dimension. Each attribute is represented by a column in a
table, a report, or a chart.
Ex: A Person dimension has attributes such as ID, name, age, and gender.

Dimensional Hierarchy
An arrangement of multiple levels of granularity within a single dimension. With a hierarchy
in place, data for a given dimension can be rolled up to aggregated totals, or drilled down into
for finer analysis. This can be represented in a data model by multiple columns within a
dimension table in standard star schemas called hierarchy columns.
Ex: Degrees are arranged in a hierarchy of Degree Level (Undergraduate, Graduate, etc.) with
individual Degrees (Bachelor of Arts, Master of Science, Doctor of Philosophy, etc.) below.

Star Schema
A star schema is a logical model used to organize data in a data warehouse. The "star" is a
central fact table with an array of dimension tables organized around it. This is a natural fit
for cube data: it allows information to be viewed from many perspectives and facilitates
multidimensional querying.
Ex: A central fact table for enrolment can have dimensions for Term, Faculty, and
Citizenship.

You might also like