Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 63

Business Analytics (EMIS 508)

BA at Data Warehouse Level

Dr. Md. Rakibul Hoque


University of Dhaka
Data for Business Analytics

 Metrics are used to quantify performance.


 Measures are numerical values of metrics.

 Discrete metrics involve counting

- number or proportion of on time deliveries


 Continuous metrics are measured on a
continuum
- package weight
- purchase price
Data for Business Analytics

• When collecting or gathering data we collect data


from individuals cases on particular variables.
• A variable is a unit of data collection whose value
can vary.
• Variables can be defined into types according to the
level of mathematical scaling that can be carried out
on the data.
• There are four types of data or levels of
measurement:
1. Categorical 2. Ordinal
(Nominal)
3. Interval 4. Ratio
Categorical (Nominal) data
• Nominal or categorical data is data that comprises of
categories that cannot be rank ordered – each category is just
different.
• The categories available cannot be placed in any order and
no judgement can be made about the relative size or distance
from one category to another.
Categories bear no quantitative relationship to one another
Examples:
- customer’s location (America, Europe, Asia)
• What does this mean? No mathematical operations can be
performed on the data relative to each other.
•Therefore, nominal data reflect qualitative differences rather
than quantitative ones.
Nominal data

Examples:

What is your Did you


gender? enjoy the
(please tick) film? (please
tick)

Male
Female Yes
No
Nominal data

•Systems for measuring nominal data must


ensure that each category is mutually
exclusive and the system of measurement
needs to be exhaustive.
•Variables that have only two responses i.e.
Yes or No, are known as dichotomies.
Ordinal data

• Ordinal data is data that comprises of categories that


can be rank ordered.
• Similarly with nominal data the distance between each
category cannot be calculated but the categories can be
ranked above or below each other.
No fixed units of measurement
Examples:
- college football rankings
- survey responses
(poor, average, good, very good, excellent)
• What does this mean? Can make statistical
judgements and perform limited maths.
Ordinal data

Example:
How satisfied are you
with the level of service
you have received?
(please tick)

Very satisfied
Somewhat satisfied
Neutral
Somewhat dissatisfied
Very dissatisfied
Interval and ratio data

• Both interval and ratio data are examples of scale


data.
• Scale data:
• data is in numeric format ($50, $100, $150)
•data that can be measured on a continuous
scale
• the distance between each can be observed and
as a result measured
• the data can be placed in rank order.
Interval data

•Ordinal data but with constant differences between


observations
•Ratios are not meaningful
•Examples:
•Time – moves along a continuous measure or
seconds, minutes and so on and is without a zero
point of time.
•Temperature – moves along a continuous
measure of degrees and is without a true zero.
•SAT scores
Ratio data

• Ratio data measured on a continuous scale and


does have a natural zero point.
Ratios are meaningful
Examples:
• Monthly sales
• Weight
• Height
• Age
Data for Business Analytics
A Sales Transaction Database File

Records

Entities Fields or Attributes


Data for Business Analytics

Classifying Data Elements in a Purchasing


Database
Classifying Data Elements in a Purchasing

al
Data for Business Analytics

rv
te
In al
rv
te
In
io
at
R
io
at
R
Database

io
at
R
io
at
R al
ic
or
eg
at al
C ic
or
eg al
at ic
C or
eg
at
C al
ic
or
eg
at
C
Data Warehouse

 A data warehouse is a large store of data


accumulated from a wide range of sources
within a company and used to guide
management decisions.
 A data warehouse is a collection of data drawn
from other databases used by the business.
 It is a database that stores current and
historical data of potential interest to decision
makers throughout the company.
Data Warehouse
 Data warehouse is to give the organization a
common information platform, which ensures
consistent, integrated, and valid data across
source systems and business areas. This is
essential if a company wants to obtain the most
complete picture possible of its customers.
Business analytics (BA) function receives input
from different primary source systems and
combines and uses these in a different context
than initially intended.
Data Warehouse
In data warehouse we have to join information from a large
number of independent systems, such as:
Billing systems (systems printing bills)
Reminder systems (systems sending out reminders, if customers do not pay
on time, and credit scores)
Debt collection systems (status on cases that were outsourced for external
collection)
Customer relationship management (CRM) systems (systems for storing
history about customer meetings and calls)
Product and purchasing information (which products and services a customer
has purchased over time)
Data Warehouse
 Customer information (names, addresses, opening of accounts,
cancellations, special contracts, segmentations, etc.)
 Corporate information (industry codes, number of employees,
accounts figures)
 Campaign history (who received which campaigns and when)
 Web logs (information about customer behavior on our portals)
 Social network information (e.g., Facebook and Twitter)
 Various questionnaire surveys carried out over time
 Human resources (HR) information (information about employees,
time sheets, their competencies, and history)
Data Warehouse
 Production information (production processes, inventory management,
procurement)
 Generation of key performance indicators (KPIs; used for monitoring current
processes, but can be used to optimize processes at a later stage)
 Data mining results (segmentations, added sales models, loyalty segmentations,
up-sale models, and loyalty segmentations, all of which have their history added
when they are placed in a data warehouse)
A billing system, for instance, was built to send out bills, and when
they have been sent, it’s up to the reminder system to monitor
whether reminders should be sent out.
DATABASE TRENDS

Components of a Data Warehouse


Data Warehouse

 A data warehouse consists of a technical part


and a business part. The technical part must
ensure that the organization’s data is collected
from its source systems and that it is stored,
combined, structured, and cleansed
regardless of the source system platform. The
business content of a data warehouse must
ensure that the desired key figures and
reports can be created.
Data Warehouse

There are many good arguments for


integrating data into an overall data
warehouse, including:
To avoid information islands and manual processes in
connection with the organization’s primary systems
To avoid overloading of source systems with daily
reporting and analysis
To integrate data from many different source systems
Data Warehouse
 To create a historical data foundation that can be
changed/ removed in source systems (e.g., saving the
orders historically, even if the enterprise resource planning
[ERP] system “deletes” open orders on invoicing)
 To aggregate performance and data for business needs
 To add new business terms, rules, and logic to data (e.g.,
rules that do not exist in source systems)
 To establish central reporting and analysis environments
Data Warehouse

 To hold documentation of metadata centrally upon


collection of data
 To secure scalability to ensure future handling of
increased data volumes
 To ensure consistency and valid data definitions
across business areas and countries (this principle
is called one version of the truth)
 To improved and easy accessibility to information
 To ability to model and remodel the data
Architecture and Processes
in a Data Warehouse
Data floats from source systems to the BI Portal
Architecture and Processes
in a Data Warehouse

 Source Systems: ERP, CRM, external data


sources, time sheet data etc.
 ETL Process: ETL is a data warehouse
process that always includes these actions:
 Extract data from a source table.
 Transform data for business use.
 Load to target table in the data warehouse or
different locations outside the data warehouse.
Architecture and Processes
in a Data Warehouse
 Staging Area: Staging area is the data entry of the data
warehouse and repository of Operational Dara Stores (ODS). The
staging area is a temporary storing facility in the area before the
data warehouse. ETL processes transfer business source data
from the operational systems (e.g., the accounting system) to a
staging area, usually either raw and unprocessed or transformed
by means of simple business rules. Source systems use different
types of formats on databases (e.g., relational databases such as
Oracle, DB2, SQL Server, MySQL, SAS, or flat text files).
Architecture and Processes
in a Data Warehouse

Simple ELT Job

ELT Job with SQL Join


Architecture and Processes
in a Data Warehouse
 Data Mart Area: The data mart is a subset of the data
warehouse and is usually oriented to a specific business line
or team. Whereas data warehouses have an enterprise-wide
depth, the information in data marts pertains to a single
department.
 A data mart represents the specific data from a data
warehouse which a user needs.
 It is a subset of data warehouse in which a summarized or
highly focused portion of the organization’s data is placed
in a separate database for a specified function or group of
users.
Architecture and Processes
in a Data Warehouse
 A data mart is a specialized version of a data warehouse.
Like a data warehouse, a data mart is a snapshot of
operational data to help business users make decisions or
make strategic analyses(e.g., based on historical trends). The
difference between a data mart and a data warehouse is that
data marts are created based on the particular reporting
needs of specific, well-defined user groups, and data marts
provide easy business access to relevant information. A data
mart is thus designed to answer the users’ specific
questions.
Architecture and Processes
in a Data Warehouse
 BA Portal: BA tools and portals aim to deliver
information to operational decision makers.
The BA portal constitutes a small part of the
overall process to deliver BA decision support
for the business. A rule of thumb is that the
portal part constitutes only 15 percent of the
work; 85 percent of the work lies in the data
collection and processing in the data
warehouse.
Architecture and Processes
in a Data Warehouse
 Explain what information the front-end displays and how
business people can use them to improve upon decision
making and processes?
Architecture and Processes in a
Data Warehouse
 Explain what information the front-end displays and how business people can use them to
improve upon decision making and processes?
Causes and Effect of Poor
Data Quality
 Many companies still suffer from low data quality, which
makes the business reluctant to trust the data provided by its
data warehouse section. In addition, the business typically
does not realize that their data warehouse section only
stores the data on behalf of the business, and that the data
quality issue hence is a problem that they must be solved by
themselves. The trend is, however, positive, and we see
more and more cases where the ownership of each
individual column in a data warehouse is assigned to an
individual named responsible business unit, based on who
will suffer the most if the data quality is low.
Causes and Effect of Poor
Data Quality
 Data quality is central in all data integration initiatives, too.
Data from a data warehouse can’t be used in an efficient
way until it has been analyzed and cleansed. In terms of
data warehouses, it’s becoming more and more common to
install an actual storage facility or a firewall, which ensures
quality when data is loaded from the staging area to the
actual data warehouse. To ensure that poor data quality
from external sources does not destroy or reduce the
quality of internal processes and applications, organizations
should establish this data quality firewall in their data
warehouse.
Causes and Effect of Poor
Data Quality
 Analogous to a network firewall, whose objective is to keep
hackers, viruses, and other undesirables out of the
organization’s network, the data quality firewall must keep
data of poor quality out of internal processes and
applications. The firewall can analyze incoming data as well
as cleanse data by means of known patterns of problems, so
that data will be of a certain quality, before it arrives in the
data warehouse. Poor data quality is very costly and can
cause breakdowns in the organization’s value chains (e.g., no
items in stock) and lead to impaired decision-making at
management and operational levels.
Causes and Effect of Poor
Data Quality
 The first step toward improved data quality in the data
warehouse will typically be the deployment of tools for
data profiling. By means of advanced software, basic
statistical analyses are performed to search for
frequencies and column widths on the data in the tables.
Based on the statistics, we can see, for example,
frequencies on nonexistent or missing postal codes as
well as the number of rows without a customer name.
Incorrect values of sales figures in transaction tables can
be identified by means of analyses of the numeric widths
of the columns.
Causes and Effect of Poor
Data Quality
 For example, “Mr. Thomas D. Marchand” could be the
same customer as “Thomas D. Marchand.” Is it the same
customer twice? Software packages can disclose whether
data fits valid patterns and formats. Phone numbers, for
instance, must have the format 311-555-1212 and not
3115551212 or 31 15 121 2.
 Data profiling can also identify superfluous data and
whether business rules are observed (e.g., whether two
fields contain the same data and whether sales and
distributions are calculated correctly in the source system).
Causes and Effect of Poor
Data Quality
 Poor data quality may also be a result of the BA function
introducing new requirements. If a source system is
registering only the date of a business transaction (e.g., 12
April 2010), the BA initiative can-not analyze the sales
distribution over the hours of the working day. That initiative
will not be possible unless the source system is
reprogrammed to register business transactions with a
timestamp such as “12APR2010:12:40:31.” Data will now
show that the transaction took place 40 minutes and 31
seconds past 12, on 12 April 2010. The data quality is now
secured, and the BA initiative can be carried out.
Causes and Effect of Poor
Data Quality
 Data profiling is thus an analysis of the problems
we are facing. In the next phase, the
improvement of data quality, the process starts
with the development of better data. In other
words, this means correcting errors, securing
accuracy, and validating and standardizing
data with a view to increase their reliability. Based
on data profiling, tools introduce intelligent
algorithms to cleanse and improve data.
Causes and Effect of Poor
Data Quality
 Fuzzy merge technology is frequently used
here. Using this technology means that
duplicate rows can often be removed, so that
customers appear only once in the system.
Rows without customer names can be
removed. Data with incorrect postal codes
can be corrected, or removed. Phone numbers
are adjusted to the desired format, such as
XXX-XXX-XXXX.
Causes and Effect of Poor
Data Quality
 Data cleansing is a process that identifies
and corrects (or removes) ruined or
incorrect rows in a table. After the cleansing,
the data set will be consistent with other
data sets elsewhere in the system.
– Software to detect and correct data that are incorrect,
incomplete, improperly formatted, or redundant
– Enforces consistency among different sets of data from
separate information systems
Causes and Effect of Poor Data
Quality
• Ensuring data quality
– More than 25 percent of critical data in Fortune 1000
company databases are inaccurate or incomplete
– Redundant data
– Inconsistent data
– Faulty input
– Before new database in place, need to:
• Identify and correct faulty data
• Establish better routines for editing data once database in
operation
The Data Warehouse: Functions,
Components, and Examples
 A modern data warehouse typically works as a storage area
for the organization’s dimensions as well as a metadata
repository. The simplest definition of metadata is data about
data. For example: for a camera, data is a digital photo;
metadata will typically contain information about the date the
photo was taken, the settings of the camera, name of
manufacturer, size, and resolution. Metadata facilitates the
understanding of data with a view to using and managing
data.
The Data Warehouse: Functions,
Components, and Examples
 From the staging area, the data sources are collected,
joined, and transformed in the actual data warehouse.
One of the most important processes is that the
business’s transactions (facts) are then enriched with
dimensions such as organizational relationship and
placed in the product hierarchy before data is sent on to
the data mart area. This will then enable analysts and
business users to prepare interactive reports via “slice
and dice” techniques (i.e., breaking down figures into
their components).
The Data Warehouse: Functions,
Components, and Examples
 An example of a meaningless statement for an analyst is
“Our sales were $25.5 million.” The business will
typically want answers to questions about when, for what,
where, by whom, for whom, in which currency? And
dimensions are exactly what enable business users or
the analyst to answer the following questions:
 When did it happen? Which year, quarter, month, week, day, time?
 Where and to whom did it happen? Which salesperson, which department, which
business area, which country?
 What happened? What did we make on which product and on which product group?
 All these questions are relevant to the analyst.
The Data Warehouse: Fact-
Based Transaction
TIPS AND TECHNIQUES IN
DATA WAREHOUSING

 Master Data Management


 Service-Oriented Architecture
 How Should Data Be Accessed?
 Access to Business Analytics Portals
 Access to Data Mart Areas
 Access to Data Warehouse Areas
 Access to Source Systems
SQL (Structured Query Language)
SQL is a widely used and special-purpose domain-specific
database language used in programming and designed for
managing data held in a relational database management
system (RDBMS), or for stream processing in a relational
data stream management system (RDSMS). SQL is the
most influential commercially marketed query language.
SQL uses a combination of relational-algebra and relational-
calculus construct.
This is not just query language; it can define the structure of
the data, modify data in the database and specify security
constraints.
Categories of SQL

The four main categories of SQL statements


are as follows:
1. DDL (Data Definition Language)
2. DML (Data Manipulation Language)
3. DCL (Data Control Language)
4. TCL (Transaction Control Language)
Categories of SQL

 Data Definition Language (DDL) Specifies


content and structure of database and
defines each data element as it appear before
that data element is translated into the forms
required by applications.
 DDL statements are used to define the
database structure or schema.
 DDL is also used to alter/modify a database or
table structure and schema.
Creation of Table

 create table employee


(employee_name varchar (25) not null,
street varchar(20) not null,
city varchar(20) not null,
primary key (employee_name));
Categories of SQL

 Data Manipulation Language (DML): DML


statements affect records in a table. These are
basic operations we perform on data such as
selecting a few records from a table, inserting
new records, deleting unnecessary records,
and updating/ modifying existing records.
 DML is a language that enables users to
access or manipulate data as organized by
appropriate data model.
Database Languages

Data manipulation includes:


i) The retrieval of information stored in the
database
ii) The insertion of new information into the
database
iii) The deletion of information from the database
iv) The modification of information stored in the
database
Data insertion

 insert into employee


values ('A', 'Spring', 'Pittsfield');

 insert into employee


values ('B', 'Senator', 'Brooklyn');
Categories of SQL

DCL (Data Control Language): A data


control language (DCL) is a syntax similar to
a computer programming language used to
control access to data stored in a database
(Authorization). In particular, it is a
component of Structured Query Language
(SQL). DCL statements control the level of
access that users have on database objects.
Categories of SQL
TCL (Transaction Control Language): A Transaction
Control Language (TCL) is a computer language and a
subset of SQL, used to control transactional
processing in a database. TCL statements allow you
to control and manage transactions to maintain the
integrity of data within SQL statements.
TCL statements are used to manage the changes
made by DML statements. It allows statements to be
grouped together into logical transactions.
SQL Statement
SQL (Structured Query Language)

 Query: A query is a statement


requesting the retrieval of
information. The portion of a DML
that involves information retrieval
is called a query language.
Basic Structure

 The basic structure of an SQL expression


consists of three clauses:
 1. select – corresponds to the projection
operation of the relational algebra
expression. It is used to list the attributes
desired in the result of the query.
 2. from – corresponds to the Cartesian
product operation of the relational algebra.
It lists the relations to be scanned in the
evaluation of the expression.
Basic Structure

 3. where – corresponds to the


selection predicate of the
relational algebra. It consists of a
predicate involving attributes of
the relations that appear in the
from clause.
Database Languages

 Example. Find the name of the customer


with customer-id 192-83-7465

Select customer_name
from customer
where customer_id = ‘192-83-7465’
Thank
You

You might also like