Professional Documents
Culture Documents
Business Analysis
Business Analysis
Business Intelligent
Analytics
1
2/25/2021
2
2/25/2021
Descriptive analytics
- uses data to understand past and present
Predictive analytics
- analyzes past performance
Prescriptive analytics
- uses optimization techniques
3
2/25/2021
DATA
- collected facts and figures
DATABASE
- collection of computer files containing data
INFORMATION
- comes from analyzing data
4
2/25/2021
Records
Figure 1.1
10
5
2/25/2021
11
Types of Data
12
6
2/25/2021
13
14
7
2/25/2021
(continued)
Classifying Data Elements in a Purchasing Database
Figure 1.2
15
16
8
2/25/2021
Nominal data
Examples:
Male Yes
Female No
17
Nominal data
18
9
2/25/2021
Ordinal data
• Ordinal data is data that comprises of categories that can be rank ordered.
• Similarly with nominal data the distance between each category cannot be
calculated but the categories can be ranked above or below each other.
No fixed units of measurement
Examples:
- college football rankings
- survey responses
(poor, average, good, very good, excellent)
• What does this mean? Can make statistical judgements and perform limited
maths.
19
Ordinal data
Example:
How satisfied are you with the level of service you have
received? (please tick)
Very satisfied
Somewhat satisfied
Neutral
Somewhat dissatisfied
Very dissatisfied
20
10
2/25/2021
21
Interval data
22
11
2/25/2021
Ratio data
• Ratio data measured on a continuous scale and does have a natural zero
point.
Ratios are meaningful
Examples:
• monthly sales
• delivery times
• weight
• Height
• age
23
Types of Analytics
24
12
2/25/2021
Decision Models
Model:
An abstraction or representation of a real system,
idea, or object
Captures the most important features
Can be a written or verbal description, a visual
display, a mathematical formula, or a
spreadsheet representation
25
Decision Models
Figure 1.3
26
13
2/25/2021
Decision Models
27
Decision Models
28
14
2/25/2021
Descriptive Analytics
29
Decision Models
30
15
2/25/2021
Decision Models
31
Predictive Analytics
32
16
2/25/2021
Decision Models
Figure 1.8
33
Decision Models
Figure 1.9
34
17
2/25/2021
Decision Models
35
Prescriptive Analytics
• For example, the use of mathematical programming for revenue management is common for
organizations that have “perishable” goods (e.g., rental cars, hotel rooms, airline seats).
• Harrah’s has been using revenue management for hotel room pricing for some time.
36
18
2/25/2021
Organizational Transformation
37
38
19
2/25/2021
39
Complex Systems
• Tackle complex problems and provide individualized solutions
• Products and services are organized around the needs of
individual customers
• Dollar value of interactions with each customer is high
• There is considerable interaction with each customer
• Examples: IBM, World Bank, Halliburton
40
20
2/25/2021
Volume Operations
• Serves high-volume markets through standardized
products and services
• Each customer interaction has a low dollar value
• Customer interactions are generally conducted
through technology rather than person-to-person
• Are likely to be analytics-based
• Examples: Amazon.com, eBay, Hertz
41
42
21
2/25/2021
43
44
22
2/25/2021
Thank you
for your
attentions!
45
23
3/4/2021
Decision Analysis
1
3/4/2021
Decision Analysis
• Effective decision-making requires that we understand:
• The values, goals, and objectives that are relevant to the decision
problem
• The areas of uncertainty that affect the decision
• The consequences of each possible decision
A Decision-Making Model
Consider the following five-part decision-making model:
Formulate
Identify and Analyze Test Implement
Problem Implement Model Results Solution
Model
Most
important
Unsatisfactory Results
2
3/4/2021
Anchoring
• A seeming trivial factor serves as a starting point
(anchor) for estimations
• Decision makers adjust their estimates but remain too
close to the anchor.
3
3/4/2021
Framing
• Affects how a decision maker perceives the alternatives in
a decision – often involves a win/loss perspective
• The way the problem is framed influences the choices
Framing Example
• You have been given $1,000 but must choose between the following
alternatives:
a) Receive an additional $500 with certainty or
b) Flip a coin and receive an additional $1,000 if heads or $0 if tails
➔ a) is a sure win and the choice most people prefer
• Now suppose you are given $2,000 and must choose between
a) Give back $500 immediately or
b) Flip a coin and give back $0 if heads or $1,000 if tails
➔ When framed this way, alternative a) is a “sure loss” and many people who
previously preferred alternative a) now opt for alternative b) (because it holds a
chance of avoiding a loss)
• However it is clear that a) alternative guarantees a total payoff of $1,500,
whereas b) offers a 50% chance of a $2,000 total payoff and a 50% chance of
$1,000 total payoff.
• A rational decision maker should focus on the consequences of his/her
choices and consistently select the same alternative, regardless of how the
problem is framed
4
3/4/2021
A Framing Example
“Careful analysis at a major U.S. steel company showed it
could save hundreds of thousands of dollars per year by
replacing its hot-metal mixing technology, which required that
metal be heated twice, with direct-pouring technology, in
which the metal was only heated once. But the move was
approved only after considerable delay because senior
engineers complained that the analysis did not include the cost
of the hot metal mixers that had been purchased for $3 million
just a few years previously.”
10
10
5
3/4/2021
Outcome A
A.2 Sales Up 15%
Decision Made
B.1 Sales Up 5%
Outcome B
B.2 Sales Even
11
11
80%
A.1 Sales Up 10% 32%
Outcome A
40% 20% A.2 Sales Up 15% 8%
Decision Made
70% B.1 Sales Up 5% 42%
60%
Outcome B
12
12
6
3/4/2021
An Example
13
Evaluating Your
Start by assigning a cash value or
score to each possible outcome.
Estimate how much you think it
would be worth to you if that
outcome came about.
Next look at each circle
(representing an uncertainty
point) and estimate the
probability of each outcome. If
you use percentages, the total
must come to 100% at each circle.
If you use fractions, these must
add up to 1. If you have data on
past events, you may be able to
make rigorous estimates of the
probabilities. Otherwise write
down your best guess. -
14
7
3/4/2021
15
16
8
3/4/2021
17
Result
18
9
3/4/2021
19
19
20
10
3/4/2021
21
22
22
11
3/4/2021
23
23
Actions
Minimum charge X
Schedule A X X
Schedule A on first 99 kwh, X
Schedule B on kwh 100 +
24
24
12
3/4/2021
Activity
Consider the following description of a company’s matching retirement
contribution plan:
25
25
26
26
13
3/4/2021
27
27
http://www.catalyst.com/products/logicgem/overview.html
28
14
3/4/2021
29
<cond-1> F T F T F T F T … T
<cond-2> F F T T F F T T … T
Conditions <cond-3> F F F F T T T T … T
… …
<cond-n> F F F F F F F F … T
<action-1> X X X X
<action-2> X X X X
Actions <action-3> X X X X
… …
<action-m> X X X
30
15
3/4/2021
Example
• Policy for charging charter flight costumers for certain
in-flight services:2
If the flight is more than half-full and costs more than $350 per
seat, we serve free cocktails unless it is a domestic flight. We
charge for cocktails on all domestic flights; that is, for all the ones
where we serve cocktails. (Cocktails are only served on flights that
are more than half-full.)
_____________________________________
2 Example taken form: Structured Analysis and System Specification, Tom de Marco,
31
Conditions Values
The flight more than half-full? Yes (Y), No (N)
32
16
3/4/2021
33
34
17
3/4/2021
domestic flight N Y N Y N Y N Y
ACTIONS
35
domestic flight N Y N Y N Y N Y
ACTIONS
serve cocktails X X X X
free X
36
18
3/4/2021
POSSIBLE COMBINATIONS
more than half-
full
N N N N Y Y Y Y Note that some
columns are identical
CONDITONS
domestic flight N Y N Y N Y N Y
ACTIONS
serve cocktails X X X X
free X
37
POSSIBLE RULES
more than half- Note that some
N N N N Y Y Y Y
full columns are identical
except for one condition.
CONDITONS
38
19
3/4/2021
39
domestic flight - N Y N Y N Y
ACTIONS
serve cocktails X X X X
free X
40
20
3/4/2021
domestic flight - - N Y N Y
ACTIONS
serve cocktails X X X X
free X
41
serve cocktails X X X X
free X
42
21
3/4/2021
43
44
22
3/4/2021
domestic flight - - N Y
serve cocktails X X X
overlooked something?
free X
45
Final Solution
Rules
domestic flight - - N Y
serve cocktails X X X
ACTIONS
free X
46
23
3/4/2021
47
47
48
48
24
3/4/2021
49
50
25
3/4/2021
Another Example:
“A marketing company wishes to construct a decision
table to decide how to treat clients according to three
characteristics:
Gender, City Dweller, and age group: A (under 30), B
(between 30 and 60), C (over 60).
The company has four products (W, X, Y and Z) to test
market.
Product W will appeal to female city dwellers.
Product X will appeal to young females.
Product Y will appeal to Male middle aged shoppers who
do not live in cities.
Product Z will appeal to all but older females.”
51
52
26
3/4/2021
1 2 3 4 5 6 7 8 9 10 11 12
Gender F M F M F M F M F M F M
City Y Y N N Y Y N N Y Y N N
Age A A A A B B B B C C C C
MarketW X X X
MarketX X X
MarketY X
MarketZ X X X X X X X X X X
53
54
27
3/4/2021
City Y Y N N Y N N Y N N
Age A A A B B B C C C
MarketW X X X
MarketX X X
MarketY X
MarketZ X X X X X X X X
55
56
28
3/4/2021
Thank you
for your
attentions!
57
29
3/11/2021
Introduction to Decision
Support Systems
1
3/11/2021
Decision Making
• Business Environment Factors
• Markets: strong competition, global markets, market on
Internet
• Consumer demands: customization, quality, diversity,
delivery
• Technology: more innovations, more obsolescence rate,
more information overload
• Societal: more regulation and deregulation, more
diversified workforce, more social responsibility.
Decision Making
• Business Pressure-Response-Support Model
2
3/11/2021
Decision Making
• Process of Decision Making
• Define the problem (i.e., a decision situation that may
deal with some difficulty or with and opportunity).
• Construct a model that describes the real-world problem.
• Identify possible solutions to the modeled problem and
evaluate the solutions.
• Compare, choose, and recommend a potential solution to
the problem.
3
3/11/2021
Data Models
Knowledge
User Interface
4
3/11/2021
10
10
5
3/11/2021
Data:
external Data Model External
and management management models
internal
Knowledge-based
subsystems
User interface
Organizational
KB
Manager
(user)
11
11
12
12
6
3/11/2021
13
13
Natural Language
processor
Input Output
PC Display Action Display Printers, plotters
language language
Users
14
14
7
3/11/2021
Modeling
15
Highlights
• Static and Dynamic Models
• Certainty, Uncertainty, and Risk
• Modeling with Spreadsheets
• Decision Tables and Decision Trees
• The Structure of Mathematical Models
• Mathematical Programming Optimization
• Multiple Goals, Sensitivity, What-if and Goal
Seeking
• Problem Solving Search Methods
16
16
8
3/11/2021
17
18
18
9
3/11/2021
19
19
20
20
10
3/11/2021
21
21
22
22
11
3/11/2021
Bonds 12 6 3 8.4
Stocks 15 3 -2 8.0
23
23
24
24
12
3/11/2021
Uncontrollable
variables
Intermediate
variables
25
25
26
26
13
3/11/2021
Mathematical Programming
Optimization
• Linear programming
• Product Mix
• Transportation Problem
• Non-Linear programming
• Travelling salesman
• Vehicle routing problem
11/03/64
27
27
Multiple Goals
• Managers want to attain simultaneous goals, some of which
may conflict.
• In addition to earning money, the company wants to grow,
develop its products and employees, provide job security to
its workers.
• Managers want to satisfy the shareholders and at the same
time enjoy high salaries and expense accounts, and
employees want to increase their take-home pay and
benefits.
• To solve this kind of problems, common methods are:
• Utility theory
• Goal programming
• Expression of goals as constraints, using LP
• A point system
11/03/64
28
28
14
3/11/2021
Sensitivity Analysis
• Sensitivity analysis attempts to assess the impact of a change in
the input data or parameters on the proposed solution.
• Sensitivity allows flexibility and adaptation to changing
conditions and to the requirements of different decision-making
situations.
• Sensitivity analysis tests relationships such as the following:
• The impact of changes in external (uncontrollable) variables and
parameters on the outcome variables(s)
• The impact of changes in decision variables on the outcome variable(s).
• The effect of uncertainty in estimating external variables
• The effects of different dependent interactions among variables
• The robustness of decisions under changing conditions
11/03/64
29
29
What-if Analysis
• What-if analysis is structured as What will happen
to the solution if an input variable, and assumption,
or a parameter value is changed?
• For example, what will happen to the total inventory
cost if the cost of the carrying inventories increases
by 10 per cent?
• A spreadsheet tool is a good example. A manager
can analyze a cash flow problem by changing
parameters’ values and see the differences without
any involvement of computer programmers.
11/03/64
30
30
15
3/11/2021
Goal Seeking
• Goal seeking calculates the values of the inputs
necessary to achieve a desired level of an output
(goal). It represents a backward solution approach.
• For example, What annual R&D budget is needed
for an annual growth rate of 15 per cent by 2012?
11/03/64
31
31
Heuristics
Only promising Stop when Good
solutions are solution is enough
considered. good enough 11/03/64
32
32
16
3/11/2021
33
Thank you
for your
attentions!
34
17
3/11/2021
Introduction to
Optimization
1
3/11/2021
Learning Objectives
2
3/11/2021
Real-World Examples
• Dynamic and Customized Pricing
• Product Mix
• Scheduling/Allocation
• Routing/Logistics
• Supply Chain Optimization
• Facility Location
• Financial Planning/Asset Management
• Etc.
3
3/11/2021
4
3/11/2021
Solver
10
5
3/11/2021
• Text-Based Formulation
• Decision Variables:
Number of camshafts
to make, number of
gears to make
• Objective Function:
Maximize profit
• Constraints: Don’t
exceed amounts
available of steel,
labor, and machine
time.
11
Algebraic Formulation
• Decision Variables
• C = number of camshafts to make
• G = number of gears to make
• Objective Function
• Maximize 25C + 18G (profit in $)
• Constraints
• 5C + 8G <= 5000 (steel in lbs)
• 1C + 4G <= 1500 (labor in hours)
• 3C + 2G <= 1000 (machine time in hours)
• C >= 0, G >= 0 (non-negativity)
12
6
3/11/2021
Important Concepts
• Linear Program: The objective function and constraint are linear
functions of the decision variables. Therefore, this is a Linear
Program.
• Feasibility
• Feasible Solution. A solution is feasible for an LP if all constraints are
satisfied.
• Infeasible Solution. A solution is infeasible if one or more constraints is
violated.
• Check the solutions C=75, G=200; and C=300, G=200 for feasibility.
• Optimal Solution. The optimal solution is the feasible solution
with the largest (for a max problem) objective value (smallest for
a min problem).
13
14
7
3/11/2021
15
Spreadsheet Model
16
8
3/11/2021
Solver Basics
• Don’t even think about using Solver until you have a
working, flexible spreadsheet model that you can use as a
“what if” tool!
• Solver Settings
• Specify Objective Cell (objective function)
• Specify Changing Cells (decision variables)
• Specify Constraints
• Specify Solver Settings
• Solve Problem to find Optimal Solution
17
18
9
3/11/2021
19
Solver Options
20
10
3/11/2021
21
A B C D E F
• Optimal Solution: Make 1 Example B.1
100 camshafts, 350 2 DJJ Enterprises Production Planning
gears. 3
4 Decision Variables Camshafts Gears
• Optimal Objective 5 Units to Make 100 350
Value: $8800 profit. 6
7 Objective Total
• Both pieces of 8 Profit $25 $18 $8,800
information are 9
important. Knowing the 10 Constraints Used Available
optimal objective value 11 Steel (lbs) 5 8 3300 <= 5000
is useless without 12 Labor (hrs) 1 4 1500 <= 1500
13 Machine Time (hrs) 3 2 1000 <= 1000
knowing how that value
can be attained.
22
11
3/11/2021
23
Solution Reports
• Solver can generate three solution reports
• Answer Report
• Sensitivity Report
• Limits Report: Not covered here
• The Answer Report presents in a standard format the Solver
Settings and the optimal solution.
• The Sensitivity Report shows what will happen if certain
problem parameters are changed from their current values.
24
12
3/11/2021
• Also note that the Solver Cell Name Cell Value Formula Status Slack
Settings for Objective Cell,
Changing Cells, and $D$11 Steel (lbs) Used 3300 $D$11<=$F$11 Not Binding 1700
Constraints are reported
here. This can be a useful $D$12 Labor (hrs) Used 1500 $D$12<=$F$12 Binding 0
debugging tool.
$D$13 Machine Time (hrs) Used 1000 $D$13<=$F$13 Binding 0
25
Sensitivity Report
Variable Cells
Final Reduced Objective Allowable Allowable
Cell Name Value Cost Coefficient Increase Decrease
$B$5 Units to Make Camshafts 100 0 25 2 20.5
Constraints
Final Shadow Constraint Allowable Allowable
Cell Name Value Price R.H. Side Increase Decrease
$D$13 Machine Time (hrs) Used 1000 8.2 1000 1416.666667 250
26
13
3/11/2021
27
28
14
3/11/2021
29
Highlights
• People use informal “optimization” to make decisions almost every
day.
• Organizations use formal optimization methods to address problems
across the organization, from optimal pricing to locating a new
facility.
• The algebraic formulation of an LP comprises the definitions of the
decision variables, an algebraic statement of the objective function,
and algebraic statements of the constraints.
• The spreadsheet model for an optimization problem should be
guided by the algebraic formulation.
• Solver, an Excel Add -In, is able to solve both linear and nonlinear
problems. This lecturefocuses on solving linear problems.
30
15
3/11/2021
Highlights
• After solving an LP, you must interpret the results to see if
they make sense, fix problems with the model, and find
the insights useful for management.
• Solver can generate the Answer and Sensitivity Reports.
The Sensitivity Report provides additional information
about what happens to the solution when certain
coefficients of the problem are changed.
31
32
16
3/11/2021
Thank you
for your
attentions!
33
17
Chapter 4:
Data Warehousing
Learning Objectives
1
Learning Objectives
• DW definition
• Characteristics of DW
• Data Marts
• ODS, EDW, Metadata
• DW Framework
• DW Architecture & ETL Process
• DW Development
• DW Issues
2
What is a Data Warehouse?
Characteristics of DW
• Subject oriented
• Integrated
• Time-variant (time series)
• Nonvolatile
• Summarized
• Not normalized
• Metadata
• Web based, relational/multi-dimensional
• Client/server
• Real-time and/or right-time (active)
3
Data Mart
4
DW Framework
/ Middleware
Legacy Metadata Data/text
Extract mining
Data mart
(Engineering)
Transform Enterprise
POS Data warehouse
OLAP,
Integrate
API
Data mart Dashboard,
(Finance) Web
Other Load
OLTP/wEB
Replication Data mart
(...) Custom built
External
applications
data
DW Architecture
• Three-tier architecture
1. Data acquisition software (back-end)
2. The data warehouse that contains the data &
software
3. Client (front-end) software that allows users to
access and analyze data from the warehouse
• Two-tier architecture
First 2 tiers in three-tier architecture is combined into
one
Sometimes there is only one tier
10
5
DW Architectures
Tier 1: Tier 2:
Client workstation Application & database server
11
A Web-based DW Architecture
Web pages
Application
Server
Client Web
(Web browser) Internet/ Server
Intranet/
Extranet
Data
warehouse
12
6
Data Warehousing Architectures
13
Alternative DW Architectures
(a) Independent Data Marts Architecture
ETL
End user
Source Staging Independent data marts
access and
Systems Area (atomic/summarized data)
applications
ETL
Dimensionalized data marts End user
Source Staging
linked by conformed dimentions access and
Systems Area
(atomic/summarized data) applications
ETL
End user
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications
14
7
Alternative DW Architectures
ETL
Normalized relational End user
Source Staging
warehouse (atomic/some access and
Systems Area
summarized data) applications
15
Alternative DW Architectures
16
8
Teradata Corp. DW Architecture
17
18
9
Data Warehousing Architectures
Ten factors that potentially affect the
architecture selection decision:
1. Information 6. Strategic view of the data
interdependence between warehouse prior to
organizational units implementation
2. Upper management’s 7. Compatibility with existing
information needs systems
3. Urgency of need for a 8. Perceived ability of the in-house
data warehouse IT staff
4. Nature of end-user tasks 9. Technical issues
5. Constraints on resources 10. Social/political factors
19
• Data integration
Integration that comprises three major processes:
data access, data federation, and change capture
• Enterprise application integration (EAI)
A technology that provides a vehicle for pushing data
from source systems into a data warehouse
• Enterprise information integration (EII)
An evolving tool space that promises real-time data
integration from a variety of sources, such as
relational databases, Web services, and
multidimensional databases
20
10
Data Integration and the Extraction,
Transformation, and Load (ETL) Process
Packaged Transient
application data source
Data
warehouse
Legacy
Extract Transform Cleanse Load
system
Data mart
Other internal
applications
21
ETL
22
11
Data Warehouse Development
23
24
12
Representation of Data in DW
25
Multidimensionality
• Multidimensionality
The ability to organize, present, and analyze data by
several dimensions, such as sales by region, by
product, by salesperson, and by time (four
dimensions)
• Multidimensional presentation
– Dimensions: products, salespeople, market segments, business
units, geographical locations, distribution channels, country, or
industry
– Measures: money, sales volume, head count, inventory profit,
actual versus forecast
– Time: daily, weekly, monthly, quarterly, or yearly
26
13
Star vs Snowflake Schema
27
Analysis of Data in DW
28
14
Analysis of Data Stored in DW
OLTP vs. OLAP
29
30
15
OLAP Operations
31
OLAP
A 3-dimensional
OLAP cube with Sales volumes of
a specific Product
slicing
operations on variable Time
and Region
Slicing Operations on a
Simple Tree-Dimensional
e
m
Ti
Data Cube
Product
Geography
Sales volumes of
a specific Time on
variable Region
and Products
32
16
Variations of OLAP
33
DW Implementation Issues
34
17
DW Implementation Guidelines
35
Successful DW Implementation
Things to Avoid
36
18
Successful DW Implementation
Things to Avoid - Cont.
37
38
19
Massive DW and Scalability
• Scalability
– The main issues pertaining to scalability:
• The amount of data in the warehouse
• How quickly the warehouse is expected to grow
• The number of concurrent users
• The complexity of user queries
– Good scalability means that queries and
other data-access functions will grow linearly
with the size of the warehouse
39
Real-time/Active DW/BI
40
20
Real-time/Active DW at Teradata
41
42
21
HADOOP DATA WAREHOUSE
ARCHITECTURE
43
44
22
DW Administration and Security
45
The Future of DW
• Sourcing…
– Open source software
– SaaS (software as a service)
– Cloud computing
– DW appliances
• Infrastructure…
– Real-time DW
– Data management practices/technologies
– In-memory processing (“super-computing”)
– New DBMS
– Advanced analytics
46
23
End of the Chapter
• Questions, comments
47
24
Data Mining for Business Intelligence
Learning Objectives
• Define data mining as an enabling technology for
business intelligence
• Understand the objectives and benefits of business
analytics and data mining
• Recognize the wide range of applications of data
mining
• Learn the standardized data mining processes
• CRISP-DM
• SEMMA
• KDD
1
Learning Objectives
• Understand the steps involved in data
preprocessing for data mining
• Learn different methods and algorithms of data
mining
• Build awareness of the existing data mining
software tools
• Commercial versus free/open source
• Understand the pitfalls and myths of data mining
Opening Vignette…
2
Opening Vignette:
Data Mining Goes to Hollywood!
Class No. 1 2 3 4 5 6 7 8 9
Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
Number of
Independent Variable Possible Values
Dependent Values
Variable MPAA Rating 5 G, PG, PG-13, R, NR
Independent Competition 3 High, Medium, Low
Variables Star value 3 High, Medium, Low
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
A Typical Genre 10 Related, Thriller, Horror,
Comedy, Cartoon, Action,
Classification Documentary
Opening Vignette:
Data Mining Goes to Hollywood!
Model
Development
process
The DM
Process
Map in Model
Modeler
3
Opening Vignette:
Data Mining Goes to Hollywood!
Prediction Models
4
Definition of Data Mining
• The nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data stored in structured databases
- Fayyad et al., (1996)
• Keywords in this definition: Process, nontrivial, valid,
novel, potentially useful, understandable
• Data mining: a misnomer?
• Other names: knowledge extraction, pattern
analysis, knowledge discovery, information
harvesting, pattern searching, data dredging
Pattern
c
ial
Recognition
s
tic
Int
tis
ellig
Sta
en
ce
DATA Machine
MINING Learning
Mathematical
Modeling Databases
10
10
5
Data Mining Characteristics/Objectives
11
11
Data
- DM with different
data types?
Categorical Numerical - Other data types?
12
12
6
What Does DM Do?
How Does it Work?
• DM extracts patterns from data
• Pattern?
A mathematical (numeric and/or symbolic) relationship among data items
• Types of patterns
• Association
• Prediction
• Cluster (segmentation)
• Sequential (or time series) relationships
13
13
14
14
7
Other Data Mining Tasks
• These are in addition to the primary DM tasks (prediction, association,
clustering)
• Time-series forecasting
• Part of sequence or link analysis?
• Visualization
• Another data mining task?
• Types of DM
• Hypothesis-driven data mining
• Discovery-driven data mining
15
15
16
16
8
Data Mining Applications (cont.)
17
17
• Insurance
• Forecast claim costs for better business planning
• Determine optimal rate plans
• Optimize marketing to specific customers
• Identify and prevent fraudulent claim activities
18
18
9
Data Mining Applications (cont.)
• Computer hardware and software
• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel industry
• Healthcare
Highly popular application
• Medicine areas for data mining
• Entertainment industry
• Sports
• Etc.
19
19
20
20
10
Data Mining Process
21
1 2
Business Data
Understanding Understanding
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
5
Testing and
Evaluation
22
22
11
Data Mining Process: CRISP-DM
Step 1: Business Understanding
Step 2: Data Understanding Accounts for
~85% of total
Step 3: Data Preparation (!) project time
Step 4: Model Building
Step 5: Testing and Evaluation
Step 6: Deployment
23
23
· Collect data
Data Consolidation · Select data
· Integrate data
· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes
Well-formed
Data
24
24
12
Data Mining Process: SEMMA
Sample
(Generate a representative
sample of the data)
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
SEMMA
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
25
25
26
26
13
Assessment Methods for Classification
• Predictive accuracy
• Hit rate
• Speed
• Model building; predicting
• Robustness
• Scalability
• Interpretability
• Transparency; ease of understanding
27
27
True Class TP + TN
Accuracy =
Positive Negative TP + TN + FP + FN
TP
Positive
True False
True Positive Rate =
TP + FN
Predicted Class
Positive Positive
Count (TP) Count (FP)
TN
True Negative Rate =
TN + FP
Negative
False True
Negative Negative
Count (FN) Count (TN) TP TP
P recision = Recall =
TP + FP TP + FN
28
28
14
Estimation Methodologies for Classification
• Simple split (or holdout or test sample estimation)
• Split the data into 2 mutually exclusive sets training
(~70%) and testing (30%)
Model
Training Data Development
2/3
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)
29
29
30
30
15
Estimation Methodologies for Classification –
ROC Curve
1
0.9
0.8
True Positive Rate (Sensitivity) A
0.7
B
0.6
C
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
31
31
Classification Techniques
32
32
16
A general
algorithm
Decision Trees for
decision
tree
• Employs the divide and conquer method building
• Recursively divides a training set until each
division consists of examples from one class
1. Create a root node and assign all of the training
data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of
the split. Split the data into mutually exclusive
subsets along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every
leaf node until the stopping criteria is reached.
33
33
Decision Trees
34
34
17
Decision Trees
35
35
36
36
18
Cluster Analysis for Data Mining
37
37
• Analysis methods
• Statistical methods (including both hierarchical and nonhierarchical), such as
k-means, k-modes, and so on.
• Neural networks (adaptive resonance theory [ART], self-organizing map
[SOM])
• Fuzzy logic (e.g., fuzzy c-means algorithm)
• Genetic algorithms
38
38
19
Cluster Analysis for Data Mining
39
39
40
40
20
Cluster Analysis for Data Mining -
k-Means Clustering Algorithm
41
41
42
42
21
Association Rule Mining
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a laptop computer and a virus
protection software, also bought extended service plan 70
percent of the time"
• How do you use such a pattern/knowledge?
• Put the items next to each other for ease of finding
• Promote the items as a package (do not put one on sale if the other(s)
are on sale)
• Place items far apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially see and buy
other items
43
43
44
44
22
Association Rule Mining
• Are all association rules interesting and useful?
A Generic Rule: X Y [S%, C%]
X, Y: products and/or services
X: Left-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how often X and Y go together
C: Confidence: how often Y go together with the X
Example: {Laptop Computer, Antivirus Software}
{Extended Service Plan} [30%, 70%]
45
45
46
46
23
Association Rule Mining
• Apriori Algorithm
• Finds subsets that are common to at least a minimum number of the
itemsets
• Uses a bottom-up approach
• frequent subsets are extended one item at a time (the size of frequent subsets increases
from one-item subsets to two-item subsets, then three-item subsets, and so on)
• groups of candidates at each level are tested against the data for minimum support
(see the figure) →
47
47
49
49
24
Dendrites Biological NN
Synapse
Synapse
Axon
Axon
Dendrites Neuron
Neuron
Biological
versus x1
w1
Artificial NN
Y1
Artificial Inputs
Outputs
x2
Neural w2 Processing
Element (PE) f (S )
Networks . S =
n
X iW
Y Y2
. Weights i =1
i
Transfer .
.
. Summation
Function .
wn Yn
Biological Artificial
xn
Neuron Node (or PE)
Dendrites Input
Axon Output
Synapse Weight
Slow Fast
50
Many (109) Few (102)
50
Elements/Concepts of ANN
51
51
25
SPSS PASW Modeler (formerly Clementine)
• Commercial KXEN
MATLAB
• IBM SPSS Modeler (formerly Other commercial tools
Clementine) KNIME
Statsoft Statistica
Orange
Megaputer
• … many more Viscovery
Clario Analytics
Total (w/ others) Alone
Miner3D
Thinkanalytics
0 20 40 60 80 100 120
Source: KDNuggets.com, May 2009
52
52
• Data mining …
• provides instant solutions/predictions.
• is not yet viable for business applications.
• requires a separate, dedicated database.
• can only be done by those with advanced degrees.
• is only for large firms that have lots of customer data.
• is another name for good-old statistics.
53
53
26
Common Data Mining Blunders
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data mining is
and what it really can/cannot do
3. Not leaving sufficient time for data acquisition,
selection and preparation
4. Looking only at aggregated results and not at
individual records/predictions
5. Being sloppy about keeping track of the data
mining procedure and results
54
54
55
55
27
End of the Chapter
• Questions, comments
56
56
28
4/9/2021
Data Preparation
(Data pre-processing)
1
4/9/2021
Data Preparation
2
4/9/2021
3
4/9/2021
Data Mining
Selection and
Transformation
Cleaning and
Integration
DW
DB
8
4
4/9/2021
TYPES OF DATA
Types of Measurements
• Nominal scale
content
More information
• Ordinal scale
• Ratio scale
Discrete or Continuous
10
10
5
4/9/2021
11
11
Data Conversion
• Some tools can deal with nominal values but other need
fields to be numeric
12
6
4/9/2021
13
Outliers
• Outliers are values thought to be out of range.
• “An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a
different mechanism”
• Can be detected by standardizing observations and label the standardized
values outside a predetermined bound as outliers
• Outlier detection can be used for fraud detection or data cleaning
• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem
14
7
4/9/2021
Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an
outlier if outside limits (normal distribution assumed)
(x − ks, x + ks)
15
16
8
4/9/2021
Outlier detection
• Univariate
• Boxplot: An observation is an extreme outlier if
(Q1-3IQR, Q3+3IQR), where IQR=Q3-Q1
(IQR = Inter Quartile Range)
http://www.physics.csbsju.edu/stats/box2.html 44
17
> 3L
> 1.5 L
18
9
4/9/2021
Outlier detection
• Multivariate
• Clustering
• Very small clusters are outliers
http://www.ibm.com/developerworks/data/li
brary/techarticle/dm-0811wurst/
19
Outlier detection
• Multivariate
• Distance based
• An instance with very few neighbors within D is regarded
as an outlier
Knn algorithm
20
10
4/9/2021
21
Recommended reading
22
11
4/9/2021
DATA TRANSFORMATION
23
Normalization
• min-max normalization
• z-score normalization
• normalization by decimal scaling
24
12
4/9/2021
Normalization
• min-max normalization
v − min v
v'= (new _ max v − new_min v) + new_minv
max v − min v
• z-score normalization
25
28 minimun
66 maximum
39.50 avgerage
53
10.01 standard deviation
26
13
4/9/2021
Data Transformation
• It is the process to create new attributes
• Often called transforming the attributes or the attribute
set.
27
Data Transformation
• It is the process to create new attributes
• Often called transforming the attributes or the attribute
set.
28
14
4/9/2021
Data Transformation
Linear Transformations
• Normalizations may not be enough to adapt the data
to improve the generated model.
• Aggregating the information contained in various
attributes might be beneficial
• If B is an attribute subset of the complete set A, a
new attribute Z can be obtained by a linear
combination:
29
Data Transformation
Quadratic Transformations
• In quadratic transformations a new attribute is built
as follows
30
15
4/9/2021
Data Transformation
Non-polynomial Approximations of Transformations
• Sometimes polynomial transformations are not
enough
• For example, guessing whether a set of triangles are
congruent is not possible by simply observing their
vertices coordinates
• Computing the length of their segments will easily solve
the problem → non-polynomial transformation
31
Data Transformation
Polynomial Approximations of Transformations
• We have observed that specific transformations may
be needed to extract knowledge
• But help from an expert is not always available
• When no knowledge is available, a transformation f
can be approximated via a polynomial
transformation using a brute search with one degree
at a time.
• Using the Weistrass approximation, there is a
polynomial function f that takes the value Yi for each
instance Xi .
32
16
4/9/2021
Data Transformation
Polynomial Approximations of Transformations
• There are as many polynomials verifying Y = f (X)
as we want
• As the number of instances in the data set increases,
the approximations will be better
• We can use computer assistance to approximate the
intrinsic transformation
33
Data Transformation
Polynomial Approximations of Transformations
• When the intrinsic transformation is polynomial we
need to add the cartesian product of the attributes
needed for the polynomial degree approximation.
• Sometimes the approximation obtained must be
rounded to avoid the limitations of the computer
digital precision.
34
17
4/9/2021
Data Transformation
Rank Transformations
• A change in an attribute distribution can result in a
change of the model performance
• The simplest transformation to accomplish this in
numerical attributes is to replace the value of an
attribute with its rank
• The attribute will be transformed into a new
attribute containing integer values ranging from 1 to
m, being m the number of instances in the data set.
35
Data Transformation
Rank Transformations
• Next we can transform the ranks to normal scores
representing their probabilities in the normal
distribution by spreading these values on the Gaussian
curve using a simple transformation given by:
36
18
4/9/2021
Data Transformation
Box-Cox Transformations
• When selecting the optimal transformation for an
attribute is that we do not know in advance which
transformation will be the best
• The Box-Cox transformation aims to transform a
continuous variable into an almost normal
distribution
37
Data Transformation
Box-Cox Transformations
• This can be achieved by mapping the values using
following the set of transformations:
38
19
4/9/2021
Data Transformation
Box-Cox Transformations
• Please note that all the values of variable x in the
previous slide must be positive. If we have negative
values in the attribute we must add a parameter c to
offset such negative values:
39
Data Transformation
Box-Cox Transformations
• The value of λ is iteratively found by testing
different values in the range from −3.0 to 3.0 in
small steps until the resulting attribute is as close as
possible to the normal distribution.
40
20
4/9/2021
Data Transformation
Spreading the Histogram
• Spreading the histogram is a special case of Box-Cox
transformations
• As Box-Cox transforms the data to resemble a normal
distribution, the histogram is thus spread as shown
here
41
Data Transformation
Spreading the Histogram
• When the user is not interested in converting the
distribution to a normal one, but just spreading it,
we can use two special cases of Box-Cox
transformations
1. Using the logarithm (with an offset if necessary) can be
used to spread the right side of the histogram: y =
log(x)
2. If we are interested in spreading the left side of the
histogram we can simply use the power transformation
y = xg
42
21
4/9/2021
Data Transformation
Nominal to Binary Transformation
• The presence of nominal attributes in the data set can be
problematic, specially if the Data Mining (DM)
algorithm used cannot correctly handle them
• The first option is to transform the nominal variable to a
numeric one
• Although simple, this approach has two big drawbacks
that discourage it:
• With this transformation we assume an ordering of the
attribute values
• The integer values can be used in operations as numbers,
whereas the nominal values cannot
43
Data Transformation
Nominal to Binary Transformation
• In order to avoid the aforementioned problems, a very
typical transformation used for DM methods is to map
each nominal attribute to a set of newly generated
attributes.
• If N is the number of different values the nominal
attribute has, we will substitute the nominal variable
with a new set of binary attributes, each one
representing one of the N possible values.
• For each instance, only one of the N newly created
attributes will have a value of 1, while the rest will have
the value of 0
44
22
4/9/2021
Data Transformation
Nominal to Binary Transformation
• This transformation is also referred in the literature
as 1-to-N transformation.
• A problem with this kind of transformation appears
when the original nominal attribute has a large
cardinality
• The number of attributes generated will be large as well,
resulting in a very sparse data set which will lead to
numerical and performance problems.
45
Data Transformation
Transformations via Data Reduction
• When the data set is very large, performing complex
analysis and DM can take a long computing time
• Data reduction techniques are applied in these
domains to reduce the size of the data set while
trying to maintain the integrity and the information
of the original data set as much as possible
• Mining on the reduced data set will be much more
efficient and it will also resemble the results that
would have been obtained using the original data
set.
46
23
4/9/2021
Data Transformation
Transformations via Data Reduction
• The main strategies to perform data reduction are
Dimensionality Reduction (DR) techniques
• They aim to reduce the number of attributes or
instances available in the data set
• Well known attribute reduction techniques are Wavelet
transforms or Principal Component Analysis (PCA).
47
Data Transformation
Transformations via Data Reduction
• Many techniques can be found for reducing the
dimensionality in the number of instances, like the
use of clustering techniques, parametric methods
and so on
48
24
4/9/2021
Data Transformation
Transformations via Data Reduction
• The use of binning and discretization techniques is
also useful to reduce the dimensionality and
complexity of the data set.
• They convert numerical attributes into nominal
ones, thus drastically reducing the cardinality of the
attributes involved
49
MISSING DATA
50
25
4/9/2021
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
51
Missing Values
• There are always MVs in a real dataset
52
26
4/9/2021
• tedious + infeasible?
53
54
27
4/9/2021
• Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the most
frequent value or the average value
• Finding neighbours in a large dataset may be slow
55
Nearest-Neighbour
56
28
4/9/2021
57
Summary
• Every real world data set needs some kind of data
pre-processing
• Deal with missing values
• Correct erroneous values
• Select relevant attributes
• Adapt data set format to the software tool to be used
58
29
4/9/2021
References
• ‘Data preparation for data mining’, Dorian Pyle, 1999
• ‘Data Mining: Concepts and Techniques’, Jiawei Han and Micheline Kamber, 2000
• ‘Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations’, Ian H. Witten and Eibe Frank, 1999
59
60
30
4/9/2021
Thank you
for your
attentions!
61
62
31
Data Mining
Classification: Basic Concepts and Techniques
Classification: Definition
l Task:
– Learn a model that maps each attribute set x into one of the predefined class
labels y
2
Examples of Classification Task
4
Classification Techniques
• Base Classifiers
• Decision Tree based Methods
• Rule-based Methods
• Nearest-neighbor
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• Neural Networks, Deep Neural Nets
• Ensemble Classifiers
• Boosting, Bagging, Random Forests
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
6
Apply Model to Test Data
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
8
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
10
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
11
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
12
Another Example of Decision Tree
MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10
13
6 No Medium 60K No
Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
14
Decision Tree Induction
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
15
16
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
17
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
18
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
19
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
20
Design Issues of Decision Tree Induction
21
22
Test Condition for Nominal Attributes
Multi-way split:
– Use as many partitions as distinct Marital
Status
values.
23
subsets
– Preserve order property
among attribute values {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}
Shirt
Size
This grouping
violates order
property
{Small, {Medium,
Large} Extra Large}
Introduction to Data Mining - Classification 24
24
Test Condition for Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
25
26
How to determine the Best Split
27
l Greedy approach:
– Nodes with purer class distribution are preferred
C0: 5 C0: 9
C1: 5 C1: 1
28
Measures of Node Impurity
29
Gain = P - M
30
Finding the Best Split
Before Splitting: C0 N00
P
C1 N01
A? B?
Yes No Yes No
M1 M2
Gain = P – M1 vs P – M2
Introduction to Data Mining - Classification 31
31
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝𝑖 𝑡 2
𝑖=0
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
number of classes
• Maximum of 1 − 1/𝑐 when records are equally distributed
among all classes, implying the least beneficial situation for
classification
• Minimum of 0 when all records belong to one class, implying
the most beneficial situation for classification
• Gini index is used in decision tree algorithms such as CART,
SLIQ, SPRINT
32
Measure of Impurity: GINI
• Gini Index for a given node t :
𝑐−1
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝𝑖 𝑡 2
𝑖=0
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
33
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝𝑖 𝑡 2
𝑖=0
34
Computing Gini Index for a Collection of Nodes
l When a node 𝑝 is split into 𝑘 partitions (children)
𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1
where, 𝑛𝑖 = number of records at child 𝑖,
𝑛 = number of records at parent node 𝑝.
35
36
Categorical Attributes: Computing Gini Index
l For each distinct value, gather counts for each class in the
dataset
l Use the count matrix to make decisions
37
v and A > v 10
10 No Single 90K Yes
38
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
39
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
40
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
41
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
42
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
43
44
Computing Entropy of a Single Node
𝑐−1
45
l Information Gain:
𝑘
𝑛𝑖
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
𝑖=1
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
46
Problem with large number of partitions
Node impurity measures tend to prefer splits that result in large number
of partitions, each being small but pure
47
Gain Ratio
l Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − 𝑙𝑜𝑔2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
𝑖=1
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
– Adjusts Information Gain by the entropy of the partitioning
(𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜).
◆Higher entropy partitioning (large number of small partitions) is penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain
48
Gain Ratio
l Gain Ratio:
𝑘
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = 𝑙𝑜𝑔2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
𝑖=1
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
49
50
Computing Error of a Single Node
𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝𝑖 𝑡 ]
𝑖
51
52
Misclassification Error vs Gini Index
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
Introduction to Data Mining - Classification 53
53
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416
54
Decision Tree Based Classification
l Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are interacting)
l Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that can
distinguish between classes together but not individually) may be passed over in
favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute
55
Handling interactions
56
Handling interactions
57
58
Limitations of single attribute-based decision boundaries
59
1/11/2023
Data Mining:
Cluster Analysis: Basic
Concepts and Methods
1
1
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
2
1
1/11/2023
2
1/11/2023
■ Proximity measure
■ Similarity of two feature vectors
■ Clustering criterion
■ Expressed via a cost function or some rules
■ Clustering algorithms
■ Choice of algorithms
3
1/11/2023
4
1/11/2023
these
■ Constraint-based clustering
■ User may give inputs on constraints
■ Use domain knowledge to determine input parameters
■ Interpretability and usability
■ Others
■ Discovery of clusters with arbitrary shape
■ High dimensionality
10
10
5
1/11/2023
■ Partitioning approach:
■ Construct various partitions and then evaluate them by some
■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or objects)
■ Density-based approach:
■ Based on connectivity and density functions
■ Grid-based approach:
■ based on a multiple-level granularity structure
11
11
■ Frequent pattern-based:
■ Based on the analysis of frequent patterns
■ User-guided or constraint-based:
■ Clustering by considering user-specified or application-specific
constraints
■ Typical methods: COD (obstacles), constrained clustering
■ Link-based clustering:
■ Objects are often linked together in various ways
12
12
6
1/11/2023
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
13
13
14
7
1/11/2023
15
15
K=2
16
8
1/11/2023
17
17
■ Dissimilarity calculations
■ Strategies to calculate cluster means
■ Handling categorical data: k-modes
■ Replacing means of clusters with modes
■ Using new dissimilarity measures to deal with categorical objects
■ Using a frequency-based method to update modes of clusters
■ A mixture of categorical and numerical data: k-prototype method
18
18
9
1/11/2023
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
19
19
Arbitrary Assign
7
each
6
choose k
remaining
5
object as
object to
4
3
initial nearest
2
medoids medoids
1
0
0 1 2 3 4 5 6 7 8 9 1
0
Do loop
0 0
Compute
9 9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0
20
20
10
1/11/2023
21
21
Hierarchical Clustering
■ Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
22
22
11
1/11/2023
23
23
24
24
12
1/11/2023
25
25
Distance between X X
Clusters
■ Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
26
13
1/11/2023
27
27
28
28
14
1/11/2023
29
29
30
30
15
1/11/2023
■ Density-reachable:
■ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
■ Density-connected
■ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there
is a point o such that both, p and o
q are density-reachable from o
w.r.t. Eps and MinPts
31
31
Outlier
Border
Eps = 1cm
Core MinPts = 5
32
32
16
1/11/2023
33
33
techniques
34
34
17
1/11/2023
35
35
36
36
18
1/11/2023
37
37
38
19
1/11/2023
39
39
40
40
20
1/11/2023
■ Partition the data space and find the number of points that
lie inside each cell of the partition.
■ Identify the subspaces that contain clusters using the
Apriori principle
■ Identify clusters
■ Determine dense units in all subspaces of interests
■ Determine connected dense units in all subspaces of
interests.
■ Generate minimal description for the clusters
■ Determine maximal regions that cover a cluster of
connected dense units for each cluster
■ Determination of minimal cover for each cluster
41
41
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary
42
42
21
1/11/2023
■ Elbow method
■ Use the turning point in the curve of sum of within cluster variance
■ E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
■ For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
43
43
44
44
22
1/11/2023
■ Matching-based measures
■ Purity, maximum matching, F-measure
■ Entropy-Based Measures
Ground truth partitioning T T2
■ Conditional entropy, normalized mutual
1
Cluster C Cluster C
1 2
Mallow measure
■ Correlation measures
■ Discretized Huber static, normalized discretized
Huber static
45
45
23
11/21/2023
Learning Objectives
• Describe what forecasting is
• Explain time series & its components
• Smooth a data series
• Moving average
• Exponential smoothing
• Forecast using trend models Simple
Linear Regression Auto-
regressive
1
11/21/2023
What Is Forecasting?
• Process of predicting a
future event
• Underlying basis of
all business decisions
• Production
• Inventory
• Personnel
• Facilities
Forecasting Approaches
Qualitative Methods Quantitative Methods
• Used when situation is
vague & little data exist
• New products
• New technology
• Involve intuition,
experience
• e.g., forecasting sales on
Internet
2
11/21/2023
Forecasting Approaches
Qualitative Methods Quantitative Methods
• Used when situation is • Used when situation is
vague & little data exist ‘stable’ & historical data
• New products exist
• New technology • Existing products
• Involve intuition, • Current technology
experience • Involve mathematical
• e.g., forecasting sales on techniques
Internet • e.g., forecasting sales of
color televisions
Quantitative Forecasting
• Select several forecasting methods
• ‘Forecast’ the past
• Evaluate forecasts
• Select best method
• Forecast the future
• Monitor continuously forecast accuracy
3
11/21/2023
4
11/21/2023
Time Series
Models
10
5
11/21/2023
11
12
6
11/21/2023
13
14
7
11/21/2023
15
16
8
11/21/2023
Time series is
dynamic, it does
change over time.
17
18
9
11/21/2023
19
Trend
20
10
11/21/2023
Trend Cyclical
21
Trend Cyclical
Seasonal
22
11
11/21/2023
Trend Cyclical
Seasonal Irregular
23
Trend Component
• Persistent, overall upward or downward pattern
• Due to population, technology etc.
• Several years duration
Response
24
12
11/21/2023
Trend Component
• Overall Upward or Downward Movement
• Data Taken Over a Period of Years
Sales
Time
25
Cyclical Component
• Repeating up & down movements
• Due to interactions of factors influencing economy
• Usually 2-10 years duration
Cycle
Response
26
13
11/21/2023
Cyclical Component
• Upward or Downward Swings
• May Vary in Length
• Usually Lasts 2 - 10 Years
Sales
Time
27
Seasonal Component
• Regular pattern of up & down fluctuations
• Due to weather, customs etc.
• Occurs within one year
Summer
Response
Mo., Qtr.
28
14
11/21/2023
Seasonal Component
• Upward or Downward Swings
• Regular Patterns
• Observed Within One Year
Sales
29
Irregular Component
• Erratic, unsystematic, ‘residual’ fluctuations
• Due to random variation or unforeseen events
• Union strike
• War © 1984-1994 T/Maker Co.
30
15
11/21/2023
Random or Irregular
Component
• Erratic, Nonsystematic, Random, ‘Residual’ Fluctuations
• Due to Random Variations of
• Nature
• Accidents
31
32
16
11/21/2023
33
Trend?
34
17
11/21/2023
Smoothing No
Methods Trend?
35
36
18
11/21/2023
Moving Exponential
Average Smoothing
37
Moving Exponential
Average Smoothing
Auto-
Linear Quadratic Exponential Regressive
38
19
11/21/2023
39
12
10
Number of Passengers
6
Month/Year
40
20
11/21/2023
41
Moving Exponential
Average Smoothing
Auto-
Linear Quadratic Exponential Regressive
42
21
11/21/2023
43
44
22
11/21/2023
Year
45
Moving Average
[An Example]
You work for Firestone Tire. You
want to smooth random
fluctuations using a 3-period
moving average.
1995 20,000
1996 24,000
1997 22,000
1998 26,000
1999 25,000
46
23
11/21/2023
Moving Average
[Solution]
Year Sales MA(3) in 1,000
1995 20,000 NA
1996 24,000 (20+24+22)/3 = 22
1997 22,000 (24+22+26)/3 = 24
1998 26,000 (22+26+25)/3 = 24
1999 25,000 NA
47
Moving Average
Year Response Moving
Ave
Sales
1994 2 NA
8
1995 5 3
6
1996 2 3
4
1997 2 3.67
2
1998 7 5
0
1999 6 NA 94 95 96 97 98 99
48
24
11/21/2023
Exponential Smoothing
Method
49
Moving Exponential
Average Smoothing
Auto-
Linear Quadratic Exponential Regressive
50
25
11/21/2023
Exponential Smoothing
Method
• Form of weighted moving average
• Weights decline exponentially
• Most recent data weighted most
• Requires smoothing constant (W)
• Ranges from 0 to 1
• Subjectively chosen
• Involves little record keeping of past data
51
Exponential Smoothing
[An Example]
52
26
11/21/2023
Exponential Smoothing
Ei = W·Yi + (1 - W)·Ei-1
Smoothed Value, Ei Forecast
Time Yi ^
(W = .2) Yi + 1
1995 4 4.0 NA
1996 6 (.2)(6) + (1-.2)(4.0) = 4.4 4.0
1997 5 (.2)(5) + (1-.2)(4.4) = 4.5 4.4
1998 3 (.2)(3) + (1-.2)(4.5) = 4.2 4.5
1999 7 (.2)(7) + (1-.2)(4.2) = 4.8 4.2
2000 NA NA 4.8
53
Year
54
27
11/21/2023
55
56
28
11/21/2023
Moving Exponential
Average Smoothing
Auto-
Linear Quadratic Exponential Regressive
57
58
29
11/21/2023
b1 < 0
Time, X1
59
60
30
11/21/2023
61
95 1 5 7
96 2 2 6
97 3 2 5
98 4 7 4
Projected to
99 5 6 3 year 2000
2
Excel Output
C o efficien ts 1
I n te r c e p t 2.14285714
0
X V a ria b le 1 0 .7 4 2 8 5 7 1 4 1993 1994 1995 1996 1997 1998 1999 2000
62
31
11/21/2023
20
19
18
Number of Surgeries
17
16
Month/Year
63
193
191
189
Number of Surgeries
187
185
183
Month/Year
64
32
11/21/2023
Seasonality Plot
Revised Surgery Data
(Seasonal Decomposition)
100.5
100.3
Monthly Index
100.1
99.9
99.7
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
65
Trend Analysis
Revised Surgery Data
(Trend Analysis)
(X 1000)
19.5
19.2
18.9
Number of Surgeries
18.6
18.3
18
Month/Year
66
33
11/21/2023
Quadratic Time-Series
Forecasting Model
67
Moving Exponential
Average Smoothing
Auto-
Linear Quadratic Exponential Regressive
68
34
11/21/2023
Quadratic Time-Series
Forecasting Model
69
70
35
11/21/2023
Year, X1 Year, X1
Year, X1 Year, X1
71
72
36
11/21/2023
Exponential Time-Series
Model
73
Moving Exponential
Average Smoothing
Exponential Auto-
Linear Quadratic Regressive
74
37
11/21/2023
Exponential Time-Series
Forecasting Model
• Used for forecasting trend
• Relationship is an exponential function
• Series increases (decreases) at increasing (decreasing) rate
75
Exponential Time-Series
Forecasting Model
• Used for forecasting trend
• Relationship is an exponential function
• Series increases (decreases) at increasing (decreasing) rate
76
38
11/21/2023
b1 > 1
Year, X 1
0 < b1 < 1
77
2 Smoothed
0
94 95 96 97 98 99 Year
78
39
11/21/2023
79
Autoregressive Modeling
80
40
11/21/2023
Moving Exponential
Average Smoothing
Auto-
Linear Quadratic Exponential Regressive
81
Autoregressive Modeling
• Used for forecasting trend
• Like regression model
• Independent variables are lagged response variables Yi-1, Yi-2, Yi-3 etc.
• Assumes data are correlated with past data values
• 1st Order: Correlated with prior period
• Estimate with ordinary least squares
82
41
11/21/2023
12
10
Number of Passengers
6
Month/Year
83
Auto-correlation Plot
Intra-Campus Bus
(Auto Correlation
PassengersFunction
1 Plot
0.5
2 0
-0.5
-1
0 5 10 15 20 25
Lag
84
42
11/21/2023
85
Y i = 3 .5 + .8125 Y i − 1 − .9375 Y i − 2
86
43
11/21/2023
Evaluating Forecasts
87
Quantitative
Forecasting Steps
• Select several forecasting methods
• ‘Forecast’ the past
•Evaluate forecasts
• Select best method
• Forecast the future
• Monitor continuously forecast accuracy
88
44
11/21/2023
Forecasting Guidelines
• No pattern or direction in forecast error
• ei = (Actual Yi - Forecast Yi)
• Seen in plots of errors over time
• Smallest forecast error
• Measured by mean absolute deviation
• Simplest model
• Called principle of parsimony
89
Time (Years)
90
45
11/21/2023
Residual Analysis
e e
0 0
T T
Random errors Cyclical effects not accounted for
e e
0 0
T T
Trend not accounted for Seasonal effects not accounted for
91
Principal of Parsimony
• Suppose two or more models provide good fit for data
• Select the Simplest Model
• Simplest model types:
• least-squares linear
• least-square quadratic
• 1st order autoregressive
• More complex types:
• 2nd and 3rd order autoregressive
• least-squares exponential
92
46
11/21/2023
Summary
• Described what forecasting is
• Explained time series & its components
• Smoothed a data series
• Moving average
• Exponential smoothing
• Forecasted using trend models
93
94
47
11/21/2023
Questions?
95
96
48
11/21/2023
ANOVA
97
End of Chapter
98
49
Business Intelligence:
Analytics
Business Performance
Management (BPM)
Learning Objectives
1
Learning Objectives
Opening Vignette…
• Company background
• Problem description
• Proposed solution
• Results
• Answer & discuss the case questions.
2
Business Performance Management (BPM) Overview
3
BPM versus BI
• Process Steps
1. Strategize
2. Plan
3. Monitor/analyze
4. Act/adjust
4
Strategize: Where Do We Want to Go?
• Strategic planning
• Common tasks for the strategic planning process:
1. Conduct a current situation analysis
2. Determine the planning horizon
3. Conduct an environment scan
4. Identify critical success factors
5. Complete a gap analysis
6. Create a strategic vision
7. Develop a business strategy
8. Identify strategic objectives and goals
• Strategic objective
A broad statement or general course of action
prescribing targeted directions for an organization
• Strategic goal
A quantified objective with a designated time period
• Strategic vision
A picture or mental image of what the organization
should look like in the future
• Critical success factors (CSF)
Key factors that delineate the things that an
organization must excel at to be successful
10
5
Strategize: Where Do We Want to Go?
11
• Operational planning
• Operational plan: plan that translates an
organization’s strategic objectives and goals into a
set of well-defined tactics and initiatives, resources
requirements, and expected results for some future
time period (usually a year).
• Operational planning can be
• Tactic-centric (operationally focused)
• Budget-centric (financially focused)
12
6
Plan: How Do We Get There?
13
14
7
Monitor: How Are We Doing?
15
16
8
Monitor: How Are We Doing?
17
18
9
Act and Adjust: What Do We Need to Do Differently?
Harrah’s Closed-Loop
Marketing Model
19
20
10
Performance Measurement
21
22
11
Performance Measurement
23
Performance Measurement
24
12
Performance Measurement
25
Performance Measurement
26
13
BPM Methodologies
27
BPM Methodologies
28
14
BPM Methodologies Balanced Scorecard
29
BPM Methodologies
30
15
BPM Methodologies
31
BPM Methodologies
32
16
BPM Methodologies
Strategy map
A visual display
that delineates
the relationships
among the key
organizational
objectives for all
four BSC
perspectives
33
BPM Methodologies
• Six Sigma
A performance management methodology
aimed at reducing the number of defects in a
business process to as close to zero defects
per million opportunities (DPMO) as
possible
34
17
BPM Methodologies
• Six Sigma
• The DMAIC performance model
A closed-loop business improvement model that
encompasses the steps of defining, measuring,
analyzing, improving, and controlling a process
• Lean Six Sigma
• Lean manufacturing / lean production
• Lean production versus Six Sigma
35
BPM Methodologies
36
18
BPM Methodologies
37
• BPM architecture
• The logical and physical design of a system
• BPM systems consist of three logical parts:
1. BPM Applications
2. Information Hub
3. Source Systems
• BPM systems consist of three physical parts:
1. Database tier
2. Application tier
3. Client or user interface
38
19
BPM Architecture and Applications
• BPM applications
1. Strategy management
2. Budgeting, planning, and
forecasting
3. Financial consolidation
4. Profitability modeling
and optimization
5. Financial, statutory, and
management reporting
39
40
20
Performance Dashboards
41
Performance Dashboards
42
21
Performance Dashboards
43
Performance Dashboards
44
22
Performance Dashboards
• Dashboard design
• “The fundamental challenge of dashboard design is
to display all the required information on a single
screen, clearly and without distraction, in a manner
that can be assimilated quickly"
(Few, 2005)
45
Performance Dashboards
46
23
End of the Chapter
• Questions, comments
47
24
Information Visualization:
Principles, Promise, and Pragmatics
Agenda
• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up
Information Visualization
• Problem:
– HUGE Datasets: How to understand them?
• Solution
– Take better advantage of human perceptual system
– Convert information into a graphical representation.
• Issues
– How to convert abstract information into graphical form?
– Do visualizations do a better job than other methods?
10
10
Horizontal lines
indicate location of
deaths.
From Visual
Explanations by
Edward Tufte,
Graphics Press,
1997
11
11
12
12
To help:
Explore
Calculate
Communicate
Decorate
13
13
Explore/Calculate
Analyze
Reason about Information
Communicate
Explain
Make Decisions
Reason about Information
14
14
15
15
Why Visualization?
Use the eye for pattern recognition; people are good at
scanning
recognizing
remembering images
16
16
18
18
Case Study:
The Journey of the TreeMap
19
19
20
20
Treemap Problems
• Too disorderly
– What does adjacency mean?
– Aspect ratios uncontrolled leads to lots of skinny boxes
that clutter
• Color not used appropriately
– In fact, is meaningless here
• Wrong application
– Don’t need all this to just see the largest files in the OS
21
21
22
22
TreeMaps in Action
http://www.smartmoney.com/maps
http://www.peets.com/tast/11/coffee_selector.asp
23
23
www.smartmoney.com/marketmap
24
24
25
25
26
26
28
28
• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up
30
30
Visual Principles
31
31
32
32
33
33
34
34
• Graphs
• Charts Typenamehere
Typetitlehere
• Maps
• Diagrams
35
35
36
Examples:
family tree
flow chart
network diagram
37
Examples:
map of census data
topographic maps
From www.thehighsierra.com
38
39
• Framework
– sets the stage
– kinds of measurements, scale, ...
• Content
– marks
– point symbols, lines, areas, bars, …
• Labels
– title, axes, tic marks, ...
40
• Nominal (qualitative)
– (no inherent order)
– city names, types of diseases, ...
• Ordinal (qualitative)
– (ordered, but not at measurable intervals)
– first, second, third, …
– cold, warm, hot
• Interval (quantitative)
– list of integers or reals
41
length of page
# of accesses
# of accesses
URL length of access length of access
length of access
# of accesses url 1
45
40 url 2
35
30 url 3
25 url 4
20 url 5
15
10 url 6
5 url 7
0
medium
short
very
long
long
# of accesses
days
length of page
42
Nominal Nominal
Nominal Ordinal
Nominal Interval
Ordinal Ordinal
Ordinal Interval
Interval Interval
43
44
45
47
48
50
Interesting Findings
Lohse et al.
51
• Preattentive Processing
• Accuracy of Interpretation of Visual Properties
• Illusions and the Relation to Graphical
Integrity
52
Preattentive Processing
53
54
55
56
Example: Conjunction of
Features
57
58
59
60
61
62
63
65
66
• Law of Proximity
– Stimulus elements that are close together will be
perceived as a group
• Law of Similarity
– like the preattentive processing examples
• Law of Common Fate
– like preattentive motion property
• move a subset of objects among similar ones and they
will be perceived as a group
67
68
69
70
71
72
Color Purposes
74
75
Visual Illusions
• People don’t perceive length, area, angle,
brightness they way they “should”.
• Some illusions have been reclassified as
systematic perceptual errors
– e.g., brightness contrasts (grey square on white
background vs. on black background)
– partly due to increase in our understanding of the
relevant parts of the visual system
• Nevertheless, the visual system does some
really unexpected things.
76
• Horizontal-Vertical
77
Illusions of Area
• Delboeuf Illusion
78
79
Tufte
80
80
81
81
Tufte Principle
Maximize the data-ink ratio:
data ink
Data-ink ratio = --------------------------
total ink used in graphic
82
82
83
83
84
84
85
85
86
86
Error:
Shrinking
along both
dimensions
87
87
Howard Wainer
How to Display Data Badly
(Video)
http://www.dartmouth.edu/~chance/ChanceLecture/AudioVideo.html
88
88
• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up
89
89
Promising Techniques
90
90
• Perceptual Techniques
– Animation
– Grouping / Gestalt principles
– Using size to indicate quantity
– Color for Accent, Distinction, Selection
• NOT FOR QUANTITY!!!!
• General Approaches
– Standard Techniques
• Graphs, bar charts, tables
– Brushing and Linking
– Providing Multiple Views and Models
– Aesthetics!
91
91
Standard Techniques
• It’s often hard to beat:
– Line graphs, bar charts
– Scatterplots (or Scatterplot Matrix)
– Tables
• A Darwinian view of visualizations:
– Only the fittest survive
– We are in a period of great experimentation; eventually it
will be clear what works and what dies out.
• A bright spot:
– Enhancing the old techniques with interactivity
– Example: Spotfire
• Adds interactivity, color highlighting, zooming to scatterplots
– Example: TableLens / Eureka
• Adds interactivity and length cues to tables
92
92
93
93
Spotfire/IVEE: Integrating
Interaction with Scatterplots
94
94
• Interactive technique
– Highlighting
– Brushing and Linking
• At least two things must be linked together to
allow for brushing
– select a subset of points
– see the role played by this subset of points in one or
more other views
• Example systems
– Graham Will’s EDV system
– Ahlberg & Sheiderman’s IVEE (Spotfire)
95
96
how long
select high
in majors
salaries
avg career
avg assists vs
HRs vs avg
avg putouts
career hits
(fielding ability)
(batting ability)
distribution
of positions
played
97
98
99
100
101
102
103
103
Layout - Illustration
104
104
• Transition Paths
– Linear interpolation of polar coordinates
– Node moves in arc not straight line
– Moves along circle if not changing levels (like great
circles on earth)
– Spirals in or out to next ring
105
105
Animation (continued)
• Transition constraints
– Orientation of transition to minimize rotational
travel
– (Move former parent away from new focus in same
orientation)
– Avoid cross-over of edges
– (to allow users to keep track of which is which)
• Animation timing
– Slow in Slow out timing (allows users to better
track movement)
106
106
107
107
108
108
109
109
Hyperbolic Tree
110
110
111
112
112
113
114
114
• Displaying text
– The size of the text
• Works good for small things like directories
• Not so good for URLs
• Only a portion of the data can be seen in the
focus at one time
• Only works for certain types of data -
Hierarchical
• Not clear if it is actually useful for anything.
115
115
Animating Algorithms
• Kehoe, Stasko, and Taylor, “Rethinking Evaluation of
Algorithm Animations as Learning Aids”
116
117
Findings
118
– Pan-and-Zoom
– 3D Navigation
– Node-and-link representations of concept spaces
119
119
120
120
121
121
Overview + Detail
• K. Hornbaek et al., Navigation patterns and Usability of Zoomable
User Interfaces with and without an Overview, ACM TOCHI, 9(4),
December 2002.
122
122
• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up
123
123
Problem Solving
124
124
125
125
Multidimensional Detective
A. Inselberg, Multidimensional Detective, Proceedings of IEEE
Symposium on Information Visualization (InfoVis '97), 1997.
126
126
127
127
A Detective Story
A. Inselberg, Multidimensional Detective, Proceedings of IEEE Symposium on Information
Visualization (InfoVis '97), 1997
• The Dataset:
– Production data for 473 batches of a VLSI chip
– 16 process parameters
– The yield: % of produced chips that are useful
• X1
– The quality of the produced chips (speed)
• X2
– 10 types of defects (zero defects shown at top)
• X3 … X12
– 4 physical parameters
• X13 … X16
• The Objective:
– Raise the yield (X1) and maintain high quality (X2)
128
128
129
129
Multidimensional Detective
• Each line represents the values for one batch of chips
• This figure shows what happens when only those
batches with both high X1 and high X2 are chosen
• Notice the separation in values at X15
• Also, some batches with few X3 defects are not in this
high-yield/high-quality group.
130
130
131
131
Multidimensional Detective
• Fig 5 and 6 show that high yield batches don’t have non-zero values
for defects of type X3 and X6
– Don’t believe your assumptions …
• Looking now at X15 we see the separation is important
– Lower values of this property end up in the better yield batches
132
132
133
133
134
• Collect
– Clickstream
– Purchase history
– Demographic data
• Associates customer data with their
clickstream
• Different color for each customer segment
135
Layout
• Aggregation based on file system path
136
• Gender shopping
differences
137
138
139
Directed graph
140
141
142
143
144
145
146
147
Findings
• WebQuilt methodology is promising for uncovering site
design related issues.
• 1/3 of the issues were device or browser related.
• Browser and device issues can not be captured
automatically with WebQuilt unless they cause an
interaction with the server
• Can be revealed via the questionnaire data.
148
149
149
150
150
151
151
152
152
153
153
Agenda
• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up
154
154
155
155
156
156
157
157
158
158
159
160
160
161
161
162
162
163
163
166
166
167
167
168
168
169
Galaxy of News
Rennison 95
170
170
Example: Themescapes
(Wise et al. 95)
171
171
ScatterPlot of Clusters
(Chen et al. 97)
172
172
175
175
Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 177
177
181
181
182
182
183
183
• Participants liked:
– Correspondence of region size to # documents
– Overview (but also wanted zoom)
– Ease of jumping from one topic to another
– Multiple routes to topics
– Use of category and subcategory labels
184
184
Study 2 (cont.)
• Participants wanted:
– hierarchical organization
– other ordering of concepts (alphabetical)
– integration of browsing and search
– correspondence of color to meaning
– more meaningful labels
– labels at same level of abstraction
– fit more labels in the given space
– combined keyword and category search
– multiple category assignment (sports+entertain)
185
185
186
186
Study 3
Visualization of search results: a comparative evaluation of text, 2D,
and 3D interfaces Sebrechts, Cugini, Laskowski, Vasilakis and Miller,
Proceedings of SIGIR 99, Berkeley, CA, 1999.
• This study compared:
– 3D graphical clusters
– 2D graphical clusters
– textual clusters
• 15 participants, between-subject design
• Tasks
– Locate a particular document
– Locate and mark a particular document
– Locate a previously marked document
– Locate all clusters that discuss some topic
– List more frequently represented topics
187
187
188
189
189
190
190
IR Infovis Meta-Analysis
(Empirical studies of information visualization:
a meta-analysis, Chen & Yu IJHCS 53(5),2000)
• Conclusions:
– IR Infoviz studies not reported in a standard format
– Individual cognitive differences had the largest effect
• Especially on accuracy
• Somewhat on efficiency
– Holding cognitive abilities constant, users did better
with simpler visual-spatial interfaces
– The combined effect of visualization is not
statistically significant
191
191
192
192
Cha-Cha
• Chen, M., Hearst, M., Hong, J.,
and Lin, J. Cha-Cha: A System
for Organizing Intranet Search
Results in the Proceedings of
the 2nd USENIX Symposium on
Internet Technologies and
SYSTEMS (USITS), Boulder,
CO, October 11-14, 1999
193
193
194
194
195
195
196
196
197
197
198
198
Flamenco
199
199
Design Goals
– Enhance features that help the user decide whether
document is relevant to their query
• Emphasize text that is relevant to query
– Text callouts
• Enlarge (make readable) text that might be
helpful in assessing page
– Enlarge headers
200
• Text summaries
– Lots of abstract, semantic information
• Image summaries (plain thumbnails)
– Layout, genre information
– Gist extraction faster than with text
• Benefits are complementary
• Create textually-enhanced thumbnails that
leverage the advantages of both text
summaries and plain thumbnails
201
202
Design Issues:
• Color Management
– Problems: Callouts need to be both readable and
draw attention
– Solution: Desaturate the background image, and use
a visual search model to choose appropriate colors
– Colors look like those in highlighter pens
• Resizing of Text
– Problem: We want to make certain text elements
readable, but not necessarily draw attention to them
– Solution: Modify the HTML before rendering the
thumbnail
203
204
Tasks
205
• Text summary
– Page title
– Extracted text with
query terms in bold
– URL
• Plain thumbnail
• Enhanced thumbnail
– Readable H1, H2 tags
– Highlighted callouts of
query terms
– Reduced contrast level
in thumbnail
206
Collections of Summaries
207
• Procedure
– 6 practice tasks
– 3 questions for each of the 4 task types
• e.g., each participant would do one E-commerce
question using text, one E-commerce question using
plain thumbnails, and one E-commerce question using
enhanced thumbnails
– Questions blocked by type of summary
– WebLogger recorded user actions during browsing
– Semi-structured interview
• Participants
– 12 members of the PARC community
208
Results
• Average total search times, by task:
– Picture: 61 secs
– Homepage: 80 secs
– E-commerce: 64 secs
– Side effects: 128 secs
• Results pooled across all tasks:
– Subjects searched 20 seconds faster with enhanced
thumbnails than with plain
– Subjects searched 30 seconds faster with enhanced
thumbnails than with text summaries
– Mean search time overall was 83 seconds
209
210
211
• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up
214
214
Comparing Approaches
215
215
216
216
Eureka (InXight)
217
217
218
218
SpotFire
219
219
220
221
221
222
223
223
224
224
Datasets
225
226
226
Experiment Design
227
• Answer correctness:
• Infozoom users: 68%
• Spotfire users: 75%
• Eureka users: 71%
•Not a time-error tradeoff
•Spotfire more accurate only 6 questions
228
Eureka - problems
229
230
Infozoom - problems
• Erroneous Correlations
• Overview mode has all attributes sorted
independent of each other
231
232
232
Discussion
233
233
234
234
Information Exploration
“Shootout”
• http://ivpr.cs.uml.edu/shootout/about.html
• Data Mining Applications
• One component focuses on visualization
235
235
236
.
T. Barlow and P. Neville, Comparison of 2D Visualizations of Hierarchies, INFOVIS’01237
237
238
239
240
Usability Test 1:
• Users:
– 15 colleagues familiar with org chart but not others
• Tasks
– Is the tree binary or n-ary?
– Is the tree balanced or unbalanced?
– Find deepest common ancestor of two nodes
– Number of levels?
– Find three larges leaves (excluding org chart)
241
• Response Time
– TreeMap slowest; no statistical difference between
others
• Response Accuracy
– No significant difference
• User Preference
– Prefer icicle map and org chart (faster)
– Dislike tree map
242
Discussion
243
• Three views:
– TreeMap eliminated from this round
• Tasks
– Node Description
• Four versions – select those nodes or leaves that meet
certain criteria
– Node Analysis:
• Memorize a highlighted node – find again after tree
redrawn in different position
244
Results
245
246
Visualizing Conversations
247
247
248
Chat Circles
249
250
Chat Circles –
Conversational Groupings
251
252
253
254
Agenda
• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up
255
255
256
256
Design Exercise
• BreakingStory
(Reffel, Fitzpatrick, Ayedelott SIMS final project, at CHI 2003)
– Create an application that supplies a visualization for
trends over time in web-based news. The primary
purpose is to provide an overview, but it should also
be possible to view text from individual news sources
on specific days. Its goal is to inform, inspire, and
enlighten, and also to make people want to look
more deeply at the news.
257
257
258
258
259
259
260
261
261
262
262
263
263
264
264
265
265
266
266
267
267
268
268
269
269
270
270
271
271
272
272
1
<Insert Picture Here>
Oracle BI Success
2
Agenda
3
Oracle’s EPM Vision: Extend Operational
Excellence to Management Excellence
Competitive
Advantage
MANAGEMENT EXCELLENCE
OPERATIONAL EXCELLENCE
Time
SMART
AGILE
ALIGNED
4
Oracle’s EPM System
EPM Workspace
Performance Management
BI Applications
Applications
Fusion Middleware
OLTP & ODS Data Warehouse OLAP SAP, Oracle, Siebel, Excel Business
Systems Data Mart PeopleSoft, Custom XML Process
Oracle BI Applications
Overview
10
5
What Gartner Is Saying
“The majority of customers are purchasing and
implementing BI and CPM as disparate point solutions,
which weaken their ability to achieve pervasive BI or to
link BI platform and CPM suites capabilities into an
integrated continuum to drive business transformation
from the strategic level to the process level”
Source: Employ a Coordinated Approach to BI and CPM, April 2007
11
Comprehensive BI Applications
EPM Workspace
BI APPLICATIONS
Performance Management
Sales Contact Center Procurement & Spend Finance
Applications Service Marketing Supply Chain & Order Mgmt HR
Fusion Middleware
OLTP & ODS Data Warehouse OLAP SAP, Oracle, Siebel, Excel Business
Systems Data Mart PeopleSoft, Custom XML Process
12
6
Enabling the Insight-Driven Enterprise
13
Oracle BI Applications
Complete, Pre-built, Best Practice Analytics
Comms Complex Consumer Financial High Insurance Life Public Travel
Auto & Media Sector Energy Services Tech Sciences Sector
Mfg & Health & Trans
Up-sell/ Service Rep Loyalty & Supplier Customer Cash Flow Workforce
Cross-sell Efficiency Attrition Performance Status Profile
Cycle Times Service Cost Market Basket Purchase Order Profitability Retention
Analysis Cycle Time Cycle Time Analysis
14
7
<Insert Picture Here>
15
16
8
Oracle Marketing Analytics Provides Insight
to Optimize Spending and Drive Demand
ANALYSIS & METRICS BENEFITS
Marketing Planning
• Sales alignment • Executive scorecard report
• Competitor pipeline • Expense analysis by time
• Forecast & Actual • Financial information on • Monitor campaign performance to take
expenses by time marketing tactics timely corrective action to improve
Marketing Performance efficiencies
• Campaign scorecard • Cumulative revenue trend
• Campaign trends • Oppty revenue by product
• Campaign pipeline • Demographics profile of
• Cross sell analysis responders
• Make intelligent resource allocations
Customer Insight (B2B) based on effectiveness of tactics
• Account revenue • Market basket analysis
• Revenue growth • Account attrition
• Account status • Next product purchased • Track expenses and reduce wasted spend
• # of new accounts • Over promoted customers
Customer Insight (B2C) • Increase customer profitability with better
• Income/Age range • # of customer interactions
• Customer counts • # of new contacts buyer behavior insight
• Contact attrition
Events
• Top events ranking • Events by region/type
• Improve cross-selling
• Event expenses • Events lead generation
• Event scorecard • Opportunity revenue
17
Campaign Performance
Provides Campaign Results data by Offer,
Segment, Agent performance. Manager can
monitor a campaign scorecard and identify root
causes for shortfalls in meeting predicted goals
Customer Insight
Provides product affinity, market basket and next product purchased
analysis. Provides demographic information and information on
impact of customer behavior due to marketing activities.
Marketing Events Analytics
Provides Analytics related to management of trade
shows, customer events etc. Marketing Events
Analytics can show analysis of Event registrations,
expenses on supplies by vendor, region, event etc.,
Event ROI analysis that is fully integrated with
Marketing Planning Analytics.
18
9
Complementary BI Applications
Complete Solution for entire Campaign-to-Cash Process
Sales Analysis
• Analyze pipeline opportunities and forecasts to
determine actions required to meet sales targets.
• Determine which products and customer segments
generate the most revenue and how to effectively
cross-sell and up-sell.
• Understand which competitors are faced most often
and how to win against them.
19
20
10
<Insert Picture Here>
21
• Leverage
22
11
Role-Based Best Practices Provide Relevant
and Actionable Insight for Everyone
Marketing Analytics – Key Objectives and Questions by Role
Optimizing Marketing Performance for Competitive Advantage
VP Marketing • How is the marketing budget being • How should I allocate the marketing
consumed? budget to generate the best results?
/ CMO
• What areas / programs are trending to • What areas historically have yielded
go over budget? the best results?
MARCOM • Do purchased lists perform better than • Which marketing campaigns generated
our house list and why? the most qualified leads?
/ DM Manager
• Is the sales organization picking up the • Which programs / campaigns yield the
Leads in a timely manner? highest take rate?
23
Campaign
Sales Execution
Management
Core
Event Management
Management
Processes
Planning &
Budgeting
Processes
Support
Needs
Analysis
24
12
Example Response and Lead Management Process
Marketing
Track,
measure
Marketing/Sales results
Operations
Cleanse,
Capture, Assign
enrich, Assign
load oppty,
score leads
responses notify rep
responses
Sales Development
Call Center
Qualify
leads,
create
opptys
25
26
13
Campaign to Cash Flow Example Decision Flow
Marketing Executive Role
Business Marketing Planning &
Objective Execution
Are we on target
to meet our goals?
Gain
Insights
Are we generating quality
leads from these
campaigns?
27
28
14
Campaign to Cash Flow Example, MARCOM Manager
Gain
If the campaign flow
Insights was phased, would it For this product and
be more effective? target, which channel
When is best time to works best?
launch marketing
events?
What is inferred
leads generated?
Take Design the flow Build program flow
Action
29
Identify similar
tactics. Which
customers did I
target in those Exclude people from
Gain What is the
campaigns? the list based on the
Insights demographic profile
contact frequency
of those customers /
and their preferences
prospects?
30
15
Campaign to Cash Flow Example, MARCOM Manager
When is last
campaign Have all treatment
completing? / media been
approved?
31
32
16
Campaign to Cash Flow Example, MARCOM Manager
33
• Insight
34
17
Typical Operational Challenges
Analyses, Reports
Executives
Data
IT Warehouse
35
Customers Suppliers
Procurement
Distribution
Operations
Marketing
Finance
Service
Sales
Customers Suppliers
Customers Suppliers
36
18
Maximizing Customer Value
Key Objectives in Sales, Marketing, Customer Service
37
• How are actual sales tracking against forecast and plan by region?
SALES • What are the best products to cross and up-sell?
• Why are sales opportunities being lost?
38
19
Deeper Insight within Business
Functions
SALES ANALYTICS CONTACT CENTER ANALYTICS
• Improve pipeline visibility • Understand service cost drivers
• Forecast with confidence • Optimize staffing for call volumes
• Increase cross & up-selling • Monitor CSR performance & drivers
• Quickly spot opportunities/threats • Detect defects early
39
• Devise marketing
programs to deflect
product availability or
MARKETING ANALYTICS quality issues SERVICE ANALYTICS
• Understand how
marketing promotions
impact service centers
40
20
Alignment across the Enterprise
Sales Financial
• Impact of product mix and discounts
Analytics Analytics
on revenue and margins
41
42
21
Key Benefits of Oracle BI Applications
• Insight
• Alignment
43
44
22
Marketing Analytics Components
1 Pre-built warehouse with 15 star-schemas 3 Pre-mapped metadata defining real-time
designed for analysis and reporting on access to analytical and operational sources,
Marketing data best practice calculations, and metrics for
marketing.
➢ Presentation Layer
➢ Logical Business Model
➢ Physical Sources
2 Pre-built ETL to extract data from over 1,000 4 A “best practice” library of over 500 pre-built
operational tables and load it into the DW, metrics, Intelligence Dashboards, Reports
sourced from CRM systems and other and alerts for marketing analyst, managers
sources and executives.
45
DASHBOARDS&
REPORTS
• Prebuilt best
practice library
• “One size does
NOT fit all”
SUBJECT AREAS
• Many metrics and dimensional
attributes not surfaced by prebuilt
dashboards and reports
• Possibilities are endless
• Incremental work to build tons
more content from this foundation
46
23
Unrivaled Integration with Oracle Apps
Extends BI Value. Lowers TCO.
ACTION LINKS – “INSIGHT TO ACTION” INTEGRATED SECURITY
Seamless navigation from analytical information One login. Right content for each user.
to transactional detail
47
48
24
Unrivaled Integration with Oracle Apps
Deeply Integrated into Siebel CRM
log in
Oracle BI
user show data based on
security group filters
3
49
CONDITIONAL NAVIGATION
50
25
Speeds Time To Value and Lowers TCO
Build from Scratch
Oracle BI Applications
with Traditional BI Tools
Training / Roll-out
Results
Define Metrics • Faster time to value
& Dashboards
• Lower TCO
• Assured business value
DW Design
51
Degree of Level of
Customization Effort
52
26
The Value is Below the Surface
Oracle BI Applications
• Dashboards
53
54
27
Selected Key Entities of Business Analytics
Warehouse
Call Center Conformed Dimensions
Sales ▪ ACD Events
▪ Opportunities ▪ Customer
▪ Quotes ▪ Rep Activities
▪ Pipeline ▪ Contact-Rep Snapshot ▪ Products
▪ Targets and Benchmark ▪ Suppliers
Order Management ▪ IVR Navigation History ▪ Internal Organizations
▪ Sales Order Lines ▪ Customer Locations
▪ Sales Schedule Lines Service
▪ Bookings ▪ Service Requests ▪ Customer Contacts
▪ Pick Lines ▪ Activities ▪ GL Accounts
▪ Billings ▪ Agreements ▪ Employee
▪ Backlogs
▪ Sales Reps
Workforce
Marketing ▪ Compensation ▪ Service Reps
▪ Campaigns ▪ Employee Profile ▪ Partners
▪ Responses ▪ Employee Events
▪ Marketing Costs ▪ Campaign
▪ Offers
Supply Chain Pharma
▪ Prescriptions ▪ Cost Centers
▪ Purchase Order Lines ▪ Profit Centers
▪ Syndicated Market Data
▪ Purchase Requisition Lines
▪ Purchase Order Receipts
▪ Inventory Balance Financials
▪ Inventory Transactions ▪ Financial Assets Modular DW Data Warehouse Data
▪ Insurance Claims Model includes:
Finance ~350 Fact Tables
▪ Receivables Public Sector
▪ Benefits
~550 Dimension Tables
▪ Payables
▪ General Ledger ▪ Cases ~3,500 prebuilt Metrics
▪ COGS ▪ Incidents (2,000+ are derived metrics)
▪ Leads ~15,000 Data Elements
© 2008 Oracle Corporation – Proprietary and Confidential 55
55
Rapid Deployments
Oracle BI Applications
6 weeks
9 weeks
10 weeks
12 weeks
3 months
3½ months
100 days
56
28
<Insert Picture Here>
Customer Success
57
58
29
Holistic View of Customer Information Enables
Alignment of Marketing and Service
World’s leading manufacturer and marketer of major home
appliances. Deployed Oracle BI Suite EE and Oracle Marketing
Analytics integrated with Siebel CRM Call Center Application.
Before After
▪ No centralized customer view ▪ Companywide, holistic view of information
▪ Multiple siloed customer data sources by customer, household and asset
hampered marketing abilities ▪ Consolidated 3 customer databases into 1
▪ Slow time-to-market with marketing ▪ Accelerated marketing campaign
campaigns despite millions spend on introductions to capitalize on trends
outside vendors ▪ Provided call centers with information and
▪ Call center unable to effectively use tools to up-sell customers and establish
customer data to enhance service or “closed loop” marketing capabilities
capitalize on sales opportunities
“With Oracle, Whirlpool business units are capitalizing on the integration between our
business intelligence, call center, and marketing solutions to drive revenue creation
and customer loyalty incentives.” - Thomas Mender, Whirlpool Corporation
59
Before After
“One of the most important values of Oracle’s BI solution is its TCO. We created 400
reports used by 1,250 users with a staff of one within a few months—that is very
cost effective.” – William Duffy, Data Warehousing Project Manager
60
30
Q&
A
© 2008 Oracle Corporation – Proprietary and Confidential 61
61
62
31