Download as pdf or txt
Download as pdf or txt
You are on page 1of 455

2/25/2021

Business Intelligent
Analytics

1
2/25/2021

What is Data Analytics?

Analytics is the use of:


data,
information technology,
statistical analysis,
quantitative methods, and
mathematical or computer-based models
to help managers gain improved insight about their
business operations and make better, fact-based
decisions.
Business Analytics (BI) is a subset of Data Analytics

What is Business Analytics?

Business Analytics Applications


Management of customer relationships
Financial and marketing activities
Supply chain management
Human resource planning
Pricing decisions
Sport team game strategies

2
2/25/2021

What is Business Analytics?

Importance of Business Analytics


There is a strong relationship of BA with:
- profitability of businesses
- revenue of businesses
- shareholder return
BA enhances understanding of data
BA is vital for businesses to remain competitive
BA enables creation of informative reports

Scope of Business Analytics

Descriptive analytics
- uses data to understand past and present
Predictive analytics
- analyzes past performance
Prescriptive analytics
- uses optimization techniques

3
2/25/2021

Scope of Business Analytics

Retail Markdown Decisions


Most department stores clear seasonal inventory by
reducing prices.
The question is:
When to reduce the price and by how much?
Descriptive analytics: examine historical data for
similar products (prices, units sold, advertising, …)
Predictive analytics: predict sales based on price
Prescriptive analytics: find the best sets of pricing and
advertising to maximize sales revenue

Data for Business Analytics

DATA
- collected facts and figures
DATABASE
- collection of computer files containing data
INFORMATION
- comes from analyzing data

4
2/25/2021

Data for Business Analytics

Metrics are used to quantify performance.


Measures are numerical values of metrics.
Discrete metrics involve counting
- on time or not on time
- number or proportion of on time deliveries
Continuous metrics are measured on a continuum
- delivery time
- package weight
- purchase price

Data for Business Analytics

A Sales Transaction Database File

Records

Figure 1.1

Entities Fields or Attributes

10

5
2/25/2021

What is Big Data?


• Information from multiple internal and external sources:
• Transactions
• Social media
• Enterprise content
• Sensors
• Mobile devices
• Companies leverage data to adapt products and services to:
• Meet customer needs
• Optimize operations
• Optimize infrastructure
• Find new sources of revenue
• Can reveal more patterns and anomalies
• IBM estimates that by 2015 4.4 million jobs will be created globally to
support big data
• 1.9 million of these jobs will be in the United States

11

Types of Data

12

6
2/25/2021

• When collecting or gathering data we collect data


from individuals cases on particular variables.
• A variable is a unit of data collection whose value
can vary.
• Variables can be defined into types according to the
level of mathematical scaling that can be carried out
on the data.
• There are four types of data or levels of
measurement:
1. Categorical (Nominal) 2. Ordinal
3. Interval 4. Ratio

13

Data for Business Analytics

Classifying Data Elements in a Purchasing Database

14

7
2/25/2021

Data for Business Analytics

(continued)
Classifying Data Elements in a Purchasing Database

Figure 1.2

15

Categorical (Nominal) data


• Nominal or categorical data is data that comprises of categories that cannot be
rank ordered – each category is just different.
• The categories available cannot be placed in any order and no judgement can be
made about the relative size or distance from one category to another.
 Categories bear no quantitative relationship to one another
 Examples:
- customer’s location (America, Europe, Asia)
- employee classification (manager, supervisor,
associate)
• What does this mean? No mathematical operations can be performed on the data
relative to each other.
•Therefore, nominal data reflect qualitative differences rather than quantitative
ones.

16

8
2/25/2021

Nominal data
Examples:

What is your gender? Did you enjoy the


(please tick) film? (please tick)

Male Yes
Female No

17

Nominal data

•Systems for measuring nominal data must ensure


that each category is mutually exclusive and the
system of measurement needs to be exhaustive.
• Variables that have only two responses i.e. Yes or
No, are known as dichotomies.

18

9
2/25/2021

Ordinal data

• Ordinal data is data that comprises of categories that can be rank ordered.
• Similarly with nominal data the distance between each category cannot be
calculated but the categories can be ranked above or below each other.
 No fixed units of measurement
 Examples:
- college football rankings
- survey responses
(poor, average, good, very good, excellent)
• What does this mean? Can make statistical judgements and perform limited
maths.

19

Ordinal data
Example:
How satisfied are you with the level of service you have
received? (please tick)

Very satisfied

Somewhat satisfied

Neutral

Somewhat dissatisfied

Very dissatisfied

20

10
2/25/2021

Interval and ratio data


• Both interval and ratio data are examples of scale data.
• Scale data:
• data is in numeric format ($50, $100, $150)
•data that can be measured on a continuous scale
• the distance between each can be observed and as a result measured
• the data can be placed in rank order.

21

Interval data

• Ordinal data but with constant differences between observations


• Ratios are not meaningful
• Examples:
•Time – moves along a continuous measure or seconds, minutes and so
on and is without a zero point of time.
• Temperature – moves along a continuous measure of degrees and is
without a true zero.
•SAT scores

22

11
2/25/2021

Ratio data
• Ratio data measured on a continuous scale and does have a natural zero
point.
 Ratios are meaningful
 Examples:
• monthly sales
• delivery times
• weight
• Height
• age

23

Types of Analytics

24

12
2/25/2021

Decision Models

Model:
An abstraction or representation of a real system,
idea, or object
Captures the most important features
Can be a written or verbal description, a visual
display, a mathematical formula, or a
spreadsheet representation

25

Decision Models

Figure 1.3

26

13
2/25/2021

Decision Models

A decision model is a model used to understand,


analyze, or facilitate decision making.
Types of model input
- data
- uncontrollable variables
- decision variables (controllable)

27

Decision Models

Descriptive Decision Models


 Simply tell “what is” and describe relationships
 Do not tell managers what to do
An Influence Diagram for Total Cost

28

14
2/25/2021

Descriptive Analytics

• Descriptive analytics, such as reporting/OLAP, dashboards, and data


visualization, have been widely used for some time.
• They are the core of traditional BI.

What has occurred?


Descriptive analytics, such as data visualization, is
important in helping users interpret the output from
predictive and predictive analytics.

29

Decision Models

A Break-even Decision Model


TC(manufacturing) = $50,000 + $125*Q
TC(outsourcing) = $175*Q
Breakeven Point:
Set TC(manufacturing)
= TC(outsourcing)
Solve for Q = 1000 units

30

15
2/25/2021

Decision Models

• Predictive Decision Models often incorporate


uncertainty to help managers analyze risk.
• Aim to predict what will happen in the future.
• Uncertainty is imperfect knowledge of what will
happen in the future.
• Risk is associated with the consequences of what
happens.

31

Predictive Analytics

• Algorithms for predictive analytics, such as regression analysis, machine learning,


and neural networks, have also been around for some time.
• Prescriptive analytics are often referred to as advanced analytics.

What will occur?


• Marketing is the target for many predictive analytics applications.
• Descriptive analytics, such as data visualization, is important in helping users
interpret the output from predictive and prescriptive analytics.

32

16
2/25/2021

Decision Models

A Linear Demand Prediction Model


As price increases, demand falls.

Figure 1.8

33

Decision Models

A Nonlinear Demand Prediction Model


Assumes price elasticity (constant ratio of % change in demand to % change
in price)

Figure 1.9

34

17
2/25/2021

Decision Models

Prescriptive Decision Models help decision makers identify


the best solution.
 Optimization - finding values of decision variables that
minimize (or maximize) something such as cost (or profit).
 Objective function - the equation that minimizes (or
maximizes) the quantity of interest.
 Constraints - limitations or restrictions.
 Optimal solution - values of the decision variables at the
minimum (or maximum) point.

35

Prescriptive Analytics

• Prescriptive analytics are often referred to as advanced analytics.


• Regression analysis, machine learning, and neural networks
• Often for the allocation of scarce resources

What should occur?

• For example, the use of mathematical programming for revenue management is common for
organizations that have “perishable” goods (e.g., rental cars, hotel rooms, airline seats).
• Harrah’s has been using revenue management for hotel room pricing for some time.

36

18
2/25/2021

Organizational Transformation

• Brought about by opportunity or necessity


• The firm adopts a new business model
enabled by analytics
• Analytics are a competitive requirement

37

2013 Academic Research

• A 2011 TDWI report on Big Data Analytics found that 85% of


respondents indicated that their firms would be using advanced
analytics within three years

• A 2011 IBM/MIT Sloan Management Review research study


found that top performing companies in their industry are much
more likely to use analytics rather than intuition across the widest
range of possible decisions.

38

19
2/25/2021

Conditions that Lead to Analytics-based


Organizations

• The nature of the industry


• Seizing an opportunity
• Responding to a problem

39

Complex Systems
• Tackle complex problems and provide individualized solutions
• Products and services are organized around the needs of
individual customers
• Dollar value of interactions with each customer is high
• There is considerable interaction with each customer
• Examples: IBM, World Bank, Halliburton

40

20
2/25/2021

Volume Operations
• Serves high-volume markets through standardized
products and services
• Each customer interaction has a low dollar value
• Customer interactions are generally conducted
through technology rather than person-to-person
• Are likely to be analytics-based
• Examples: Amazon.com, eBay, Hertz

41

The Nature of the Industry: Online


Retailers
BI Applications
• Analysis of clickstream data
• Customer profitability analysis
• Customer segmentation analysis
• Product recommendations
• Campaign management
• Pricing
• Forecasting
• Dashboards

42

21
2/25/2021

The Nature of the Industry


• Online retailers like Amazon.com and Overstock.com are high
volume operations who rely on analytics to compete.
• When you enter their sites a cookie is placed on your PC and
all clicks are recorded.
• Based on your clicks and any search terms, recommendation
engines decide what products to display.
• After you purchase an item, they have additional information
that is used in marketing campaigns.
• Customer segmentation analysis is used in deciding what
promotions to send you.
• How profitable you are influences how the customer care
center treats you.
• A pricing team helps set prices and decides what prices are
needed to clear out merchandise.
• Forecasting models are used to decide how many items to
order for inventory.
• Dashboards monitor all aspects of organizational performance

43

Analytics Help the Cincinnati Zoo Know Its Customers


• What management, organization, and technology factors were behind
the Cincinnati Zoo losing opportunities to increase revenue?
• Why was replacing legacy point-of-sale systems and implementing a
data warehouse essential to an information system solution?
• How did the Cincinnati Zoo benefit from business intelligence? How
did it enhance operational performance and decision making? What
role was played by predictive analytics?
• Visit the IBM Cognos Web site and describe the business intelligence
tools that would be the most useful for the Cincinnati Zoo.

44

22
2/25/2021

Thank you
for your
attentions!

45

23
3/4/2021

Decision Analysis

1
3/4/2021

Decision Analysis
• Effective decision-making requires that we understand:
• The values, goals, and objectives that are relevant to the decision
problem
• The areas of uncertainty that affect the decision
• The consequences of each possible decision

A Decision-Making Model
Consider the following five-part decision-making model:

Formulate
Identify and Analyze Test Implement
Problem Implement Model Results Solution
Model
Most
important

Unsatisfactory Results

Adapted from Winning Decisions, by Russo and Shoemaker

2
3/4/2021

Anchoring and Framing


• Errors in judgment arise due to what psychologists term
anchoring and framing

Anchoring
• A seeming trivial factor serves as a starting point
(anchor) for estimations
• Decision makers adjust their estimates but remain too
close to the anchor.

3
3/4/2021

Framing
• Affects how a decision maker perceives the alternatives in
a decision – often involves a win/loss perspective
• The way the problem is framed influences the choices

Framing Example
• You have been given $1,000 but must choose between the following
alternatives:
a) Receive an additional $500 with certainty or
b) Flip a coin and receive an additional $1,000 if heads or $0 if tails
➔ a) is a sure win and the choice most people prefer
• Now suppose you are given $2,000 and must choose between
a) Give back $500 immediately or
b) Flip a coin and give back $0 if heads or $1,000 if tails
➔ When framed this way, alternative a) is a “sure loss” and many people who
previously preferred alternative a) now opt for alternative b) (because it holds a
chance of avoiding a loss)
• However it is clear that a) alternative guarantees a total payoff of $1,500,
whereas b) offers a 50% chance of a $2,000 total payoff and a 50% chance of
$1,000 total payoff.
• A rational decision maker should focus on the consequences of his/her
choices and consistently select the same alternative, regardless of how the
problem is framed

4
3/4/2021

A Framing Example
“Careful analysis at a major U.S. steel company showed it
could save hundreds of thousands of dollars per year by
replacing its hot-metal mixing technology, which required that
metal be heated twice, with direct-pouring technology, in
which the metal was only heated once. But the move was
approved only after considerable delay because senior
engineers complained that the analysis did not include the cost
of the hot metal mixers that had been purchased for $3 million
just a few years previously.”

How would you critique this decision?

Adapted from Winning Decisions, by Russo and Shoemaker

Decision Trees and Decision


Tables
• Often our problem solutions require decisions to be made
according to two or more conditions or combinations of
conditions
• Decision trees represent such decision as a sequence of steps
• Decision tables describe all possible combinations of
conditions and the decision appropriate to each combination
• Levels of uncertainty can also be built into decision trees to
account for the relative probabilities of the various outcomes

10

10

5
3/4/2021

Decision Tree Showing Possible


Outcomes Projected
Sales
Results
Outcomes

A.1 Sales Up 10%

Outcome A
A.2 Sales Up 15%

Decision Made
B.1 Sales Up 5%

Outcome B
B.2 Sales Even

11

11

Decision Tree of Outcomes -- Quantifying Uncertainties


Projected
Sales
Results

80%
A.1 Sales Up 10% 32%

Outcome A
40% 20% A.2 Sales Up 15% 8%

Decision Made
70% B.1 Sales Up 5% 42%

60%
Outcome B

30% B.2 Sales Even 18%

12

12

6
3/4/2021

An Example

13

Evaluating Your
Start by assigning a cash value or
score to each possible outcome.
Estimate how much you think it
would be worth to you if that
outcome came about.
Next look at each circle
(representing an uncertainty
point) and estimate the
probability of each outcome. If
you use percentages, the total
must come to 100% at each circle.
If you use fractions, these must
add up to 1. If you have data on
past events, you may be able to
make rigorous estimates of the
probabilities. Otherwise write
down your best guess. -

14

7
3/4/2021

Calculating Tree Values


• Once you have worked out the value of the
outcomes, and have assessed the probability of the
outcomes of uncertainty, it is time to start
calculating the values that will help you make your
decision.
• Start on the right hand side of the decision tree, and
work back towards the left. As you complete a set of
calculations on a node (decision square or
uncertainty circle), all you need to do is to record
the result. You can ignore all the calculations that
lead to that result from then on. –

15

Calculating The Value of


Uncertain Outcome Nodes

Calculating the value of uncertain


outcomes (circles on the
diagram): Multiply the value of
the outcomes by their probability.
The total for that node of the tree
is the total of these values. In the
example in Figure 2, the value for
'new product, thorough
development' is: 0.4 (probability
good outcome) x $1,000,000
(value) = $400,000, 0.4
(probability moderate outcome) x
$50,000 (value) = $20,000, 0.2
(probability poor outcome) x
$2,000 (value) = $400 + $420,400

16

8
3/4/2021

Calculating the Value


of Decision Nodes
When evaluating a decision node, write down the
cost of each option along each decision line. Then
subtract the cost from the outcome value that you
have already calculated.
This will give you a value that represents the benefit
of that decision. Note that amounts already spent do
not count for this analysis – these are 'sunk costs' and
(despite emotional counter-arguments) should not be
factored into the decision.
When you have calculated these decision benefits,
choose the option that has the largest benefit, and
take that as the decision made.
This is the value of that decision node.
Figure 4 shows this calculation of decision nodes in
our example:
In this example, the benefit we previously calculated
for 'new product, thorough development' was
$420,400. We estimate the future cost of this
approach as $150,000. This gives a net benefit of
$270,400. The net benefit of 'new product, rapid
development' was $31,400. On this branch we
therefore choose the most valuable option, 'new
product, thorough development', and allocate this
value to the decision node.

17

Result

By applying this technique, we


can see that the best option is
to develop a new product. It is
worth much more to us to take
our time and get the product
right, than to rush the product
to market.
However, if we decide to
consolidate, It is better just to
improve our existing products
than to botch a new product

18

9
3/4/2021

Example of Using a Decision Tree or


Table to Capture Complex Business
Logic
Consider the following excerpt from an actual business
document:

If the customer account is billed using a fixed rate method, a minimum


monthly charge is assessed for consumption of less than 100 kwh. Otherwise,
apply a schedule A rate structure.
However, if the account is billed using a variable rate method, a schedule A
rate structure will apply to consumption below 100 kwh, with additional
consumption billed according to schedule B.

19

19

Articulating Complex Business Rules


◼ Complex business rules, such as our example, can become
rather confusing
◼ Capturing such rules in text form alone can lead to ambiguity
and misinterpretation
◼ As an alternative, it is often wise to capture such rules in
decision tress or decision tables
◼ The examples on the following slides will illustrate this
technique

20

10
3/4/2021

Decision Tree for this Example


< 100 kwh minimum charge
fixed
rate
billing >= 100 kwh schedule A

< 100 kwh schedule A


variable
rate
billing >= 100 kwh
?
21

21

Decision Tree for this Example


< 100 kwh minimum charge
fixed
rate
billing >= 100 kwh schedule A

< 100 kwh schedule A


variable
rate schedule A on
billing first 99 kwh
>= 100 kwh schedule B on
kwh 100 and above

22

22

11
3/4/2021

Decision Table for Example –


Version 1
Is this a
Rules valid
business
Conditions 1 2 3 4 5 case? Did
we miss
Fixed rate act T T F F F something?
Variable rate act F F T T F
Consumption < 100 kwh T F T F
Consumption >= 100 kwh F T F T
Actions
Minimum charge X
Schedule A X X
Schedule A on first 99 kwh, X
Schedule B on kwh 100 +

23

23

Decision Table for Example –


Version 2
Rules
Conditions 1 2 3 4
Account type fixed fixed variable variable
Consumption < 100 >=100 <100 >= 100

Actions
Minimum charge X
Schedule A X X
Schedule A on first 99 kwh, X
Schedule B on kwh 100 +

24

24

12
3/4/2021

Activity
Consider the following description of a company’s matching retirement
contribution plan:

Acme Widgets wants to encourage its employees to save for retirement.


To promote this goal, Acme will match an employee’s contribution to the
approved retirement plan by 50% provided the employee keeps the
money in the retirement plan at least two years. However, the company
limits its matching contributions depending on the employee’s salary and
time of service as follows. Acme will match five, six, or seven percent of
the first $30,000 of an employee's salary if he or she has been with the
company for at least two, five, or ten years respectively. If the employee
has been with the company for at least five years, the company will
match up to four percent of the next $25,000 in salary and three percent
of any excess. Ten-year plus workers get a five percent match from
$30,000 to $55,000. Long-term service employees (fifteen years or
more) get seven percent on the first $30,000 and five percent after that.

25

25

Team Activity (cont’d)


1) Do one of the following tasks according to your instructor’s directions:
a) Create a decision tree that captures the business rules in this
policy.
b) Create a decision table that captures the business rules in this
policy.
2) Did your analysis uncover any questions, ambiguities, or missing
rules?
3) If so, do you think these would be as easy to spot and to analyze
using only the narrative description of this policy?

26

26

13
3/4/2021

Developing a More Complex Decision


Table

27

27

Developing More Complex Decision


Tables
In the 1950's General Electric, the Sutherland
Corporation, and the United States Air Force worked
on a complex file maintenance project, using
flowcharts and traditional narratives, they spent six
labor-years of effort but failed to define the problem.

It was not until 1958, when four analysts using


decision tables, successfully defined the problem in
less than four weeks 1.
_____________________________________
1Taken from “A History of Decision Tables” located at

http://www.catalyst.com/products/logicgem/overview.html

28

14
3/4/2021

Steps to create a decision table


1. List all the conditions which determine which action to take.
2. Calculate the number of rules required.
• Multiple the number of values for each condition by each other.
• Example: Condition 1 has 2 values, Condition 2 has 2 values,
Condition 3 has 2 values. Thus 2 X 2 X 2 = 8 rules
3. Fill all combinations in the table.
4. Define the action for each rule
5. Analyze column by column to determine which actions are
appropriate for each rule.
6. Reduce the table by eliminating redundant columns.

29

All possible combinations

<cond-1> F T F T F T F T … T

<cond-2> F F T T F F T T … T

Conditions <cond-3> F F F F T T T T … T

… …

<cond-n> F F F F F F F F … T

<action-1> X X X X

<action-2> X X X X

Actions <action-3> X X X X

… …

<action-m> X X X

Actions per combination


(each column represents a different situation)

30

15
3/4/2021

Example
• Policy for charging charter flight costumers for certain
in-flight services:2
If the flight is more than half-full and costs more than $350 per
seat, we serve free cocktails unless it is a domestic flight. We
charge for cocktails on all domestic flights; that is, for all the ones
where we serve cocktails. (Cocktails are only served on flights that
are more than half-full.)
_____________________________________
2 Example taken form: Structured Analysis and System Specification, Tom de Marco,

Yourdon inc., New York,

31

List all the conditions that


determine which action to take.

Conditions Values
The flight more than half-full? Yes (Y), No (N)

Cost is more than $350? Y, N


Is it a domestic flight? Y, N

32

16
3/4/2021

Calculate the space of


combinations

Conditions Number of Possible Combinations/ Rules


Combinations
1 2 Y N
2 4 Y N Y N
Y Y N N
3 8 Y N Y N Y N Y N
Y Y N N Y Y N N
Y Y Y Y N N N N
… …
n 2n

33

Calculate the Number of Rules in


Table

• Conditions in the example are 3 and all are two-valued


ones, hence we have:

All combinations are 23 = 8 rules OR


2 X 2 X 2 = 8 rules

34

17
3/4/2021

Fill all rules in the table.


POSSIBLE RULES

CONDITONS more than half-full N N N N Y Y Y Y

more than $350 per


N N Y Y N N Y Y
seat

domestic flight N Y N Y N Y N Y
ACTIONS

35

Analyze column by column to determine which


actions are appropriate for each combination
POSSIBLE RULES

more than half-full N N N N Y Y Y Y


CONDITONS

more than $350 per seat N N Y Y N N Y Y

domestic flight N Y N Y N Y N Y
ACTIONS

serve cocktails X X X X

free X

36

18
3/4/2021

Reduce the table by eliminating


redundant columns.

POSSIBLE COMBINATIONS
more than half-
full
N N N N Y Y Y Y Note that some
columns are identical
CONDITONS

more than $350 except for one condition.


N N Y Y N N Y Y
per seat

domestic flight N Y N Y N Y N Y
ACTIONS

serve cocktails X X X X

free X

37

Reduce the table by eliminating


redundant columns.

POSSIBLE RULES
more than half- Note that some
N N N N Y Y Y Y
full columns are identical
except for one condition.
CONDITONS

more than $350


N N Y Y N N Y Y
per seat
Which means that
actions are
domestic flight N Y N Y N Y N Y independent from
the value of that
ACTIONS

serve cocktails X X X X particular condition.


free X

38

19
3/4/2021

Reduce the table by eliminating


redundant columns.

POSSIBLE RULES Note that some


more than half- columns are identical
N N N N Y Y Y Y except for one
full
condition.
CONDITONS

more than $350


N N Y Y N N Y Y
per seat
Which means that
actions are
domestic flight N Y N Y N Y N Y independent from
the value of that
ACTIONS

serve cocktails X X X X particular condition.


free X
Hence, the table
can be simplified.

39

Reduce the table by eliminating


redundant columns.

POSSIBLE RULES First we combine


more than half- the yellow ones
N N N Y Y Y Y nullifying the condition.
full
CONDITONS

more than $350


N Y Y N N Y Y
per seat

domestic flight - N Y N Y N Y
ACTIONS

serve cocktails X X X X

free X

40

20
3/4/2021

Reduce the table by eliminating


redundant columns.

POSSIBLE RULES First we combine the yellow


more than half- ones nullifying the condition.
N N Y Y Y Y
full
Then the red ones.
CONDITONS

more than $350


N Y N N Y Y
per seat

domestic flight - - N Y N Y
ACTIONS

serve cocktails X X X X

free X

41

Reduce the table by eliminating


redundant columns.

POSSIBLE RULES First we combine the yellow


more than half- ones nullifying the condition.
N N Y Y Y Y
full
Then the red ones.
CONDITONS

more than $350


N Y N N Y Y
per seat
Notice that yellow and red
columns are identical but by
domestic flight - - N Y N Y one condition.
ACTIONS

serve cocktails X X X X

free X

42

21
3/4/2021

Reduce the table by eliminating


redundant columns.

POSSIBLE RULES First we combine the yellow


more than half- ones nullifying the condition.
N Y Y Y Y
full
Then the red ones.
CONDITONS

more than $350


- N N Y Y
per seat
Notice that yellow and red
columns are identical but by one
domestic flight - N Y N Y condition.
ACTIONS

serve cocktails X X X X So, we combine them.


free X

43

Reduce the table by eliminating


redundant columns.

POSSIBLE RULES First we combine the yellow


more than half- ones nullifying the condition.
N Y Y Y
full
Then the red ones.
CONDITONS

more than $350


- N Y Y
per seat
Notice that yellow and red
columns are identical but by one
domestic flight - - N Y condition.
ACTIONS

serve cocktails X X X So, we combine them.


free X
Then we combine the violet
colored ones.

44

22
3/4/2021

Reduce the table by eliminating


redundant columns.

POSSIBLE RULES Notice that even when we observe that


more than half- the green columns are identical except
N Y Y Y for one condition we do not combine
full
them:
CONDITONS

more than $350


- N Y Y A “NULLIFIED” condition is not the
per seat
same as a valued one.

domestic flight - - N Y

What about this rule? Have we


ACTIONS

serve cocktails X X X
overlooked something?
free X

45

Final Solution

Rules

more than half-full N Y Y Y


CONDITONS

more than $350 per seat - N Y Y

domestic flight - - N Y

serve cocktails X X X
ACTIONS

free X

46

23
3/4/2021

Complex Decision Table Exercise


A company is trying to maintain a meaningful list of customers. The objective is
to send out only the catalogs from which customers will buy merchandise.
The company realizes that certain loyal customers order from every catalog and
some people on the mailing list never order. These customers are easy to identify.
Deciding which catalogs to send to customers who order from only selected
catalogs is a more difficult decision. Once these decisions have been made by the
marketing department, you as the analyst have been asked to develop a decision
table for the three conditions described below. Each condition has two
alternatives (Y or N):
1. Customer ordered from Fall catalog
2. Customer ordered from Christmas catalog
3. Customer ordered from Specialty catalog

The actions for these conditions, as determined by marketing, are described on


the next slide.

47

47

Complex Decision Table Exercise


Customers who ordered from all three catalogs will get the Christmas and
Special catalogs. Customers who ordered from the Fall and Christmas catalogs
but not the Special catalog will get the Christmas catalog. Customers who
ordered from the Fall catalog and the Special catalog but not the Christmas
catalog will get the Special catalog. Customers who ordered only from the
Fall catalog but no other catalog will get the Christmas catalog. Customers
who ordered from the Christmas and Special catalogs but not the Fall catalog
will get both catalogs. Customers who ordered only from the Christmas
catalog or only from the Special catalog will get only the Christmas or Special
catalogs respectively. Customers who ordered from no catalog will get the
Christmas catalog.
1. Create a simplified decision table (or tree) based on the above decision
logic.

48

48

24
3/4/2021

Solution to Decision Table Exercise

Conditions and Actions 1 2 3 4 5 6 7 8

Order from Fall Catalog Y Y Y Y N N N N


Order from Christmas Catalog Y Y N N Y Y N N
Order from Special Catalog Y N Y N Y N Y N
Mail Christmas Catalog X X X X
Mail Special Catalog X X
Mail Both Catalogs X X

The four gray columns can In addition, Rules 1 & 5 and


be combined into a single Rules 3 & 7 can be combined.
rule. Note that for each, Each pair produces the same
there was NO order placed action and each pair shares two
from the Special Catalog. common conditions.

49

Solution to Decision Table


Exercise~ Final Version
Conditions and Actions 1 2 3

Order from Fall Catalog -- -- --


Order from Christmas Catalog Y -- N
Order from Special Catalog Y N Y
Mail Christmas Catalog X
Mail Special Catalog X
Mail Both Catalogs X

50

25
3/4/2021

Another Example:
“A marketing company wishes to construct a decision
table to decide how to treat clients according to three
characteristics:
Gender, City Dweller, and age group: A (under 30), B
(between 30 and 60), C (over 60).
The company has four products (W, X, Y and Z) to test
market.
Product W will appeal to female city dwellers.
Product X will appeal to young females.
Product Y will appeal to Male middle aged shoppers who
do not live in cities.
Product Z will appeal to all but older females.”

51

The process used to create this decision table is the following:


1. Identify conditions and their alternative values.
There are 3 conditions: gender, city dweller, and age group. Put these into table as 3
rows in upper left side.
Gender’s alternative values are: F and M.
City dweller’s alternative values are: Y and N
Age group’s alternative values are: A, B, and C
2. Compute max. number of rules.
Determine the product of number of alternative values for each condition.
2 x 2 x 3 = 12.
Fill table on upper right side with one column for each unique combination of these
alternative values. Label each column using increasing numbers 1-12 corresponding to
the 12 rules. For example, the first column (rule 1) corresponds to F, Y, and A. Rule 2
corresponds to M, Y, and A. Rule 3 corresponds to F, N, and A. Rule 4 corresponds to M,
N, and A. Rule 5 corresponds to F, Y, and B. Rule 6 corresponds to M, Y, and B and so on.
3. Identify possible actions
Market product W, X, Y, or Z. Put these into table as 4 rows in lower left side.
4. Define each of the actions to take given each rule.
For example, for rule 1 where it is F, Y, and A; we see from the above example scenario
that products W, X, and Z will appeal. Therefore, we put an ‘X’ into the table’s
intersection of column 1 and the rows that correspond to the actions: market product W,
market product X, and market product Z.

52

26
3/4/2021

1 2 3 4 5 6 7 8 9 10 11 12
Gender F M F M F M F M F M F M

City Y Y N N Y Y N N Y Y N N
Age A A A A B B B B C C C C
MarketW X X X

MarketX X X

MarketY X

MarketZ X X X X X X X X X X

53

5. Verify that the actions given to each rule are correct.


6. Simplify the table.
Determine if there are rules (columns) that represent impossible situations.
If so, remove those columns. There are no impossible situations in this
example.
Determine if there are rules (columns) that have the same actions. If so,
determine if these are rules that are identical except for one condition and
for that one condition, all possible values of this condition are present in the
rules in these columns. In the example scenario, columns 2, 4, 6, 7, 10, and
12 have the same action. Of these columns: 2, 6, and 10 are identical except
for one condition: age group. The gender is M and they are city dwellers. The
age group is A for rule 2, B for rule 6, and C for rule 10. Therefore, all possible
values of condition ‘age group’ are present. For rules 2, 6, and 10; the age
group is a “don’t care”. These 3 columns can be collapsed into one column
and a hyphen is put into the age group location to signify that we don’t care
what the value of the age group is, we will treat all male city dwellers the
same: market product Z.

54

27
3/4/2021

Final Decision Table


1 2 3 4 5 6 7 8 9 10
Gender F M F M F M F M F M

City Y Y N N Y N N Y N N
Age A A A B B B C C C
MarketW X X X

MarketX X X

MarketY X

MarketZ X X X X X X X X

55

56

28
3/4/2021

Thank you
for your
attentions!

57

29
3/11/2021

Introduction to Decision
Support Systems

1
3/11/2021

Decision Making
• Business Environment Factors
• Markets: strong competition, global markets, market on
Internet
• Consumer demands: customization, quality, diversity,
delivery
• Technology: more innovations, more obsolescence rate,
more information overload
• Societal: more regulation and deregulation, more
diversified workforce, more social responsibility.

Decision Making
• Business Pressure-Response-Support Model

Organization Decisions and


Response Support
Business
Environment Strategy, Analysis,
Factors partners’ decisions,
collaboration, predictions
Globalization,
consumer Pressures real-time
response,
demands, agility,
government •Integrated
regulations, Opportunities increased
productivity, computerized
markets and new vendors, decision
competition support
new business
models, etc. •Business
Intelligence

2
3/11/2021

Decision Making
• Process of Decision Making
• Define the problem (i.e., a decision situation that may
deal with some difficulty or with and opportunity).
• Construct a model that describes the real-world problem.
• Identify possible solutions to the modeled problem and
evaluate the solutions.
• Compare, choose, and recommend a potential solution to
the problem.

Decision Support Systems


• Definition I (Keen and Scott-Morton):
• Decision support systems couple the intellectual
resources of individuals with the capabilities of the
computer to improve the quality of decisions. It is a
computer-based support system for management decision
makers who deal with semistructured problems.

3
3/11/2021

Decision Support Systems


• High-level Architecture of DSS

Data Models

Knowledge

User Interface

Decision Support Systems


• Multitiered architecture for incorporating
optimization, simulation, and other models into
web-based DSS
Optimization/
Web browser simulation/ etc
server

Web server Application Data


Server warehouse or
lagacy DBMS
Data
server

4
3/11/2021

Decision Support Systems


• Key characteristics and capabilities of DSS
1. Semistructured 2. Support
and Unstructured managers
14. Stand alone, problems at all levels 3. Support individuals
integration and and groups
web-based
4. Interdependent or
13. Data access sequential decisions

12. Modelling and 5. Support intelligence,


DSS
analysis design, choice,
implementation

11. Ease of 6. Support variety of


development by end decision processes and
users styles

10. Human controls 7. Adaptable and


the process flexible
9. Effectiveness, 8. Interactive
not efficiency ease of use

Decision Support Systems


• Components of DSS
• Data Management Subsystems: include Database
management system (DBMS) and dataware house
• Model Management Subsystems: include financial,
statistical, management science, or other quantitative
models that provide the analytical capabilities (also
called model base management system MDMS)
• User Interface Subsystems: include graphical user
interface (GUI) that allows users to communicate with
the system.

10

10

5
3/11/2021

Decision Support Systems


• Schematic View of DSS
Other component- Internet, intranets,
based systems extranets

Data:
external Data Model External
and management management models
internal

Knowledge-based
subsystems

User interface

Organizational
KB

Manager
(user)

11

11

Decision Support Systems


• The Structure of the DMS
Internal data sources

External Finance Marketing Production Personnel Other


data sources

Extraction Private personal data


Organizational
knowledge base
Corporate data
warehouse
Decision support
database
Query facility
Interface
management
Database
management system
•Retrieval Model management
Data directory
•Inquiry
•Update
•Report generation Knowledge-based
•Delete subsystem

12

12

6
3/11/2021

Decision Support Systems


• The Structure of MMS
Models (Model Base)
•Strategic, tactical, operational
•Statistical, financial, marketing, Model
management science, accounting, Directory
engineering, etc.
•Model building block

Model Base Management


•Modeling commands: creation Model execution,
•Maintenance: update integration, and command
•Database interface processor
•Modeling language

Data Interface Knowledge-based


management management subsystem

13

13

Decision Support Systems


• Schematic View of the User Interface Systems
Data management Knowledge-based Model management
and DBMS subsystems and MBMS

User Interface management


system (UIMS)

Natural Language
processor

Input Output
PC Display Action Display Printers, plotters
language language

Users

14

14

7
3/11/2021

Modeling

15

Highlights
• Static and Dynamic Models
• Certainty, Uncertainty, and Risk
• Modeling with Spreadsheets
• Decision Tables and Decision Trees
• The Structure of Mathematical Models
• Mathematical Programming Optimization
• Multiple Goals, Sensitivity, What-if and Goal
Seeking
• Problem Solving Search Methods

16

16

8
3/11/2021

Static and Dynamic Models


• A static model takes a single snapshot of the system.
The decision is made on that snapshot.
• A decision whether to buy a product, a quaterly or
annual income statement, the decision to invest are
static.
• A dynamic model is time dependent.
• Determining how many checkout points should be
open in a supermarket—this needs to take into
account the time of day because different numbers
of customers arrive during each hour.
11/03/64
17

17

Certainty, Uncertainty, and Risk


• In decision making under certainty, it is assumed
that complete knowledge is available so that the
decision maker knows exactly what the outcome of
each course of action will be.
• Certainty models are relatively easy to develop and
solve, and they can yield optimal solutions.

18

18

9
3/11/2021

Certainty, Uncertainty, and Risk


• In decision making under uncertainty, the decision
maker considers situations in which several
outcomes are possible for each course of action.
• The decision maker does not know, or cannot
estimate, the probability of occurrence of the
possible outcomes.
• It is more difficult than making decision under
certainty because there is insufficient information.

19

19

Certainty, Uncertainty, and Risk


• A decision made under risk (also known as a
probabilistic, or stochastic) is one in which the
decision maker must consider several possible
outcomes for each alternative, each with a given
probability of occurrence.
• Risk analysis is a decision-making method that
analyzes the risk (based on assumed known
probabilities) associated with different alternatives.

20

20

10
3/11/2021

Modeling with Spreadsheets


• Spreadsheet packages were quickly recognized as easy-to-
use implementation software for the development of a wide
range of applications in business, engineering, mathematics,
and science.
• Spreadsheets include extensive statistical, forecasting, and
other modeling and database management capabilities,
functions, and routines.
• These DSS-related spreadsheets include solver(solver.com),
What’s best(lindo.com), Braincel(promland.com),
NeuralTools, Evolver, @RISK (palisade.com), and GRG-
2(MS Excel).

21

21

Decision Tables and Trees


• Decision tables conventionally organize information and
knowledge in a systematic tabular manner to prepare for it
analysis.
• Decision under certainty:
State of Nature (Uncontrollable Variables)
Alternative Solid Stagnation(%) Inflation(%)
Growth(%)
Bonds 12 6 3
Stocks 15 3 -2
CDs 6.5 6.5 6.5

22

22

11
3/11/2021

Decision Tables and Trees


• Decision under risk

Solid Stagnation Inflation Expected


Growth Value
Alternative .5 (%) .3 (%) .2 (%) (%)

Bonds 12 6 3 8.4

Stocks 15 3 -2 8.0

CDs 6.5 6.5 6.5 6.5

23

23

Decision Tables and Trees


• A decision tree shows the relationships of the
problem graphically and can handle complex
situation in a compact form.
• TreeAge Pro(treeage.com), PrecisionTree
(palisade.com), psychwww.com/mtsite/dectree.html
and Mind Tools (mindtools.com).

24

24

12
3/11/2021

Structure of Mathematical Model

Uncontrollable
variables

Decision Mathematical Results


variables relationships variables

Intermediate
variables

25

25

Example of the Components of Models


Area Decision variables Result variables Uncontrollable variables
and parameters
Financial Investment alternatives Total profit, risk Inflation rate
investment and amounts Rate of return on Prime rate
investment (ROI) Competition
Earning per share
Liquidity level
Marketing Advertising budget Market share Customer’s income
Where to advertise Customer satisfaction Competitor’s action

Manufacturing What and how much to Total cost Machine capacity


produce Quantity level Technology
Inventory levels Employee satisfaction Material prices
Compensation programs
Accounting Use of computers Data processing cost Computer technology
Audit schedule Error rate Tax rates
Legal requirements
Transportation Shipments schedule Total transport cost Delivery distance
Use of smart cards Payment float time Regulations

Services Staffing levels Customer satisfaction Demand for services

26

26

13
3/11/2021

Mathematical Programming
Optimization
• Linear programming
• Product Mix
• Transportation Problem
• Non-Linear programming
• Travelling salesman
• Vehicle routing problem

11/03/64
27

27

Multiple Goals
• Managers want to attain simultaneous goals, some of which
may conflict.
• In addition to earning money, the company wants to grow,
develop its products and employees, provide job security to
its workers.
• Managers want to satisfy the shareholders and at the same
time enjoy high salaries and expense accounts, and
employees want to increase their take-home pay and
benefits.
• To solve this kind of problems, common methods are:
• Utility theory
• Goal programming
• Expression of goals as constraints, using LP
• A point system

11/03/64
28

28

14
3/11/2021

Sensitivity Analysis
• Sensitivity analysis attempts to assess the impact of a change in
the input data or parameters on the proposed solution.
• Sensitivity allows flexibility and adaptation to changing
conditions and to the requirements of different decision-making
situations.
• Sensitivity analysis tests relationships such as the following:
• The impact of changes in external (uncontrollable) variables and
parameters on the outcome variables(s)
• The impact of changes in decision variables on the outcome variable(s).
• The effect of uncertainty in estimating external variables
• The effects of different dependent interactions among variables
• The robustness of decisions under changing conditions

11/03/64
29

29

What-if Analysis
• What-if analysis is structured as What will happen
to the solution if an input variable, and assumption,
or a parameter value is changed?
• For example, what will happen to the total inventory
cost if the cost of the carrying inventories increases
by 10 per cent?
• A spreadsheet tool is a good example. A manager
can analyze a cash flow problem by changing
parameters’ values and see the differences without
any involvement of computer programmers.
11/03/64
30

30

15
3/11/2021

Goal Seeking
• Goal seeking calculates the values of the inputs
necessary to achieve a desired level of an output
(goal). It represents a backward solution approach.
• For example, What annual R&D budget is needed
for an annual growth rate of 15 per cent by 2012?

11/03/64
31

31

Problem Solving Search Methods


Optimization Generate Stop when no Optimal
(Analytical) improved solutions improvement (best)
or get the best is possible
solution directly

Complete All possible Comparisons: Optimal


enumeration solutions are Stop when all (best)
(exhaustive) checked alternatives
Search Blind are checked.
approaches search

Partial Check only some Comparisons, Best


search alternatives: simulation: among
Systematically Stop when alternative
drop interior solution is s checked
solutions good enough.

Heuristics
Only promising Stop when Good
solutions are solution is enough
considered. good enough 11/03/64

32

32

16
3/11/2021

33

Thank you
for your
attentions!

34

17
3/11/2021

Introduction to
Optimization

1
3/11/2021

Learning Objectives

• Recognize decision-making situations which that may benefit from


an optimization modeling approach.
• Formulate algebraic models for linear programming problems.
• Develop spreadsheet models for linear programming problems.
• Use Excel’s Solver Add-In to solve linear programming problems.
• Interpret the results of models and perform basic sensitivity
analysis.

Optimization: Basic Ideas


• Major field within the discipline of Data Analytics,
Operations Research and Management Science
• Optimization Problem Components
• Decision Variables
• Objective Function (to maximize or minimize)
• Constraints (requirements or limitations)
• Basic Idea
• Find the values of the decision variables that maximize
(minimize) the objective function value, while staying
within the constraints.

2
3/11/2021

Linear Programming (LP)


• If the objective function and all constraints are linear
functions of the decision variables (e.g., no squared
terms, trigonometric functions, ratios of variables), then
the problem is called a Linear Programming (LP)
problem. LPs are much easier to solve by computer than
problems involving nonlinear functions.
• Real-world LPs are solved which contain hundreds of
thousands to millions of variables (with specialized
software!). Our problems are obviously much smaller,
but the basic concepts are much the same.

Real-World Examples
• Dynamic and Customized Pricing
• Product Mix
• Scheduling/Allocation
• Routing/Logistics
• Supply Chain Optimization
• Facility Location
• Financial Planning/Asset Management
• Etc.

3
3/11/2021

Solving Optimization Problems

• Understand the problem; draw a diagram


• Write a problem formulation in words
• Write the algebraic formulation of the problem
• Define the decision variables
• Write the objective function in terms of the decision variables
• Write the constraints in terms of the decision variables

Solving Optimization Problems


(continued)
• Develop Spreadsheet Model ☺
• Set up the Solver settings and solve the problem
• Examine results, make corrections to model and/or
Solver settings
• Interpret the results and draw insights

4
3/11/2021

Solver

• We will use Solver as simple tool to illustrate the


example, an Excel add-in, to solve Linear Programming
problems.
• Add-In: An additional piece of software that Excel loads
into memory when needed.

Example: Product Mix Decision

• DJJ Enterprises makes automotive parts, Camshafts & Gears


• Unit Profit: Camshafts $25/unit, Gears $18/unit
• Resources needed: Steel, Labor, Machine Time. In total, 5000 lbs
steel available, 1500 hours labor, and 1000 hours machine time.
• Camshafts need 5 lbs steel, 1 hour labor, 3 hours machine time.
• Gears need 8 lbs steel, 4 hours labor, 2 hours machine time.
• How many camshafts & gears to make in order to maximize profit?

10

5
3/11/2021

Understanding the Problem

• Text-Based Formulation
• Decision Variables:
Number of camshafts
to make, number of
gears to make
• Objective Function:
Maximize profit
• Constraints: Don’t
exceed amounts
available of steel,
labor, and machine
time.

11

Algebraic Formulation
• Decision Variables
• C = number of camshafts to make
• G = number of gears to make
• Objective Function
• Maximize 25C + 18G (profit in $)
• Constraints
• 5C + 8G <= 5000 (steel in lbs)
• 1C + 4G <= 1500 (labor in hours)
• 3C + 2G <= 1000 (machine time in hours)
• C >= 0, G >= 0 (non-negativity)

12

6
3/11/2021

Important Concepts
• Linear Program: The objective function and constraint are linear
functions of the decision variables. Therefore, this is a Linear
Program.
• Feasibility
• Feasible Solution. A solution is feasible for an LP if all constraints are
satisfied.
• Infeasible Solution. A solution is infeasible if one or more constraints is
violated.
• Check the solutions C=75, G=200; and C=300, G=200 for feasibility.
• Optimal Solution. The optimal solution is the feasible solution
with the largest (for a max problem) objective value (smallest for
a min problem).

13

Solving Linear Programming


Problems
• Trial and error: possible for very small problems; virtually
impossible for large problems.
• Graphical approach: It is possible to solve a 2-variable problem
graphically to find the optimal solution (not shown).
• Simplex Method. This is a mathematical approach developed by
George Dantzig. Can solve small problems by hand.
• Computer Software. Most optimization software uses the Simplex
Method to solve the problems. Excel’s Solver Add-In is an example
of such software.
• Solver can solve LPs of up to 200 variables.

14

7
3/11/2021

Spreadsheet Model Development


• Develop correct, flexible, documented model.
• Sections for decision variables, objective function, and
constraints.
• Use algebraic formulation and natural structure of the problem to
guide structure of the spreadsheet.
• Use one cell for each decision variable
• Store objective function coefficients in separate cells and use
another cell to compute the objective function value.
• Store constraint coefficients in cells, compute the LHS value of
each constraint, for comparison to the RHS value.

15

Spreadsheet Model

• C=75, G=200 (cells B5:C5)


entered as trial values. This is
a feasible, but not the
A B C D E F G
optimal, solution.
1 Example B.1
2 DJJ Enterprises Production Planning • Note close relationship to
3
4 Decision Variables Camshafts Gears
algebraic formulation.
5
6
Units to Make 75 200
• Note that only one distinct
7 Objective Total D8: =B8*B$5+C8*C$5
formula needed to be entered;
8 Profit $25 $18 $5,475 (cop ied to D11:D13 ) once entered in Cell D8, it
9
10 Constraints Used Available
was copied to Cells D11:D13.
11 Steel (lbs) 5 8 1975 <= 5000
12 Labor (hrs) 1 4 875 <= 1500 • This is possible because the
13 Machine Time (hrs) 3 2 625 <= 1000 coefficients were stored
separately in a specific
structure.

16

8
3/11/2021

Solver Basics
• Don’t even think about using Solver until you have a
working, flexible spreadsheet model that you can use as a
“what if” tool!
• Solver Settings
• Specify Objective Cell (objective function)
• Specify Changing Cells (decision variables)
• Specify Constraints
• Specify Solver Settings
• Solve Problem to find Optimal Solution

17

Solver Settings: Target Cell and Changing Cells


• Specify Target Cell: D8
• Equal to: Max
• Changing Cells (decision variables)
• B5:C5

18

9
3/11/2021

Solver Settings: Constraints

• Click “Add” to add constraints


• Select LHS Cell (D11), relationship (<=), and RHS Cell
(F11).
• LHS Cell should contain a formula which computes the LHS Value
of the constraint.
• Typically, RHS Cell should contain a fixed value, but this is not
absolutely required.
• Repeat for the other two constraints (for labor and machine
time).
• OR use the range for the LHS and the RHS (see next
slide)

19

Solver Options

• Select Simples LP”


• Tells Solver to use the Simplex Method, which is faster and is Solver’s
default optimization method.
• Check “Make unconstrained Variables Non-negative”
• Tells Solver that the decision variables (B5:C5, representing the number
of Camshafts and Gears) must be 0 in any feasible solution.
• Leave other settings at their defaults.

20

10
3/11/2021

Completed Solver Box

• Click “Solve” to tell Solver to find the Optimal Solution.

21

Solved Spreadsheet (Optimal Solution)

A B C D E F
• Optimal Solution: Make 1 Example B.1
100 camshafts, 350 2 DJJ Enterprises Production Planning
gears. 3
4 Decision Variables Camshafts Gears
• Optimal Objective 5 Units to Make 100 350
Value: $8800 profit. 6
7 Objective Total
• Both pieces of 8 Profit $25 $18 $8,800
information are 9
important. Knowing the 10 Constraints Used Available
optimal objective value 11 Steel (lbs) 5 8 3300 <= 5000
is useless without 12 Labor (hrs) 1 4 1500 <= 1500
13 Machine Time (hrs) 3 2 1000 <= 1000
knowing how that value
can be attained.

22

11
3/11/2021

Interpreting the Solution


• Which constraints are actively holding us back from making even
more profit?
• Binding Constraints: Constraints at the optimal solution with the LHS equal
to the RHS; equivalently for a resource, a binding constraint is one in
which all the resource is used up.
• Labor (all 1500 hours used), Machine Time (all 1000 hours used).
• Non-Binding Constraints
• Steel: Have 5000 lbs available, but only need 3300 lbs for this solution.
• What-If Analysis. Once the Spreadsheet and Solver model is set
up, it is easy to change one or more input values and re-solve the
problem.
• This interactive use can be a very powerful way to use optimization.

23

Solution Reports
• Solver can generate three solution reports
• Answer Report
• Sensitivity Report
• Limits Report: Not covered here
• The Answer Report presents in a standard format the Solver
Settings and the optimal solution.
• The Sensitivity Report shows what will happen if certain
problem parameters are changed from their current values.

24

12
3/11/2021

Tightening a binding constraint can only worsen the objective


Answer Report function value, and loosening a binding constraint can only
improve the objective function value. As such, once an
optimal solution is found, managers can seek to improve that
solution by finding ways to relax binding constraints.

Objective Cell (Max)


Cell Name Original Value Final Value
• Three sections: Objective $D$8 Profit Total $0 $8,800
Cell, Changing Cells,
Constraints
• “Final Value” indicates Variable Cells
optimal solution
Cell Name Original Value Final Value Integer
• For each constraint, $B$5 Units to Make Camshafts 0 100 Contin
binding/non-binding status $C$5 Units to Make Gears 0 350 Contin
and “slack” (difference
between LHS and
RHS)…slack is zero for
binding constraints. Constraints

• Also note that the Solver Cell Name Cell Value Formula Status Slack
Settings for Objective Cell,
Changing Cells, and $D$11 Steel (lbs) Used 3300 $D$11<=$F$11 Not Binding 1700
Constraints are reported
here. This can be a useful $D$12 Labor (hrs) Used 1500 $D$12<=$F$12 Binding 0
debugging tool.
$D$13 Machine Time (hrs) Used 1000 $D$13<=$F$13 Binding 0

25

Sensitivity Report

Variable Cells
Final Reduced Objective Allowable Allowable
Cell Name Value Cost Coefficient Increase Decrease
$B$5 Units to Make Camshafts 100 0 25 2 20.5

$C$5 Units to Make Gears 350 0 18 82 1.333333333

Constraints
Final Shadow Constraint Allowable Allowable
Cell Name Value Price R.H. Side Increase Decrease

$D$11 Steel (lbs) Used 3300 0 5000 1E+30 1700

$D$12 Labor (hrs) Used 1500 0.4 1500 500 1166.666667

$D$13 Machine Time (hrs) Used 1000 8.2 1000 1416.666667 250

• Note two sections: Variable Cells and Constraints

26

13
3/11/2021

Sensitivity Report (continued)


• Constraint Section
• Shadow Price: This is the amount the optimal objective value will change by if the
RHS of the constraint is increased by one unit.
• Units of shadow price: (objective function units/constraint units). Example: for the
steel constraint, the units are $/lb; for the labor constraint, the units are $/hour.
• Allowable Increase/Decrease: Provides the increase/decrease of the RHS of the
constraint for which the shadow price stays the same; that is, the effect on the
objective value stays the same in this range.
• Example
• Shadow Price of Machine Time is $8.20/hour, valid for an increase of 1416 hours or
a decrease of 250 hours.
• If we make 200 additional machine time hours available, profit can be increased by
(200 hours)($8.20/hour) = $1640.
• What is the shadow price of a non-binding constraint? Why? Will this always be
the case?

27

Sensitivity Report (continued)


• Variable Cells Section
• Allowable Increase/Decrease: This is the amount the objective coefficient for a
decision variable can be increased/decreased without changing the optimal solution.
• Example
• Gears currently have a profit of $18/unit (objective coefficient).
• Allowable Increase/Decrease is $82 and $1.33, respectively.
• Interpretation: As long as the unit profit for gears is between $16.67 and $100, the
optimal solution (production plan) will be to make 100 camshafts and 350 gears.
• Note: The optimal solution (production plan) stays the same within this range, but
the optimal objective value (profit) changes since the unit profit is changing. So, if
the profit on gears goes up to $20/unit, profit will increase by ($20-$18)(350) =
$700.

28

14
3/11/2021

Sensitivity Report: Caveat!

• Effects identified in the Sensitivity Report need to be carefully


interpreted.
• Specifically, the effect of the change of one parameter (e.g., a RHS
value or an objective coefficient) assume that all other parameters
of the model stay at their base case values.
• For example, the Sensitivity Report does not tell us what happens
if additional quantities of both Labor and Machine Time become
available. In this scenario, we would need to enter the new values
and re-solve the model.

29

Highlights
• People use informal “optimization” to make decisions almost every
day.
• Organizations use formal optimization methods to address problems
across the organization, from optimal pricing to locating a new
facility.
• The algebraic formulation of an LP comprises the definitions of the
decision variables, an algebraic statement of the objective function,
and algebraic statements of the constraints.
• The spreadsheet model for an optimization problem should be
guided by the algebraic formulation.
• Solver, an Excel Add -In, is able to solve both linear and nonlinear
problems. This lecturefocuses on solving linear problems.

30

15
3/11/2021

Highlights
• After solving an LP, you must interpret the results to see if
they make sense, fix problems with the model, and find
the insights useful for management.
• Solver can generate the Answer and Sensitivity Reports.
The Sensitivity Report provides additional information
about what happens to the solution when certain
coefficients of the problem are changed.

31

32

16
3/11/2021

Thank you
for your
attentions!

33

17
Chapter 4:
Data Warehousing

Assoc. Prof. Nguyen Binh Minh Ph.D.

Learning Objectives

• Understand the basic definitions and concepts of


data warehouses
• Learn different types of data warehousing
architectures; their comparative advantages and
disadvantages
• Describe the processes used in developing and
managing data warehouses
• Explain data warehousing operations
• Explain the role of data warehouses in decision
support

1
Learning Objectives

• Explain data integration and the extraction,


transformation, and load (ETL) processes
• Describe real-time (a.k.a. right-time and/or active)
data warehousing
• Understand data warehouse administration and
security issues

Main Data Warehousing Topics

• DW definition
• Characteristics of DW
• Data Marts
• ODS, EDW, Metadata
• DW Framework
• DW Architecture & ETL Process
• DW Development
• DW Issues

2
What is a Data Warehouse?

• A physical repository where relational data are


specially organized to provide enterprise-wide,
cleansed data in a standardized format

• “The data warehouse is a collection of integrated,


subject-oriented databases designed to support
DSS functions, where each unit of data is non-
volatile and relevant to some moment in time”

Characteristics of DW
• Subject oriented
• Integrated
• Time-variant (time series)
• Nonvolatile
• Summarized
• Not normalized
• Metadata
• Web based, relational/multi-dimensional
• Client/server
• Real-time and/or right-time (active)

3
Data Mart

A departmental data warehouse that stores


only relevant data

– Dependent data mart


A subset that is created directly from a data
warehouse

– Independent data mart


A small data warehouse designed for a strategic
business unit or a department

Data Warehousing Definitions

• Operational data stores (ODS)


A type of database often used as an interim area for a
data warehouse
• Oper marts
An operational data mart
• Enterprise data warehouse (EDW)
A data warehouse for the enterprise
• Metadata
Data about data. In a data warehouse, metadata
describe the contents of a data warehouse and the
manner of its acquisition and use

4
DW Framework

No data marts option


Data Applications
Sources (Visualization)
Access
Routine
ERP Business
ETL
Reporting
Process Data mart
(Marketing)
Select

/ Middleware
Legacy Metadata Data/text
Extract mining
Data mart
(Engineering)
Transform Enterprise
POS Data warehouse
OLAP,
Integrate

API
Data mart Dashboard,
(Finance) Web
Other Load
OLTP/wEB
Replication Data mart
(...) Custom built
External
applications
data

DW Architecture

• Three-tier architecture
1. Data acquisition software (back-end)
2. The data warehouse that contains the data &
software
3. Client (front-end) software that allows users to
access and analyze data from the warehouse
• Two-tier architecture
First 2 tiers in three-tier architecture is combined into
one
Sometimes there is only one tier

10

5
DW Architectures

Tier 1: Tier 2: Tier 3:


Client workstation Application server Database server

Tier 1: Tier 2:
Client workstation Application & database server

11

A Web-based DW Architecture

Web pages
Application
Server

Client Web
(Web browser) Internet/ Server
Intranet/
Extranet
Data
warehouse

12

6
Data Warehousing Architectures

• Issues to consider when deciding which


architecture to use:
– Which database management system (DBMS)
should be used?
– Will parallel processing and/or partitioning be
used?
– Will data migration tools be used to load the
data warehouse?
– What tools will be used to support data
retrieval and analysis?

13

Alternative DW Architectures
(a) Independent Data Marts Architecture

ETL
End user
Source Staging Independent data marts
access and
Systems Area (atomic/summarized data)
applications

(b) Data Mart Bus Architecture with Linked Dimensional Datamarts

ETL
Dimensionalized data marts End user
Source Staging
linked by conformed dimentions access and
Systems Area
(atomic/summarized data) applications

(c) Hub and Spoke Architecture (Corporate Information Factory)

ETL
End user
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications

Dependent data marts


(summarized/some atomic data)

14

7
Alternative DW Architectures

(d) Centralized Data Warehouse Architecture

ETL
Normalized relational End user
Source Staging
warehouse (atomic/some access and
Systems Area
summarized data) applications

(e) Federated Architecture

Data mapping / metadata


End user
Logical/physical integration of access and
Existing data warehouses
common data elements applications
Data marts and legacy systmes

15

Alternative DW Architectures

1. Independent Data Marts


2. Data Mart Bus Architecture
3. Hub-and-Spoke Architecture
4. Centralized Data Warehouse
5. Federated Data Warehouse

• Each has pros and cons!

16

8
Teradata Corp. DW Architecture

17

Bank DataHub Conceptual Diagram

18

9
Data Warehousing Architectures
Ten factors that potentially affect the
architecture selection decision:
1. Information 6. Strategic view of the data
interdependence between warehouse prior to
organizational units implementation
2. Upper management’s 7. Compatibility with existing
information needs systems
3. Urgency of need for a 8. Perceived ability of the in-house
data warehouse IT staff
4. Nature of end-user tasks 9. Technical issues
5. Constraints on resources 10. Social/political factors

19

Data Integration and the Extraction,


Transformation, and Load (ETL) Process

• Data integration
Integration that comprises three major processes:
data access, data federation, and change capture
• Enterprise application integration (EAI)
A technology that provides a vehicle for pushing data
from source systems into a data warehouse
• Enterprise information integration (EII)
An evolving tool space that promises real-time data
integration from a variety of sources, such as
relational databases, Web services, and
multidimensional databases

20

10
Data Integration and the Extraction,
Transformation, and Load (ETL) Process

Extraction, transformation, and load (ETL)

Packaged Transient
application data source

Data
warehouse

Legacy
Extract Transform Cleanse Load
system

Data mart
Other internal
applications

21

ETL

• Issues affecting the purchase of ETL tool


– Data transformation tools are expensive
– Data transformation tools may have a long learning
curve
• Important criteria in selecting an ETL tool
– Ability to read from and write to an unlimited number
of data sources/architectures
– Automatic capturing and delivery of metadata
– A history of conforming to open standards
– An easy-to-use interface for the developer and the
functional user
19 Best ETL Tools for 2023
https://blog.hubspot.com/website/etl-tools

22

11
Data Warehouse Development

• Data warehouse development approaches


– Inmon Model: EDW approach (top-down)
– Kimball Model: Data mart approach (bottom-up)
– Which model is best?
• There is no one-size-fits-all strategy to DW
– One alternative is the hosted warehouse
• Data warehouse structure:
– The Star Schema vs. Relational
• Real-time data warehousing?

23

Hosted Data Warehouses


• Benefits:
– Requires minimal investment in infrastructure
– Frees up capacity on in-house systems
– Frees up cash flow
– Makes powerful solutions affordable
– Enables powerful solutions that provide for growth
– Offers better quality equipment and software
– Provides faster connections
– Enables users to access data remotely
– Allows a company to focus on core business
– Meets storage needs for large volumes of data

24

12
Representation of Data in DW

• Dimensional Modeling – a retrieval-based system that


supports high-volume query access
• Star schema – the most commonly used and the simplest
style of dimensional modeling
– Contain a fact table surrounded by and connected to several
dimension tables
– Fact table contains the descriptive attributes (numerical values)
needed to perform decision analysis and query reporting
– Dimension tables contain classification and aggregation
information about the values in the fact table
• Snowflakes schema – an extension of star schema where
the diagram resembles a snowflake in shape

25

Multidimensionality

• Multidimensionality
The ability to organize, present, and analyze data by
several dimensions, such as sales by region, by
product, by salesperson, and by time (four
dimensions)
• Multidimensional presentation
– Dimensions: products, salespeople, market segments, business
units, geographical locations, distribution channels, country, or
industry
– Measures: money, sales volume, head count, inventory profit,
actual versus forecast
– Time: daily, weekly, monthly, quarterly, or yearly

26

13
Star vs Snowflake Schema

Star Schema Snowflake Schema


Dimension Dimension Dimension Dimension
TIME PRODUCT MONTH BRAND
Quarter Brand M_Name Brand
... ... ... Dimension Dimension ...
DATE PRODUCT
Date LineItem
Fact Table
SALES Dimension ... ... Dimension
QUARTER CATEGORY
UnitsSold
Q_Name Category
... Fact Table
... SALES ...
UnitsSold
Dimension Dimension
PEOPLE GOGRAPHY ...
Division Coutry
... ... Dimension Dimension
PEOPLE STORE
Division LocID
... ... Dimension
LOCATION
State
...

27

Analysis of Data in DW

• Online analytical processing (OLAP)


– Data driven activities performed by end users to
query the online system and to conduct analyses
– Data cubes, drill-down / rollup, slice & dice, …
• OLAP Activities
– Generating queries (query tools)
– Requesting ad hoc reports
– Conducting statistical and other analyses
– Developing multimedia-based applications

28

14
Analysis of Data Stored in DW
OLTP vs. OLAP

• OLTP (online transaction processing)


– A system that is primarily responsible for capturing
and storing data related to day-to-day business
functions such as ERP, CRM, SCM, POS,
– The main focus is on efficiency of routine tasks
• OLAP (online analytic processing)
– A system is designed to address the need of
information extraction by providing effectively and
efficiently ad hoc analysis of organizational data
– The main focus is on effectiveness

29

OLAP vs. OLTP

30

15
OLAP Operations

• Slice – a subset of a multidimensional array


• Dice – a slice on more than two dimensions
• Drill Down/Up – navigating among levels of data
ranging from the most summarized (up) to the
most detailed (down)
• Roll Up – computing all of the data relationships
for one or more dimensions
• Pivot – used to change the dimensional
orientation of a report or an ad hoc query-page
display

31

OLAP
A 3-dimensional
OLAP cube with Sales volumes of
a specific Product
slicing
operations on variable Time
and Region

Slicing Operations on a
Simple Tree-Dimensional
e
m
Ti

Data Cube
Product
Geography

Cells are filled


with numbers
Sales volumes of
representing a specific Region
sales volumes on variable Time
and Products

Sales volumes of
a specific Time on
variable Region
and Products

32

16
Variations of OLAP

• Multidimensional OLAP (MOLAP)


OLAP implemented via a specialized
multidimensional database (or data store) that
summarizes transactions into multidimensional
views ahead of time
• Relational OLAP (ROLAP)
The implementation of an OLAP database on
top of an existing relational database
• Database OLAP and Web OLAP (DOLAP and
WOLAP); Desktop OLAP,…

33

DW Implementation Issues

• 11 tasks for successful DW implementation


– Establishment of service-level agreements and data-refresh
requirements
– Identification of data sources and their governance policies
– Data quality planning
– Data model design
– ETL tool selection
– Relational database software and platform selection
– Data transport
– Data conversion
– Reconciliation process
– Purge and archive planning
– End-user support

34

17
DW Implementation Guidelines

• Project must fit with corporate strategy & business objectives


• There must be complete buy-in to the project by executives,
managers, and users
• It is important to manage user expectations about the completed
project
• The data warehouse must be built incrementally
• Build in adaptability, flexibility and scalability
• The project must be managed by both IT and business
professionals
• Only load data that have been cleansed and are of a quality
understood by the organization
• Do not overlook training requirements
• Be politically aware

35

Successful DW Implementation
Things to Avoid

• Starting with the wrong sponsorship


• Setting expectations that you cannot meet
• Engaging in politically naive behavior
• Loading the data warehouse with information
just because it is available
• Believing that data warehousing database design
is the same as transactional database design
• Choosing a data warehouse manager who is
technology oriented rather than user oriented

36

18
Successful DW Implementation
Things to Avoid - Cont.

• Focusing on traditional internal record-oriented


data and ignoring the value of external data and
of text, images, etc.
• Delivering data with confusing definitions
• Believing promises of performance, capacity, and
scalability
• Believing that your problems are over when the
data warehouse is up and running
• Focusing on ad hoc data mining and periodic
reporting instead of alerts

37

Failure Factors in DW Projects

• Unclear business objectives


• Cultural issues being ignored
– Change management
• Unrealistic expectations
• Inappropriate architecture
• Low data quality / missing information
• Loading data just because it is available

38

19
Massive DW and Scalability

• Scalability
– The main issues pertaining to scalability:
• The amount of data in the warehouse
• How quickly the warehouse is expected to grow
• The number of concurrent users
• The complexity of user queries
– Good scalability means that queries and
other data-access functions will grow linearly
with the size of the warehouse

39

Real-time/Active DW/BI

• Enabling real-time data updates for real-


time analysis and real-time decision
making is growing rapidly
– Push vs. Pull (of data)
• Concerns about real-time BI
– Not all data should be updated continuously
– Mismatch of reports generated minutes apart
– May be cost prohibitive
– May also be infeasible

40

20
Real-time/Active DW at Teradata

41

Enterprise Decision Evolution and DW

42

21
HADOOP DATA WAREHOUSE
ARCHITECTURE

43

Traditional vs Active DW Environment

44

22
DW Administration and Security

• Data warehouse administrator (DWA)


– DWA should…
• have the knowledge of high-performance software, hardware
and networking technologies.
• possess solid business knowledge and insight.
• be familiar with the decision-making processes so as to
suitably design/maintain the data warehouse structure.
• possess excellent communications skills.
• Security and privacy is a pressing issue in DW
– Safeguarding the most valuable assets
– Government regulations (HIPAA, etc.)
– Must be explicitly planned and executed

45

The Future of DW

• Sourcing…
– Open source software
– SaaS (software as a service)
– Cloud computing
– DW appliances
• Infrastructure…
– Real-time DW
– Data management practices/technologies
– In-memory processing (“super-computing”)
– New DBMS
– Advanced analytics

46

23
End of the Chapter

• Questions, comments

47

24
Data Mining for Business Intelligence

Assoc. Prof. Nguyen Binh Minh

Learning Objectives
• Define data mining as an enabling technology for
business intelligence
• Understand the objectives and benefits of business
analytics and data mining
• Recognize the wide range of applications of data
mining
• Learn the standardized data mining processes
• CRISP-DM
• SEMMA
• KDD

1
Learning Objectives
• Understand the steps involved in data
preprocessing for data mining
• Learn different methods and algorithms of data
mining
• Build awareness of the existing data mining
software tools
• Commercial versus free/open source
• Understand the pitfalls and myths of data mining

Opening Vignette…

“Data Mining Goes to Hollywood!”


• Decision situation
• Problem
• Proposed solution
• Results
• Answer & discuss the case questions

2
Opening Vignette:
Data Mining Goes to Hollywood!
Class No. 1 2 3 4 5 6 7 8 9

Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)

Number of
Independent Variable Possible Values
Dependent Values
Variable MPAA Rating 5 G, PG, PG-13, R, NR
Independent Competition 3 High, Medium, Low
Variables Star value 3 High, Medium, Low
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
A Typical Genre 10 Related, Thriller, Horror,
Comedy, Cartoon, Action,
Classification Documentary

Problem Special effects 3 High, Medium, Low


Sequel 1 Yes, No
Number of screens 1 Positive integer

Opening Vignette:
Data Mining Goes to Hollywood!
Model
Development
process
The DM
Process
Map in Model

IBM SPSS Assessment


process

Modeler

3
Opening Vignette:
Data Mining Goes to Hollywood!
Prediction Models

Individual Models Ensemble Models

Performance Random Boosted Fusion


Measure SVM ANN C&RT Forest Tree (Average)

Count (Bingo) 192 182 140 189 187 194

Count (1-Away) 104 120 126 121 104 120

Accuracy (% Bingo) 55.49% 52.60% 40.46% 54.62% 54.05% 56.07%

Accuracy (% 1-Away) 85.55% 87.28% 76.88% 89.60% 84.10% 90.75%

Standard deviation 0.93 0.87 1.05 0.76 0.84 0.63


* Training set: 1998 – 2005 movies; Test set: 2006 movies

Data Mining Concepts and Definitions


Why Data Mining?
• More intense competition at the global scale
• Recognition of the value in data sources
• Availability of quality data on customers, vendors,
transactions, Web, etc.
• Consolidation and integration of data repositories
into data warehouses
• The exponential increase in data processing and
storage capabilities; and decrease in cost
• Movement toward conversion of information
resources into nonphysical form

4
Definition of Data Mining
• The nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data stored in structured databases
- Fayyad et al., (1996)
• Keywords in this definition: Process, nontrivial, valid,
novel, potentially useful, understandable
• Data mining: a misnomer?
• Other names: knowledge extraction, pattern
analysis, knowledge discovery, information
harvesting, pattern searching, data dredging

Data Mining at the Intersection of Many


Disciplines
Ar
tifi

Pattern
c
ial

Recognition
s
tic

Int
tis

ellig
Sta

en
ce

DATA Machine
MINING Learning

Mathematical
Modeling Databases

Management Science &


Information Systems

10

10

5
Data Mining Characteristics/Objectives

• Source of data for DM is often a consolidated data


warehouse (not always!).
• DM environment is usually a client-server or a Web-
based information systems architecture.
• Data is the most critical ingredient for DM which
may include soft/unstructured data.
• The miner is often an end user.
• Striking it rich requires creative thinking.
• Data mining tools’ capabilities and ease of use are
essential (Web, Parallel processing, etc.).

11

11

Data in Data Mining


• Data: a collection of facts usually obtained as the result of
experiences, observations, or experiments
• Data may consist of numbers, words, and images
• Data: lowest level of abstraction (from which information and
knowledge are derived)

Data
- DM with different
data types?
Categorical Numerical - Other data types?

Nominal Ordinal Interval Ratio

12

12

6
What Does DM Do?
How Does it Work?
• DM extracts patterns from data
• Pattern?
A mathematical (numeric and/or symbolic) relationship among data items
• Types of patterns
• Association
• Prediction
• Cluster (segmentation)
• Sequential (or time series) relationships

13

13

A Taxonomy for Data Mining Tasks


Data Mining Learning Method Popular Algorithms

Classification and Regression Trees,


Prediction Supervised
ANN, SVM, Genetic Algorithms

Decision trees, ANN/MLP, SVM, Rough


Classification Supervised
sets, Genetic Algorithms

Linear/Nonlinear Regression, Regression


Regression Supervised
trees, ANN/MLP, SVM

Association Unsupervised Apriory, OneR, ZeroR, Eclat

Link analysis Unsupervised Expectation Maximization, Apriory


Algorithm, Graph-based Matching

Sequence analysis Unsupervised Apriory Algorithm, FP-Growth technique

Clustering Unsupervised K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

14

14

7
Other Data Mining Tasks
• These are in addition to the primary DM tasks (prediction, association,
clustering)
• Time-series forecasting
• Part of sequence or link analysis?
• Visualization
• Another data mining task?

• Types of DM
• Hypothesis-driven data mining
• Discovery-driven data mining

15

15

Data Mining Applications

• Customer Relationship Management


• Maximize return on marketing campaigns
• Improve customer retention (churn analysis)
• Maximize customer value (cross- or up-selling)
• Identify and treat most valued customers

• Banking & Other Financial


• Automate the loan application process
• Detecting fraudulent transactions
• Maximize customer value (cross- and up-selling)
• Optimizing cash reserves with forecasting

16

16

8
Data Mining Applications (cont.)

• Retailing and Logistics


• Optimize inventory levels at different locations
• Improve the store layout and sales promotions
• Optimize logistics by predicting seasonal effects
• Minimize losses due to limited shelf life

• Manufacturing and Maintenance


• Predict/prevent machinery failures
• Identify anomalies in production systems to optimize
manufacturing capacity
• Discover novel patterns to improve product quality

17

17

Data Mining Applications (cont.)


• Brokerage and Securities Trading
• Predict changes on certain bond prices
• Forecast the direction of stock fluctuations
• Assess the effect of events on market movements
• Identify and prevent fraudulent activities in trading

• Insurance
• Forecast claim costs for better business planning
• Determine optimal rate plans
• Optimize marketing to specific customers
• Identify and prevent fraudulent claim activities

18

18

9
Data Mining Applications (cont.)
• Computer hardware and software
• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel industry
• Healthcare
Highly popular application
• Medicine areas for data mining
• Entertainment industry
• Sports
• Etc.

19

19

Data Mining Process


• A manifestation of best practices
• A systematic way to conduct DM projects
• Different groups have different versions
• Most common standard processes:
• CRISP-DM (Cross-Industry Standard Process for Data Mining)
• SEMMA (Sample, Explore, Modify, Model, and Assess)
• KDD (Knowledge Discovery in Databases)

20

20

10
Data Mining Process

Source: KDNuggets.com, August 2007


21

21

Data Mining Process: CRISP-DM

1 2
Business Data
Understanding Understanding

3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building

5
Testing and
Evaluation

22

22

11
Data Mining Process: CRISP-DM
Step 1: Business Understanding
Step 2: Data Understanding Accounts for
~85% of total
Step 3: Data Preparation (!) project time
Step 4: Model Building
Step 5: Testing and Evaluation
Step 6: Deployment

• The process is highly repetitive and experimental (DM: art versus


science?)

23

23

Data Preparation – A Critical DM Task


Real-world
Data

· Collect data
Data Consolidation · Select data
· Integrate data

· Impute missing values


Data Cleaning · Reduce noise in data
· Eliminate inconsistencies

· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes

· Reduce number of variables


Data Reduction · Reduce number of cases
· Balance skewed data

Well-formed
Data

24

24

12
Data Mining Process: SEMMA
Sample
(Generate a representative
sample of the data)

Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)

SEMMA

Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)

25

25

Data Mining Methods: Classification


• Most frequently used DM method
• Part of the machine-learning family
• Employ supervised learning
• Learn from past data, classify new data
• The output variable is categorical (nominal or ordinal) in nature
• Classification versus regression?
• Classification versus clustering?

26

26

13
Assessment Methods for Classification

• Predictive accuracy
• Hit rate
• Speed
• Model building; predicting
• Robustness
• Scalability
• Interpretability
• Transparency; ease of understanding

27

27

Accuracy of Classification Models


• In classification problems, the primary source for
accuracy estimation is the confusion matrix

True Class TP + TN
Accuracy =
Positive Negative TP + TN + FP + FN
TP
Positive

True False
True Positive Rate =
TP + FN
Predicted Class

Positive Positive
Count (TP) Count (FP)
TN
True Negative Rate =
TN + FP
Negative

False True
Negative Negative
Count (FN) Count (TN) TP TP
P recision = Recall =
TP + FP TP + FN

28

28

14
Estimation Methodologies for Classification
• Simple split (or holdout or test sample estimation)
• Split the data into 2 mutually exclusive sets training
(~70%) and testing (30%)

Model
Training Data Development
2/3

Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)

• For ANN, the data is split into three sub-sets (training


[~60%], validation [~20%], testing [~20%])

29

29

Estimation Methodologies for Classification


• k-Fold Cross Validation (rotation estimation)
• Split the data into k mutually exclusive subsets
• Use each subset as testing while using the rest of the
subsets as training
• Repeat the experimentation for k times
• Aggregate the test results for true estimation of
prediction accuracy training
• Other estimation methodologies
• Leave-one-out, bootstrapping, jackknifing
• Area under the ROC curve

30

30

15
Estimation Methodologies for Classification –
ROC Curve
1

0.9

0.8
True Positive Rate (Sensitivity) A
0.7

B
0.6

C
0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate (1 - Specificity)

31

31

Classification Techniques

• Decision tree analysis


• Statistical analysis
• Neural networks
• Support vector machines
• Case-based reasoning
• Bayesian classifiers
• Genetic algorithms
• Rough sets

32

32

16
A general
algorithm
Decision Trees for
decision
tree
• Employs the divide and conquer method building
• Recursively divides a training set until each
division consists of examples from one class
1. Create a root node and assign all of the training
data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of
the split. Split the data into mutually exclusive
subsets along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every
leaf node until the stopping criteria is reached.

33

33

Decision Trees

• DT algorithms mainly differ on


• Splitting criteria
• Which variable to split first?
• What values to use to split?
• How many splits to form for each node?
• Stopping criteria
• When to stop building the tree
• Pruning (generalization method)
• Pre-pruning versus post-pruning
• Most popular DT algorithms include
• ID3, C4.5, C5; CART; CHAID; M5

34

34

17
Decision Trees

• Alternative splitting criteria


• Gini index determines the purity of a specific class as a result of a decision to
branch along a particular attribute/value
• Used in CART
• Information gain uses entropy to measure the extent of uncertainty or
randomness of a particular attribute/value split
• Used in ID3, C4.5, C5
• Chi-square statistics (used in CHAID)

35

35

Cluster Analysis for Data Mining

• Used for automatic identification of natural groupings of things


• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past data, then assigns new
instances
• There is no output variable
• Also known as segmentation

36

36

18
Cluster Analysis for Data Mining

• Clustering results may be used to


• Identify natural groupings of customers
• Identify rules for assigning new cases to classes for targeting/diagnostic
purposes
• Provide characterization, definition, labeling of populations
• Decrease the size and complexity of problems for other data mining methods
• Identify outliers in a specific domain (e.g., rare-event detection)

37

37

Cluster Analysis for Data Mining

• Analysis methods
• Statistical methods (including both hierarchical and nonhierarchical), such as
k-means, k-modes, and so on.
• Neural networks (adaptive resonance theory [ART], self-organizing map
[SOM])
• Fuzzy logic (e.g., fuzzy c-means algorithm)
• Genetic algorithms

• Divisive versus Agglomerative methods

38

38

19
Cluster Analysis for Data Mining

• How many clusters?


• There is no “truly optimal” way to calculate it
• Heuristics are often used
• Look at the sparseness of clusters
• Number of clusters = (n/2)1/2 (n: no of data points)
• Use Akaike information criterion (AIC)
• Use Bayesian information criterion (BIC)
• Most cluster analysis methods involve the use of a
distance measure to calculate the closeness
between pairs of items.
• Euclidian versus Manhattan (rectilinear) distance

39

39

Cluster Analysis for Data Mining

• k-Means Clustering Algorithm


• k : pre-determined number of clusters
• Algorithm (Step 0: determine value
of k)
Step 1: Randomly generate k random points as initial
cluster centers.
Step 2: Assign each point to the nearest cluster center.
Step 3: Re-compute the new cluster centers.
Repeat steps 3 and 4 until some convergence criterion is
met (usually that the assignment of points to clusters
becomes stable).

40

40

20
Cluster Analysis for Data Mining -
k-Means Clustering Algorithm

Step 1 Step 2 Step 3

41

41

Association Rule Mining


• A very popular DM method in business
• Finds interesting relationships (affinities) between
variables (items or events)
• Part of machine learning family
• Employs unsupervised learning
• There is no output variable
• Also known as market basket analysis
• Often used as an example to describe DM to
ordinary people, such as the famous “relationship
between diapers and beers!”

42

42

21
Association Rule Mining
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a laptop computer and a virus
protection software, also bought extended service plan 70
percent of the time"
• How do you use such a pattern/knowledge?
• Put the items next to each other for ease of finding
• Promote the items as a package (do not put one on sale if the other(s)
are on sale)
• Place items far apart from each other so that the customer has to
walk the aisles to search for it, and by doing so potentially see and buy
other items

43

43

Association Rule Mining


• Representative applications of association rule
mining include
• In business: cross-marketing, cross-selling, store design,
catalog design, e-commerce site design, optimization of
online advertising, product pricing, and sales/promotion
configuration
• In medicine: relationships between symptoms and
illnesses; diagnosis and patient characteristics and
treatments (to be used in medical DSS); and genes and
their functions (to be used in genomics projects)

44

44

22
Association Rule Mining
• Are all association rules interesting and useful?
A Generic Rule: X  Y [S%, C%]
X, Y: products and/or services
X: Left-hand-side (LHS)
Y: Right-hand-side (RHS)
S: Support: how often X and Y go together
C: Confidence: how often Y go together with the X
Example: {Laptop Computer, Antivirus Software} 
{Extended Service Plan} [30%, 70%]

45

45

Association Rule Mining

• Algorithms are available for generating association rules


• Apriori
• Eclat
• FP-Growth
• + Derivatives and hybrids of the three
• The algorithms help identify the frequent item sets, which are, then
converted to association rules

46

46

23
Association Rule Mining

• Apriori Algorithm
• Finds subsets that are common to at least a minimum number of the
itemsets
• Uses a bottom-up approach
• frequent subsets are extended one item at a time (the size of frequent subsets increases
from one-item subsets to two-item subsets, then three-item subsets, and so on)
• groups of candidates at each level are tested against the data for minimum support
(see the figure) →

47

47

Artificial Neural Networks


for Data Mining
• Artificial neural networks (ANN or NN) is a brain
metaphor for information processing
• a.k.a. Neural Computing
• Very good at capturing highly complex non-linear
functions!
• Many uses – prediction (regression, classification),
clustering/segmentation
• Many application areas – finance, medicine, marketing,
manufacturing, service operations, information systems, etc.

49

49

24
Dendrites Biological NN
Synapse
Synapse

Axon

Axon

Dendrites Neuron
Neuron
Biological
versus x1
w1
Artificial NN
Y1
Artificial Inputs
Outputs
x2
Neural w2 Processing
Element (PE) f (S )
Networks . S = 
n
X iW
Y Y2
. Weights i =1
i
Transfer .
.
. Summation
Function .

wn Yn
Biological Artificial
xn
Neuron Node (or PE)
Dendrites Input
Axon Output
Synapse Weight
Slow Fast
50
Many (109) Few (102)

50

Elements/Concepts of ANN

• Processing element (PE)


• Information processing
• Network structure
• Feedforward vs. recurrent vs. multi-layer…
• Learning parameters
• Supervised/unsupervised, backpropagation, learning rate, momentum
• ANN Software – NN
shells, integrated modules in
comprehensive DM software, …

51

51

25
SPSS PASW Modeler (formerly Clementine)

Data Mining RapidMiner

SAS / SAS Enterprise Miner

Software Microsoft Excel

Your own code

Weka (now Pentaho)

• Commercial KXEN

MATLAB
• IBM SPSS Modeler (formerly Other commercial tools

Clementine) KNIME

• SAS – Enterprise Miner


Microsoft SQL Server

Other free tools

• IBM – Intelligent Miner Zementis

• StatSoft – Statistica Data Miner


Oracle DM

Statsoft Statistica

• … many more Salford CART, Mars, other

Orange

• Free and/or Open Source Angoss

C4.5, C5.0, See5


• RapidMiner Bayesia

• Weka Insightful Miner/S-Plus (now TIBCO)

Megaputer
• … many more Viscovery

Clario Analytics
Total (w/ others) Alone
Miner3D

Thinkanalytics

0 20 40 60 80 100 120
Source: KDNuggets.com, May 2009

52

52

Data Mining Myths

• Data mining …
• provides instant solutions/predictions.
• is not yet viable for business applications.
• requires a separate, dedicated database.
• can only be done by those with advanced degrees.
• is only for large firms that have lots of customer data.
• is another name for good-old statistics.

53

53

26
Common Data Mining Blunders
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data mining is
and what it really can/cannot do
3. Not leaving sufficient time for data acquisition,
selection and preparation
4. Looking only at aggregated results and not at
individual records/predictions
5. Being sloppy about keeping track of the data
mining procedure and results

54

54

Common Data Mining Mistakes


6. Ignoring suspicious (good or bad) findings and
quickly moving on
7. Running mining algorithms repeatedly and blindly,
without thinking about the next stage
8. Naively believing everything you are told about
the data
9. Naively believing everything you are told about
your own data mining analysis
10. Measuring your results differently from the way
your sponsor measures them

55

55

27
End of the Chapter

• Questions, comments

56

56

28
4/9/2021

Data Preparation
(Data pre-processing)

1
4/9/2021

Data Preparation

• Introduction to Data Preparation


• Types of Data
• Outliers
• Data Transformation
• Missing Data

Why Prepare Data?


• Some data preparation is needed for all mining tools

• The purpose of preparation is to transform data sets


so that their information content is best exposed to
the mining tool

• Error prediction rate should be lower (or the same)


after the preparation as before it

2
4/9/2021

Why Prepare Data?


• Preparing data also prepares the miner so that when
using prepared data the miner produces better models,
faster
• Good data is a prerequisite for producing effective
models of any type

Why Prepare Data?


• Data need to be formatted for a given software tool
• Data need to be made adequate for a given method
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., occupation=“”

• noisy: containing errors or outliers


• e.g., Salary=“-10”, Age=“222”

• inconsistent: containing discrepancies in codes or names


• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
• e.g., Endereço: travessa da Igreja de Nevogilde Freguesia: Paranhos

3
4/9/2021

Major Tasks in Data Preparation


• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results 7

Data Preparation as a step in the


Knowledge Discovery Process Knowledge
Evaluation and
Presentation

Data Mining

Selection and
Transformation

Cleaning and
Integration
DW

DB
8

4
4/9/2021

TYPES OF DATA

Types of Measurements

• Nominal scale
content
More information

• Categorical scale Qualitative

• Ordinal scale

• Interval scale Quantitative

• Ratio scale
Discrete or Continuous

10

10

5
4/9/2021

Types of Measurements: Examples


• Nominal:
• ID numbers, Names of people
• Categorical:
• eye color, zip codes
• Ordinal:
• rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
• Interval:
• calendar dates, temperatures in Celsius or Fahrenheit, GRE
(Graduate Record Examination) and IQ scores
• Ratio:
• temperature in Kelvin, length, time, counts

11

11

Data Conversion
• Some tools can deal with nominal values but other need
fields to be numeric

• Convert ordinal fields to numeric to be able to use “>”


and “<“ comparisons on such fields.
• A → 4.0
• A- → 3.7
• B+ → 3.3
• B → 3.0

• Multi-valued, unordered attributes with small no. of


values

• e.g. Color=Red, Orange, Yellow, …, Violet


• for each value v create a binary “flag” variable C_v , which is 1 if
Color=v, 0 otherwise

12

6
4/9/2021

Conversion: Nominal, Many Values


• Examples:
• US State Code (50 values)
• Profession Code (7,000 values, but only few frequent)

• Ignore ID-like fields whose values are unique for


each record
• For other fields, group values “naturally”:
• e.g. 50 US States => 3 or 5 regions
• Profession – select most frequent ones, group the rest
• Create binary flag-fields for selected values

13

Outliers
• Outliers are values thought to be out of range.
• “An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a
different mechanism”
• Can be detected by standardizing observations and label the standardized
values outside a predetermined bound as outliers
• Outlier detection can be used for fraud detection or data cleaning

• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem

14

7
4/9/2021

Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an
outlier if outside limits (normal distribution assumed)

(x − ks, x + ks)

15

16

8
4/9/2021

Outlier detection
• Univariate
• Boxplot: An observation is an extreme outlier if
(Q1-3IQR, Q3+3IQR), where IQR=Q3-Q1
(IQR = Inter Quartile Range)

and declared a mild outlier if it lies


outside of the interval
(Q1-1.5IQR, Q3+1.5IQR).

http://www.physics.csbsju.edu/stats/box2.html 44

17

> 3L

> 1.5 L

18

9
4/9/2021

Outlier detection
• Multivariate

• Clustering
• Very small clusters are outliers

http://www.ibm.com/developerworks/data/li
brary/techarticle/dm-0811wurst/

19

Outlier detection
• Multivariate

• Distance based
• An instance with very few neighbors within D is regarded
as an outlier

Knn algorithm

20

10
4/9/2021

A bi-dimensional outlier that is not an outlier in either of its projections.

21

Recommended reading

Only with hard work


and a favorable
context you will have
the chance to become
an outlier!!!

22

11
4/9/2021

DATA TRANSFORMATION

23

Normalization

• For distance-based methods, normalization helps to


prevent that attributes with large ranges out-weight
attributes with small ranges

• min-max normalization
• z-score normalization
• normalization by decimal scaling

24

12
4/9/2021

Normalization
• min-max normalization

v − min v
v'= (new _ max v − new_min v) + new_minv
max v − min v

• z-score normalization

v ' = v −v does not eliminate outliers


v

• normalization by decimal scaling

v Where j is the smallest integer such that Max(| v' |)<1


v'=
10 j range: -986 to 917 => j=3 -986 -> -0.986 917 -> 0.917

25

Age min‐max (0‐1) z‐score dec. scaling


44 0.421 0.450 0.44
35 0.184 ‐0.450 0.35
34 0.158 ‐0.550 0.34
34 0.158 ‐0.550 0.34
39 0.289 ‐0.050 0.39
41 0.342 0.150 0.41
42 0.368 0.250 0.42
31 0.079 ‐0.849 0.31
28 0.000 ‐1.149 0.28
30 0.053 ‐0.949 0.3
38 0.263 ‐0.150 0.38
36 0.211 ‐0.350 0.36
42 0.368 0.250 0.42
35 0.184 ‐0.450 0.35
33 0.132 ‐0.649 0.33
45 0.447 0.550 0.45
34 0.158 ‐0.550 0.34
65 0.974 2.548 0.65
66 1.000 2.648 0.66
38 0.263 ‐0.150 0.38

28 minimun
66 maximum
39.50 avgerage
53
10.01 standard deviation

26

13
4/9/2021

Data Transformation
• It is the process to create new attributes
• Often called transforming the attributes or the attribute
set.

• Data transformation usually combines the original


raw attributes using different mathematical formulas
originated in business models or pure mathematical
formulas.

27

Data Transformation
• It is the process to create new attributes
• Often called transforming the attributes or the attribute
set.

• Data transformation usually combines the original


raw attributes using different mathematical formulas
originated in business models or pure mathematical
formulas.

28

14
4/9/2021

Data Transformation
Linear Transformations
• Normalizations may not be enough to adapt the data
to improve the generated model.
• Aggregating the information contained in various
attributes might be beneficial
• If B is an attribute subset of the complete set A, a
new attribute Z can be obtained by a linear
combination:

29

Data Transformation
Quadratic Transformations
• In quadratic transformations a new attribute is built
as follows

• where ri,j is a real number.


• These kinds of transformations have been
thoroughly studied and can help to transform data to
make it separable.

30

15
4/9/2021

Data Transformation
Non-polynomial Approximations of Transformations
• Sometimes polynomial transformations are not
enough
• For example, guessing whether a set of triangles are
congruent is not possible by simply observing their
vertices coordinates
• Computing the length of their segments will easily solve
the problem → non-polynomial transformation

31

Data Transformation
Polynomial Approximations of Transformations
• We have observed that specific transformations may
be needed to extract knowledge
• But help from an expert is not always available
• When no knowledge is available, a transformation f
can be approximated via a polynomial
transformation using a brute search with one degree
at a time.
• Using the Weistrass approximation, there is a
polynomial function f that takes the value Yi for each
instance Xi .

32

16
4/9/2021

Data Transformation
Polynomial Approximations of Transformations
• There are as many polynomials verifying Y = f (X)
as we want
• As the number of instances in the data set increases,
the approximations will be better
• We can use computer assistance to approximate the
intrinsic transformation

33

Data Transformation
Polynomial Approximations of Transformations
• When the intrinsic transformation is polynomial we
need to add the cartesian product of the attributes
needed for the polynomial degree approximation.
• Sometimes the approximation obtained must be
rounded to avoid the limitations of the computer
digital precision.

34

17
4/9/2021

Data Transformation
Rank Transformations
• A change in an attribute distribution can result in a
change of the model performance
• The simplest transformation to accomplish this in
numerical attributes is to replace the value of an
attribute with its rank
• The attribute will be transformed into a new
attribute containing integer values ranging from 1 to
m, being m the number of instances in the data set.

35

Data Transformation
Rank Transformations
• Next we can transform the ranks to normal scores
representing their probabilities in the normal
distribution by spreading these values on the Gaussian
curve using a simple transformation given by:

• being ri the rank of the observation i and Φ the


cumulative normal function
• Note: this transformation cannot be applied separately
to the training and test partitions

36

18
4/9/2021

Data Transformation
Box-Cox Transformations
• When selecting the optimal transformation for an
attribute is that we do not know in advance which
transformation will be the best
• The Box-Cox transformation aims to transform a
continuous variable into an almost normal
distribution

37

Data Transformation
Box-Cox Transformations
• This can be achieved by mapping the values using
following the set of transformations:

• All linear, inverse, quadratic and similar


transformations are special cases of the Box-Cox
transformations.

38

19
4/9/2021

Data Transformation
Box-Cox Transformations
• Please note that all the values of variable x in the
previous slide must be positive. If we have negative
values in the attribute we must add a parameter c to
offset such negative values:

• The parameter g is used to scale the resulting


values, and it is often considered as the geometric
mean of the data

39

Data Transformation
Box-Cox Transformations
• The value of λ is iteratively found by testing
different values in the range from −3.0 to 3.0 in
small steps until the resulting attribute is as close as
possible to the normal distribution.

40

20
4/9/2021

Data Transformation
Spreading the Histogram
• Spreading the histogram is a special case of Box-Cox
transformations
• As Box-Cox transforms the data to resemble a normal
distribution, the histogram is thus spread as shown
here

41

Data Transformation
Spreading the Histogram
• When the user is not interested in converting the
distribution to a normal one, but just spreading it,
we can use two special cases of Box-Cox
transformations
1. Using the logarithm (with an offset if necessary) can be
used to spread the right side of the histogram: y =
log(x)
2. If we are interested in spreading the left side of the
histogram we can simply use the power transformation
y = xg

42

21
4/9/2021

Data Transformation
Nominal to Binary Transformation
• The presence of nominal attributes in the data set can be
problematic, specially if the Data Mining (DM)
algorithm used cannot correctly handle them
• The first option is to transform the nominal variable to a
numeric one
• Although simple, this approach has two big drawbacks
that discourage it:
• With this transformation we assume an ordering of the
attribute values
• The integer values can be used in operations as numbers,
whereas the nominal values cannot

43

Data Transformation
Nominal to Binary Transformation
• In order to avoid the aforementioned problems, a very
typical transformation used for DM methods is to map
each nominal attribute to a set of newly generated
attributes.
• If N is the number of different values the nominal
attribute has, we will substitute the nominal variable
with a new set of binary attributes, each one
representing one of the N possible values.
• For each instance, only one of the N newly created
attributes will have a value of 1, while the rest will have
the value of 0

44

22
4/9/2021

Data Transformation
Nominal to Binary Transformation
• This transformation is also referred in the literature
as 1-to-N transformation.
• A problem with this kind of transformation appears
when the original nominal attribute has a large
cardinality
• The number of attributes generated will be large as well,
resulting in a very sparse data set which will lead to
numerical and performance problems.

45

Data Transformation
Transformations via Data Reduction
• When the data set is very large, performing complex
analysis and DM can take a long computing time
• Data reduction techniques are applied in these
domains to reduce the size of the data set while
trying to maintain the integrity and the information
of the original data set as much as possible
• Mining on the reduced data set will be much more
efficient and it will also resemble the results that
would have been obtained using the original data
set.

46

23
4/9/2021

Data Transformation
Transformations via Data Reduction
• The main strategies to perform data reduction are
Dimensionality Reduction (DR) techniques
• They aim to reduce the number of attributes or
instances available in the data set
• Well known attribute reduction techniques are Wavelet
transforms or Principal Component Analysis (PCA).

47

Data Transformation
Transformations via Data Reduction
• Many techniques can be found for reducing the
dimensionality in the number of instances, like the
use of clustering techniques, parametric methods
and so on

48

24
4/9/2021

Data Transformation
Transformations via Data Reduction
• The use of binning and discretization techniques is
also useful to reduce the dimensionality and
complexity of the data set.
• They convert numerical attributes into nominal
ones, thus drastically reducing the cardinality of the
attributes involved

49

MISSING DATA

50

25
4/9/2021

Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.

• Missing values may carry some information content: e.g. a credit


application may carry information by noting which field the applicant
did not complete

51

Missing Values
• There are always MVs in a real dataset

• MVs may have an impact on modelling, in fact, they can


destroy it!
• Some tools ignore missing values, others use some
metric to fill in replacements
• The modeller should avoid default automated replacement
techniques
• Difficult to know limitations, problems and introduced bias

• Replacing missing values without elsewhere


capturing that information removes information
from the dataset

52

26
4/9/2021

How to Handle Missing Data?


• Ignore records (use only cases with all values)
• Usually done when class label is missing as most prediction methods do not
handle missing data well
• Not effective when the percentage of missing values per attribute varies
considerably as it can lead to insufficient and/or biased sample sizes

• Ignore attributes with missing values


• Use only features (attributes) with all values (may leave out important
features)

• Fill in the missing value manually

• tedious + infeasible?

53

How to Handle Missing Data?


• Use a global constant to fill in the missing value

• e.g., “unknown”. (May create a new class!)

• Use the attribute mean to fill in the missing value

• It will do the least harm to the mean of existing data


• If the mean is to be unbiased
• What if the standard deviation is to be unbiased?
• Use the attribute mean for all samples belonging to the same
class to fill in the missing value

54

27
4/9/2021

How to Handle Missing Data?


• Use the most probable value to fill in the missing value
• Inference-based such as Bayesian formula or decision tree
• Identify relationships among variables
• Linear regression, Multiple linear regression, Nonlinear regression

• Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the most
frequent value or the average value
• Finding neighbours in a large dataset may be slow

55

Nearest-Neighbour

56

28
4/9/2021

How to Handle Missing Data?


• Note that, it is as important to avoid adding bias and distortion
to the data as it is to make the information available.

• bias is added when a wrong value is filled-in

• No matter what techniques you use to conquer the problem, it


comes at a price. The more guessing you have to do, the
further away from the real data the database becomes. Thus,
in turn, it can affect the accuracy and validation of the mining
results.

57

Summary
• Every real world data set needs some kind of data
pre-processing
• Deal with missing values
• Correct erroneous values
• Select relevant attributes
• Adapt data set format to the software tool to be used

• In general, data pre-processing consumes more than


60% of a data mining project effort

58

29
4/9/2021

References
• ‘Data preparation for data mining’, Dorian Pyle, 1999

• ‘Data Mining: Concepts and Techniques’, Jiawei Han and Micheline Kamber, 2000

• ‘Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations’, Ian H. Witten and Eibe Frank, 1999

• ‘Data Mining: Practical Machine Learning Tools and Techniques second


edition’, Ian H. Witten and Eibe Frank, 2005

• DM: Introduction: Machine Learning and Data Mining, Gregory Piatetsky-


Shapiro and Gary Parker
• (http://www.kdnuggets.com/data_mining_course/dm1-introduction-ml-data-mining.ppt)

• ESMA 6835 Mineria de Datos (http://math.uprm.edu/~edgar/dm8.ppt)

59

60

30
4/9/2021

Thank you
for your
attentions!

61

62

31
Data Mining
Classification: Basic Concepts and Techniques

Lecture Notes for Business Intelligent


Analytics
Introduction to Data Mining - Classification

Introduction to Data Mining - Classification 1

Classification: Definition

l Given a collection of records (training set )


– Each record is by characterized by a tuple (x,y), where x is the attribute set
and y is the class label
◆ x: attribute, predictor, independent variable, input
◆ y: class, response, dependent variable, output

l Task:
– Learn a model that maps each attribute set x into one of the predefined class
labels y

Introduction to Data Mining - Classification 2

2
Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells x-rays or MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies

Introduction to Data Mining - Classification 3

General Approach for Building Classification Model

Introduction to Data Mining - Classification 4

4
Classification Techniques

• Base Classifiers
• Decision Tree based Methods
• Rule-based Methods
• Nearest-neighbor
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
• Neural Networks, Deep Neural Nets

• Ensemble Classifiers
• Boosting, Bagging, Random Forests

Introduction to Data Mining - Classification 5

Example of a Decision Tree

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Introduction to Data Mining - Classification 6

6
Apply Model to Test Data

Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Introduction to Data Mining - Classification 7

Apply Model to Test Data


Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Introduction to Data Mining - Classification 8

8
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Introduction to Data Mining - Classification 9

Apply Model to Test Data


Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Introduction to Data Mining - Classification 10

10
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES

Introduction to Data Mining - Classification 11

11

Apply Model to Test Data


Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES

Introduction to Data Mining - Classification 12

12
Another Example of Decision Tree

MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10

Introduction to Data Mining - Classification 13

13

Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Introduction to Data Mining - Classification 14

14
Decision Tree Induction

• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT

Introduction to Data Mining - Classification 15

15

General Structure of Hunt’s Algorithm

Let Dt be the set of training records ID


Home
Owner
Marital
Status
Annual Defaulted
Income Borrower
that reach a node t 1 Yes Single 125K No
2 No Married 100K No

General Procedure: 3 No Single 70K No

– If Dt contains records that 4 Yes Married 120K No


5 No Divorced 95K Yes
belong the same class yt, then t
6 No Married 60K No
is a leaf node labeled as yt
7 Yes Divorced 220K No
– If Dt contains records that 8 No Single 85K Yes
belong to more than one class, 9 No Married 75K No
use an attribute test to split the 10 No Single 90K Yes
data into smaller subsets. 10

Recursively apply the procedure Dt


to each subset.
?

Introduction to Data Mining - Classification 16

16
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Introduction to Data Mining - Classification 17

17

Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Introduction to Data Mining - Classification 18

18
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Introduction to Data Mining - Classification 19

19

Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
8 No Single 85K Yes
Owner
Home Yes No 9 No Married 75K No
Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Introduction to Data Mining - Classification 20

20
Design Issues of Decision Tree Induction

How should training records be split?


– Method for expressing test condition
◆ depending on attribute types
– Measure for evaluating the goodness of a test condition

How should the splitting procedure stop?


– Stop splitting if all the records belong to the same class or have identical
attribute values
– Early termination

Introduction to Data Mining - Classification 21

21

Methods for Expressing Test Conditions

Depends on attribute types


– Binary
– Nominal
– Ordinal
– Continuous

Introduction to Data Mining - Classification 22

22
Test Condition for Nominal Attributes

Multi-way split:
– Use as many partitions as distinct Marital
Status
values.

Single Divorced Married


Binary split:
– Divides values into two subsets

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}

Introduction to Data Mining - Classification 23

23

Test Condition for Ordinal Attributes

l Multi-way split: Shirt


Size
– Use as many partitions as
distinct values
Small
Medium Large Extra Large

l Binary split: Shirt Shirt


– Divides values into two Size Size

subsets
– Preserve order property
among attribute values {Small,
Medium}
{Large,
Extra Large}
{Small} {Medium, Large,
Extra Large}

Shirt
Size
This grouping
violates order
property

{Small, {Medium,
Large} Extra Large}
Introduction to Data Mining - Classification 24

24
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Introduction to Data Mining - Classification 25

25

Splitting Based on Continuous Attributes

• Different ways of handling


• Discretization to form an ordinal categorical attribute
Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles),
or clustering.
• Static – discretize once at the beginning
• Dynamic – repeat at each node

• Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive

Introduction to Data Mining - Classification 26

26
How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


Introduction to Data Mining - Classification 27

27

How to determine the Best Split

l Greedy approach:
– Nodes with purer class distribution are preferred

l Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity

Introduction to Data Mining - Classification 28

28
Measures of Node Impurity

Where 𝒑𝒊 𝒕 is the frequency


l Gini Index 𝑐−1 of class 𝒊 at node t, and 𝒄 is
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2 the total number of classes
𝑖=0
l Entropy 𝑐−1

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)


𝑖=0
l Misclassification error

𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 = 1 − max[𝑝𝑖 (𝑡)]

Introduction to Data Mining - Classification 29

29

Finding the Best Split

1. Compute impurity measure (P) before splitting


2. Compute impurity measure (M) after splitting
l Compute impurity measure of each child node
l M is the weighted impurity of child nodes
3. Choose the attribute test condition that produces the highest gain

Gain = P - M

or equivalently, lowest impurity measure after splitting (M)

Introduction to Data Mining - Classification 30

30
Finding the Best Split
Before Splitting: C0 N00
P
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M11 M12 M21 M22

M1 M2
Gain = P – M1 vs P – M2
Introduction to Data Mining - Classification 31

31

Measure of Impurity: GINI


• Gini Index for a given node 𝒕
𝑐−1

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2

𝑖=0
Where 𝒑𝒊 𝒕 is the frequency of class 𝒊 at node 𝒕, and 𝒄 is the total
number of classes
• Maximum of 1 − 1/𝑐 when records are equally distributed
among all classes, implying the least beneficial situation for
classification
• Minimum of 0 when all records belong to one class, implying
the most beneficial situation for classification
• Gini index is used in decision tree algorithms such as CART,
SLIQ, SPRINT

Introduction to Data Mining - Classification 32

32
Measure of Impurity: GINI
• Gini Index for a given node t :
𝑐−1

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2

𝑖=0

• For 2-class problem (p, 1 – p):


• GINI = 1 – p2 – (1 – p)2 = 2p (1-p)

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

Introduction to Data Mining - Classification 33

33

Computing Gini Index of a Single Node


𝑐−1

𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2

𝑖=0

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Introduction to Data Mining - Classification 34

34
Computing Gini Index for a Collection of Nodes
l When a node 𝑝 is split into 𝑘 partitions (children)
𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1
where, 𝑛𝑖 = number of records at child 𝑖,
𝑛 = number of records at parent node 𝑝.

Introduction to Data Mining - Classification 35

35

Binary Attributes: Computing GINI Index

Splits into two partitions (child nodes)


Effect of Weighing partitions:
– Larger and purer partitions are sought
Parent
B? C1 7
Yes No C2 5
Gini = 0.486
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (1/6)2 N1 N2 Weighted Gini of N1 N2
= 0.278
C1 5 2 = 6/12 * 0.278 +
Gini(N2) C2 1 4 6/12 * 0.444
= 1 – (2/6)2 – (4/6)2 = 0.361
Gini=0.361
= 0.444 Gain = 0.486 – 0.361 = 0.125
Introduction to Data Mining - Classification 36

36
Categorical Attributes: Computing Gini Index

l For each distinct value, gather counts for each class in the
dataset
l Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

Which of these is the best?

Introduction to Data Mining - Classification 37

37

Continuous Attributes: Computing Gini Index

l Use Binary Decisions based on one value ID


Home
Owner
Marital
Status
Annual
Income
Defaulted

1 Yes Single 125K No


l Several Choices for the splitting value 2 No Married 100K No

– Number of possible splitting values 3 No Single 70K No

= Number of distinct values 4


5
Yes
No
Married 120K
Divorced 95K
No
Yes

l Each splitting value has a count matrix 6 No Married 60K No


7 Yes Divorced 220K No
associated with it 8 No Single 85K Yes

– Class counts in each of the partitions, A ≤ 9 No Married 75K No

v and A > v 10
10 No Single 90K Yes

l Simple method to choose best v


– For each v, scan the database to gather
count matrix and compute its Gini index Annual Income ?
– Computationally Inefficient! Repetition of
work.
≤ 80 > 80
Defaulted Yes 0 3
Defaulted No 3 4

Introduction to Data Mining - Classification 38

38
Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Introduction to Data Mining - Classification 39

39

Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Introduction to Data Mining - Classification 40

40
Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Introduction to Data Mining - Classification 41

41

Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Introduction to Data Mining - Classification 42

42
Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix and
computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Annual Income
Sorted Values 60 70 75 85 90 95 100 120 125 220
Split Positions 55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Introduction to Data Mining - Classification 43

43

Measure of Impurity: Entropy

l Entropy at a given node 𝒕


𝑐−1

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)


𝑖=0 of class 𝒊 at node 𝒕, and 𝒄 is the total number of
Where 𝒑𝒊 𝒕 is the frequency
classes

◆ Maximum of log 2 𝑐 when records are equally distributed among


all classes, implying the least beneficial situation for classification
◆ Minimum of 0 when all records belong to one class, implying
most beneficial situation for classification

– Entropy based computations are quite similar to the GINI index


computations

Introduction to Data Mining - Classification 44

44
Computing Entropy of a Single Node
𝑐−1

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)


𝑖=0

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Introduction to Data Mining - Classification 45

45

Computing Information Gain After Splitting

l Information Gain:
𝑘
𝑛𝑖
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
𝑖=1
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖

– Choose the split that achieves most reduction (maximizes


GAIN)

– Used in the ID3 and C4.5 decision tree algorithms

– Information gain is the mutual information between the


class variable and the splitting variable

Introduction to Data Mining - Classification 46

46
Problem with large number of partitions
Node impurity measures tend to prefer splits that result in large number
of partitions, each being small but pure

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c c
– Customer ID has highest information
Sports gain because entropy for all10 the11children is
zero
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 C0: 1 C0: 0 C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0
... C1: 0 C1: 1
... C1: 1

Introduction to Data Mining - Classification 47

47

Gain Ratio

l Gain Ratio:

𝑘
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = − ෍ 𝑙𝑜𝑔2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
𝑖=1
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖
– Adjusts Information Gain by the entropy of the partitioning
(𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜).
◆Higher entropy partitioning (large number of small partitions) is penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain

Introduction to Data Mining - Classification 48

48
Gain Ratio

l Gain Ratio:

𝑘
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 𝑛𝑖 𝑛𝑖
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 = 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 = ෍ 𝑙𝑜𝑔2
𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜 𝑛 𝑛
𝑖=1
Parent Node, 𝑝 is split into 𝑘 partitions (children)
𝑛𝑖 is number of records in child node 𝑖

CarType CarType CarType


{Sports, {Family,
Family Sports Luxury {Family} {Sports}
Luxury} Luxury}
C1 1 8 1 C1 9 1 C1 8 2
C2 3 0 7 C2 7 3 C2 0 10
Gini 0.163 Gini 0.468 Gini 0.167

SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97

Introduction to Data Mining - Classification 49

49

Measure of Impurity: Classification Error

l Classification error at a node 𝑡


𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝𝑖 𝑡 ]
𝑖

– Maximum of 1 − 1/𝑐 when records are equally


distributed among all classes, implying the least
interesting situation
– Minimum of 0 when all records belong to one class,
implying the most interesting situation

Introduction to Data Mining - Classification 50

50
Computing Error of a Single Node

𝐸𝑟𝑟𝑜𝑟 𝑡 = 1 − max[𝑝𝑖 𝑡 ]
𝑖

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Introduction to Data Mining - Classification 51

51

Comparison among Impurity Measures

For a 2-class problem:

Introduction to Data Mining - Classification 52

52
Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.342 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but
error remains the
same!!
Introduction to Data Mining - Classification 53

53

Misclassification Error vs Gini Index

A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42

N1 N2 N1 N2
C1 3 4 C1 3 4
C2 0 3 C2 1 2
Gini=0.342 Gini=0.416

Misclassification error for all three cases = 0.3 !

Introduction to Data Mining - Classification 54

54
Decision Tree Based Classification

l Advantages:
– Relatively inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid overfitting are employed)
– Can easily handle redundant attributes
– Can easily handle irrelevant attributes (unless the attributes are interacting)
l Disadvantages: .
– Due to the greedy nature of splitting criterion, interacting attributes (that can
distinguish between classes together but not individually) may be passed over in
favor of other attributed that are less discriminating.
– Each decision boundary involves only a single attribute

Introduction to Data Mining - Classification 55

55

Handling interactions

+ : 1000 instances Entropy (X) : 0.99


Entropy (Y) : 0.99
o : 1000 instances
Y

Introduction to Data Mining - Classification 56

56
Handling interactions

Introduction to Data Mining - Classification 57

57

Handling interactions given irrelevant attributes

+ : 1000 instances Entropy (X) : 0.99


Entropy (Y) : 0.99
o : 1000 instances Entropy (Z) : 0.98
Y
Adding Z as a noisy Attribute Z will be
attribute generated chosen for splitting!
from a uniform
distribution
X

Introduction to Data Mining - Classification 58

58
Limitations of single attribute-based decision boundaries

Both positive (+) and


negative (o) classes
generated from skewed
Gaussians with centers
at (8,8) and (12,12)
respectively.

Introduction to Data Mining - Classification 59

59
1/11/2023

Data Mining:
Cluster Analysis: Basic
Concepts and Methods

1
1

Cluster Analysis: Basic Concepts and


Methods

■ Cluster Analysis: Basic Concepts

■ Partitioning Methods

■ Hierarchical Methods

■ Density-Based Methods

■ Grid-Based Methods

■ Evaluation of Clustering

■ Summary
2

1
1/11/2023

What is Cluster Analysis?


■ Cluster: A collection of data objects
■ similar (or related) to one another within the same group

■ dissimilar (or unrelated) to the objects in other groups

■ Cluster analysis (or clustering, data segmentation, …)


■ Finding similarities between data according to the

characteristics found in the data and grouping similar


data objects into clusters
■ Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
■ Typical applications
■ As a stand-alone tool to get insight into data distribution

■ As a preprocessing step for other algorithms

Applications of Cluster Analysis


■ Data reduction
■ Summarization: Preprocessing for regression, PCA,

classification, and association analysis


■ Compression: Image processing: vector quantization
■ Hypothesis generation and testing
■ Prediction based on groups
■ Cluster & find characteristics/patterns for each group

■ Finding K-nearest Neighbors


■ Localizing search to one or a small number of clusters
■ Outlier detection: Outliers are often viewed as those “far
away” from any cluster

2
1/11/2023

Clustering: Application Examples


■ Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
■ Information retrieval: document clustering
■ Land use: Identification of areas of similar land use in an earth
observation database
■ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
■ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
■ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
■ Climate: understanding earth climate, find patterns of atmospheric
and ocean
■ Economic Science: market resarch
5

Basic Steps to Develop a Clustering Task


■ Feature selection
■ Select info concerning the task of interest

■ Minimal information redundancy

■ Proximity measure
■ Similarity of two feature vectors

■ Clustering criterion
■ Expressed via a cost function or some rules

■ Clustering algorithms
■ Choice of algorithms

■ Validation of the results


■ Validation test (also, clustering tendency test)

■ Interpretation of the results


■ Integration with applications

3
1/11/2023

Quality: What Is Good Clustering?

■ A good clustering method will produce high quality


clusters
■ high intra-class similarity: cohesive within clusters
■ low inter-class similarity: distinctive between clusters
■ The quality of a clustering method depends on
■ the similarity measure used by the method
■ its implementation, and
■ Its ability to discover some or all of the hidden
patterns

Measure the Quality of Clustering


■ Dissimilarity/Similarity metric
■ Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
■ The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
■ Weights should be associated with different variables
based on applications and data semantics
■ Quality of clustering:
■ There is usually a separate “quality” function that
measures the “goodness” of a cluster.
■ It is hard to define “similar enough” or “good enough”
■ The answer is typically highly subjective
8

4
1/11/2023

Considerations for Cluster Analysis


■ Partitioning criteria
■ Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
■ Separation of clusters
■ Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
■ Similarity measure
■ Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
■ Clustering space
■ Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)

Requirements and Challenges


■ Scalability
■ Clustering all the data instead of only on samples

■ Ability to deal with different types of attributes


■ Numerical, binary, categorical, ordinal, linked, and mixture of

these
■ Constraint-based clustering
■ User may give inputs on constraints
■ Use domain knowledge to determine input parameters
■ Interpretability and usability
■ Others
■ Discovery of clusters with arbitrary shape

■ Ability to deal with noisy data

■ Incremental clustering and insensitivity to input order

■ High dimensionality

10

10

5
1/11/2023

Major Clustering Approaches (I)

■ Partitioning approach:
■ Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors


■ Typical methods: k-means, k-medoids, CLARANS

■ Hierarchical approach:
■ Create a hierarchical decomposition of the set of data (or objects)

using some criterion


■ Typical methods: Diana, Agnes, BIRCH, CAMELEON

■ Density-based approach:
■ Based on connectivity and density functions

■ Typical methods: DBSACN, OPTICS, DenClue

■ Grid-based approach:
■ based on a multiple-level granularity structure

■ Typical methods: STING, WaveCluster, CLIQUE

11

11

Major Clustering Approaches (II)


■ Model-based:
■ A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other


■ Typical methods: EM, SOM, COBWEB

■ Frequent pattern-based:
■ Based on the analysis of frequent patterns

■ Typical methods: p-Cluster

■ User-guided or constraint-based:
■ Clustering by considering user-specified or application-specific

constraints
■ Typical methods: COD (obstacles), constrained clustering

■ Link-based clustering:
■ Objects are often linked together in various ways

■ Massive links can be used to cluster objects: SimRank, LinkClus

12

12

6
1/11/2023

Cluster Analysis: Basic Concepts and


Methods

■ Cluster Analysis: Basic Concepts

■ Partitioning Methods

■ Hierarchical Methods

■ Density-Based Methods

■ Grid-Based Methods

■ Evaluation of Clustering

■ Summary
13

13

Partitioning Algorithms: Basic Concept

■ Partitioning method: Partitioning a database D of n objects into a set


of k clusters, such that the sum of squared distances is minimized
(where ci is the centroid or medoid of cluster Ci)

■ Given k, find a partition of k clusters that optimizes the chosen


partitioning criterion
■ Global optimal: exhaustively enumerate all partitions
■ Heuristic methods: k-means and k-medoids algorithms
■ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
■ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
14

14

7
1/11/2023

The K-Means Clustering Method

■ Given k, the k-means algorithm is implemented in


four steps:
■ Partition objects into k nonempty subsets
■ Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
■ Assign each object to the cluster with the nearest
seed point
■ Go back to Step 2, stop when the assignment does
not change

15

15

An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
■ Partition objects into k nonempty
subsets
■ Repeat
■ Compute centroid (i.e., mean Update the
point) for each partition cluster
centroids
■ Assign each object to the
cluster of its nearest centroid
■ Until no change
16

16

8
1/11/2023

Comments on the K-Means Method

■ Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t


is # iterations. Normally, k, t << n.
■ Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
■ Comment: Often terminates at a local optimal
■ Weakness
■ Applicable only to objects in a continuous n-dimensional space
■ Using the k-modes method for categorical data
■ In comparison, k-medoids can be applied to a wide range of data
■ Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
■ Sensitive to noisy data and outliers
■ Not suitable to discover clusters with non-convex shapes

17

17

Variations of the K-Means Method

■ Most of the variants of the k-means which differ in


■ Selection of the initial k means

■ Dissimilarity calculations
■ Strategies to calculate cluster means
■ Handling categorical data: k-modes
■ Replacing means of clusters with modes
■ Using new dissimilarity measures to deal with categorical objects
■ Using a frequency-based method to update modes of clusters
■ A mixture of categorical and numerical data: k-prototype method

18

18

9
1/11/2023

What Is the Problem of the K-Means Method?

■ The k-means algorithm is sensitive to outliers !


■ Since an object with an extremely large value may substantially
distort the distribution of the data
■ K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0

19

19

PAM: A Typical K-Medoids Algorithm


Total Cost = 20
1
0
9

Arbitrary Assign
7

each
6

choose k
remaining
5

object as
object to
4

3
initial nearest
2
medoids medoids
1

0
0 1 2 3 4 5 6 7 8 9 1
0

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
1 1

Do loop
0 0

Compute
9 9

Swapping O
8 8

total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1
0 0

20

20

10
1/11/2023

Cluster Analysis: Basic Concepts and


Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

21

21

Hierarchical Clustering
■ Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

22

22

11
1/11/2023

AGNES (Agglomerative Nesting)


■ Introduced in Kaufmann and Rousseeuw (1990)
■ Implemented in statistical packages, e.g., Splus
■ Use the single-link method and the dissimilarity matrix
■ Merge nodes that have the least dissimilarity
■ Go on in a non-descending fashion
■ Eventually all nodes belong to the same cluster

23

23

Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of


clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the dendrogram at


the desired level, then each connected component forms a cluster

24
24

12
1/11/2023

DIANA (Divisive Analysis)

■ Introduced in Kaufmann and Rousseeuw (1990)


■ Implemented in statistical analysis packages, e.g., Splus
■ Inverse order of AGNES
■ Eventually each node forms a cluster on its own

25

25

Distance between X X

Clusters
■ Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

■ Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

■ Average: avg distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

■ Centroid: distance between the centroids of two clusters, i.e.,


dist(Ki, Kj) = dist(Ci, Cj)

■ Medoid: distance between the medoids of two clusters, i.e., dist(Ki,


Kj) = dist(Mi, Mj)
■ Medoid: a chosen, centrally located object in the cluster
26

26

13
1/11/2023

Centroid, Radius and Diameter of X

a Cluster (for numerical data sets)


■ Centroid: the “middle” of a cluster

■ Radius: square root of average distance from any point


of the cluster to its centroid

■ Diameter: square root of average mean squared


distance between all pairs of points in the cluster

27

27

Cluster Analysis: Basic Concepts and


Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

28

28

14
1/11/2023

Density-Based Clustering Methods

■ Clustering based on density (local cluster criterion), such


as density-connected points
■ Major features:
■ Discover clusters of arbitrary shape
■ Handle noise
■ One scan
■ Need density parameters as termination condition
■ Several interesting studies:
■ DBSCAN: Ester, et al. (KDD’96)
■ OPTICS: Ankerst, et al (SIGMOD’99).
■ DENCLUE: Hinneburg & D. Keim (KDD’98)
■ CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

29

29

Density-Based Clustering: Basic


Concepts
■ Two parameters:
■ Eps: Maximum radius of the neighbourhood
■ MinPts: Minimum number of points in an Eps-
neighbourhood of that point
■ NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
■ Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if

■ p belongs to NEps(q) p MinPts = 5


■ core point condition: Eps = 1 cm
|NEps (q)| ≥ MinPts q

30

30

15
1/11/2023

Density-Reachable and Density-Connected

■ Density-reachable:
■ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
■ Density-connected
■ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there
is a point o such that both, p and o
q are density-reachable from o
w.r.t. Eps and MinPts
31

31

DBSCAN: Density-Based Spatial


Clustering of Applications with Noise
■ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
■ Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm

Core MinPts = 5

32

32

16
1/11/2023

DBSCAN: Sensitive to Parameters

33

33

OPTICS: A Cluster-Ordering Method (1999)

■ OPTICS: Ordering Points To Identify the Clustering


Structure
■ Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)

■ Produces a special order of the database wrt its

density-based clustering structure


■ This cluster-ordering contains info equiv to the density-

based clusterings corresponding to a broad range of


parameter settings
■ Good for both automatic and interactive cluster

analysis, including finding intrinsic clustering structure


■ Can be represented graphically or using visualization

techniques

34

34

17
1/11/2023

Chapter 10. Cluster Analysis: Basic


Concepts and Methods
■ Cluster Analysis: Basic Concepts
■ Partitioning Methods
■ Hierarchical Methods
■ Density-Based Methods
■ Grid-Based Methods
■ Evaluation of Clustering
■ Summary

35

35

Grid-Based Clustering Method

■ Using multi-resolution grid data structure


■ Several interesting methods
■ STING (a STatistical INformation Grid
approach) by Wang, Yang and Muntz (1997)
■ CLIQUE: Agrawal, et al. (SIGMOD’98)

■ Both grid-based and subspace clustering

■ WaveCluster by Sheikholeslami, Chatterjee,


and Zhang (VLDB’98)
■ A multi-resolution clustering approach

using wavelet method

36

36

18
1/11/2023

STING: A Statistical Information Grid


Approach
■ Wang, Yang and Muntz (VLDB’97)
■ The spatial area is divided into rectangular cells
■ There are several levels of cells corresponding to different
levels of resolution

37

37

The STING Clustering Method


■ Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
■ Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
■ Parameters of higher level cells can be easily calculated
from parameters of lower level cell
■ count, mean, s, min, max

■ type of distribution—normal, uniform, etc.

■ Use a top-down approach to answer spatial data queries


■ Start from a pre-selected layer—typically with a small
number of cells
■ For each cell in the current level compute the confidence
interval
38

38

19
1/11/2023

STING Algorithm and Its Analysis

■ Remove the irrelevant cells from further consideration


■ When finish examining the current layer, proceed to the
next lower level
■ Repeat this process until the bottom layer is reached
■ Advantages:
■ Query-independent, easy to parallelize, incremental
update
■ O(K), where K is the number of grid cells at the lowest
level
■ Disadvantages:
■ All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected

39

39

CLIQUE (Clustering In QUEst)

■ Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)


■ Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
■ CLIQUE can be considered as both density-based and grid-based
■ It partitions each dimension into the same number of equal length
interval
■ It partitions an m-dimensional data space into non-overlapping
rectangular units
■ A unit is dense if the fraction of total data points contained in the
unit exceeds the input model parameter
■ A cluster is a maximal set of connected dense units within a
subspace

40

40

20
1/11/2023

CLIQUE: The Major Steps

■ Partition the data space and find the number of points that
lie inside each cell of the partition.
■ Identify the subspaces that contain clusters using the
Apriori principle
■ Identify clusters
■ Determine dense units in all subspaces of interests
■ Determine connected dense units in all subspaces of
interests.
■ Generate minimal description for the clusters
■ Determine maximal regions that cover a cluster of
connected dense units for each cluster
■ Determination of minimal cover for each cluster

41

41

Chapter 10. Cluster Analysis: Basic


Concepts and Methods

■ Cluster Analysis: Basic Concepts

■ Partitioning Methods

■ Hierarchical Methods

■ Density-Based Methods

■ Grid-Based Methods

■ Evaluation of Clustering

■ Summary
42

42

21
1/11/2023

Determine the Number of Clusters


■ Empirical method
■ # of clusters: k ≈√n/2 for a dataset of n points, e.g., n = 200, k = 10

■ Elbow method
■ Use the turning point in the curve of sum of within cluster variance

w.r.t the # of clusters


■ Cross validation method
■ Divide a given data set into m parts

■ Use m – 1 parts to obtain a clustering model

■ Use the remaining part to test the quality of the clustering

■ E.g., For each point in the test set, find the closest centroid, and

use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
■ For any k > 0, repeat it m times, compare the overall quality measure

w.r.t. different k’s, and find # of clusters that fits the data the best
43

43

Measuring Clustering Quality


■ 3 kinds of measures: External, internal and relative
■ External: supervised, employ criteria not inherent to the dataset
■ Compare a clustering against prior or expert-specified
knowledge (i.e., the ground truth) using certain clustering
quality measure
■ Internal: unsupervised, criteria derived from data itself
■ Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are, e.g., Silhouette coefficient
■ Relative: directly compare different clusterings, usually those
obtained via different parameter settings for the same algorithm

44

44

22
1/11/2023

Some Commonly Used External Measures

■ Matching-based measures
■ Purity, maximum matching, F-measure

■ Entropy-Based Measures
Ground truth partitioning T T2
■ Conditional entropy, normalized mutual
1

Cluster C Cluster C
1 2

information (NMI), variation of information


■ Pair-wise measures
■ Four possibilities: True positive (TP), FN, FP, TN

■ Jaccard coefficient, Rand statistic, Fowlkes-

Mallow measure
■ Correlation measures
■ Discretized Huber static, normalized discretized

Huber static
45

45

23
11/21/2023

Time Series Data


Analysis
Assoc. Prof. Nguyen Binh Minh Ph.D.

Learning Objectives
• Describe what forecasting is
• Explain time series & its components
• Smooth a data series
• Moving average
• Exponential smoothing
• Forecast using trend models Simple
Linear Regression Auto-
regressive

1
11/21/2023

What Is Forecasting?
• Process of predicting a
future event
• Underlying basis of
all business decisions
• Production
• Inventory
• Personnel
• Facilities

Forecasting Approaches
Qualitative Methods Quantitative Methods
• Used when situation is
vague & little data exist
• New products
• New technology
• Involve intuition,
experience
• e.g., forecasting sales on
Internet

2
11/21/2023

Forecasting Approaches
Qualitative Methods Quantitative Methods
• Used when situation is • Used when situation is
vague & little data exist ‘stable’ & historical data
• New products exist
• New technology • Existing products
• Involve intuition, • Current technology
experience • Involve mathematical
• e.g., forecasting sales on techniques
Internet • e.g., forecasting sales of
color televisions

Quantitative Forecasting
• Select several forecasting methods
• ‘Forecast’ the past
• Evaluate forecasts
• Select best method
• Forecast the future
• Monitor continuously forecast accuracy

3
11/21/2023

Quantitative Forecasting Methods

Quantitative Forecasting Methods


Quantitative
Forecasting

4
11/21/2023

Quantitative Forecasting Methods


Quantitative
Forecasting

Time Series
Models

Quantitative Forecasting Methods


Quantitative
Forecasting

Time Series Causal


Models Models

10

5
11/21/2023

Quantitative Forecasting Methods


Quantitative
Forecasting

Time Series Causal


Models Models

Moving Exponential Trend


Average Smoothing Models

11

Quantitative Forecasting Methods


Quantitative
Forecasting

Time Series Causal


Models Models

Moving Exponential Trend


Regression
Average Smoothing Models

12

6
11/21/2023

Quantitative Forecasting Methods


Quantitative
Forecasting

Time Series Causal


Models Models

Moving Exponential Trend


Regression
Average Smoothing Models

13

What is a Time Series?


• Set of evenly spaced numerical data
• Obtained by observing response variable at regular time periods
• Forecast based only on past values
• Assumes that factors influencing past, present, & future will continue
• Example
• Year: 1995 1996 1997 1998 1999
• Sales: 78.7 63.5 89.7 93.2 92.1

14

7
11/21/2023

Time Series vs.


Cross Sectional Data

Time series data is a sequence of


observations

•collected from a process


•with equally spaced periods of time.

15

Time Series vs.


Cross Sectional Data
Contrary to restrictions placed on
cross-sectional data, the major purpose
of forecasting with time series is to
extrapolate beyond the range of
the explanatory variables.

16

8
11/21/2023

Time Series vs.


Cross Sectional Data

Time series is
dynamic, it does
change over time.

17

Time Series vs.


Cross Sectional Data
When working with time series data, it
is paramount that the data is plotted so
the researcher can view the data.

18

9
11/21/2023

Time Series Components

19

Time Series Components

Trend

20

10
11/21/2023

Time Series Components

Trend Cyclical

21

Time Series Components

Trend Cyclical

Seasonal

22

11
11/21/2023

Time Series Components

Trend Cyclical

Seasonal Irregular

23

Trend Component
• Persistent, overall upward or downward pattern
• Due to population, technology etc.
• Several years duration

Response

Mo., Qtr., Yr. © 1984-1994 T/Maker Co.

24

12
11/21/2023

Trend Component
• Overall Upward or Downward Movement
• Data Taken Over a Period of Years

Sales

Time

25

Cyclical Component
• Repeating up & down movements
• Due to interactions of factors influencing economy
• Usually 2-10 years duration

Cycle
Response

Mo., Qtr., Yr.

26

13
11/21/2023

Cyclical Component
• Upward or Downward Swings
• May Vary in Length
• Usually Lasts 2 - 10 Years

Sales

Time

27

Seasonal Component
• Regular pattern of up & down fluctuations
• Due to weather, customs etc.
• Occurs within one year

Summer
Response

© 1984-1994 T/Maker Co.

Mo., Qtr.

28

14
11/21/2023

Seasonal Component
• Upward or Downward Swings
• Regular Patterns
• Observed Within One Year

Sales

Time (Monthly or Quarterly)

29

Irregular Component
• Erratic, unsystematic, ‘residual’ fluctuations
• Due to random variation or unforeseen events
• Union strike
• War © 1984-1994 T/Maker Co.

• Short duration &


nonrepeating

30

15
11/21/2023

Random or Irregular
Component
• Erratic, Nonsystematic, Random, ‘Residual’ Fluctuations
• Due to Random Variations of
• Nature
• Accidents

• Short Duration and Non-repeating

31

Time Series Forecasting

32

16
11/21/2023

Time Series Forecasting


Time
Series

33

Time Series Forecasting


Time
Series

Trend?

34

17
11/21/2023

Time Series Forecasting


Time
Series

Smoothing No
Methods Trend?

35

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

36

18
11/21/2023

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

Moving Exponential
Average Smoothing

37

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

Moving Exponential
Average Smoothing

Auto-
Linear Quadratic Exponential Regressive

38

19
11/21/2023

Time Series Analysis

39

Plotting Time Series Data


Intra-Campus Bus Passengers
(X 1000)

12

10

Number of Passengers
6

09/83 07/86 05/89 03/92 01/95

Month/Year

Data collected by Coop Student (10/6/95)

40

20
11/21/2023

Moving Average Method

41

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

Moving Exponential
Average Smoothing

Auto-
Linear Quadratic Exponential Regressive

42

21
11/21/2023

Moving Average Method


• Series of arithmetic means
• Used only for smoothing
• Provides overall impression of data over time

43

Moving Average Method


• Series of arithmetic means
• Used only for smoothing
• Provides overall impression of data over time

Used for elementary forecasting

44

22
11/21/2023

Moving Average Graph


Sales
8 Actual
6
4
2
0
93 94 95 96 97 98

Year

45

Moving Average
[An Example]
You work for Firestone Tire. You
want to smooth random
fluctuations using a 3-period
moving average.
1995 20,000
1996 24,000
1997 22,000
1998 26,000
1999 25,000

46

23
11/21/2023

Moving Average
[Solution]
Year Sales MA(3) in 1,000
1995 20,000 NA
1996 24,000 (20+24+22)/3 = 22
1997 22,000 (24+22+26)/3 = 24
1998 26,000 (22+26+25)/3 = 24
1999 25,000 NA

47

Moving Average
Year Response Moving
Ave
Sales
1994 2 NA
8
1995 5 3
6
1996 2 3
4
1997 2 3.67
2
1998 7 5
0
1999 6 NA 94 95 96 97 98 99

48

24
11/21/2023

Exponential Smoothing
Method

49

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

Moving Exponential
Average Smoothing

Auto-
Linear Quadratic Exponential Regressive

50

25
11/21/2023

Exponential Smoothing
Method
• Form of weighted moving average
• Weights decline exponentially
• Most recent data weighted most
• Requires smoothing constant (W)
• Ranges from 0 to 1
• Subjectively chosen
• Involves little record keeping of past data

51

Exponential Smoothing
[An Example]

You’re organizing a Kwanza meeting. You


want to forecast attendance for 1998 using
exponential smoothing
( = .20). Past attendance (00) is:
1995 4
1996 6
1997 5
1998 3
1999 7 © 1995 Corel Corp.

52

26
11/21/2023

Exponential Smoothing
Ei = W·Yi + (1 - W)·Ei-1
Smoothed Value, Ei Forecast
Time Yi ^
(W = .2) Yi + 1
1995 4 4.0 NA
1996 6 (.2)(6) + (1-.2)(4.0) = 4.4 4.0
1997 5 (.2)(5) + (1-.2)(4.4) = 4.5 4.4
1998 3 (.2)(3) + (1-.2)(4.5) = 4.2 4.5
1999 7 (.2)(7) + (1-.2)(4.2) = 4.8 4.2
2000 NA NA 4.8

53

Exponential Smoothing [Graph]


Attendance
8
Actual
6
4
2
0
93 96 97 98 99

Year

54

27
11/21/2023

Forecast Effect of Smoothing


Coefficient (W)
^
Yi+1 = W·Yi + W·(1-W)·Yi-1 + W·(1-W)2·Yi-2 +...
Weight
2 Periods 3 Periods
W is... Prior Period Ago Ago
W W(1-W) W(1-W)2
0.10 10% 9% 8.1%
0.90 90% 9% 0.9%

55

Linear Time-Series Forecasting


Model

56

28
11/21/2023

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

Moving Exponential
Average Smoothing

Auto-
Linear Quadratic Exponential Regressive

57

Linear Time-Series Forecasting


Model
• Used for forecasting trend
• Relationship between response variable Y & time X is a linear function
• Coded X values used often
• Year X: 1995 1996 1997 1998 1999
• Coded year: 0 1 2 3 4
• Sales Y: 78.7 63.5 89.7 93.2 92.1

58

29
11/21/2023

Linear Time-Series Model


Yi = b0 + b1X 1i
Y
b1 > 0

b1 < 0
Time, X1

59

Linear Time-Series Model [An


Example]

You’re a marketing analyst for Hasbro Toys. Using


coded years, you find Yi = .6 + .7Xi.^
1995 1
1996 1
1997 2
1998 2
1999 4
Forecast 2000 sales.

60

30
11/21/2023

Linear Time-Series [Example]


Year Coded Year Sales (Units)
1995 0 1
1996 1 1
1997 2 2
1998 3 2
1999 4 4
2000 5 ?

2000 forecast sales: Yi = .6


^ + .7·(5) = 4.1
The equation would be different if ‘Year’ used.

61

The Linear Trend Model


Year Coded Sales Ŷi = b0 + b1 X i = 2.143 + .743 X i
94 0 2
8

95 1 5 7

96 2 2 6

97 3 2 5

98 4 7 4
Projected to
99 5 6 3 year 2000
2
Excel Output
C o efficien ts 1

I n te r c e p t 2.14285714
0
X V a ria b le 1 0 .7 4 2 8 5 7 1 4 1993 1994 1995 1996 1997 1998 1999 2000

62

31
11/21/2023

Time Series Plot


Surgery Data
(Time Sequence Plot)
(X 1000)

20

19

18

Number of Surgeries

17

16

01/93 01/94 01/95 01/96 01/97

Month/Year

Source: General Hospital, Metropolis

63

Time Series Plot [Revised]


Revised Surgery Data
(Time Sequence Plot)
(X 100)

193

191

189

Number of Surgeries
187

185

183

01/93 01/94 01/95 01/96 01/97

Month/Year

Source: General Hospital, Metropolis

64

32
11/21/2023

Seasonality Plot
Revised Surgery Data
(Seasonal Decomposition)

100.5

100.3

Monthly Index

100.1

99.9

99.7

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Month

Source: General Hospital, Metropolis

65

Trend Analysis
Revised Surgery Data
(Trend Analysis)
(X 1000)

19.5

19.2

18.9
Number of Surgeries

18.6

18.3

18

12/92 10/93 8/94 6/95 9/96 2/97 12/97

Month/Year

Source: General Hospital, Metropolis

66

33
11/21/2023

Quadratic Time-Series
Forecasting Model

67

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

Moving Exponential
Average Smoothing

Auto-
Linear Quadratic Exponential Regressive

68

34
11/21/2023

Quadratic Time-Series
Forecasting Model

• Used for forecasting trend


• Relationship between response variable Y & time X is a quadratic
function
• Coded years used

69

Quadratic Time-Series Forecasting


Model
• Used for forecasting trend
• Relationship between response variable Y & time X is a quadratic
function
• Coded years used
• Quadratic model

Yi = b0 + b1X 1i + b11X 12i

70

35
11/21/2023

Quadratic Time-Series Model


Relationships
Y b11 > 0 Y b11 > 0

Year, X1 Year, X1

Y b11 < 0 Y b11 < 0

Year, X1 Year, X1

71

Quadratic Trend Model


Year Coded Sales Ŷ i = b 0 + b1 X i + b 2 X i2
94 0 2
Coefficients
95 1 5 I n te rc e p t 2.85714286
96 2 2 X V a ri a b l e 1 -0 . 3 2 8 5 7 1 4
X V a ri a b l e 2 0.21428571
97 3 2
Excel Output
98 4 7
99 5 6 Ŷ i = 2 . 857 − 0 . 33 X i + . 214 X i2

72

36
11/21/2023

Exponential Time-Series
Model

73

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

Moving Exponential
Average Smoothing

Exponential Auto-
Linear Quadratic Regressive

74

37
11/21/2023

Exponential Time-Series
Forecasting Model
• Used for forecasting trend
• Relationship is an exponential function
• Series increases (decreases) at increasing (decreasing) rate

75

Exponential Time-Series
Forecasting Model
• Used for forecasting trend
• Relationship is an exponential function
• Series increases (decreases) at increasing (decreasing) rate

76

38
11/21/2023

Exponential Time-Series Model


Relationships
Y

b1 > 1
Year, X 1

0 < b1 < 1

77

Exponential Weight [Example


Graph]
Sales
8
Data
6

2 Smoothed

0
94 95 96 97 98 99 Year

78

39
11/21/2023

Exponential Trend Model


Ŷ i = b 0 b 1 or log Ŷ i = log b 0 + X 1 log b1
Xi

Year Coded Sales C o e f f ic ie n t s


94 0 2 In t e rc e p t 0 .3 3 5 8 3 7 9 5
X V a ri a b l e 1 0 . 0 8 0 6 8 5 4 4
95 1 5
96 2 2 Excel Output of Values in logs
a n t ilo g (. 3 3 5 8 3 7 9 5 ) = 2.17
97 3 2
a n t ilo g (. 0 8 0 6 8 5 4 4 ) = 1.2
98 4 7
99 5 6 Ŷ i = ( 2 . 17 )( 1 . 2 ) X i

79

Autoregressive Modeling

80

40
11/21/2023

Time Series Forecasting


Time
Series

Smoothing No Yes Trend


Methods Trend? Models

Moving Exponential
Average Smoothing

Auto-
Linear Quadratic Exponential Regressive

81

Autoregressive Modeling
• Used for forecasting trend
• Like regression model
• Independent variables are lagged response variables Yi-1, Yi-2, Yi-3 etc.
• Assumes data are correlated with past data values
• 1st Order: Correlated with prior period
• Estimate with ordinary least squares

82

41
11/21/2023

Time Series Data Plot


Intra-Campus Bus Passengers
(X 1000)

12

10

Number of Passengers
6

09/83 07/86 05/89 03/92 01/95

Month/Year

Data collected by Coop Student (10/6/95)

83

Auto-correlation Plot
Intra-Campus Bus
(Auto Correlation
PassengersFunction
1 Plot

0.5

2 0

-0.5

-1

0 5 10 15 20 25

Lag

84

42
11/21/2023

Autoregressive Model [An


Example]
The Office Concept Corp. has acquired a number of office
units (in thousands of square feet) over the last 8 years.
Develop the 2nd order Autoregressive models.
Year Units
92 4
93 3
94 2
95 3
96 2
97 2
98 4
99 6

85

Autoregressive Model [Example


Solution]
•Develop the 2nd order Year Yi Yi-1 Yi-2
table 92 4 --- ---
93 3 4 ---
•Use Excel to run a
94 2 3 4
regression model 95 3 2 3
Excel Output 96 2 3 2
Coefficients 97 2 2 3
I n te rc e p t 3.5 98 4 2 2
X V a ri a b l e 1 0.8125 99 6 4 2
X V a ri a b l e 2 -0 . 9 3 7 5

Y i = 3 .5 + .8125 Y i − 1 − .9375 Y i − 2

86

43
11/21/2023

Evaluating Forecasts

87

Quantitative
Forecasting Steps
• Select several forecasting methods
• ‘Forecast’ the past

•Evaluate forecasts
 • Select best method
• Forecast the future
• Monitor continuously forecast accuracy

88

44
11/21/2023

Forecasting Guidelines
• No pattern or direction in forecast error
• ei = (Actual Yi - Forecast Yi)
• Seen in plots of errors over time
• Smallest forecast error
• Measured by mean absolute deviation
• Simplest model
• Called principle of parsimony

89

Pattern of Forecast Error


Error
Trend Not Fully Accounted
for Desired Pattern
0
Error
Time (Years)

Time (Years)

90

45
11/21/2023

Residual Analysis
e e

0 0

T T
Random errors Cyclical effects not accounted for
e e

0 0

T T
Trend not accounted for Seasonal effects not accounted for

91

Principal of Parsimony
• Suppose two or more models provide good fit for data
• Select the Simplest Model
• Simplest model types:
• least-squares linear
• least-square quadratic
• 1st order autoregressive
• More complex types:
• 2nd and 3rd order autoregressive
• least-squares exponential

92

46
11/21/2023

Summary
• Described what forecasting is
• Explained time series & its components
• Smoothed a data series
• Moving average
• Exponential smoothing
• Forecasted using trend models

93

You and StatGraphics


• Specification
[Know assumptions
underlying various
models.]
• Estimation
[Know mechanics of
StatGraphics Plus Win].
• Diagnostic checking

94

47
11/21/2023

Questions?

95

Source of Elaborate Slides

Prentice Hall, Inc


Levine, et. all, First Edition

96

48
11/21/2023

ANOVA

97

End of Chapter

98

49
Business Intelligence:
Analytics
Business Performance
Management (BPM)

Learning Objectives

• Understand the all-encompassing nature of


performance management (BPM)
• Understand the closed-loop processes linking
strategy to execution
• Strategize: Where Do We Want to Go?
• Plan: How Do We Get There?
• Monitor: How Are We Doing?
• Act /Adjust: What Do We Need to Do Differently?
• Describe some of the best practices in planning
and management reporting

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

1
Learning Objectives

• Describe the difference between performance


management and measurement
• Understand the role of methodologies in BPM
• Describe the basic elements of the balanced
scorecard and Six Sigma methodologies
• Describe the differences between scorecards and
dashboards
• Understand some of the basic concepts of
dashboards and dashboard design

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

Opening Vignette…

“Double Down at Harrah's”

• Company background
• Problem description
• Proposed solution
• Results
• Answer & discuss the case questions.

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

2
Business Performance Management (BPM) Overview

• Business Performance Management (BPM) is…


A real-time system that alert managers to
potential opportunities, impending problems, and
threats, and then empowers them to react through
models and collaboration.
• Also called, corporate performance management
(CPM by Gartner Group), enterprise performance
management (EPM by Oracle), strategic enterprise
management (SEM by SAP)

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

Business Performance Management (BPM) Overview

• BPM refers to the business processes,


methodologies, metrics, and technologies used by
enterprises to measure, monitor, and manage
business performance.
• BPM encompasses three key components
• A set of integrated, closed-loop management and
analytic processes, supported by technology
• Tools for businesses to define strategic goals and then
measure/manage performance against them
• Methods and tools for monitoring key performance
indicators (KPIs), linked to organizational strategy

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

3
BPM versus BI

• BPM is an outgrowth of BI and incorporates many


of its technologies, applications, and techniques.
• BI is a crucial element of BPM.

• BPM = BI + Planning (a unified solution)

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

A Closed-loop Process to Optimize Business Performance

• Process Steps
1. Strategize
2. Plan
3. Monitor/analyze
4. Act/adjust

Each with its own


process steps…

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

4
Strategize: Where Do We Want to Go?

• Strategic planning
• Common tasks for the strategic planning process:
1. Conduct a current situation analysis
2. Determine the planning horizon
3. Conduct an environment scan
4. Identify critical success factors
5. Complete a gap analysis
6. Create a strategic vision
7. Develop a business strategy
8. Identify strategic objectives and goals

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

Strategize: Where Do We Want to Go?

• Strategic objective
A broad statement or general course of action
prescribing targeted directions for an organization
• Strategic goal
A quantified objective with a designated time period
• Strategic vision
A picture or mental image of what the organization
should look like in the future
• Critical success factors (CSF)
Key factors that delineate the things that an
organization must excel at to be successful

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

10

5
Strategize: Where Do We Want to Go?

“90 percent of organizations fail to execute their


strategies”
• The strategy gap
• Four sources for the gap between strategy and
execution:
1. Communication (enterprise-wide)
2. Alignment of rewards and incentives
3. Focus (concentrating on the core elements)
4. Resources

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

11

Plan: How Do We Get There?

• Operational planning
• Operational plan: plan that translates an
organization’s strategic objectives and goals into a
set of well-defined tactics and initiatives, resources
requirements, and expected results for some future
time period (usually a year).
• Operational planning can be
• Tactic-centric (operationally focused)
• Budget-centric (financially focused)

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

12

6
Plan: How Do We Get There?

• Financial planning and budgeting


• An organization’s strategic objectives and key
metrics should serve as top-down drivers for the
allocation of an organization’s tangible and
intangible assets
• Resource allocations should be carefully aligned
with the organization’s strategic objectives and
tactics in order to achieve strategic success

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

13

Monitor: How Are We Doing?

• A comprehensive framework for monitoring


performance should address two key issues:
• What to monitor
• Critical success factors
• Strategic goals and targets
• How to monitor

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

14

7
Monitor: How Are We Doing?

• Diagnostic control system


A cybernetic system that has inputs, a
process for transforming the inputs into
outputs, a standard or benchmark against
which to compare the outputs, and a
feedback channel to allow information on
variances between the outputs and the
standard to be communicated and acted
upon

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

15

Monitor: How Are We Doing?

• Pitfalls of variance analysis


• Most of the exception analysis focuses on negative
variances when functional groups or departments fail to
meet their targets
• Rarely are positive variances reviewed for potential
opportunities, and rarely does the analysis focus on
assumptions underlying the variance patterns

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

16

8
Monitor: How Are We Doing?

What if strategic assumptions (not


the operations) are wrong?

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

17

Act and Adjust: What Do We Need to Do Differently?

• Success (or mere survival) depends on new


projects: creating new products, entering new
markets, acquiring new customers (or
businesses), or streamlining some process.
• Most new projects and ventures fail!
• Hollywood movies: 60% chance of failure
• Mergers and acquisitions: 60%
• IT projects (large-scale): 70%
• New food products: 80%
• New pharmaceutical products: 90% …

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

18

9
Act and Adjust: What Do We Need to Do Differently?

Harrah’s Closed-Loop
Marketing Model

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

19

Act and Adjust: What Do We Need to Do Differently?

• Saxon Group’s findings:


• Only 20 percent of the organizations utilized an
integrated performance management system
• Fewer than 3 out of 10 companies developed plans
that clearly identified the expected results of major
projects or initiatives
• More than 75 percent of the information reported
to management was historic and internally focused;
less than 25 percent was predictive of the future
• The average knowledge worker spent less than 20
percent of his or her time focused on the so-called
higher-value analytical and decision support tasks

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

20

10
Performance Measurement

• Performance measurement system


A system that assists managers in tracking the
implementations of business strategy by
comparing actual results against strategic goals
and objectives
• Comprises systematic comparative methods that
indicate progress (or lack thereof) against goals

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

21

Performance Measurement KPIs and Operational Metrics

• Key performance indicator (KPI)


A KPI represents a strategic objective and
metric that measures performance against a
goal
• Distinguishing features of KPIs
◼ Strategy ◼ Encodings
◼ Targets ◼ Time frames
◼ Ranges ◼ Benchmarks

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

22

11
Performance Measurement

• Key performance indicator (KPI)


Outcome KPIs vs. Driver KPIs
(lagging indicators (leading indicators
e.g., revenues) e.g., sales leads)

• Operational areas covered by driver KPIs


• Customer performance
• Service performance
• Sales operations
• Sales plan/forecast

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

23

Performance Measurement

• Problems with existing performance


measurement systems
• The most popular system in use is some variant of
the balanced scorecard (BSC)
• 50-90% of all companies implemented BSC
• BSC methodology is a holistic vision of a
measurement system tied to the strategic direction
of the organization and based on a four-perspective
view of the world:
• Financial measures supported by customer, internal
process, and learning and growth metrics

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

24

12
Performance Measurement

• The drawbacks of using financial data as the


core of a performance measurement:
• Financial measures are usually reported by
organizational structures and not by the processes
that produced them
• Financial measures are lagging indicators, telling us
what happened, not why it happened or what is
likely to happen in the future
• Financial measures are often the product of
allocations that are not related to the underlying
processes that generated them
• Financial measures are focused on the short term
returns
TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG
School of Information and Communication Technology

25

Performance Measurement

• Good performance measures should:


• Be focused on key factors.
• Be a mix of past, present, and future.
• Balance the needs of all stakeholders
(shareholders, employees, partners, suppliers, etc.).
• Start at the top and trickle down to the bottom.
• Have targets that are based on research and reality
rather than be arbitrary.

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

26

13
BPM Methodologies

• An effective performance measurement system should


help:
• Align top-level strategic objectives and bottom-level initiatives.
• Identify opportunities and problems in a timely fashion.
• Determine priorities and allocate resources accordingly.
• Change measurements when the underlying processes and
strategies change.
• Delineate responsibilities, understand actual performance
relative to responsibilities, and reward and recognize
accomplishments.
• Take action to improve processes and procedures when the
data warrant it.
• Plan and forecast in a more reliable and timely fashion.

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

27

BPM Methodologies

• Balanced scorecard (BSC)


A performance measurement and management
methodology that helps translate an
organization’s financials, customer, internal
process, and learning and growth objectives
and targets into a set of actionable initiatives
• "The Balanced Scorecard: Measures That Drive
Performance” (HBR, 1992)

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

28

14
BPM Methodologies Balanced Scorecard

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

29

BPM Methodologies

• The meaning of “balance”


• BSC is designed to overcome the limitations of
systems that are financially focused
• Nonfinancial objectives fall into one of three
perspectives:
1. Customer
2. Internal business process
3. Learning and growth

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

30

15
BPM Methodologies

• In BSC, the term “balance” arises because the


combined set of measures are supposed to
encompass indicators that are:
• Financial and nonfinancial
• Leading and lagging
• Internal and external
• Quantitative and qualitative
• Short term and long term

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

31

BPM Methodologies

• Aligning strategies and actions


• A six-step process
1. Developing and formulating a strategy
2. Planning the strategy
3. Aligning the organization
4. Planning the operations
5. Monitoring and learning
6. Testing and adapting the strategy

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

32

16
BPM Methodologies

Strategy map
A visual display
that delineates
the relationships
among the key
organizational
objectives for all
four BSC
perspectives

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

33

BPM Methodologies

• Six Sigma
A performance management methodology
aimed at reducing the number of defects in a
business process to as close to zero defects
per million opportunities (DPMO) as
possible

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

34

17
BPM Methodologies

• Six Sigma
• The DMAIC performance model
A closed-loop business improvement model that
encompasses the steps of defining, measuring,
analyzing, improving, and controlling a process
• Lean Six Sigma
• Lean manufacturing / lean production
• Lean production versus Six Sigma

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

35

BPM Methodologies

• How to Succeed in Six Sigma


• Six Sigma is integrated with business strategy
• Six Sigma supports business objectives
• Key executives are engaged in the process
• Project selection is based on value potential
• There is a critical mass of projects and resources
• Projects-in-process are actively managed
• Team leadership skills are emphasized
• Results are rigorously tracked
• BSC + Six Sigma = Success

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

36

18
BPM Methodologies

• Integrating Six Sigma with BSC by


• Translating their strategy into quantifiable
objectives
• Cascading objectives through the organization
• Setting targets based on the voice of the customer
• Implementing strategic projects using Six Sigma
• Executing processes in a consistent fashion to
deliver business results
• See Table 3.3 for a comparison of balanced
scorecard and Six Sigma

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

37

BPM Technologies and Applications

• BPM architecture
• The logical and physical design of a system
• BPM systems consist of three logical parts:
1. BPM Applications
2. Information Hub
3. Source Systems
• BPM systems consist of three physical parts:
1. Database tier
2. Application tier
3. Client or user interface

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

38

19
BPM Architecture and Applications

• BPM applications
1. Strategy management
2. Budgeting, planning, and
forecasting
3. Financial consolidation
4. Profitability modeling
and optimization
5. Financial, statutory, and
management reporting

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

39

BPM Architecture and Applications

• Leading BPM Application Suits/Vendors


• SAP Business Objects Enterprise Performance
Management
• Oracle Hyperion Performance Management
• IBM Cognos BI and Financial Performance
Management
• Microstrategy, Microsoft

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

40

20
Performance Dashboards

• Dashboards and scorecards both provide visual


displays of important information that is
consolidated and arranged on a single screen so
that information can be digested at a single
glance and easily explored

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

41

Performance Dashboards

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

42

21
Performance Dashboards

• Dashboards versus scorecards


• Performance dashboards
Visual display used to monitor operational
performance (free form)
• Performance scorecards
Visual display used to chart progress against
strategic and tactical goals and targets
(predetermined measures)

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

43

Performance Dashboards

• Dashboards versus scorecards


• Performance dashboard is a multilayered application
built on a business intelligence and data integration
infrastructure that enables organizations to
measure, monitor, and manage business
performance more effectively
- Eckerson
• Three types of performance dashboards:
1. Operational dashboards
2. Tactical dashboards
3. Strategic dashboards

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

44

22
Performance Dashboards

• Dashboard design
• “The fundamental challenge of dashboard design is
to display all the required information on a single
screen, clearly and without distraction, in a manner
that can be assimilated quickly"
(Few, 2005)

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

45

Performance Dashboards

• What to look for in a dashboard


• Use of visual components (e.g., charts, performance bars, spark
lines, gauges, meters, stoplights) to highlight, at a glance, the
data and exceptions that require action
• Transparent to the user, meaning that it requires minimal
training and is extremely easy to use
• Combines data from a variety of systems into a single,
summarized, unified view of the business
• Enables drill-down or drill-through to underlying data sources
or reports
• Presents a dynamic, real-world view with timely data updates
• Requires little, if any, customized coding to implement, deploy,
and maintain

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

46

23
End of the Chapter

• Questions, comments

TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG


School of Information and Communication Technology

47

24
Information Visualization:
Principles, Promise, and Pragmatics

Assoc. Prof. Nguyen Binh Minh Ph.D.

Agenda

• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up

CHI 2003 Tutorial 1 Marti Hearst


Introduction

• Goals of Information Visualization


• Case Study: The Journey of the TreeMap
• Key Questions

What is Information Visualization?

• Visualize: to form a mental image or vision of …

• Visualize: to imagine or remember as if actually


seeing.
American Heritage dictionary, Concise Oxford dictionary

CHI 2003 Tutorial 2 Marti Hearst


What is Information Visualization?
“Transformation of the symbolic into the geometric”
(McCormick et al., 1987)

“... finding the artificial memory that best supports our


natural means of perception.''
(Bertin, 1983)

The depiction of information using spatial or graphical


representations, to facilitate comparison, pattern
recognition, change detection, and other cognitive skills
by making use of the visual system.

Information Visualization
• Problem:
– HUGE Datasets: How to understand them?
• Solution
– Take better advantage of human perceptual system
– Convert information into a graphical representation.
• Issues
– How to convert abstract information into graphical form?
– Do visualizations do a better job than other methods?

CHI 2003 Tutorial 3 Marti Hearst


Visualization
Success Stories

Images from yahoo.com 7

The Power of Visualization


1. Start out going Southwest on ELLSWORTH AVE
Towards BROADWAY by turning right.
2: Turn RIGHT onto BROADWAY.
3. Turn RIGHT onto QUINCY ST.
4. Turn LEFT onto CAMBRIDGE ST.
5. Turn SLIGHT RIGHT onto MASSACHUSETTS AVE.
6. Turn RIGHT onto RUSSELL ST.

Image from mapquest.com 8

CHI 2003 Tutorial 4 Marti Hearst


The Power of Visualization

Visualization Success Story

Mystery: what is causing a cholera


epidemic in London in 1854?

10

10

CHI 2003 Tutorial 5 Marti Hearst


Visualization Success Story
Illustration of John
Snow’s
deduction that a
cholera epidemic
was caused by a bad
water pump, circa
1854.

Horizontal lines
indicate location of
deaths.

From Visual
Explanations by
Edward Tufte,
Graphics Press,
1997

11

11

Visualization Success Story


Illustration of John
Snow’s
deduction that a cholera
epidemic
was caused by a bad
water pump, circa 1854.

Horizontal lines indicate


location of deaths.

From Visual Explanations by Edward Tufte,


Graphics Press, 1997

12

12

CHI 2003 Tutorial 6 Marti Hearst


Purposes of Information Visualization

To help:
Explore
Calculate
Communicate
Decorate

13

13

Two Different Primary Goals:


Two Different Types of Viz

Explore/Calculate
Analyze
Reason about Information
Communicate
Explain
Make Decisions
Reason about Information

14

14

CHI 2003 Tutorial 7 Marti Hearst


Goals of Information Visualization

More specifically, visualization should:

– Make large datasets coherent


(Present huge amounts of information compactly)
– Present information from various viewpoints
– Present information at several levels of detail
(from overviews to fine structure)
– Support visual comparisons
– Tell stories about the data

15

15

Why Visualization?
Use the eye for pattern recognition; people are good at
scanning
recognizing
remembering images

Graphical elements facilitate comparisons via


length
shape
orientation
texture
Animation shows changes across time
Color helps make distinctions
Aesthetics make the process appealing

16

16

CHI 2003 Tutorial 8 Marti Hearst


The Need for Critical Analysis
• We see many creative ideas, but they often fail in
practice

• The hard part: how to apply it judiciously


– Inventors usually do not accurately predict how their
invention will be used

• This tutorial will emphasize


– Getting past the coolness factor
– Examining usability studies

18

18

Case Study:
The Journey of the TreeMap

• The TreeMap (Johnson & Shneiderman ‘91)


• Idea:
– Show a hierarchy as a 2D layout
– Fill up the space with rectangles representing objects
– Size on screen indicates relative size of underlying
objects.

19

19

CHI 2003 Tutorial 9 Marti Hearst


Early Treemap Applied to File System

20

20

Treemap Problems

• Too disorderly
– What does adjacency mean?
– Aspect ratios uncontrolled leads to lots of skinny boxes
that clutter
• Color not used appropriately
– In fact, is meaningless here
• Wrong application
– Don’t need all this to just see the largest files in the OS

21

21

CHI 2003 Tutorial 10 Marti Hearst


Successful Application of Treemaps
• Think more about the use
– Break into meaningful groups
– Fix these into a useful aspect ratio
• Use visual properties properly
– Use color to distinguish meaningfully
• Use only two colors:
– Can then distinguish one thing from another
• When exact numbers aren’t very important
• Provide excellent interactivity
– Access to the real data
– Makes it into a useful tool

22

22

TreeMaps in Action

http://www.smartmoney.com/maps

http://www.peets.com/tast/11/coffee_selector.asp

23

23

CHI 2003 Tutorial 11 Marti Hearst


A Good Use of TreeMaps and Interactivity

www.smartmoney.com/marketmap
24

24

Treemaps in Peets site

25

25

CHI 2003 Tutorial 12 Marti Hearst


Analysis vs. Communication

• MarketMap’s use of TreeMaps allows for


sophisticated analysis
• Peets’ use of TreeMaps is more for
presentation and communication
• This is a key contrast

26

26

Key Questions to Ask about a Viz

1. What does it teach/show/elucidate?


2. What is the key contribution?
3. What are some compelling, useful examples?
4. Could it have been done more simply?
5. Have there been usability studies done?
What do they show?

28

28

CHI 2003 Tutorial 13 Marti Hearst


Agenda

• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up

30

30

Visual Principles

31

31

CHI 2003 Tutorial 14 Marti Hearst


Visual Principles
– Types of Graphs
– Pre-attentive Properties
– Relative Expressiveness of Visual Cues
– Visual Illusions
– Tufte’s notions
• Graphical Excellence
• Data-Ink Ratio Maximization
• How to Lie with Visualization

32

32

References for Visual Principles

• Kosslyn: Types of Visual Representations


• Lohse et al: How do people perceive common
graphic displays
• Bertin, MacKinlay: Perceptual properties and
visual features
• Tufte/Wainer: How to mislead with graphs

33

33

CHI 2003 Tutorial 15 Marti Hearst


A Graph is: (Kosslyn)

• A visual display that illustrates one or more


relationships among entities
• A shorthand way to present information
• Allows a trend, pattern, or comparison to be
easily apprehended

34

34

Types of Symbolic Displays


(Kosslyn)

• Graphs

• Charts Typenamehere
Typetitlehere

Typenamehere Typenamehere Typenamehere


Typetitlehere Typetitlehere Typetitlehere

• Maps

• Diagrams

35

35

CHI 2003 Tutorial 16 Marti Hearst


Types of Symbolic Displays
• Graphs
– at least two scales required
– values associated by a symmetric “paired with”
relation
• Examples: scatter-plot, bar-chart, layer-graph

36

Types of Symbolic Displays


Charts
– discrete relations among discrete entities
– structure relates entities to one another
– lines and relative position serve as links

Examples:
family tree
flow chart
network diagram

37

CHI 2003 Tutorial 17 Marti Hearst


Types of Symbolic Displays
• Maps
– internal relations determined (in part) by the spatial
relations of what is pictured
– labels paired with locations

Examples:
map of census data
topographic maps
From www.thehighsierra.com

38

Types of Symbolic Displays


Diagrams
– schematic pictures of objects or entities
– parts are symbolic (unlike photographs)
• how-to illustrations
• figures in a manual

From Glietman, Henry. Psychology.


W.W. Norton and Company, Inc.
New York, 1995

39

CHI 2003 Tutorial 18 Marti Hearst


Anatomy of a Graph (Kosslyn)

• Framework
– sets the stage
– kinds of measurements, scale, ...
• Content
– marks
– point symbols, lines, areas, bars, …
• Labels
– title, axes, tic marks, ...

40

Basic Types of Data

• Nominal (qualitative)
– (no inherent order)
– city names, types of diseases, ...
• Ordinal (qualitative)
– (ordered, but not at measurable intervals)
– first, second, third, …
– cold, warm, hot
• Interval (quantitative)
– list of integers or reals

41

CHI 2003 Tutorial 19 Marti Hearst


Common Graph Types

length of page
# of accesses

# of accesses
URL length of access length of access
length of access

# of accesses url 1
45
40 url 2
35
30 url 3
25 url 4
20 url 5
15
10 url 6
5 url 7
0
medium
short

very
long

long

# of accesses
days
length of page

42

Combining Data Types in Graphs

Nominal Nominal

Nominal Ordinal

Nominal Interval

Ordinal Ordinal

Ordinal Interval

Interval Interval

43

CHI 2003 Tutorial 20 Marti Hearst


Scatter Plots

• Qualitatively determine if variables


– are highly correlated
• linear mapping between horizontal & vertical axes
– have low correlation
• spherical, rectangular, or irregular distributions
– have a nonlinear relationship
• a curvature in the pattern of plotted points
• Place points of interest in context
– color representing special entities

44

When to use which type?


• Line graph
– x-axis requires quantitative variable
– Variables have contiguous values
– familiar/conventional ordering among ordinals
• Bar graph
– comparison of relative point values
• Scatter plot
– convey overall impression of relationship between two
variables
• Pie Chart?
– Emphasizing differences in proportion among a few
numbers

45

CHI 2003 Tutorial 21 Marti Hearst


Subset of Example Visual Representations
From Lohse et al. 94

47

Subset of Example Visual Representations


From Lohse et al. 94

48

CHI 2003 Tutorial 22 Marti Hearst


Experimentally Motivated
Classification (Lohse et al.)
• Graphs
• Tables (numerical)
• Tables (graphical)
• Charts (time)
• Charts (network)
• Diagrams (structure)
• Diagrams (network)
• Maps
• Cartograms
• Icons
• Pictures

50

Interesting Findings
Lohse et al.

• Photorealistic images were least informative


– Echos results in icon studies – better to use less complex,
more schematic images
• Graphs and tables are the most self-similar categories
– Results in the literature comparing these are inconclusive
• Cartograms were hard to understand
– Echos other results – better to put points into a framed
rectangle to aid spatial perception
• Temporal data more difficult to show than cyclic data
– Recommend using animation for temporal data

51

CHI 2003 Tutorial 23 Marti Hearst


Visual Properties

• Preattentive Processing
• Accuracy of Interpretation of Visual Properties
• Illusions and the Relation to Graphical
Integrity

All Preattentive Processing figures from Healey 97


http://www.csc.ncsu.edu/faculty/healey/PP/PP.html

52

Preattentive Processing

• A limited set of visual properties are processed


preattentively
– (without need for focusing attention).
• This is important for design of visualizations
– what can be perceived immediately
– what properties are good discriminators
– what can mislead viewers

53

CHI 2003 Tutorial 24 Marti Hearst


Example: Color Selection

Viewer can rapidly and accurately determine


whether the target (red circle) is present or absent.
Difference detected in color.

54

Example: Shape Selection

Viewer can rapidly and accurately determine


whether the target (red circle) is present or absent.
Difference detected in form (curvature)

55

CHI 2003 Tutorial 25 Marti Hearst


Pre-attentive Processing

• < 200 - 250ms qualifies as pre-attentive


– eye movements take at least 200ms
– yet certain processing can be done very quickly,
implying low-level processing in parallel
• If a decision takes a fixed amount of time
regardless of the number of distractors, it is
considered to be preattentive.

56

Example: Conjunction of
Features

Viewer cannot rapidly and accurately determine


whether the target (red circle) is present or absent when
target has two or more features, each of which are
present in the distractors. Viewer must search sequentially.
All Preattentive Processing figures from Healey 97
http://www.csc.ncsu.edu/faculty/healey/PP/PP.html

57

CHI 2003 Tutorial 26 Marti Hearst


Example: Emergent Features

Target has a unique feature with respect to


distractors (open sides) and so the group
can be detected preattentively.

58

Example: Emergent Features

Target does not have a unique feature with respect to


distractors and so the group cannot be detected
preattentively.

59

CHI 2003 Tutorial 27 Marti Hearst


Asymmetric and Graded
Preattentive Properties
• Some properties are asymmetric
– a sloped line among vertical lines is preattentive
– a vertical line among sloped ones is not
• Some properties have a gradation
– some more easily discriminated among than others

60

Use Grouping of Well-Chosen


Shapes for Displaying Multivariate
Data

61

CHI 2003 Tutorial 28 Marti Hearst


SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO
CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM
SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC
GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM
CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM
GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM
SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC
SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO
CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM
SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC

62

Text NOT Preattentive

SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO


CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM
SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC
GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM
CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM
GOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREM
SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC
SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXO
CERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEM
SCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC

63

CHI 2003 Tutorial 29 Marti Hearst


Gestalt (shape) Properties
• Gestalt: form or configuration
• Idea: forms or patterns transcend the stimuli
used to create them.
– Why do patterns emerge?
– Under what circumstances?

Why perceive pairs vs. triplets?

65

Gestalt Laws of Perceptual


Organization (Kaufman)
• Figure and Ground
– Escher illustrations are good examples
– Vase/Face contrast
• Subjective Contour

66

CHI 2003 Tutorial 30 Marti Hearst


More Gestalt Laws

• Law of Proximity
– Stimulus elements that are close together will be
perceived as a group
• Law of Similarity
– like the preattentive processing examples
• Law of Common Fate
– like preattentive motion property
• move a subset of objects among similar ones and they
will be perceived as a group

67

Which Properties are


Appropriate for Which
Information Types?

68

CHI 2003 Tutorial 31 Marti Hearst


Accuracy Ranking of Quantitative Perceptual Tasks
Estimated; only pairwise comparisons have been validated

69

Interpretations of Visual Properties

Some properties can be discriminated more accurately


but don’t have intrinsic meaning
– Density (Greyscale)
Darker -> More
– Size / Length / Area
Larger -> More
– Position
Leftmost -> first, Topmost -> first
– Hue
??? no intrinsic meaning
– Slope
??? no intrinsic meaning

70

CHI 2003 Tutorial 32 Marti Hearst


Ranking of Applicability of Properties for
Different Data Types

QUANTITATIVE ORDINAL NOMINAL

Position Position Position


Length Density Color Hue
Angle Color Saturation Texture
Slope Color Hue Connection
Area Texture Containment
Volume Connection Density
Density Containment Color Saturation
Color Saturation Length Shape
Color Hue Angle Length

71

72

CHI 2003 Tutorial 33 Marti Hearst


73

Color Purposes

• Call attention to specific items


• Distinguish between classes of items
– Increases the number of dimensions for encoding
• Increase the appeal of the visualization

74

CHI 2003 Tutorial 34 Marti Hearst


Using Color
• Proceed with caution
– Less is more
– Representing magnitude is tricky
• Examples
– Red-orange-yellow-white
• Works for costs
• Maybe because people are very experienced at
reasoning shrewdly according to cost
– Green-light green-light brown-dark brown-grey-white
works for atlases
– Grayscale is unambiguous but has limited range

75

Visual Illusions
• People don’t perceive length, area, angle,
brightness they way they “should”.
• Some illusions have been reclassified as
systematic perceptual errors
– e.g., brightness contrasts (grey square on white
background vs. on black background)
– partly due to increase in our understanding of the
relevant parts of the visual system
• Nevertheless, the visual system does some
really unexpected things.

76

CHI 2003 Tutorial 35 Marti Hearst


Illusions of Linear Extent
• Mueller-Lyon (off by 25-30%)

• Horizontal-Vertical

77

Illusions of Area

• Delboeuf Illusion

• Height of 4-story building overestimated by


approximately 25%

78

CHI 2003 Tutorial 36 Marti Hearst


What are good guidelines for Infoviz?

• Use graphics appropriately


– Don’t use images gratuitously
– Don’t lie with graphics!
• Link to original data
– Don’t conflate area with other information
• E.g., use area in map to imply amount
• Make it interactive (feedback)
– Brushing and linking
– Multiple views
– Overview + details
• Match mental models

79

Tufte

• Principles of Graphical Excellence


– Graphical excellence is
• the well-designed presentation of interesting data – a
matter of substance, of statistics, and of design
• consists of complex ideas communicated with clarity,
precision and efficiency
• is that which gives to the viewer the greates number of
ideas in the shortest time with the least ink in the
smallest space
• requires telling the truth about the data.

80

80

CHI 2003 Tutorial 37 Marti Hearst


Tufte’s Notion of Data Ink
Maximization
• What is the main idea?
– draw viewers attention to the substance of the
graphic
– the role of redundancy
– principles of editing and redesign
• What’s wrong with this? What is he really
getting at?

81

81

Tufte Principle
Maximize the data-ink ratio:

data ink
Data-ink ratio = --------------------------
total ink used in graphic

Avoid “chart junk”

82

82

CHI 2003 Tutorial 38 Marti Hearst


Tufte Principles

• Use multifunctioning graphical elements


• Use small multiples
• Show mechanism, process, dynamics, and
causality
• High data density
– Number of items/area of graphic
– This is controversial
• White space thought to contribute to good visual
design
• Tufte’s book itself has lots of white space

83

83

Tufte’s Graphical Integrity

• Some lapses intentional, some not


• Lie Factor = size of effect in graph
size of effect in data
• Misleading uses of area
• Misleading uses of perspective
• Leaving out important context
• Lack of taste and aesthetics

84

84

CHI 2003 Tutorial 39 Marti Hearst


From Tim Craven’s LIS 504 course
http://instruct.uwo.ca/fim-lis/504/504gra.htm#data-ink_ratio

85

85

How to Exaggerate with Graphs


from Tufte ’83

“Lie factor” = 2.8

86

86

CHI 2003 Tutorial 40 Marti Hearst


How to Exaggerate with Graphs
from Tufte ’83

Error:
Shrinking
along both
dimensions

87

87

Howard Wainer
How to Display Data Badly
(Video)
http://www.dartmouth.edu/~chance/ChanceLecture/AudioVideo.html

88

88

CHI 2003 Tutorial 41 Marti Hearst


Agenda

• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up

89

89

Promising Techniques

90

90

CHI 2003 Tutorial 42 Marti Hearst


Promising Techniques & Approaches

• Perceptual Techniques
– Animation
– Grouping / Gestalt principles
– Using size to indicate quantity
– Color for Accent, Distinction, Selection
• NOT FOR QUANTITY!!!!

• General Approaches
– Standard Techniques
• Graphs, bar charts, tables
– Brushing and Linking
– Providing Multiple Views and Models
– Aesthetics!

91

91

Standard Techniques
• It’s often hard to beat:
– Line graphs, bar charts
– Scatterplots (or Scatterplot Matrix)
– Tables
• A Darwinian view of visualizations:
– Only the fittest survive
– We are in a period of great experimentation; eventually it
will be clear what works and what dies out.
• A bright spot:
– Enhancing the old techniques with interactivity
– Example: Spotfire
• Adds interactivity, color highlighting, zooming to scatterplots
– Example: TableLens / Eureka
• Adds interactivity and length cues to tables

92

92

CHI 2003 Tutorial 43 Marti Hearst


Spotfire: Integrating Interaction
with Scatterplots

93

93

Spotfire/IVEE: Integrating
Interaction with Scatterplots

94

94

CHI 2003 Tutorial 44 Marti Hearst


Brushing and Linking

• Interactive technique
– Highlighting
– Brushing and Linking
• At least two things must be linked together to
allow for brushing
– select a subset of points
– see the role played by this subset of points in one or
more other views
• Example systems
– Graham Will’s EDV system
– Ahlberg & Sheiderman’s IVEE (Spotfire)

95

Linking types of assist behavior


to position played (from Eick & Wills 95)

96

CHI 2003 Tutorial 45 Marti Hearst


Baseball data:
Scatterplots and histograms and
bars (from Eick & Wills 95)

how long
select high
in majors
salaries

avg career
avg assists vs
HRs vs avg
avg putouts
career hits
(fielding ability)
(batting ability)

distribution
of positions
played

97

What was learned from interaction


with this baseball data?

– Seems impossible to earn a high salary in the first


three years
– High salaried players have a bimodal distribution
(peaking around 7 & 13 yrs)
– Hits/Year a better indicator of salary than HR/Year
– High paid outlier with low HR and medium hits/year.
Reason: person is player-coach
– There seem to be two differentiated groups in the
put-outs/assists category (but not correlated with
salary) Why?

98

CHI 2003 Tutorial 46 Marti Hearst


Animation
• “The quality or condition of being alive, active, spirited,
or vigorous” (dictionary.com)

• “A dynamic visual statement that evolves through


movement or change in the display”

• “… creating the illusion of change by rapidly displaying a


series of single frames” (Roncarelli 1988).

Slide by Saifon Obromsook & Linda Harjono 99

99

We Use Animation to…

• Tell stories / scenarios: cartoons


• Illustrate dynamic process / simulation
• Create a character / an agent
• Navigate through virtual spaces
• Draw attention
• Delight

Slide by Saifon Obromsook & Linda Harjono 100

100

CHI 2003 Tutorial 47 Marti Hearst


Cartoon Animation Principles
• Chang & Unger ‘93
• Solidity (squash and stretch)
– Solid drawing
– Motion blur
– Dissolves
• Exaggeration
– Anticipation
– Follow through
• Reinforcement
– Slow in and slow out
– Arcs
– Follow through

Slide by Saifon Obromsook & Linda Harjono 101

101

Why Cartoon-Style Animation?

• Cartoons’ theatricality is powerful in


communicating to the user.
• Cartoons can make UI engage the user into its
world.
• The medium of cartoon animation is like that
of graphic computers.

Slide by Saifon Obromsook & Linda Harjono 102

102

CHI 2003 Tutorial 48 Marti Hearst


Application using Animation:
Gnutellavision

• Visualization of Peer-to-Peer Network


– Hosts (with color for status and size for number of files)
– Nodes with closer network distance from focus on inner
rings
– Queries shown; can trace queries
• Gnutellavision as exploratory tool
– Very few hosts share many files
– Uneven propagation of queries
– Qualitative assessment of queries (simple)

103

103

Layout - Illustration

104

104

CHI 2003 Tutorial 49 Marti Hearst


Animation in Gnutellavision

Goal of animation is to help maintain context of nodes


and general orientation of user during refocus

• Transition Paths
– Linear interpolation of polar coordinates
– Node moves in arc not straight line
– Moves along circle if not changing levels (like great
circles on earth)
– Spirals in or out to next ring

105

105

Animation (continued)

• Transition constraints
– Orientation of transition to minimize rotational
travel
– (Move former parent away from new focus in same
orientation)
– Avoid cross-over of edges
– (to allow users to keep track of which is which)

• Animation timing
– Slow in Slow out timing (allows users to better
track movement)

106

106

CHI 2003 Tutorial 50 Marti Hearst


Transition Constraint - Orientation

107

107

Transition Constraint - Order

108

108

CHI 2003 Tutorial 51 Marti Hearst


Usability Testing
• In general, users appreciated the subtleties added to the general
method when the number of nodes increased.

• Perhaps the most interesting result is that most people preferred


rectangular movement for the small graph and polar coordinate
movement for the large one.

Overall Preference of Users


No Features All Features
Small Graph 5 5
Large Graph 1 9

109

109

Hyperbolic Tree

• A Focus+Context Technique Based on Hyperbolic Geometry for


Visualizing Large Hierarchies (1995) John Lamping, Ramana Rao, Peter
Pirolli Proc. ACM Conf. Human Factors in Computing Systems, CHI

• Also uses animation

• Tree-based layout; leaves stretch to infinity

• Only a few labels can be seen at a time

110

110

CHI 2003 Tutorial 52 Marti Hearst


111

111

112

112

CHI 2003 Tutorial 53 Marti Hearst


113

113

114

114

CHI 2003 Tutorial 54 Marti Hearst


Issues

• Displaying text
– The size of the text
• Works good for small things like directories
• Not so good for URLs
• Only a portion of the data can be seen in the
focus at one time
• Only works for certain types of data -
Hierarchical
• Not clear if it is actually useful for anything.

115

115

Animating Algorithms
• Kehoe, Stasko, and Taylor, “Rethinking Evaluation of
Algorithm Animations as Learning Aids”

• Why previous studies present no benefits:


– No or limited benefits from particular animations
– Benefits are not captured in measurements
– Design of experiments hides the benefits

• Methods for this study:


– Combination of qualitative & quantitative
– More flexible setting
– Metrics: score for each type of questions, time used,
usage of materials, qualitative data from observations &
interviews

Slide by Saifon Obromsook & Linda Harjono 116

116

CHI 2003 Tutorial 55 Marti Hearst


117

117

Findings

• Value of animation is more apparent in


interactive situations
• Most useful to learn procedural operations
• Makes subject more accessible & less
intimidating → increase motivation

Slide by Saifon Obromsook & Linda Harjono 118

118

CHI 2003 Tutorial 56 Marti Hearst


What Isn’t Working?

The existing studies indicate that we don’t yet


know how to make the following work well for
every-day tasks:

– Pan-and-Zoom
– 3D Navigation
– Node-and-link representations of concept spaces

119

119

Zoom, Overview + Detail


• An exception, possibly:
– Benjamin B. Bederson: PhotoMesa: a zoomable image browser using
quantum treemaps and bubblemaps. UIST 2001: 71-80

120

120

CHI 2003 Tutorial 57 Marti Hearst


Overview + Detail
• K. Hornbaek et al., Navigation patterns and Usability of
Zoomable User Interfaces with and without an Overview, ACM
TOCHI, 9(4), December 2002.

121

121

Overview + Detail
• K. Hornbaek et al., Navigation patterns and Usability of Zoomable
User Interfaces with and without an Overview, ACM TOCHI, 9(4),
December 2002.

• A study on integrating Overview + Detail on a Map


search task
– Incorporating panning & zooming as well.
– They note that panning & zooming does not do well in
most studies.
• Results seem to be
– Subjectively, users prefer to have a linked overview
– But they aren’t necessarily faster or more effective using it
– Well-constructed representation of the underlying data
may be more important.
• More research needed as each study seems to turn up
different results, sensitive to underlying test set.

122

122

CHI 2003 Tutorial 58 Marti Hearst


Agenda

• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up

123

123

Problem Solving

124

124

CHI 2003 Tutorial 59 Marti Hearst


Problem Solving

• A Detective Tool for Multidimensional Data


– Inselberg on using Parallel Coordinates

• Analyzing Web Clickstream Data


– Brainerd & Becker, Waterson et al.

• Information Visualization for Pattern Detection


– Carlis & Konstan on Periodic Data

• Visualization vs. Analysis


– Comments by Wesley Johnson of Chevron

125

125

Multidimensional Detective
A. Inselberg, Multidimensional Detective, Proceedings of IEEE
Symposium on Information Visualization (InfoVis '97), 1997.

126

126

CHI 2003 Tutorial 60 Marti Hearst


A Detective Story
A. Inselberg, Multidimensional Detective, Proceedings of IEEE Symposium on
Information Visualization (InfoVis '97), 1997

Inselberg’s Principles for analysis using visualizations:


1. Do not let the picture scare you
2. Understand your objectives
– Use them to obtain visual cues
3. Carefully scrutinize the picture
4. Test your assumptions, especially the “I am really sure of’s”
5. You can’t be unlucky all the time!

127

127

A Detective Story
A. Inselberg, Multidimensional Detective, Proceedings of IEEE Symposium on Information
Visualization (InfoVis '97), 1997

• The Dataset:
– Production data for 473 batches of a VLSI chip
– 16 process parameters
– The yield: % of produced chips that are useful
• X1
– The quality of the produced chips (speed)
• X2
– 10 types of defects (zero defects shown at top)
• X3 … X12
– 4 physical parameters
• X13 … X16
• The Objective:
– Raise the yield (X1) and maintain high quality (X2)

128

128

CHI 2003 Tutorial 61 Marti Hearst


Multidimensional Detective
A. Inselberg, Multidimensional Detective, Proceedings of IEEE Symposium on
Information Visualization (InfoVis '97), 1997.

Do Not Let the Picture Scare You!!

129

129

Multidimensional Detective
• Each line represents the values for one batch of chips
• This figure shows what happens when only those
batches with both high X1 and high X2 are chosen
• Notice the separation in values at X15
• Also, some batches with few X3 defects are not in this
high-yield/high-quality group.

130

130

CHI 2003 Tutorial 62 Marti Hearst


Multidimensional Detective

• Now look for batches which have nearly zero defects.


– For 9 out of 10 defect categories
• Most of these have low yields
• Surprising because we know from first diagram that some
defects are ok.
• Go back to first diagram, looking at defect categories
• Notice that X6 behaves differently than the rest
• Allow two defects, where one defect in X6
• This results in the very best batch appearing

131

131

Multidimensional Detective
• Fig 5 and 6 show that high yield batches don’t have non-zero values
for defects of type X3 and X6
– Don’t believe your assumptions …
• Looking now at X15 we see the separation is important
– Lower values of this property end up in the better yield batches

132

132

CHI 2003 Tutorial 63 Marti Hearst


Automated Analysis
A. Inselberg, Automated Knowledge Discovery using Parallel
Coordinates, INFOVIS ‘99

133

133

Case Study: E-Commerce


Clickstream Visualization

• Brainerd & Becker, IEEE


Infovis 2001
• Aggregate nodes
using an icon (e.g. all
the checkout pages)
• Edges represent
transitions
– Wider means more
transitions

Slide by Wayne Kao 134

134

CHI 2003 Tutorial 64 Marti Hearst


Customer Segments

• Collect
– Clickstream
– Purchase history
– Demographic data
• Associates customer data with their
clickstream
• Different color for each customer segment

Slide by Wayne Kao 135

135

Layout
• Aggregation based on file system path

Slide by Wayne Kao 136

136

CHI 2003 Tutorial 65 Marti Hearst


Initial Findings

• Gender shopping
differences

Slide by Wayne Kao 137

137

Initial Findings (cont)

• Checkout process analysis


• Newsletter hurting sales

Slide by Wayne Kao 138

138

CHI 2003 Tutorial 66 Marti Hearst


WebQuilt

Interactive, zoomable directed graph


• Nodes = web pages
• Edges = aggregate traffic between
pages

Waterson et al.,``What Did They Do?


Understanding Clickstreams with the WebQuilt
Visualization System.'' in AVI 2002.

Slide by Wayne Kao 139

139

Directed graph

• Nodes: visited pages


– Color marks entry and exit nodes
• Arrows: traversed links
– Thicker: more heavily traversed
– Color
• Red/yellow: Time spend before
clicking
• Blue: optimal path chosen by
designer

Slide by Wayne Kao 140

140

CHI 2003 Tutorial 67 Marti Hearst


Slide by Wayne Kao 141

141

Pilot Usability Study

• Edmunds.com PDA web site


• Visor Handspring equipped with a OmniSky
wireless modem
• 10 users asked to find…
– Anti-lock brake information on the latest Nissan
Sentra model
– The Nissan dealer closest to them.

Slide by Wayne Kao 142

142

CHI 2003 Tutorial 68 Marti Hearst


In the Lab vs. Out in the Wild

Comparing in-lab usability testing with WebQuilt remote


usability testing
• 5 users were tested in the lab
• 5 were given the device and asked to perform the task
at their convenience
• All task directions, demographic data, and follow up
questionnaire data was presented and collected in web
forms as part of the WebQuilt testing framework.

Slide by Wayne Kao 143

143

Slide by Wayne Kao 144

144

CHI 2003 Tutorial 69 Marti Hearst


Slide by Wayne Kao 145

145

Slide by Wayne Kao 146

146

CHI 2003 Tutorial 70 Marti Hearst


Findings
Browser Device
• Interact before load (3) • Difficulty with input in
• No forward button (2) questionnaire (3)
• Difficulty scrolling (2)
• Device errors unrelated to
testing (1)
• Tried writing on screen (0)
Site Design Test Design
• Falsely completed task (4) • Falsely completed task (4)
• Long download times (4) • Difficulty remembering
• Ping-pong behavior (3) task description (3)
• Interact before load (3) • Difficulty with input in
• Too much scrolling (2) questionnaire (3)
• Save address functionality • Questionnaire wording
not clear (1) problems (3)
• Back button navigation (0) • Forgot how to end task (1)
• Would like more features (0) • Confusing task description
• Finds site useful (0) (1)

Slide by Wayne Kao 147

147

Findings
• WebQuilt methodology is promising for uncovering site
design related issues.
• 1/3 of the issues were device or browser related.
• Browser and device issues can not be captured
automatically with WebQuilt unless they cause an
interaction with the server
• Can be revealed via the questionnaire data.

Slide by Wayne Kao 148

148

CHI 2003 Tutorial 71 Marti Hearst


Visualization for Analysis
• Carlis & Konstan, UIST 1998

• Problem: data that is both periodic and serial


– Time students spend on different activities
– Tree growth patterns
• Time: which year
• Period: yearly
– Multi-day races such as the Tour de France
– Calendars arbitrarily wrap around at end of month
– Octaves in music
• How to find patterns along both dimensions?

149

149

Analyzing Complex Periodic Data

Carlis & Konstan, UIST 1998.

150

150

CHI 2003 Tutorial 72 Marti Hearst


Analyzing Complex Periodic Data
•Consumption values for
each month appear as spikes
•Each food has its own color
•Boundary line (in black)
shows when season
begins/ends

Carlis & Konstan, UIST 1998.

151

151

Carlis & Konstan, UIST 1998.

152

152

CHI 2003 Tutorial 73 Marti Hearst


Visualization vs. Analysis?
• Applications to data mining and data discovery.
• Wesley Johnson ’02:
– Visualization tools are helpful for exploring hunches and
presenting results
• Examples: scatterplots
– They are the WRONG primary tool when the goal is to find
a good classifier model in a complex situation.
– Need:
• Solid insight into the domain and problem
• Tools that visualize several alternative models.
• Emphasize “model visualization” rather than “data
visualization”

153

153

Agenda

• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up

154

154

CHI 2003 Tutorial 74 Marti Hearst


Visualizing Documents and
Search

155

155

Documents and Search

• Why Visualize Text?


• Why Text is Tough
• Visualizing Concept Spaces
– Clusters
– Category Hierarchies
• Visualizing Retrieval Results
• Usability Study Meta-Analysis

156

156

CHI 2003 Tutorial 75 Marti Hearst


Why Visualize Text?
• To help with Information Retrieval
– give an overview of a collection
– show user what aspects of their interests are
present in a collection
– help user understand why documents retrieved as a
result of a query
• Text Data Mining
– Mainly clustering & nodes-and-links
• Software Engineering
– not really text, but has some similar properties

157

157

Why Text is Tough


• Text is not pre-attentive
• Text consists of abstract concepts
– which are difficult to visualize
• Text represents similar concepts in many
different ways
– space ship, flying saucer, UFO, figment of imagination

• Text has very high dimensionality


– Tens or hundreds of thousands of features
– Many subsets can be combined together

158

158

CHI 2003 Tutorial 76 Marti Hearst


Why Text is Tough

As the man walks the cavorting dog, thoughts


arrive unbidden of the previous spring, so unlike
this one, in which walking was marching and
dogs were baleful sentinals outside unjust halls.

How do we visualize this? 159

159

Why Text is Tough

• Abstract concepts are difficult to visualize


• Combinations of abstract concepts are even
more difficult to visualize
– time
– shades of meaning
– social and psychological concepts
– causal relationships

160

160

CHI 2003 Tutorial 77 Marti Hearst


Why Text is Tough

• Language only hints at meaning


• Most meaning of text lies within our minds and
common understanding
– “How much is that doggy in the window?”
• how much: social system of barter and trade (not the
size of the dog)
• “doggy” implies childlike, plaintive, probably cannot do
the purchasing on their own
• “in the window” implies behind a store window, not
really inside a window, requires notion of window
shopping

161

161

Why Text is Easy


• Text is highly redundant
– When you have lots of it
– Pretty much any simple technique can pull out phrases that
seem to characterize a document
• Instant summary:
– Extract the most frequent words from a text
– Remove the most common English words
• People are very good at attributing meaning to lists
of otherwise unrelated words

162

162

CHI 2003 Tutorial 78 Marti Hearst


10 PEOPLE

Guess the Text: 10 ALL


9 STATES
9 LAWS
8 NEW
7 RIGHT
7 GEORGE
6 WILLIAM
6 THOMAS
6 JOHN
6 GOVERNMENT
5 TIME
5 POWERS
5 COLONIES
4 LARGE
4 INDEPENDENT
4 FREE
4 DECLARATION
4 ASSENT
3 WORLD
3 WAR
3 USURPATIONS
3 UNITED
3 SEAS
3 RIGHTS

163

163

Visualization of Text Collections

• How to summarize the contents of hundreds,


thousands, tens of thousands of texts?
• Many have proposed clustering the words and
showing points of light in a 2D or 3D space.
• Examples
– Showing docs/collections as a word space
– Showing retrieval results as points in word space

166

166

CHI 2003 Tutorial 79 Marti Hearst


TextArc.org (Bradford Paley)

167

167

TextArc.org (Bradford Paley)

168

168

CHI 2003 Tutorial 80 Marti Hearst


Galaxy of News
Rennison 95
169

169

Galaxy of News
Rennison 95
170

170

CHI 2003 Tutorial 81 Marti Hearst


Themescapes (Wise et al. 95)

Example: Themescapes
(Wise et al. 95)

171

171
ScatterPlot of Clusters
(Chen et al. 97)

172

172

CHI 2003 Tutorial 82 Marti Hearst


Kohonen Feature Maps
(Lin 92, Chen et al. 97)

(594 docs) 173


173

Clustering for Collection


Overviews
• Two main steps
– cluster the documents according to the words they
have in common
– map the cluster representation onto a (interactive)
2D or 3D representation
• Since text has tens of thousands of features
– the mapping to 2D loses a tremendous amount of
information
– only very coarse themes are detected

175

175

CHI 2003 Tutorial 83 Marti Hearst


Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 177

177

How Useful is Collection Cluster


Visualization for Search?

Three studies find negative results

181

181

CHI 2003 Tutorial 84 Marti Hearst


Study 1
Kleiboemer, Lazear, and Pedersen. Tailoring a retrieval system
for naive users. In Proc. of the 5th Annual Symposium on
Document Analysis and Information Retrieval, 1996

• This study compared


– a system with 2D graphical clusters
– a system with 3D graphical clusters
– a system that shows textual clusters
• Novice users
• Only textual clusters were helpful (and they
were difficult to use well)

182

182

Study 2: Kohonen Feature Maps

H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS 49(7)


• Comparison: Kohonen Map and Yahoo
• Task:
– “Window shop” for interesting home page
– Repeat with other interface
• Results:
– Starting with map could repeat in Yahoo (8/11)
– Starting with Yahoo unable to repeat in map (2/14)

183

183

CHI 2003 Tutorial 85 Marti Hearst


Study 2 (cont.)

• Participants liked:
– Correspondence of region size to # documents
– Overview (but also wanted zoom)
– Ease of jumping from one topic to another
– Multiple routes to topics
– Use of category and subcategory labels

184

184

Study 2 (cont.)
• Participants wanted:
– hierarchical organization
– other ordering of concepts (alphabetical)
– integration of browsing and search
– correspondence of color to meaning
– more meaningful labels
– labels at same level of abstraction
– fit more labels in the given space
– combined keyword and category search
– multiple category assignment (sports+entertain)

185

185

CHI 2003 Tutorial 86 Marti Hearst


Study 3: NIRVE
NIRVE Interface by Cugini et al. 96. Each rectangle is a cluster. Larger clusters closer to the
“pole”. Similar clusters near one another. Opening a cluster causes a projection that shows
the titles.

186

186

Study 3
Visualization of search results: a comparative evaluation of text, 2D,
and 3D interfaces Sebrechts, Cugini, Laskowski, Vasilakis and Miller,
Proceedings of SIGIR 99, Berkeley, CA, 1999.
• This study compared:
– 3D graphical clusters
– 2D graphical clusters
– textual clusters
• 15 participants, between-subject design
• Tasks
– Locate a particular document
– Locate and mark a particular document
– Locate a previously marked document
– Locate all clusters that discuss some topic
– List more frequently represented topics

187

187

CHI 2003 Tutorial 87 Marti Hearst


Study 3
• Results (time to locate targets)
– Text clusters fastest
– 2D next
– 3D last
– With practice (6 sessions) 2D neared text results; 3D still
slower
– Computer experts were just as fast with 3D
• Certain tasks equally fast with 2D & text
– Find particular cluster
– Find an already-marked document
• But anything involving text (e.g., find title) much faster
with text.
– Spatial location rotated, so users lost context
• Helpful viz features
– Color coding (helped text too)
– Relative vertical locations
188

188

Summary: Visualizing Clusters


• Huge 2D maps may be inappropriate focus for
information retrieval
– cannot see what the documents are about
– space is difficult to browse for IR purposes
– (tough to visualize abstract concepts)
• Perhaps more suited for pattern discovery and
gist-like overviews

189

189

CHI 2003 Tutorial 88 Marti Hearst


IR Infovis Meta-Analysis
(Empirical studies of information visualization:
a meta-analysis, Chen & Yu IJHCS 53(5),2000)
• Goal
– Find invariant underlying relations suggested
collectively by empirical findings from many different
studies
• Procedure
– Examine the literature of empirical infoviz studies
• 35 studies between 1991 and 2000
• 27 focused on information retrieval tasks
• But due to wide differences in the conduct of the
studies and the reporting of statistics, could use only 6
studies

190

190

IR Infovis Meta-Analysis
(Empirical studies of information visualization:
a meta-analysis, Chen & Yu IJHCS 53(5),2000)
• Conclusions:
– IR Infoviz studies not reported in a standard format
– Individual cognitive differences had the largest effect
• Especially on accuracy
• Somewhat on efficiency
– Holding cognitive abilities constant, users did better
with simpler visual-spatial interfaces
– The combined effect of visualization is not
statistically significant

191

191

CHI 2003 Tutorial 89 Marti Hearst


So What Works?
• Yee, K-P et al., Faceted Metadata for Image Search and Browsing, to appear
in CHI 2003. Hearst, M, et al.; Chapter 10 of Modern Information
Retrieval, Baeza-Yates & Ribiero-Neto (Eds).

• Color highlighting of query terms in results listings


• Sorting of search results according to important criteria
(date, author)
• Grouping of results according to well-organized category
labels.
– Cha-cha
– Flamenco
• Only if highly accurate:
– Spelling correction/suggestions
– Simple relevance feedback (more-like-this)
– Certain types of term expansion
• Note: most don’t benefit from visualization!

192

192

Cha-Cha
• Chen, M., Hearst, M., Hong, J.,
and Lin, J. Cha-Cha: A System
for Organizing Intranet Search
Results in the Proceedings of
the 2nd USENIX Symposium on
Internet Technologies and
SYSTEMS (USITS), Boulder,
CO, October 11-14, 1999

193

193

CHI 2003 Tutorial 90 Marti Hearst


Teoma: appears to combine
categories and clusters
(this version before it was bought by askjeeves)

194

194

Teoma: Now in prime time

195

195

CHI 2003 Tutorial 91 Marti Hearst


Cat-a-Cone

Marti Hearst and Chandu Karadi, Cat-a-


Cone: An Interactive Interface for
Specifying Searches and Viewing
Retrieval Results using a Large Category
Hierarchy Proceedings of the 20th Annual
International ACM/SIGIR Conference

Philadelphia, PA, July 1997

196

196

Better to reduce the viz

• Flamenco – allows users to steer through the


category space
• Uses
– Dynamically-generated hypertext
– Color for distinguishing and grouping
– Careful layout and font choices
• Focused first on the users’ needs

197

197

CHI 2003 Tutorial 92 Marti Hearst


Flamenco

198

198

Flamenco

199

199

CHI 2003 Tutorial 93 Marti Hearst


Using Thumbnails to Search the Web
A. Woodruff, R. Rosenholtz, J. Morrison, A. Faulring, & P. Pirolli, A
comparison on the use of text summaries, plain thumbnails,
andenhanced thumbnails for web search tasks. JASIST, 53(2), 172-
185, 2002.; A. Woodruff, A. Faulring, R. Rosenholtz, J. Morrison, & P.
Pirolli,Using thumbnails to search the web. SIGCHI 2001

Design Goals
– Enhance features that help the user decide whether
document is relevant to their query
• Emphasize text that is relevant to query
– Text callouts
• Enlarge (make readable) text that might be
helpful in assessing page
– Enlarge headers

Slide by Woodruff & Rosenholtz 200

200

Text and Image Summaries

• Text summaries
– Lots of abstract, semantic information
• Image summaries (plain thumbnails)
– Layout, genre information
– Gist extraction faster than with text
• Benefits are complementary
• Create textually-enhanced thumbnails that
leverage the advantages of both text
summaries and plain thumbnails

Slide by Woodruff & Rosenholtz 201

201

CHI 2003 Tutorial 94 Marti Hearst


Putting Callouts in a Separate
Visual Layer
• Transparency
• Occlusion

Junctions indicate the


occurrence of these
events.

Slide by Woodruff & Rosenholtz 202

202

Design Issues:

• Color Management
– Problems: Callouts need to be both readable and
draw attention
– Solution: Desaturate the background image, and use
a visual search model to choose appropriate colors
– Colors look like those in highlighter pens
• Resizing of Text
– Problem: We want to make certain text elements
readable, but not necessarily draw attention to them
– Solution: Modify the HTML before rendering the
thumbnail

Slide by Woodruff & Rosenholtz 203

203

CHI 2003 Tutorial 95 Marti Hearst


Examples

Slide by Woodruff & Rosenholtz 204

204

Tasks

• Criteria: tasks that…


– Are representative of common queries
– Have result sets with different characteristics
– Vary in the number of correct answers
• 4 types of tasks
Picture: “Find a picture of a giraffe in the wild.”
Homepage: “Find Kern Holoman’s homepage.”
Side-effects: “Find at least three side effects of halcion.”
E-commerce: “Find an e-commerce site where you can buy
a DVD player. Identify the price in dollars.”

Slide by Woodruff & Rosenholtz 205

205

CHI 2003 Tutorial 96 Marti Hearst


Conditions

• Text summary
– Page title
– Extracted text with
query terms in bold
– URL
• Plain thumbnail
• Enhanced thumbnail
– Readable H1, H2 tags
– Highlighted callouts of
query terms
– Reduced contrast level
in thumbnail

Slide by Woodruff & Rosenholtz 206

206

Collections of Summaries

• 100 results in random order


Approximately same number of each
summary type on a page

Slide by Woodruff & Rosenholtz 207

207

CHI 2003 Tutorial 97 Marti Hearst


Method
18 questions, with 100 query results each
Entire process took about 75 minutes

• Procedure
– 6 practice tasks
– 3 questions for each of the 4 task types
• e.g., each participant would do one E-commerce
question using text, one E-commerce question using
plain thumbnails, and one E-commerce question using
enhanced thumbnails
– Questions blocked by type of summary
– WebLogger recorded user actions during browsing
– Semi-structured interview
• Participants
– 12 members of the PARC community

Slide by Woodruff & Rosenholtz 208

208

Results
• Average total search times, by task:
– Picture: 61 secs
– Homepage: 80 secs
– E-commerce: 64 secs
– Side effects: 128 secs
• Results pooled across all tasks:
– Subjects searched 20 seconds faster with enhanced
thumbnails than with plain
– Subjects searched 30 seconds faster with enhanced
thumbnails than with text summaries
– Mean search time overall was 83 seconds

Slide by Woodruff & Rosenholtz 209

209

CHI 2003 Tutorial 98 Marti Hearst


Results
Normalized total search time (s)

Slide by Woodruff & Rosenholtz 210

210

Results: User Responses

• Participants preferred enhanced thumbnails


– 7/12 preferred overall
– 5/12 preferred for certain task types

• Enhanced thumbnails are intuitive and less


work than text or plain thumbnails
– One subject said searching for information with text
summaries did not seem hard until he used the
enhanced thumbnails.
• Many participants reported using genre
information, cues from the callouts, the
relationship between search terms, etc.

Slide by Woodruff & Rosenholtz 211

211

CHI 2003 Tutorial 99 Marti Hearst


Agenda

• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up

214

214

Comparing Approaches

215

215

CHI 2003 Tutorial 100 Marti Hearst


Comparing 3 Commercial Systems
Alfred Kobsa, An Empirical Comparison of Three Commercial
Information Visualization Systems, INFOVIS'01.

216

216

Comparing 3 Commercial Systems

Eureka (InXight)

217

217

CHI 2003 Tutorial 101 Marti Hearst


Comparing 3 Commercial Systems
InfoZoom (HumanIT)

218

218

Comparing 3 Commercial Systems

SpotFire

219

219

CHI 2003 Tutorial 102 Marti Hearst


Infozoom Overview
•Presents data in three different views.

•Wide view shows data set in a table format.

•Compressed view packs the data set horizontally


to fit the window width.

•Overview mode has all attributes in ascending or


descending order and independent of each other.

Slide by Alfred Kobsa 220

220

InfoZoom Overview View

221

221

CHI 2003 Tutorial 103 Marti Hearst


InfoZoom Overview View

Slide by Alfred Kobsa 222

222

InfoZoom Compressed Table View

223

223

CHI 2003 Tutorial 104 Marti Hearst


InfoZoom Wide Table View

224

224

Datasets

•Multidimensional data: three databases were used


•Anonymized data from a web based dating
service (60 records, 27 variables)
•Technical data of cars sold in 1970 – 82
(406 records, 10 variables)
•Data on the concentration of heavy metals in
Sweden (2298 records, 14 variables)

Slide by Kunal Garach 225

225

CHI 2003 Tutorial 105 Marti Hearst


Sample Questions

• Do more women than men want their partners


to have a higher education?
• What proportion of the men live in California?
• Do all people who think the bar is a good place
to meet a mate also believe in love at first
site?
• Do heavier cars have more horsepower?
• Which manufacturer produced the most cars in
1980?
• Is there a relationship between the
displacement and acceleration of a vehicle?

226

226

Experiment Design

• The experimenters generated 26 tasks from all


three data sets.
• 83 participants. Between-subjects design.
•Each was given one visualization system and all three
data sets.
• Type of visualization system was the independent
variable between them.
• 30 mins were given to solve the tasks of each data
set i.e 26 tasks in 90 mins.

Slide by Kunal Garach 227

227

CHI 2003 Tutorial 106 Marti Hearst


Overall Results
• Mean task completion times:
• Infozoom users: 80 secs
• Spotfire users: 107 secs
• Eureka users: 110 secs

• Answer correctness:
• Infozoom users: 68%
• Spotfire users: 75%
• Eureka users: 71%
•Not a time-error tradeoff
•Spotfire more accurate only 6 questions

Slide by Kunal Garach 228

228

Eureka - problems

• Hidden labels: Labels are vertically aligned,


max 20 dimensions

• 3+ Attributes: Problems with queries


involving three or more attributes

• Correlation problems: Some participants had


trouble answering questions correctly that
involved correlations between two attributes.

Slide by Kunal Garach 229

229

CHI 2003 Tutorial 107 Marti Hearst


Spotfire - problems
• Cognitive setup costs: Takes participants
considerable time to decide on the right
representation and to correctly set the coordinates
and parameters.

• Biased by scatterplot default: Though powerful,


many problems cannot be solved (well) with it.

Slide by Kunal Garach 230

230

Infozoom - problems

• Erroneous Correlations
• Overview mode has all attributes sorted
independent of each other

• Narrow row height in compressed view

• Participants did not use row expansion and


scatterplot charting function which shows
correlations more accurately

Slide by Kunal Garach 231

231

CHI 2003 Tutorial 108 Marti Hearst


Geographic Questions

• Spotfire should have done better on these


•Which part of the country has the most copper
•Is there a relationship between the
concentration of vanadin and that of zinc?
•Is there a low-level chrome area that is high in
vanadim
•Spotfire was only better only for the last question
(out of 6 geographic ones)

232

232

Discussion

•Many studies of this kind use relatively simple


tasks that mirror the strengths of the system
•Find the one object with the maximum value for
a property
•Count how many of certain attributes there are
•This study looked at more complex, realistic, and
varied questions.

233

233

CHI 2003 Tutorial 109 Marti Hearst


Discussion
•Success of a visualization system depends on many
factors:
• Properties supplied
•Spotfire doesn’t visualize as many dimensions
simultaneously
•Operations
•Zooming easy in InfoZoom; allows for drill-down
as well
•Zooming in Eureka causes context to be lost
•Column view in Eureka makes labels hard to see

234

234

Information Exploration
“Shootout”
• http://ivpr.cs.uml.edu/shootout/about.html
• Data Mining Applications
• One component focuses on visualization

235

235

CHI 2003 Tutorial 110 Marti Hearst


Comparing Tree Views
• T. Barlow and P. Neville, Comparison of 2D Visualizations of
Hierarchies, INFOVIS’01.
• Problem
– Organization Chart is de facto standard for
visualizing decision trees. Is there a better compact
view of the tree for the overview window?
• Solution
– Two usability studies to determine which tree works
best.

Slide by Craig Rixford 236

236

Goal: Compact View of Tools

.
T. Barlow and P. Neville, Comparison of 2D Visualizations of Hierarchies, INFOVIS’01237

237

CHI 2003 Tutorial 111 Marti Hearst


Decision Trees

• Each split constitutes a rule


or variable in predictive
model
• Begin Splitting into nodes
• Often hundreds of leaves

Slide by Craig Rixford 238

238

Decision Trees – What makes a


good visualization
• Uses
– For novice-helps them understand models
– Experts-initial evaluation of decisions tree without
looking at models
• Criteria for usability in study
– Ease of Interpretation of Topology (Parent Child
Sibling relations)
– Comparison of Node Size
– User preference

Slide by Craig Rixford 239

239

CHI 2003 Tutorial 112 Marti Hearst


Different views examined in study

Org Chart Tree Ring Icicle Plot TreeMap

Slide by Craig Rixford 240

240

Usability Test 1:
• Users:
– 15 colleagues familiar with org chart but not others
• Tasks
– Is the tree binary or n-ary?
– Is the tree balanced or unbalanced?
– Find deepest common ancestor of two nodes
– Number of levels?
– Find three larges leaves (excluding org chart)

• Data: Created 8 trees for analysis


• Study Design
– Randomized order of tasks
– 4X5 design (almost)
– Timed task from appearance on screen until spacebar tap

Slide by Craig Rixford 241

241

CHI 2003 Tutorial 113 Marti Hearst


Results

• Response Time
– TreeMap slowest; no statistical difference between
others
• Response Accuracy
– No significant difference
• User Preference
– Prefer icicle map and org chart (faster)
– Dislike tree map

Slide by Craig Rixford 242

242

Discussion

• Org chart served as benchmark


• Icicle plot favored amongst others
– Hypothesis: Same left to right / top to bottom
structure
• TreeRing did well
• TreeMap suffered from poor accuracy
– Offset of rectangles required because of off (which is
needed for selection)

Slide by Craig Rixford 243

243

CHI 2003 Tutorial 114 Marti Hearst


Usability Test II: Tree implementation

• Three views:
– TreeMap eliminated from this round
• Tasks
– Node Description
• Four versions – select those nodes or leaves that meet
certain criteria
– Node Analysis:
• Memorize a highlighted node – find again after tree
redrawn in different position

Slide by Craig Rixford 244

244

Results

• Tree rings slower for description but fast and


accurate for memory tasks
• Perhaps due to unique geometric forms /
spatial clues

Slide by Craig Rixford 245

245

CHI 2003 Tutorial 115 Marti Hearst


Conclusions

• TreeMap not useful for this type of task


• Org Chart/Icicle seem to be best overall
• TreeRing has merits for certain tasks

• Icicle chosen for implementation


– Best design considering Org Chart could not be used
for node size tasks
• However:
– Didn’t seem to actually do tests on trees as large as
the ones they describe as typical of datamining

Slide by Craig Rixford 246

246

Visualizing Conversations

247

247

CHI 2003 Tutorial 116 Marti Hearst


Text-Based Chat

Slide by Maggie Law & Vivien Petras 248

248

Chat Circles

Fernanda Viegas and Judith Donath, Chat Circles,


Proceedings of CHI'99.

Slide by Maggie Law & Vivien Petras 249

249

CHI 2003 Tutorial 117 Marti Hearst


Chat Circles

• “Chat Circles is a graphical interface for synchronous


communication that uses abstract shapes to convey identity and
activity.”
• Each participant appears as a colored circle, which is
accompanied by the user name
• Location of circles will also identify participants (important for
many users having similar colors associated)
• Participants’ circles become larger when posting occurs (circle
adapts to text length)
• Circle appears bright when posting occurs
• Circles of inactive users fade in the background

Slide by Maggie Law & Vivien Petras 250

250

Chat Circles –
Conversational Groupings

• There is only ONE room in Chat Circles


• Groupings are achieved by moving closer to other
participants
• At any time, a participant can view all other
participants
• A participant can also detect interesting
conversations in different areas of the room by
looking at how many circles are gathered and how
often circles become larger
• Overview panel in Chat Circles II nice example of
focus + context

Slide by Maggie Law & Vivien Petras 251

251

CHI 2003 Tutorial 118 Marti Hearst


Chat Circles History

Slide by Maggie Law & Vivien Petras 252

252

History Log Patterns

+ Easy to see “lurkers”


+ Sequence and size of
messages quickly visible

- Not very scalable

Slide by Maggie Law & Vivien Petras 253

253

CHI 2003 Tutorial 119 Marti Hearst


History Log Patterns

+/- User-centric: only 1 point


of view represented

- Impossible to see all the


text at once – requires
individual mouse rollovers
- Easy to see “out of range”
conversations – but why
would you want to?

Slide by Maggie Law & Vivien Petras 254

254

Agenda

• Introduction
• Visual Principles
• What Works?
• Visualization in Analysis & Problem Solving
• Visualizing Documents & Search
• Comparing Visualization Techniques
• Design Exercise
• Wrap-Up

255

255

CHI 2003 Tutorial 120 Marti Hearst


Design Exercise

256

256

Design Exercise

• BreakingStory
(Reffel, Fitzpatrick, Ayedelott SIMS final project, at CHI 2003)
– Create an application that supplies a visualization for
trends over time in web-based news. The primary
purpose is to provide an overview, but it should also
be possible to view text from individual news sources
on specific days. Its goal is to inform, inspire, and
enlighten, and also to make people want to look
more deeply at the news.

257

257

CHI 2003 Tutorial 121 Marti Hearst


Sample Solution

258

258

259

259

CHI 2003 Tutorial 122 Marti Hearst


260

260

261

261

CHI 2003 Tutorial 123 Marti Hearst


Another Approach: ThemeRiver
• S. Havre, B. Hetzler, L. Nowell, "ThemeRiver: Visualizing Theme Changes over
Time," Proc. IEEE Symposium on Information Visualization, 2000

262

262

Wrap-up: Guidelines for Success

263

263

CHI 2003 Tutorial 124 Marti Hearst


Key Questions to Ask about a Viz

1. Is it for analysis or presentation?


2. What does it teach/show/elucidate?
3. What is the key contribution?
4. What are some compelling, useful examples?
5. Could it have been done more simply?
6. Have there been usability studies done?
What do they show?

264

264

Holistic Design Goals for


Information Visualization

– Tailor to the application and the domain


– Create highly interactive and integrated
systems
– Embed the visualization within a larger
application
– Provide alternative views

265

265

CHI 2003 Tutorial 125 Marti Hearst


Visualization with a Light Touch: Orbitz.com

266

266

Visualization with a Light Touch:


Orbitz.com

267

267

CHI 2003 Tutorial 126 Marti Hearst


Visualization with a Light Touch:
Orbitz.com

268

268

Visualization with a Light Touch: Orbitz.com

269

269

CHI 2003 Tutorial 127 Marti Hearst


Visualization with a Light Touch: Orbitz.com

270

270

For more information


• My course:
• http://www.sims.berkeley.edu/courses/is247/s02/Lectures.html
• Atlas of Cyberspaces:
• http://www.geog.ucl.ac.uk/casa/martin/atlas/atlas.html
• Gallery of Data Visualization; The Best and Worst of Statistical
Graphics
• http://www.math.yorku.ca/SCS/Gallery/
• Tamara Munzner’s collection:
• http://graphics.stanford.edu/courses/cs348c-96-fall/resources.html

271

271

CHI 2003 Tutorial 128 Marti Hearst


Thank you!

272

272

CHI 2003 Tutorial 129 Marti Hearst


© 2008 Oracle Corporation – Proprietary and Confidential 1

© 2008 Oracle Corporation – Proprietary and Confidential 2

1
<Insert Picture Here>

Oracle Marketing Analytics Overview


Presenter’s Name
Presenter’s Title

Oracle BI Success

© 2008 Oracle Corporation – Proprietary and Confidential 4

2
Agenda

• Oracle Enterprise Performance Management System


• Oracle BI Applications Overview
• Oracle Marketing Analytics
• Value of Prebuilt Oracle BI Applications
• Customer Success
• Demonstration
• Q&A

© 2008 Oracle Corporation – Proprietary and Confidential 5

<Insert Picture Here>

Oracle Enterprise Performance


Management System

© 2008 Oracle Corporation – Proprietary and Confidential 6

3
Oracle’s EPM Vision: Extend Operational
Excellence to Management Excellence

Competitive
Advantage

MANAGEMENT EXCELLENCE

OPERATIONAL EXCELLENCE

Time

© 2008 Oracle Corporation – Proprietary and Confidential 7

Enabling Management Excellence

Traditional Performance Management

SMART

AGILE

ALIGNED

© 2008 Oracle Corporation – Proprietary and Confidential 8

4
Oracle’s EPM System

EPM Workspace

Performance Management
BI Applications
Applications

Business Intelligence Foundation

Fusion Middleware

OLTP & ODS Data Warehouse OLAP SAP, Oracle, Siebel, Excel Business
Systems Data Mart PeopleSoft, Custom XML Process

© 2008 Oracle Corporation – Proprietary and Confidential 9

<Insert Picture Here>

Oracle BI Applications
Overview

© 2008 Oracle Corporation – Proprietary and Confidential 10

10

5
What Gartner Is Saying
“The majority of customers are purchasing and
implementing BI and CPM as disparate point solutions,
which weaken their ability to achieve pervasive BI or to
link BI platform and CPM suites capabilities into an
integrated continuum to drive business transformation
from the strategic level to the process level”
Source: Employ a Coordinated Approach to BI and CPM, April 2007

© 2008 Oracle Corporation – Proprietary and Confidential 11

11

Comprehensive BI Applications

EPM Workspace
BI APPLICATIONS
Performance Management
Sales Contact Center Procurement & Spend Finance
Applications Service Marketing Supply Chain & Order Mgmt HR

Business Intelligence Foundation

Fusion Middleware

OLTP & ODS Data Warehouse OLAP SAP, Oracle, Siebel, Excel Business
Systems Data Mart PeopleSoft, Custom XML Process

© 2008 Oracle Corporation – Proprietary and Confidential 12

12

6
Enabling the Insight-Driven Enterprise

1. Empower Everyone – Every person is provided with relevant,


complete and consistent information tailored to their function
and role.

2. Provide Real-time Intelligence – Deliver insight that predicts


the best next step, and deliver it in time to influence the
business outcome

3. Use Insight to Guide Actions – lead people to take action


based on facts to optimize decisions, actions and customer
interactions

Becoming an insight-driven enterprise will drive the next level of


value creation and competitive advantage for organizations.

© 2008 Oracle Corporation – Proprietary and Confidential 13

13

Oracle BI Applications
Complete, Pre-built, Best Practice Analytics
Comms Complex Consumer Financial High Insurance Life Public Travel
Auto & Media Sector Energy Services Tech Sciences Sector
Mfg & Health & Trans

Service & Supply Chain Human


Procurement
Sales Contact Marketing & Spend
& Order Financials
Center Management Resources

Pipeline Service Campaign Direct / Indirect Revenue and General Employee


Analysis Effectiveness Effectiveness Spend Backlog Ledger Productivity

Forecast Customer Customer Buyer Inventory Accounts Compensation


Accuracy Satisfaction Insight Productivity Receivable
Sales Team Resolution Product Off Contract Fulfillment Accounts Compliance
Effectiveness Rates Propensity Purchases Status Payable Reporting

Up-sell/ Service Rep Loyalty & Supplier Customer Cash Flow Workforce
Cross-sell Efficiency Attrition Performance Status Profile

Cycle Times Service Cost Market Basket Purchase Order Profitability Retention
Analysis Cycle Time Cycle Time Analysis

Lead Churn & Campaign Employee BOM Expense Return on


Conversion Service Trends ROI Expenses Analysis Management Human Capital

and Other Operational


Source adapters: & Analytic Sources
Oracle BI Suite Enterprise Edition Plus

© 2008 Oracle Corporation – Proprietary and Confidential 14

14

7
<Insert Picture Here>

Oracle Marketing Analytics

© 2008 Oracle Corporation – Proprietary and Confidential 15

15

Marketing Organizations Struggle to Use Data


and Intelligence to Increase Performance
KEY CHALLENGES EXAMPLES
• Unable to link vehicle, target list, offer and message mix with
Lack of Campaign Insight campaign success
for Successful Lead • Lack of complete visibility into campaign effectiveness and
Generation downstream sales conversion rates
• Limited understanding on campaign response rates

Limited Visibility into • Unable to determine campaign ROI


• No means to assess segment penetration effectiveness and perform
Marketing Performance & cross sell analysis
Accountability • No knowledge of effectiveness of marketing funds in generating sales
• Lack of visibility into common customer-preferred product and service
Lack of Customer Insight bundles
into Buying Behavior • Inability to establish correlation between customer buying patterns
and behavioral attributes
• Limited information to effectively control marketing expenses
Unable to Control and • No means to know marketing cost distribution across customer
Manage Marketing Spend profiles
• Inability to make fact based resource allocations

© 2008 Oracle Corporation – Proprietary and Confidential 16

16

8
Oracle Marketing Analytics Provides Insight
to Optimize Spending and Drive Demand
ANALYSIS & METRICS BENEFITS
Marketing Planning
• Sales alignment • Executive scorecard report
• Competitor pipeline • Expense analysis by time
• Forecast & Actual • Financial information on • Monitor campaign performance to take
expenses by time marketing tactics timely corrective action to improve
Marketing Performance efficiencies
• Campaign scorecard • Cumulative revenue trend
• Campaign trends • Oppty revenue by product
• Campaign pipeline • Demographics profile of
• Cross sell analysis responders
• Make intelligent resource allocations
Customer Insight (B2B) based on effectiveness of tactics
• Account revenue • Market basket analysis
• Revenue growth • Account attrition
• Account status • Next product purchased • Track expenses and reduce wasted spend
• # of new accounts • Over promoted customers
Customer Insight (B2C) • Increase customer profitability with better
• Income/Age range • # of customer interactions
• Customer counts • # of new contacts buyer behavior insight
• Contact attrition
Events
• Top events ranking • Events by region/type
• Improve cross-selling
• Event expenses • Events lead generation
• Event scorecard • Opportunity revenue

© 2008 Oracle Corporation – Proprietary and Confidential 17

17

Oracle Marketing Analytics


Complete solution for entire marketing organization
Marketing Planning
Provides Marketing Planning related information. The information is
organized for different roles like Marketing Executive, Director, Finance
Director. The dashboard also has a Sales Alignment page to allow Sales
and Marketing Executives to co-ordinate activities

Campaign Performance
Provides Campaign Results data by Offer,
Segment, Agent performance. Manager can
monitor a campaign scorecard and identify root
causes for shortfalls in meeting predicted goals

Customer Insight
Provides product affinity, market basket and next product purchased
analysis. Provides demographic information and information on
impact of customer behavior due to marketing activities.
Marketing Events Analytics
Provides Analytics related to management of trade
shows, customer events etc. Marketing Events
Analytics can show analysis of Event registrations,
expenses on supplies by vendor, region, event etc.,
Event ROI analysis that is fully integrated with
Marketing Planning Analytics.

© 2008 Oracle Corporation – Proprietary and Confidential 18

18

9
Complementary BI Applications
Complete Solution for entire Campaign-to-Cash Process
Sales Analysis
• Analyze pipeline opportunities and forecasts to
determine actions required to meet sales targets.
• Determine which products and customer segments
generate the most revenue and how to effectively
cross-sell and up-sell.
• Understand which competitors are faced most often
and how to win against them.

Supply Chain & Order Management Analytics


• Provides insight into critical Order Management
business processes and key information, including
Orders, Order Fulfillment, Invoices, G/L Revenue,
sales effectiveness and customer scorecards.
• The delivered analysis of every step in the back-office
sales processes from Order to Cash, enables
companies to respond more quickly to customer issues
and resolve them before they become problems.

© 2008 Oracle Corporation – Proprietary and Confidential 19

19

Complementary Oracle Application


Marketing Segmentation
• Highly Interactive Interface
• Drag and drop criteria definition and
grouping, across multiple customer levels
• Simplified query terminology (‘Start with’,
‘Keep’, ‘Add’, ‘Exclude’ customers)
• “Waterfall” style display of counts
• Sample counts for large data sets
• Personal and shared subject areas

• Fully Integrated on Analytics Platform


• Queries across many different stars and
subject areas, allowing complex queries
• Shields the marketer from underlying data
complexity and performance optimization
• Uses same meta data as reporting tools;
leverages all available calculations and
metrics, plus data mining models

• Enforcement of Global Rules


• Consistently apply governance rules (such
as profiling, privacy, contact frequency)

© 2008 Oracle Corporation – Proprietary and Confidential 20

20

10
<Insert Picture Here>

Value of Prebuilt Oracle BI


Applications

© 2008 Oracle Corporation – Proprietary and Confidential 21

21

Key Benefits of Oracle BI Applications

• Insight ➢ Gain visibility and insight into business


performance, processes, and customers
➢ Better decisions, actions, control at all levels
➢ Respond faster to opportunities and threats
➢ Identify and replicate best practices
• Alignment

• Leverage

© 2008 Oracle Corporation – Proprietary and Confidential 22

22

11
Role-Based Best Practices Provide Relevant
and Actionable Insight for Everyone
Marketing Analytics – Key Objectives and Questions by Role
Optimizing Marketing Performance for Competitive Advantage

VP Marketing • How is the marketing budget being • How should I allocate the marketing
consumed? budget to generate the best results?
/ CMO
• What areas / programs are trending to • What areas historically have yielded
go over budget? the best results?

Deeper Insights for Marketing Strategy


• Which customers segments are • How can we increase revenue through
Marketing unprofitable and why? more effective a cross-sell/up-selling?
Director • What is our most profitable acquisition • What can we do to increase customer
method? satisfaction and loyalty?
• What trade shows generated a positive • What types of promotions deliver the
ROI? most revenue lift?
Better Manage Acquisition & Campaign Performance

MARCOM • Do purchased lists perform better than • Which marketing campaigns generated
our house list and why? the most qualified leads?
/ DM Manager
• Is the sales organization picking up the • Which programs / campaigns yield the
Leads in a timely manner? highest take rate?

© 2008 Oracle Corporation – Proprietary and Confidential 23

23

Marketing Process Relationship Map


Marketing Sales
Processes

Campaign
Sales Execution
Management
Core

Event Management
Management
Processes

Planning &
Budgeting
Processes
Support

Needs
Analysis

© 2008 Oracle Corporation – Proprietary and Confidential 24

24

12
Example Response and Lead Management Process

Marketing

Plan, Execute Nurture


budget campaigns prospects

Track,
measure
Marketing/Sales results
Operations
Cleanse,
Capture, Assign
enrich, Assign
load oppty,
score leads
responses notify rep
responses

Sales Development
Call Center
Qualify
leads,
create
opptys

Field Sales and


Channel Sales
Set
Accept or
objectives, Work
reject
define opptys
opptys
“lead”

© 2008 Oracle Corporation – Proprietary and Confidential 25

25

End-to-End Campaign to Cash Flow


Quality Information Is Needed During All Stages
Key Questions Business Process

Plan Design Target Launch Track Analyze

How do I set What should Who is target How was


some realistic campaign flow audience? Are the
When should campaign
goals for this look like? Any responses,
the campaign performance?
campaign? related events? Who are more leads,
likely to be launched? opportunities, How much did
How much Which Product respond? orders etc. I spend?
Shall I launch it
shall I spend? offer and matching the
How to in waves? What is the
How to allocate through which plan?
the budget? channel? segment them? revenue? ROI?

Maximizing Campaign Effectiveness Will Enable


High Return on Marketing Investment and Drive Sales Revenues

© 2008 Oracle Corporation – Proprietary and Confidential 26

26

13
Campaign to Cash Flow Example Decision Flow
Marketing Executive Role
Business Marketing Planning &
Objective Execution

Are we on target
to meet our goals?

What campaigns are


under / over performing?

Gain
Insights
Are we generating quality
leads from these
campaigns?

How are these leads


converting to orders

Take Drill to Campaign


Action to modify

© 2008 Oracle Corporation – Proprietary and Confidential 27

27

Campaign to Cash Flow Example, MARCOM Manager

Plan Design Target Launch Track Analyze

Business Set campaign goals


Objective

Did they get anticipated


response? How was the
What is historical lead generation? How
campaign many orders? How was
performance for the revenue?
similar tactics?
Gain
Insights What was the plan Are there any patterns
on spending ? Did with large variance?
What was the plan for the expense stay Analyze variance by
those campaigns? Was within planned Time, Region,
the actual in line with budget ? Organization etc.
the goal for various
measures?
Take Plan the
Action campaign with
these data points

© 2008 Oracle Corporation – Proprietary and Confidential 28

28

14
Campaign to Cash Flow Example, MARCOM Manager

Plan Design Target Launch Track Analyze

Business Design the Campaign


Objective / program

How was the


Analyze similar response rate in the
campaigns / past for similar
programs which product launch /
provided strong ROI offers?

Gain
If the campaign flow
Insights was phased, would it For this product and
be more effective? target, which channel
When is best time to works best?
launch marketing
events?
What is inferred
leads generated?
Take Design the flow Build program flow
Action

© 2008 Oracle Corporation – Proprietary and Confidential 29

29

Campaign to Cash Flow Example, MARCOM Manager

Plan Design Target Launch Track Analyze

Business Identify the target


Objective audience Have they been
What is purchasing contacted through
behavior of those other campaigns
potential targets? in the recent past?

Identify similar
tactics. Which
customers did I
target in those Exclude people from
Gain What is the
campaigns? the list based on the
Insights demographic profile
contact frequency
of those customers /
and their preferences
prospects?

Conduct Identify target market


campaign history segments using past
analysis to purchase behavior Finalize the target
Take identify potential and likelihood of
Action contact lists responding

© 2008 Oracle Corporation – Proprietary and Confidential 30

30

15
Campaign to Cash Flow Example, MARCOM Manager

Plan Design Target Launch Track Analyze

Business Start campaign


Objective

When is last
campaign Have all treatment
completing? / media been
approved?

Gain Are there other


Insights ongoing
campaigns? Have response
assignments rules
been prepared?
Are all channels
capable of
handling Execute launch of
increased volume campaign across
Take of interactions? all channels
Action

© 2008 Oracle Corporation – Proprietary and Confidential 31

31

Campaign to Cash Flow Example, MARCOM Manager

Plan Design Target Launch Track Analyze

Business Track campaign /


Objective program results Is the marketing
generating enough
quality leads for the
sales force?
Are the actual in
line with the goal
for each metric?
Is the sales force
converting those What is Actual
Gain leads to Order Revenue
Insights How was the opportunities in a from the
campaign response? timely manner ? campaigns?
What are the number
of bounced
interactions?

Are these Review response to


Opportunities lead to order
Take Review response conversion rates and
leading to Quotes ?
Action accuracy cycle times

© 2008 Oracle Corporation – Proprietary and Confidential 32

32

16
Campaign to Cash Flow Example, MARCOM Manager

Plan Design Target Launch Track Analyze

Business Analyze campaign Analyze campaign


Objective execution performance

Did marketing Have we updated What is response


rate? Which What is the actual
message get contact expense and cost
successfully information based channel generated
more responses? involved?
delivered? on success/failure
of delivery?
Gain
Insights What is lead How is the ROI ?
What percentage of How does it
How quick was generation rate?
the people I compare against
response Analyze leads by
contacted other campaigns ?
qualification? geography, marketing
responded ?
source, time etc.

Take Review campaign Campaign


Campaign delivery Develop Campaign
Action execution cycle response and lead
metrics ROI Analysis
times metrics

© 2008 Oracle Corporation – Proprietary and Confidential 33

33

Key Benefits of Oracle BI Applications

• Insight

• Alignment ➢ Gain a single, consistent view of enterprise


information across functions & data sources
➢ Align strategy and execution across functions
➢ Use guided analytics and best practice analytic
workflows to drive the best actions
• Leverage

© 2008 Oracle Corporation – Proprietary and Confidential 34

34

17
Typical Operational Challenges
Analyses, Reports
Executives

Sales Marketing Operations Finance

Sales Marketing Operations Operations Finance Finance


Data Data Data 1 Data N Data 1 Data N

Data
IT Warehouse

• Delayed, inaccurate reporting • Cross-functional analysis only by IT


• Conflicting, departmentally-biased results • Sub-optimal enterprise performance

© 2008 Oracle Corporation – Proprietary and Confidential 35

35

Valuable Insights Often Require Data from


Multiple Departments and Sources
HR/Workforce

Customers Suppliers
Procurement
Distribution

Operations
Marketing

Finance
Service

Sales

Customers Suppliers

Customers Suppliers

How does lead quality affect conversion rates, pipeline build up


and revenue ?

How does sales performance of sales reps relate to their tenure


in the team/organization?

Does on- time shipment relate positively to repeat purchases?

© 2008 Oracle Corporation – Proprietary and Confidential 36

36

18
Maximizing Customer Value
Key Objectives in Sales, Marketing, Customer Service

FRONT OFFICE BACK OFFICE


• Primary role: identify, acquire & support • Primary role: buy, make, and deliver
customers products, support the workforce, and
• Key objective: grow revenues and manage finances
profit by maximizing customer value • Key objective: maximize operational
• Primary functions: Sales, Marketing, efficiency, quality & accuracy while
Service, Contact Center controlling costs
• Primary functions: Finance, HR, Supply
Chain Operations, Procurement, Order
Management

Customers Front Office Back Office Suppliers

© 2008 Oracle Corporation – Proprietary and Confidential 37

37

Typical Business Challenges

• How are actual sales tracking against forecast and plan by region?
SALES • What are the best products to cross and up-sell?
• Why are sales opportunities being lost?

• Which campaign tactics are most effective?


MARKETING • Which offers are succeeding with different customer segments?
• What is the product mix compared to plan?

• What are average handle times and abandonment rates?


CONTACT • Which are the most productive and efficient CSR’s and why?
CENTER • What are the best cross-sell and up-sell offers for each segment?

• How do I reduce costs while maintaining high customer satisfaction?


SERVICE • What are quality levels and component failure rates by product?
• How long is it taking to resolve new service requests?

© 2008 Oracle Corporation – Proprietary and Confidential 38

38

19
Deeper Insight within Business
Functions
SALES ANALYTICS CONTACT CENTER ANALYTICS
• Improve pipeline visibility • Understand service cost drivers
• Forecast with confidence • Optimize staffing for call volumes
• Increase cross & up-selling • Monitor CSR performance & drivers
• Quickly spot opportunities/threats • Detect defects early

MARKETING ANALYTICS SERVICE ANALYTICS


• Identify high potential segments • Improve customer service
• Maximize return on marketing spend • Drive efficiency, lower costs
• Track campaign results & impact • Provide single view of customer

© 2008 Oracle Corporation – Proprietary and Confidential 39

39

Alignment across Functions


• Best align switch
SALES ANALYTICS workflows to increase CONTACT CENTER ANALYTICS
customer satisfaction
• Convert inbound service
calls to sales

• Improve lead follow-up and • Monitor and manage service


conversion channel usage/mix
• Understand campaign Understand customer • Understand compliance of
impact on revenue profitability and tailor call handling to SLA’s
• Improve customer product customer experience • Monitor customer
and service offerings satisfaction vs. service cost

• Devise marketing
programs to deflect
product availability or
MARKETING ANALYTICS quality issues SERVICE ANALYTICS
• Understand how
marketing promotions
impact service centers

© 2008 Oracle Corporation – Proprietary and Confidential 40

40

20
Alignment across the Enterprise

Front Office Back Office

Sales Financial
• Impact of product mix and discounts
Analytics Analytics
on revenue and margins

• Correlation between training &


Marketing HR
compensation and worker productivity
Analytics Analytics
• Visibility into supply chain enabling
delivery of the perfect order Supply Chain
Contact Center & Order Mgmt
Analytics Analytics
• Complete visibility across value chain
to better manage supply and demand
fluctuations
Service Procurement &
Analytics Spend Analytics

© 2008 Oracle Corporation – Proprietary and Confidential 41

41

Oracle BI Applications Provide a Single


Integrated View of Enterprise Information
INTEGRATED DATA
WAREHOUSE
• Integrated enterprise-wide intelligence
• Summary level to lowest level of detail
• Data warehousing best practices –
conformed dimensions, lowest level of
granularity, full change histories for time
comparisons, built for speed, extensible

DATA INTEGRATION FROM


MULTIPLE SOURCES

• Call center telephony (IVR, ACD, CTI)


• Syndicated data
• Universal Adapters

© 2008 Oracle Corporation – Proprietary and Confidential 42

42

21
Key Benefits of Oracle BI Applications

• Insight

• Alignment

➢ Do more with less - deploy BI more broadly


• Leverage with fewer IT resources than custom-build
➢ Accelerate time-to-value, lower TCO and risk
➢ Increase the value of existing data and
applications, including CRM and ERP

© 2008 Oracle Corporation – Proprietary and Confidential 43

43

Building BI Solutions is Challenging


Investment, Skills and Time Required
Develop detailed understanding of operational data sources
These steps require
Design a data warehouse by subject area multiple different BI
and DW technology
License an ETL tool to move data from operational systems INVESTMENTS
to this DW
Build ETL programs for every data source
These steps require IT
License interactive user access tools or BI staff resources
with specialized
Research/understand analytic needs of each user community SKILLS
Build analytics for each audience
License/create information delivery tools These steps take TIME
Set up user security & visibility rules to understand and
perfect as knowledge
Perform QA & performance testing of best practices is
learned
Manage on-going changes/upgrades

© 2008 Oracle Corporation – Proprietary and Confidential 44

44

22
Marketing Analytics Components
1 Pre-built warehouse with 15 star-schemas 3 Pre-mapped metadata defining real-time
designed for analysis and reporting on access to analytical and operational sources,
Marketing data best practice calculations, and metrics for
marketing.

➢ Presentation Layer
➢ Logical Business Model
➢ Physical Sources

2 Pre-built ETL to extract data from over 1,000 4 A “best practice” library of over 500 pre-built
operational tables and load it into the DW, metrics, Intelligence Dashboards, Reports
sourced from CRM systems and other and alerts for marketing analyst, managers
sources and executives.

© 2008 Oracle Corporation – Proprietary and Confidential 45

45

More than just dashboards and reports


Value of BI Apps lies under the surface

DASHBOARDS&
REPORTS
• Prebuilt best
practice library
• “One size does
NOT fit all”

SUBJECT AREAS
• Many metrics and dimensional
attributes not surfaced by prebuilt
dashboards and reports
• Possibilities are endless
• Incremental work to build tons
more content from this foundation

© 2008 Oracle Corporation – Proprietary and Confidential 46

46

23
Unrivaled Integration with Oracle Apps
Extends BI Value. Lowers TCO.
ACTION LINKS – “INSIGHT TO ACTION” INTEGRATED SECURITY
Seamless navigation from analytical information One login. Right content for each user.
to transactional detail

Data User Object


Security Security Security

INTEGRATED WORKFLOW INTEGRATED WITH PLANNING AND EPMS


Intelligence-driven business processes View performance “in-context” with budgets
and plans

BPEL, ESB Oracle BI

© 2008 Oracle Corporation – Proprietary and Confidential 47

47

Unrivaled Integration with Oracle Apps


Deeply Integrated into Siebel CRM

• Single user interface - essential for driving user adoption


• Action Links - direct navigation from record to
transactional while maintaining context
• Take action immediately without navigating to a different
screen

© 2008 Oracle Corporation – Proprietary and Confidential 48

48

24
Unrivaled Integration with Oracle Apps
Deeply Integrated into Siebel CRM

• Integrated Data Security Visibility


• One login. Right content for each user based on
• Position
• Owner
• Organization
ETL
Oracle Business
Analytics Siebel CRM
Warehouse
Data
Security
2a 2b

retrieve position hierarchy fetches Owner ID and


from the warehouse Organization(s) via
session init block
1

log in
Oracle BI
user show data based on
security group filters
3

© 2008 Oracle Corporation – Proprietary and Confidential 49

49

Align Actions with Best Practices


Guided & Conditional Navigation Helps Novice Users
GUIDED NAVIGATION

• Enables users to quickly navigate a standard path of analytical


discovery specific to their function and role
• Enhances usability and lowers learning curve for new users

CONDITIONAL NAVIGATION

• Appears only when conditions are met and alerts users to


potential out of ordinary conditions that require attention
• Guides users to next logical step of analytical discovery

© 2008 Oracle Corporation – Proprietary and Confidential 50

50

25
Speeds Time To Value and Lowers TCO
Build from Scratch
Oracle BI Applications
with Traditional BI Tools

Training / Roll-out

Results
Define Metrics • Faster time to value
& Dashboards
• Lower TCO
• Assured business value
DW Design

Training / Rollout Easy to use, easy to adapt


Define Metrics Role-based dashboards and
Back-end & Dashboards thousands of pre-defined metrics
ETL and DW Design Prebuilt DW design, adapts to your
Mapping EDW
Back-end
ETL and Prebuilt Business Adapters for
Mapping Oracle, PeopleSoft, Siebel, SAP,
others
Months or Years Weeks or Months

Source: Patricia Seybold Research, Gartner, Merrill Lynch, Oracle Analysis

© 2008 Oracle Corporation – Proprietary and Confidential 51

51

Typical Effort & Customization balance

Additional dashboards and


reports, guided and conditional
Dashboards & Reports Easy navigations, iBots, etc.

Additional derived metrics,


custom drill paths, exposing
OBIEE Metadata Moderate extensions in physical, logical and
presentation layer, etc.

Extension of DW Schema for


extension columns, additional
DW Schema Intermediate tables, external sources,
aggregates, indices, etc.

Extension of ETL for


ETL Involved extension columns,
descriptive flexfields,
additional tables, external
sources, etc.

Degree of Level of
Customization Effort

© 2008 Oracle Corporation – Proprietary and Confidential 52

52

26
The Value is Below the Surface
Oracle BI Applications

• Dashboards

• Pre-built ETL across multiple


applications and sources
• Pre-mapped metadata
• Pre-built metrics
• Pre-built data model

© 2008 Oracle Corporation – Proprietary and Confidential 53

53

BI Applications - Business Content


Over 5,000 pre-defined assets
Application Dashboards Dashboard Reports Metrics
Pages
Sales 14 33 620 555
Marketing 5 27 124 501
Service 8 15 102 465
Contact Center 5 17 72 448
Finance 4 30 205 360
HR 4 16 76 138
Supply Chain 2 18 157 388
& Order Mgmt.
Procurement 2 14 103 161
& Spend
All Industry 44 147 1117 508
Apps
Total 88 317 2576 3524

© 2008 Oracle Corporation – Proprietary and Confidential 54

54

27
Selected Key Entities of Business Analytics
Warehouse
Call Center Conformed Dimensions
Sales ▪ ACD Events
▪ Opportunities ▪ Customer
▪ Quotes ▪ Rep Activities
▪ Pipeline ▪ Contact-Rep Snapshot ▪ Products
▪ Targets and Benchmark ▪ Suppliers
Order Management ▪ IVR Navigation History ▪ Internal Organizations
▪ Sales Order Lines ▪ Customer Locations
▪ Sales Schedule Lines Service
▪ Bookings ▪ Service Requests ▪ Customer Contacts
▪ Pick Lines ▪ Activities ▪ GL Accounts
▪ Billings ▪ Agreements ▪ Employee
▪ Backlogs
▪ Sales Reps
Workforce
Marketing ▪ Compensation ▪ Service Reps
▪ Campaigns ▪ Employee Profile ▪ Partners
▪ Responses ▪ Employee Events
▪ Marketing Costs ▪ Campaign
▪ Offers
Supply Chain Pharma
▪ Prescriptions ▪ Cost Centers
▪ Purchase Order Lines ▪ Profit Centers
▪ Syndicated Market Data
▪ Purchase Requisition Lines
▪ Purchase Order Receipts
▪ Inventory Balance Financials
▪ Inventory Transactions ▪ Financial Assets Modular DW Data Warehouse Data
▪ Insurance Claims Model includes:
Finance ~350 Fact Tables
▪ Receivables Public Sector
▪ Benefits
~550 Dimension Tables
▪ Payables
▪ General Ledger ▪ Cases ~3,500 prebuilt Metrics
▪ COGS ▪ Incidents (2,000+ are derived metrics)
▪ Leads ~15,000 Data Elements
© 2008 Oracle Corporation – Proprietary and Confidential 55

55

Rapid Deployments
Oracle BI Applications

6 weeks

9 weeks

10 weeks

12 weeks

3 months

3½ months

100 days

© 2008 Oracle Corporation – Proprietary and Confidential 56

56

28
<Insert Picture Here>

Customer Success

© 2008 Oracle Corporation – Proprietary and Confidential 57

57

Business Intelligence Customers


Select References
Communications Automotive Finance / Banking Consumer Goods High Tech

Media / Energy Aero / Industrial Insurance / Health Life Sciences Other

© 2008 Oracle Corporation – Proprietary and Confidential 58

58

29
Holistic View of Customer Information Enables
Alignment of Marketing and Service
World’s leading manufacturer and marketer of major home
appliances. Deployed Oracle BI Suite EE and Oracle Marketing
Analytics integrated with Siebel CRM Call Center Application.

Before After
▪ No centralized customer view ▪ Companywide, holistic view of information
▪ Multiple siloed customer data sources by customer, household and asset
hampered marketing abilities ▪ Consolidated 3 customer databases into 1
▪ Slow time-to-market with marketing ▪ Accelerated marketing campaign
campaigns despite millions spend on introductions to capitalize on trends
outside vendors ▪ Provided call centers with information and
▪ Call center unable to effectively use tools to up-sell customers and establish
customer data to enhance service or “closed loop” marketing capabilities
capitalize on sales opportunities

“With Oracle, Whirlpool business units are capitalizing on the integration between our
business intelligence, call center, and marketing solutions to drive revenue creation
and customer loyalty incentives.” - Thomas Mender, Whirlpool Corporation

© 2008 Oracle Corporation – Proprietary and Confidential 59

59

Optimize CRM Processes and Performance


1,250 Users, 400 Reports, 3 Months, 1 IT Admin
Pitney Bowes is the world’s largest producer of postage
meters. Implemented Oracle BI Applications (Sales, Service,
and Marketing Analytics) to over 1,250 employees.

Before After

▪ Poor measurement of employee ▪ “Turned the tides” in sales force with


performance in sales and service better insight into performance
▪ Lack of customer insight - no ▪ Enhanced sales productivity with 360o
consistent, real-time view view of customer relationship
▪ Slow “Customer Inquiry Response ▪ Increased customer responsiveness,
Time” leading to greater satisfaction / retention
▪ No single source of customer data ▪ Unified customer data for better
for segmentation marketing segmentation and targeting
▪ High reliance on IT for information ▪ Customer-facing employees empowered
with the information they need

“One of the most important values of Oracle’s BI solution is its TCO. We created 400
reports used by 1,250 users with a staff of one within a few months—that is very
cost effective.” – William Duffy, Data Warehousing Project Manager

© 2008 Oracle Corporation – Proprietary and Confidential 60

60

30
Q&
A
© 2008 Oracle Corporation – Proprietary and Confidential 61

61

© 2008 Oracle Corporation – Proprietary and Confidential 62

62

31

You might also like