Lecture08 1

Business Intelligence III
Δαμιανός Χατζηαντωνίου (damianos@aueb.gr)

Τμήμα Διοικητικής Επιστήμης και Τεχνολογίας
Οικονομικό Πανεπιστήμιο Αθηνών
Topics
 Common Issues in Star Schemas
 SQL Analytic Functions and MDX
 Introduction to Data Mining
11/29/2020 Data Management, Business Intelligence and Visualization 2

Common Issues in Star Schemas
Slowly Changing Dimensions (1)
 So far:
 New rows in dimension tables can be inserted
 Existing rows do not change
 This is not a realistic assumption
 However, dimensions change
 “Slowly changing dimensions” (SCD) phenomenon
 Dimension information change (e.g. value)
 Dimension’s schema change – another problem

 Attribute values in dimensions vary
over time
 A store changes Size
 A product changes Description
 A customer changes Address
 Handling changes
 Dimensions are not updated
 DW is not up-to-date
 Dimensions updated in a
straightforward way (i.e. replace)
 incorrect information in historical data

Source: Introduction to Data Warehousing and
Business Intelligence, by Christian Jensen
One day… the

store expands to
450 sq. meters

 Solution 1: Do nothing or overwrite

 Solution 1: do nothing or overwrite
 Consequences:
 Old facts point to rows in the dimension tables with incorrect information!
 New facts point to rows with correct information
 Pros:
 Easy to implement
 Useful if the updated attribute is not significant, or the old value should be
updated for error correction
 Cons:
 Old facts may point to “incorrect” rows in dimensions

 Solution 2: Create new record with start/end
timestamp columns (versioning)
 The key that links dimension and fact table,
identifies a version of a row, not just a “row”
 Surrogate keys make this easier to implement (i.e.
the primary key of a dimension is a randomly
generated or an increasing integer and not the
“actual” id)

 From/To columns provide time range

Common Kinds of Dimensions
 Different kind of dimensions:
 Minidimensions
 Outriggers dimensions
 Junk dimensions
 Time dimensions
 Data Quality dimensions

Minidimensions
 Some dimensions may have many attributes that
change too often. Example, a customer’s age, salary
range, etc. In such cases, one can break this
dimension to two, having all these frequently changing
attributes together.
 The minidimension can hold all possible combinations
of the values of the often changing attributes (cartesian
product of the attribute values, or it can mean all
combinations that actually occur in real life)

Outriggers Dimensions
 A dimension that is referenced from another dimension
is called an outrigger.
 Outriggers are used in relational OLAP environments
where the outrigger’s dimension table is referenced by a
foreign key in another dimension table. To use the
outrigger, a join between the dimension table and the
outrigger is performed.
 A dimension can be used both as an ordinary dimension
and as an outrigger at the same time (e.g. Time
dimension, linked both to the fact table and the
Customer dimension through date-of-birth).
Junk Dimensions
 Some dimensions have only two/three possible values
 an apartment has views (yes/no), walk-in closet (yes/no), is
renovated (yes/no)
 Instead of keeping three different dimensions, create
one, e.g. called amenities, with all possible combinations
 Such a dimension with combinations of unrelated values,
is called a junk dimension.
 While the the querying can become slightly harder with a
junk dimension, this arrangement reduces the
dimensionality of the cube.

Time Dimension
 There are two kinds of time dimensions:
 a date dimension represents the date (“3/18/2009”)
 a time-of-day dimension represents the clock time (“11:47 am”)
 It is not advisable to combine, both for space efficiency
and analysis flexibility
 It is good to make it an explicit dimension
 Sometimes – for space efficiency – assign ids to dates

Data Quality Dimensions
 A dimension that rates the reliability of each fact in the
fact table.
 values can be a real value from 0 to 1
 values can be a label such as “Normal value”, “Out-of-bounds
value”, “Unlikely value”, “Verified value”, “Unverified value”,
“Uncertain value”, etc.
 Analysts can easily restrict their analysis to facts of
specific reliability levels

Simple Hierarchies
 The hierarchies considered so far were simple
hierarchies.
 balanced: in any given instance of the hierarchy, all leaves
belong to the schema’s lowest level
 covering: in any given instance, each path starts at the root
and then goes to the level immediately below in the schema
and then to the next level immediately below and so on,
without skipping any level
 strict: in any given instance, no dimension value has more than
one parent

Unbalanced Hierarchies
 Special kinds of hierarchies:
 Unbalanced hierarchies
 child-parent
 Implementing is difficult

Non-covering Hierarchies
 Non-covering hierarchies (ragged hierarchies)
 A non-covering hierarchy allows instances to skip levels
between the leaves and the root.
 A placeholder can be put in the immediate upper level

Non-Strict Hierarchies
 Non-strict hierarchies: a child can have more than
one parents

Modeling – Hierarchies
 Non-strict hierarchies (cont.)
 Frequent cases
 Many-to-many relationship
 Use of a “bridge” table

Modeling – Hierarchies
 Non-strict hierarchies (cont.)
 Many-to-many relationships cause problems with
respect to aggregations. Assume that a book B is
written by two authors and we want to compute the
total sales for each author… Problem?
 Bridge tables usually have properties

SQL Analytic Functions and MDX
Analytic Functions – Motivation (1)
 Relation example:
 Sales (cust, prod, day, month, year, state, sale)
 Sales table stores the purchases of a product (prod) by a
customer (cust) on a date and state for a sale amount (sale)
 Query examples:
 Q1: for each customer (cust), show the total of his purchases in
‘NY’, ‘NJ’ and ‘CA’ in 2016 (as three additional columns)
 Q2: for each product and for sales of 2016, show each month’s
total sales as percentage of the year-long total sales
 Q3: for each product and month of 2016, show the month’s
cumulative total sales

Analytic Functions – Motivation (2)
 All examples require a different kind of grouping
 Q1: GROUP BY cust, and then define subsets for each
group (sales in NY, sales in NJ, sales in CA) and
compute aggregates for these subsets
 Q2: Compare aggregates of GROUP BYs defined at
different hierarchy levels: (prod) and (prod, month)
 Q3: As the month moves, we define a set of rows that
include all sales before or on this month

Analytic Functions – History
 SQL Extensions for Analytics (EMF SQL):
 Chatziantoniou et al., VLDB96, EDBT98, SSDBM99, KDD99
 Applications: Telcos (AT&T), Medical Informatics (Columbia
Presbyterian Hospital, Philips North America), Direct marketing
data (D&B), Bioinformatics, Finance
 Adopted by and basis for Oracle’s Analytic Functions
 Implemented in Oracle 8i - 1999
 Proposed to SQL Standardization Committee
 Analytic Functions part of standard SQL
 Most SQL systems implement Analytic Functions

SQL Analytic Functions – Main Idea
• we want to extend table T with a column D,
SELECT A,B,C,… where the values of D are determined as:
FROM T1,T2,… • for each row i = (ai, bi, ci):
WHERE condition • a subset S of T is defined (a window)
• an aggregated value is computed over S and
becomes the value of D in row i
T:
A B C D • This window S can defined as
a1 b1 c1 part of the entire table T, or T can
a2 b2 c2 d2=avg(S.C) be partitioned based on ai and/or
S: a3 b3 c3 bi and/or ci and S can be defined
a4 b4 c4 within a partition
… … … …

SQL Analytic Functions - Steps
 Query processing using analytic functions takes place in
three stages
 First, WHERE, GROUP BY, and HAVING clauses are performed
 Second, the result set is made available to the analytic
functions, and all their calculations take place
 Third, if the query has an ORDER BY clause at its end, the
ORDER BY is processed to allow for precise output ordering

SQL Analytic Functions – Main Constructs
 Partition: The analytic functions allow users to divide
query result sets into ordered groups of rows called
partitions
 Window: For each partition, a window of data is
defined. The window determines the range of rows
used to perform the calculations for the “current row”
 Current row: Each calculation performed with an
analytic function is based on a current row within a
window

SQL Analytic Functions – Big Picture

SQL Analytic Functions – Example #1
Sales (cust, prod, day, month, year, state, sale)
 Q2: for each product and for sales of 2016, show each month’s
total sales as percentage of the year-long total sales
SELECT prod,
distinct
month,
prod, month,
sum(amount) OVER (PARTITION BY prod, month)/
sum(amount) OVER (PARTITION BY prod)
FROM Sales
WHERE year=2016

SQL Analytic Functions – Example #2
Sales (cust, prod, day, month, year, state, sale)
 Q3: for each product and month of 2016, show the month’s
cumulative total sales
SELECT prod,
distinct
month,
prod, month,
sum(amount) OVER (PARTITION BY prod
ORDER BY month)
FROM Sales
WHERE year=2016

Microsoft – MDX Language
 With relational dbs, SQL assembles sets of data (tables)
 MDX assumes an n-dimensional space
 MDX language assembles tuples of data points in this n-
dimensional cube
 measures are also in a dimension
 MDX is not exclusive to SQL Server Analysis Services
 it is part of a vendor-neutral specification (XML for Analysis)
 A good and step-by-step introduction to MDX:
 SQL Server 2008 MDX, Step by Step

Microsoft – MDX, An Example

Data Mining
Data Mining – What is it?
 Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
 Not OLAP, not SQL, in general not Data Analysis! Or
(for statisticians) Data Analysis!

Classification
 Classification Problem:
 Given a set of known classes
 Given a training set, where its members belong to these classes
 Given a classification method (e.g. neural networks)
 Classify a new data to one of the classes

Classification Examples
 Banking - Identify individuals with credit risks
 Given a new loan application, what is the probability to default?
 Telecom - Retention
 Given a new customer, what is the retention probability?
 Finance – Predicting stock behavior
 Given a set of parameters, will a stock go up or down?
 Speech Recognition/Optical Character Recognition
 Patient Outcome Analysis

Classification – Building a Model
 Classes are predefined (e.g. 0,1, 2)
 Training dataset: a set of data items – each
data item has some description (a set of
attributes, e.g. age, gender, income) + the
class label (to which class it belongs).
 Using the training dataset a model is built that
has as input the description of a data item and
as output a class label (classifier)
39
11/29/2020 Data Management, Business Intelligence and Visualization
Classification Techniques
 Regression
 Bayesian classification
 K-Nearest Neighbors (KNN)
 Decision support trees
 Neural networks

Bayesian Classification
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1 Outlook Temp Humidity Windy Play
Sunny Hot High False No

Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Sunny Hot High True No
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 TrueHot
Overcast 3/9
High 3/5 False Yes
Rainy 3/9 2/5 Cool 3/9 1/5 Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Outlook Temp. Humidity Windy Play Sunny Cool Normal False Yes
Rainy Mild Normal False Yes

Sunny Cool High True ?
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

KNN Classification
 Assume we can measure
the distance between two
data items, i.e. there exists
a distance function f(d1,d2)
 Find k nearest items to the
new data item to be
classified
 New data item is placed to
the class with the most
number of close items

Decision Trees
 Each internal node represents
an attribute A of the data
items’ description
 Each branch is labeled with a
partitioning condition on A
 Each leaf (where a path from
the root ends) corresponds to
a class
 Each step partitions the
training dataset

Neural Networks
 Typical NN structure:
 one input node per attribute
 one output node per class
 For each data item in training
dataset, propagate values
through NN (Σ fi wik). Adjust
weights on edges to improve
accuracy of classification
 error ( w ) = (expected output -
actual output)
 goal: minimize error

Clustering
 Given a database D of data items, a similarity function
sim(di, dj), and an integer value k, the clustering
problem is to define a mapping f: D{1,..,k}, i.e. each
data item di is assigned to one cluster Kj, 1<=j<=k.
 Number of clusters are not known a priori.
 Goal: find an assignment of the data items to k clusters,
in order to minimize distance between cluster’s
members, while maximizing distance between clusters

Clustering – Example (1)
 Clustering
houses based
on geographic
distance

Clustering – Example (2)
 Clustering
houses based
on size
Size Based
Clustering – Cases
 Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs.
 Land use: Identification of areas of similar land use in
an earth observation database.
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost.
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location.
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults.

Clustering – K-Means Algorithm
 Initial set of clusters randomly chosen – by
randomly choosing the center of the clusters.
 Iteratively, items are moved among sets of
clusters until the desired set is reached.

K-means example, initialization
k1
Y
Pick 3
k2
initial
cluster
centers
(randomly)
k3
X
50
K-means example, 1st repetition
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
51
K-means example, 1st repetition
k1 k1
Y
Move k2
each cluster
center k3
k2
to the mean
of each k3
cluster
X
52
K-means example, 2nd repetition
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
53
k1
Y
A: three
points with
animation k3
k2
X
54
k1
Y
re-compute
cluster
means k3
k2
X
55
k1
Y
k2
move cluster
centers to k3
cluster
means
X
56
Association Rules
 Example: baskets in a supermarket containing products
 {Beer, Bread} => Diapers | support=5%, confidence=30%
 support 5%  5% of baskets contain Beer, Bread, Diaper
 confidence 30%  30% of baskets that contain Beer and Bread,
also contain Diapers
 Given
 a set of items I={I1,I2,…,Im} (e.g. products, books, views), and
 a set of baskets (or transactions) T={t1,t2, …, tn}
 where each transaction contains items, ti = {Ii1,Ii2, …, Iik}
 the goal is to identify all association rules X  Y with a

minimum support and confidence
Association Rules Problem
 Given
 a set of items I={I1,I2,…,Im} (e.g. products, books, views), and
 a set of baskets (or transactions) T={t1,t2, …, tn}
where each transaction contains items, ti = {Ii1,Ii2, …, Iik}
the goal is to identify all association rules X  Y with a
minimum support and confidence
 A-priori algorithm is mainly used to compute these
association rules

Association Rules – Cases
 Market-basket Analysis
 Placement
 Advertising
 Sales
 Coupons

Lecture08 1

Uploaded by

Copyright:

Available Formats

You might also like

Lecture08 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture08 1

Uploaded by

Copyright:

Available Formats

Business Intelligence III

Δαμιανός Χατζηαντωνίου (damianos@aueb.gr)

11/29/2020 Data Management, Business Intelligence and Visualization 2

11/29/2020 Data Management, Business Intelligence and Visualization 4

11/29/2020 Data Management, Business Intelligence and Visualization 5

One day… the

11/29/2020 Data Management, Business Intelligence and Visualization 6

11/29/2020 Data Management, Business Intelligence and Visualization 7

11/29/2020 Data Management, Business Intelligence and Visualization 8

11/29/2020 Data Management, Business Intelligence and Visualization 9

11/29/2020 Data Management, Business Intelligence and Visualization 10

11/29/2020 Data Management, Business Intelligence and Visualization 11

11/29/2020 Data Management, Business Intelligence and Visualization 12

11/29/2020 Data Management, Business Intelligence and Visualization 14

11/29/2020 Data Management, Business Intelligence and Visualization 15

11/29/2020 Data Management, Business Intelligence and Visualization 16

11/29/2020 Data Management, Business Intelligence and Visualization 17

11/29/2020 Data Management, Business Intelligence and Visualization 18

11/29/2020 Data Management, Business Intelligence and Visualization 19

11/29/2020 Data Management, Business Intelligence and Visualization 20

11/29/2020 Data Management, Business Intelligence and Visualization 21

11/29/2020 Data Management, Business Intelligence and Visualization 22

11/29/2020 Data Management, Business Intelligence and Visualization 24

11/29/2020 Data Management, Business Intelligence and Visualization 25

11/29/2020 Data Management, Business Intelligence and Visualization 26

11/29/2020 Data Management, Business Intelligence and Visualization 27

11/29/2020 Data Management, Business Intelligence and Visualization 28

11/29/2020 Data Management, Business Intelligence and Visualization 29

11/29/2020 Data Management, Business Intelligence and Visualization 30

11/29/2020 Data Management, Business Intelligence and Visualization 31

11/29/2020 Data Management, Business Intelligence and Visualization 32

11/29/2020 Data Management, Business Intelligence and Visualization 33

11/29/2020 Data Management, Business Intelligence and Visualization 34

11/29/2020 Data Management, Business Intelligence and Visualization 36

 Given a classification method (e.g. neural networks)

 Classify a new data to one of the classes

11/29/2020 Data Management, Business Intelligence and Visualization 37

11/29/2020 Data Management, Business Intelligence and Visualization 38

11/29/2020 Data Management, Business Intelligence and Visualization 40

Sunny Hot High False No

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Rainy Mild Normal False Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

11/29/2020 Data Management, Business Intelligence and Visualization 41

11/29/2020 Data Management, Business Intelligence and Visualization 42

11/29/2020 Data Management, Business Intelligence and Visualization 43

11/29/2020 Data Management, Business Intelligence and Visualization 44

11/29/2020 Data Management, Business Intelligence and Visualization 45

11/29/2020 Data Management, Business Intelligence and Visualization 46

11/29/2020 Data Management, Business Intelligence and Visualization 48

11/29/2020 Data Management, Business Intelligence and Visualization 49

 the goal is to identify all association rules X  Y with a