Lecture08 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Business Intelligence III

Δαμιανός Χατζηαντωνίου (damianos@aueb.gr)


Τμήμα Διοικητικής Επιστήμης και Τεχνολογίας
Οικονομικό Πανεπιστήμιο Αθηνών
Topics
 Common Issues in Star Schemas
 SQL Analytic Functions and MDX
 Introduction to Data Mining

11/29/2020 Data Management, Business Intelligence and Visualization 2


Common Issues in Star Schemas
Slowly Changing Dimensions (1)
 So far:
 New rows in dimension tables can be inserted
 Existing rows do not change
 This is not a realistic assumption
 However, dimensions change
 “Slowly changing dimensions” (SCD) phenomenon
 Dimension information change (e.g. value)
 Dimension’s schema change – another problem

11/29/2020 Data Management, Business Intelligence and Visualization 4


Slowly Changing Dimensions (2)
 Attribute values in dimensions vary
over time
 A store changes Size
 A product changes Description
 A customer changes Address
 Handling changes
 Dimensions are not updated
 DW is not up-to-date
 Dimensions updated in a
straightforward way (i.e. replace)
 incorrect information in historical data

11/29/2020 Data Management, Business Intelligence and Visualization 5


Slowly Changing Dimensions (3)
Source: Introduction to Data Warehousing and
Business Intelligence, by Christian Jensen

One day… the


store expands to
450 sq. meters

11/29/2020 Data Management, Business Intelligence and Visualization 6


Slowly Changing Dimensions (4)
 Solution 1: Do nothing or overwrite

11/29/2020 Data Management, Business Intelligence and Visualization 7


Slowly Changing Dimensions (5)
 Solution 1: do nothing or overwrite
 Consequences:
 Old facts point to rows in the dimension tables with incorrect information!
 New facts point to rows with correct information
 Pros:
 Easy to implement
 Useful if the updated attribute is not significant, or the old value should be
updated for error correction
 Cons:
 Old facts may point to “incorrect” rows in dimensions

11/29/2020 Data Management, Business Intelligence and Visualization 8


Slowly Changing Dimensions (5)
 Solution 2: Create new record with start/end
timestamp columns (versioning)
 The key that links dimension and fact table,
identifies a version of a row, not just a “row”
 Surrogate keys make this easier to implement (i.e.
the primary key of a dimension is a randomly
generated or an increasing integer and not the
“actual” id)

11/29/2020 Data Management, Business Intelligence and Visualization 9


Slowly Changing Dimensions (6)
 From/To columns provide time range

11/29/2020 Data Management, Business Intelligence and Visualization 10


Common Kinds of Dimensions
 Different kind of dimensions:
 Minidimensions
 Outriggers dimensions
 Junk dimensions
 Time dimensions
 Data Quality dimensions

11/29/2020 Data Management, Business Intelligence and Visualization 11


Minidimensions
 Some dimensions may have many attributes that
change too often. Example, a customer’s age, salary
range, etc. In such cases, one can break this
dimension to two, having all these frequently changing
attributes together.
 The minidimension can hold all possible combinations
of the values of the often changing attributes (cartesian
product of the attribute values, or it can mean all
combinations that actually occur in real life)

11/29/2020 Data Management, Business Intelligence and Visualization 12


Outriggers Dimensions
 A dimension that is referenced from another dimension
is called an outrigger.
 Outriggers are used in relational OLAP environments
where the outrigger’s dimension table is referenced by a
foreign key in another dimension table. To use the
outrigger, a join between the dimension table and the
outrigger is performed.
 A dimension can be used both as an ordinary dimension
and as an outrigger at the same time (e.g. Time
dimension, linked both to the fact table and the
Customer dimension through date-of-birth).
11/29/2020 Data Management, Business Intelligence and Visualization 13
Junk Dimensions
 Some dimensions have only two/three possible values
 an apartment has views (yes/no), walk-in closet (yes/no), is
renovated (yes/no)
 Instead of keeping three different dimensions, create
one, e.g. called amenities, with all possible combinations
 Such a dimension with combinations of unrelated values,
is called a junk dimension.
 While the the querying can become slightly harder with a
junk dimension, this arrangement reduces the
dimensionality of the cube.

11/29/2020 Data Management, Business Intelligence and Visualization 14


Time Dimension
 There are two kinds of time dimensions:
 a date dimension represents the date (“3/18/2009”)
 a time-of-day dimension represents the clock time (“11:47 am”)
 It is not advisable to combine, both for space efficiency
and analysis flexibility
 It is good to make it an explicit dimension
 Sometimes – for space efficiency – assign ids to dates

11/29/2020 Data Management, Business Intelligence and Visualization 15


Data Quality Dimensions
 A dimension that rates the reliability of each fact in the
fact table.
 values can be a real value from 0 to 1
 values can be a label such as “Normal value”, “Out-of-bounds
value”, “Unlikely value”, “Verified value”, “Unverified value”,
“Uncertain value”, etc.
 Analysts can easily restrict their analysis to facts of
specific reliability levels

11/29/2020 Data Management, Business Intelligence and Visualization 16


Simple Hierarchies
 The hierarchies considered so far were simple
hierarchies.
 balanced: in any given instance of the hierarchy, all leaves
belong to the schema’s lowest level
 covering: in any given instance, each path starts at the root
and then goes to the level immediately below in the schema
and then to the next level immediately below and so on,
without skipping any level
 strict: in any given instance, no dimension value has more than
one parent

11/29/2020 Data Management, Business Intelligence and Visualization 17


Unbalanced Hierarchies
 Special kinds of hierarchies:
 Unbalanced hierarchies
 child-parent
 Implementing is difficult

11/29/2020 Data Management, Business Intelligence and Visualization 18


Non-covering Hierarchies
 Special kinds of hierarchies:
 Non-covering hierarchies (ragged hierarchies)
 A non-covering hierarchy allows instances to skip levels
between the leaves and the root.
 A placeholder can be put in the immediate upper level

11/29/2020 Data Management, Business Intelligence and Visualization 19


Non-Strict Hierarchies
 Special kinds of hierarchies:
 Non-strict hierarchies: a child can have more than
one parents

11/29/2020 Data Management, Business Intelligence and Visualization 20


Modeling – Hierarchies
 Non-strict hierarchies (cont.)
 Frequent cases
 Many-to-many relationship
 Use of a “bridge” table

11/29/2020 Data Management, Business Intelligence and Visualization 21


Modeling – Hierarchies
 Non-strict hierarchies (cont.)
 Many-to-many relationships cause problems with
respect to aggregations. Assume that a book B is
written by two authors and we want to compute the
total sales for each author… Problem?
 Bridge tables usually have properties

11/29/2020 Data Management, Business Intelligence and Visualization 22


SQL Analytic Functions and MDX
Analytic Functions – Motivation (1)
 Relation example:
 Sales (cust, prod, day, month, year, state, sale)
 Sales table stores the purchases of a product (prod) by a
customer (cust) on a date and state for a sale amount (sale)
 Query examples:
 Q1: for each customer (cust), show the total of his purchases in
‘NY’, ‘NJ’ and ‘CA’ in 2016 (as three additional columns)
 Q2: for each product and for sales of 2016, show each month’s
total sales as percentage of the year-long total sales
 Q3: for each product and month of 2016, show the month’s
cumulative total sales

11/29/2020 Data Management, Business Intelligence and Visualization 24


Analytic Functions – Motivation (2)
 All examples require a different kind of grouping
 Q1: GROUP BY cust, and then define subsets for each
group (sales in NY, sales in NJ, sales in CA) and
compute aggregates for these subsets
 Q2: Compare aggregates of GROUP BYs defined at
different hierarchy levels: (prod) and (prod, month)
 Q3: As the month moves, we define a set of rows that
include all sales before or on this month

11/29/2020 Data Management, Business Intelligence and Visualization 25


Analytic Functions – History
 SQL Extensions for Analytics (EMF SQL):
 Chatziantoniou et al., VLDB96, EDBT98, SSDBM99, KDD99
 Applications: Telcos (AT&T), Medical Informatics (Columbia
Presbyterian Hospital, Philips North America), Direct marketing
data (D&B), Bioinformatics, Finance
 Adopted by and basis for Oracle’s Analytic Functions
 Implemented in Oracle 8i - 1999
 Proposed to SQL Standardization Committee
 Analytic Functions part of standard SQL
 Most SQL systems implement Analytic Functions

11/29/2020 Data Management, Business Intelligence and Visualization 26


SQL Analytic Functions – Main Idea
• we want to extend table T with a column D,
SELECT A,B,C,… where the values of D are determined as:
FROM T1,T2,… • for each row i = (ai, bi, ci):
WHERE condition • a subset S of T is defined (a window)
• an aggregated value is computed over S and
becomes the value of D in row i
T:
A B C D • This window S can defined as
a1 b1 c1 part of the entire table T, or T can
a2 b2 c2 d2=avg(S.C) be partitioned based on ai and/or
S: a3 b3 c3 bi and/or ci and S can be defined
a4 b4 c4 within a partition
… … … …

11/29/2020 Data Management, Business Intelligence and Visualization 27


SQL Analytic Functions - Steps
 Query processing using analytic functions takes place in
three stages
 First, WHERE, GROUP BY, and HAVING clauses are performed
 Second, the result set is made available to the analytic
functions, and all their calculations take place
 Third, if the query has an ORDER BY clause at its end, the
ORDER BY is processed to allow for precise output ordering

11/29/2020 Data Management, Business Intelligence and Visualization 28


SQL Analytic Functions – Main Constructs
 Partition: The analytic functions allow users to divide
query result sets into ordered groups of rows called
partitions
 Window: For each partition, a window of data is
defined. The window determines the range of rows
used to perform the calculations for the “current row”
 Current row: Each calculation performed with an
analytic function is based on a current row within a
window

11/29/2020 Data Management, Business Intelligence and Visualization 29


SQL Analytic Functions – Big Picture

11/29/2020 Data Management, Business Intelligence and Visualization 30


SQL Analytic Functions – Example #1
Sales (cust, prod, day, month, year, state, sale)

 Q2: for each product and for sales of 2016, show each month’s
total sales as percentage of the year-long total sales

SELECT prod,
distinct
month,
prod, month,
sum(amount) OVER (PARTITION BY prod, month)/
sum(amount) OVER (PARTITION BY prod)
FROM Sales
WHERE year=2016

11/29/2020 Data Management, Business Intelligence and Visualization 31


SQL Analytic Functions – Example #2
Sales (cust, prod, day, month, year, state, sale)

 Q3: for each product and month of 2016, show the month’s
cumulative total sales

SELECT prod,
distinct
month,
prod, month,
sum(amount) OVER (PARTITION BY prod
ORDER BY month)
FROM Sales
WHERE year=2016

11/29/2020 Data Management, Business Intelligence and Visualization 32


Microsoft – MDX Language
 With relational dbs, SQL assembles sets of data (tables)
 MDX assumes an n-dimensional space
 MDX language assembles tuples of data points in this n-
dimensional cube
 measures are also in a dimension
 MDX is not exclusive to SQL Server Analysis Services
 it is part of a vendor-neutral specification (XML for Analysis)
 A good and step-by-step introduction to MDX:
 SQL Server 2008 MDX, Step by Step

11/29/2020 Data Management, Business Intelligence and Visualization 33


Microsoft – MDX, An Example

11/29/2020 Data Management, Business Intelligence and Visualization 34


Data Mining
Data Mining – What is it?
 Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
 Not OLAP, not SQL, in general not Data Analysis! Or
(for statisticians) Data Analysis!

11/29/2020 Data Management, Business Intelligence and Visualization 36


Classification
 Classification Problem:
 Given a set of known classes
 Given a training set, where its members belong to these classes

 Given a classification method (e.g. neural networks)

 Classify a new data to one of the classes

11/29/2020 Data Management, Business Intelligence and Visualization 37


Classification Examples
 Banking - Identify individuals with credit risks
 Given a new loan application, what is the probability to default?
 Telecom - Retention
 Given a new customer, what is the retention probability?
 Finance – Predicting stock behavior
 Given a set of parameters, will a stock go up or down?
 Speech Recognition/Optical Character Recognition
 Patient Outcome Analysis

11/29/2020 Data Management, Business Intelligence and Visualization 38


Classification – Building a Model
 Classes are predefined (e.g. 0,1, 2)
 Training dataset: a set of data items – each
data item has some description (a set of
attributes, e.g. age, gender, income) + the
class label (to which class it belongs).
 Using the training dataset a model is built that
has as input the description of a data item and
as output a class label (classifier)

39
11/29/2020 Data Management, Business Intelligence and Visualization
Classification Techniques
 Regression
 Bayesian classification
 K-Nearest Neighbors (KNN)
 Decision support trees
 Neural networks

11/29/2020 Data Management, Business Intelligence and Visualization 40


Bayesian Classification
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1 Outlook Temp Humidity Windy Play

Sunny Hot High False No


Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Sunny Hot High True No
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 TrueHot
Overcast 3/9
High 3/5 False Yes

Rainy 3/9 2/5 Cool 3/9 1/5 Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Outlook Temp. Humidity Windy Play Sunny Cool Normal False Yes

Rainy Mild Normal False Yes


Sunny Cool High True ?
Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

11/29/2020 Data Management, Business Intelligence and Visualization 41


KNN Classification
 Assume we can measure
the distance between two
data items, i.e. there exists
a distance function f(d1,d2)
 Find k nearest items to the
new data item to be
classified
 New data item is placed to
the class with the most
number of close items

11/29/2020 Data Management, Business Intelligence and Visualization 42


Decision Trees
 Each internal node represents
an attribute A of the data
items’ description
 Each branch is labeled with a
partitioning condition on A
 Each leaf (where a path from
the root ends) corresponds to
a class
 Each step partitions the
training dataset

11/29/2020 Data Management, Business Intelligence and Visualization 43


Neural Networks
 Typical NN structure:
 one input node per attribute
 one output node per class
 For each data item in training
dataset, propagate values
through NN (Σ fi wik). Adjust
weights on edges to improve
accuracy of classification
 error ( w ) = (expected output -
actual output)
 goal: minimize error

11/29/2020 Data Management, Business Intelligence and Visualization 44


Clustering
 Given a database D of data items, a similarity function
sim(di, dj), and an integer value k, the clustering
problem is to define a mapping f: D{1,..,k}, i.e. each
data item di is assigned to one cluster Kj, 1<=j<=k.
 Number of clusters are not known a priori.
 Goal: find an assignment of the data items to k clusters,
in order to minimize distance between cluster’s
members, while maximizing distance between clusters

11/29/2020 Data Management, Business Intelligence and Visualization 45


Clustering – Example (1)

 Clustering
houses based
on geographic
distance

11/29/2020 Data Management, Business Intelligence and Visualization 46


Clustering – Example (2)

 Clustering
houses based
on size

Size Based
11/29/2020 Data Management, Business Intelligence and Visualization 47
Clustering – Cases
 Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs.
 Land use: Identification of areas of similar land use in
an earth observation database.
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost.
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location.
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults.

11/29/2020 Data Management, Business Intelligence and Visualization 48


Clustering – K-Means Algorithm
 Initial set of clusters randomly chosen – by
randomly choosing the center of the clusters.
 Iteratively, items are moved among sets of
clusters until the desired set is reached.

11/29/2020 Data Management, Business Intelligence and Visualization 49


K-means example, initialization

k1
Y
Pick 3
k2
initial
cluster
centers
(randomly)
k3

X
50
K-means example, 1st repetition

k1
Y

k2
Assign
each point
to the closest
cluster
center k3

X
51
K-means example, 1st repetition

k1 k1
Y

Move k2
each cluster
center k3
k2
to the mean
of each k3
cluster
X
52
K-means example, 2nd repetition

Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
53
K-means example, 2nd repetition

k1
Y
A: three
points with
animation k3
k2

X
54
K-means example, 2nd repetition

k1
Y
re-compute
cluster
means k3
k2

X
55
K-means example, 2nd repetition

k1
Y

k2
move cluster
centers to k3
cluster
means

X
56
Association Rules
 Example: baskets in a supermarket containing products
 {Beer, Bread} => Diapers | support=5%, confidence=30%
 support 5%  5% of baskets contain Beer, Bread, Diaper
 confidence 30%  30% of baskets that contain Beer and Bread,
also contain Diapers
 Given
 a set of items I={I1,I2,…,Im} (e.g. products, books, views), and
 a set of baskets (or transactions) T={t1,t2, …, tn}
 where each transaction contains items, ti = {Ii1,Ii2, …, Iik}

 the goal is to identify all association rules X  Y with a


minimum support and confidence
11/29/2020 Data Management, Business Intelligence and Visualization 57
Association Rules Problem
 Given
 a set of items I={I1,I2,…,Im} (e.g. products, books, views), and
 a set of baskets (or transactions) T={t1,t2, …, tn}
where each transaction contains items, ti = {Ii1,Ii2, …, Iik}
the goal is to identify all association rules X  Y with a
minimum support and confidence
 A-priori algorithm is mainly used to compute these
association rules

11/29/2020 Data Management, Business Intelligence and Visualization 58


Association Rules – Cases
 Market-basket Analysis
 Placement
 Advertising
 Sales
 Coupons

11/29/2020 Data Management, Business Intelligence and Visualization 59

You might also like