Professional Documents
Culture Documents
Day 2
Day 2
Day 2
Data Quality
• Accuracy
Data Cleaning and Integration • Completeness
• Consistency
• Timeliness
• Believability
• Interpretability
J. Pei: Big Data Analytics -- Data Cleaning and Integration 3 J. Pei: Big Data Analytics -- Data Cleaning and Integration 4
J. Pei: Big Data Analytics -- Data Cleaning and Integration 5 J. Pei: Big Data Analytics -- Data Cleaning and Integration 6
1
2014-05-06
3000 3000
2500 2500
2000 2000
1500 1500
1000 1000
500 500
0 0
J. Pei: Big Data Analytics -- Data Cleaning and Integration 7 J. Pei: Big Data Analytics -- Data Cleaning and Integration 8
• Examples 3000
1500
– “0” in “blood pressure” A random subset of 1000
the whole database
– “21” in “age” 500
J. Pei: Big Data Analytics -- Data Cleaning and Integration 11 J. Pei: Big Data Analytics -- Data Cleaning and Integration 12
2
2014-05-06
Binning Regression
Sorted data for price (in dollars) : 4, 8, 15, 21, 21, 24, 25, 28, 34
Bin1 : 4, 8, 15
Bin2 : 21, 21, 24
Bin3 : 25, 28, 34
Bin1 : 9, 9, 9
Bin2 : 22, 22, 22
Bin3 : 29, 29, 29
Bin1 : 4, 4, 15
Bin2 : 21, 21, 24
Bin3 : 25, 25, 34
J. Pei: Big Data Analytics -- Data Cleaning and Integration 13 J. Pei: Big Data Analytics -- Data Cleaning and Integration 14
J. Pei: Big Data Analytics -- Data Cleaning and Integration 15 J. Pei: Big Data Analytics -- Data Cleaning and Integration 16
http://en.wikipedia.org/wiki/File:Dataintegration.png
J. Pei: Big Data Analytics -- Data Cleaning and Integration 17 J. Pei: Big Data Analytics -- Data Cleaning and Integration 18
3
2014-05-06
J. Pei: Big Data Analytics -- Data Cleaning and Integration 19 J. Pei: Big Data Analytics -- Data Cleaning and Integration 20
J. Pei: Big Data Analytics -- Data Cleaning and Integration 21 J. Pei: Big Data Analytics -- Data Cleaning and Integration 22
J. Pei: Big Data Analytics -- Data Cleaning and Integration 23 J. Pei: Big Data Analytics -- Data Cleaning and Integration 24
4
2014-05-06
5
Outline
• Why multidimensional analysis?
Multidimensional Analysis • Multidimensional analysis principle
• OLAP
• OLAP indexes
Jian Pei: Big Data Analytics -- Multidimensional Analysis 3 Jian Pei: Big Data Analytics -- Multidimensional Analysis 4
1
Other Operations Relational Representation
• Dice: pick specific values or ranges on • If there are n dimensions, there are 2n
some dimensions possible aggregation columns
• Pivot: “rotate” a cube – changing the order Roll up by model by year by color in a table
of dimensions in visual analysis
http://en.wikipedia.org/wiki/File:OLAP_pivoting.png
Jian Pei: Big Data Analytics -- Multidimensional Analysis 7 Jian Pei: Big Data Analytics -- Multidimensional Analysis 8
Jian Pei: Big Data Analytics -- Multidimensional Analysis 9 Jian Pei: Big Data Analytics -- Multidimensional Analysis 10
DATA CUBE
Model Year Color Sales
Chevy 1990 blue 62
Chevy 1990 red 5
• ALL is a set
Chevy 1992 ALL 156
Chevy ALL blue 182
SALES Chevy ALL red 90
Model Year Color Sales Chevy ALL white 236
CUBE
Chevy 1990 red 5
Ford 1990 blue 63
Chevy 1990 white 87
Ford 1990 red 64
Chevy 1990 blue 62 Ford 1990 white 62
Chevy 1991 red 54 Ford 1990 ALL 189
Chevy
Chevy
1991
1991
white
blue
95
49
Ford
Ford
Ford
1991
1991
1991
blue
red
white
55
52
9
– Year.ALL = ALL(Year) = {1990,1991,1992}
Chevy 1992 red 31
Ford 1991 ALL 116
Chevy
Chevy
Ford
1992
1992
1990
white
blue
red
54
71
64
Ford
Ford
Ford
1992
1992
1992
blue
red
white
39
27
62
– Color.ALL = ALL(Color) = {red,white,blue}
Ford 1990 white 62 Ford 1992 ALL 128
Ford 63 Ford ALL blue 157
1990 blue
Ford ALL red 143
Ford 1991 red 52
Ford ALL white 133
Ford 1991 white 9 Ford ALL ALL 433
Ford 1991 blue 55 ALL 1990 blue 125
Ford 1992 red 27 ALL 1990 red 69
Ford 1992 white 62 ALL 1990 white 149
ALL 1990 ALL 343
Ford 1992 blue 39
ALL 1991 blue 106
ALL 1991 red 104
SELECT Model, Year, Color, SUM(sales) AS Sales ALL
ALL
1991
1991
white
ALL
110
314
FROM Sales ALL 1992 blue 110
ALL 1992 red 58
WHERE Model in {'Ford', 'Chevy'} ALL
ALL
1992
1992
white
ALL
116
284
AND Year BETWEEN 1990 AND 1992 ALL
ALL
ALL
ALL
blue
red
339
233
GROUP BY CUBE(Model, Year, Color); ALL ALL white 369
ALL ALL ALL 941
Jian Pei: Big Data Analytics -- Multidimensional Analysis 11 Jian Pei: Big Data Analytics -- Multidimensional Analysis 12
2
OLTP Versus OLAP What Is a Data Warehouse?
OLTP OLAP
users clerk, IT professional knowledge worker
• A data warehouse is a subject-oriented,
function day to day operations decision support integrated, time-variant, and nonvolatile
DB design
data
application-oriented
current, up-to-date, detailed, flat
subject-oriented
historical, summarized, multidimensional
collection of data in support of
relational Isolated integrated, consolidated management s decision-making process.
usage repetitive ad-hoc
access read/write, index/hash on prim. lots of scans
– W. H. Inmon
key
unit of work short, simple transaction complex query
• Data warehousing: the process of
# records tens millions constructing and using data warehouses
accessed
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Jian Pei: Big Data Analytics -- Multidimensional Analysis 13 Jian Pei: Big Data Analytics -- Multidimensional Analysis 14
Subject-Oriented Integrated
• Organized around major subjects, such as • Integrating multiple, heterogeneous data sources
customer, product, sales – Relational databases, flat files, on-line transaction
records
• Focusing on the modeling and analysis of • Data cleaning and data integration
data for decision makers, not on daily – Ensuring consistency in naming conventions, encoding
operations or transaction processing structures, attribute measures, etc. among different data
sources
• Providing a simple and concise view around • E.g., Hotel price: currency, tax, breakfast covered, etc.
particular subject issues by excluding data – When data is moved to the warehouse, it is converted
that are not useful in the decision support
process
Jian Pei: Big Data Analytics -- Multidimensional Analysis 15 Jian Pei: Big Data Analytics -- Multidimensional Analysis 16
Jian Pei: Big Data Analytics -- Multidimensional Analysis 17 Jian Pei: Big Data Analytics -- Multidimensional Analysis 18
3
Why Separate Data Warehouse? Star Schema
• High performance for both time item
time_key
– Operational DBMS: tuned for OLTP day Sales Fact Table
item_key
item_name
day_of_the_week
– Warehouse: tuned for OLAP month time_key
brand
type
• Different functions and different data quarter
year item_key supplier_type
time_key
time item time item item_key
time_key item_key supplier time_key item_key
day Sales Fact Table item_name Sales Fact Table shipper_key
supplier_key day item_name
day_of_the_week brand supplier_type day_of_the_week brand
time_key time_key from_location
month type month type
quarter supplier_key quarter supplier_type
item_key item_key to_location
year year
branch_key branch_key dollars_cost
branch location branch
location_key location_key location units_shipped
location_key
branch_key branch_key location_key
units_sold street branch_name units_sold
branch_name street shipper
city_key branch_type
branch_type
dollars_sold city dollars_sold city shipper_key
province_or_state shipper_name
city_key avg_sales
avg_sales country location_key
city
state_or_province Measures shipper_type
Measures country
Jian Pei: Big Data Analytics -- Multidimensional Analysis 21 Jian Pei: Big Data Analytics -- Multidimensional Analysis 22
4
Index Requirements in OLAP OLAP Query Example
• Data is read only • In table (cust, gender, …), find the total
– (Almost) no insertion or deletion number of male customers
• Query types • Method 1: scan the table once
– Point query: looking up one specific tuple (rare) • Method 2: build a B+ tree index on attribute
– Range query: returning the aggregate of a gender, still need to access all tuples of
(large) set of tuples, with group by male customers
– Complex queries: need specific algorithms and • Can we get the count without scanning
index structures, will be discussed later
many tuples, even not all tuples of male
customers?
Jian Pei: Big Data Analytics -- Multidimensional Analysis 25 Jian Pei: Big Data Analytics -- Multidimensional Analysis 26
1 0 … 0
Jian Pei: Big Data Analytics -- Multidimensional Analysis 27 Jian Pei: Big Data Analytics -- Multidimensional Analysis 28
5
Using Indexes Cost Comparison
SELECT SUM(sales) FROM Sales WHERE C; • Traditional value-list index (B+ tree) is costly
– Tuples satisfying C is identified by a bitmap B in both I/O and CPU time
• Direct access to rows to calculate SUM: – Not good for OLAP
scan the whole table once • Bit-sliced index is efficient in I/O
• B+ tree: find the tuples from the tree • Other case studies in [O Neil and Quass,
• Projection index: only scan attribute sales SIGMOD 97]
• Bit-sliced index: get the sum from ∑(B AND
Bk)*2k
Jian Pei: Big Data Analytics -- Multidimensional Analysis 31 Jian Pei: Big Data Analytics -- Multidimensional Analysis 32
Jian Pei: Big Data Analytics -- Multidimensional Analysis 33 Jian Pei: Big Data Analytics -- Multidimensional Analysis 34
Jian Pei: Big Data Analytics -- Multidimensional Analysis 35 Jian Pei: Big Data Analytics -- Multidimensional Analysis 36
6
DATA CUBE
Model Year Color Sales
Chevy 1990 blue 62
Chevy 1990 red 5
CUBE MOLAP
Chevy 1990 white 95
Chevy 1990 ALL 154
Chevy 1991 blue 49
Chevy 1991 red 54
Chevy 1991 white 95
Chevy 1991 ALL 198
Chevy 1992 blue 71
Chevy 1992 red 31
Chevy 1992 white 54
Date
Chevy 1992 ALL 156
Chevy ALL blue 182
SALES
2Qtr
Chevy ALL red 90
1Qtr 3Qtr 4Qtr sum
t
Model Year Color Sales Chevy ALL white 236
uc
Chevy ALL ALL 508
TV
CUBE
Chevy 1990 red 5
Ford 1990 blue 63
od
Chevy 1990 white 87
Chevy 1990 blue 62
Ford 1990 red 64
PC U.S.A
Pr
Ford 1990 white 62
Chevy 1991 red 54 Ford 1990 ALL 189
Chevy 1991 white 95 Ford
Ford
1991
1991
blue
red
55
52
VCR
Country
Chevy 1991 blue 49
Chevy 1992 red 31
Ford 1991 white 9
sum
Chevy
Chevy
1992
1992
white
blue
54
71
Ford
Ford
Ford
1991
1992
1992
ALL
blue
red
116
39
27
Canada
Ford 1990 red 64 Ford 1992 white 62
Ford 1990 white 62 Ford 1992 ALL 128
Ford ALL blue 157
Ford
Ford
1990
1991
blue
red
63
52
Ford
Ford
ALL
ALL
red
white
143
133
Mexico
Ford 1991 white 9 Ford ALL ALL 433
Ford 1991 blue 55 ALL 1990 blue 125
Ford 1992 red 27 ALL 1990 red 69
ALL 1990 white 149
sum
Ford 1992 white 62
ALL 1990 ALL 343
Ford 1992 blue 39
ALL 1991 blue 106
ALL 1991 red 104
SELECT Model, Year, Color, SUM(sales) AS Sales ALL
ALL
1991
1991
white
ALL
110
314
FROM Sales ALL 1992 blue 110
ALL 1992 red 58
WHERE Model in {'Ford', 'Chevy'} ALL
ALL
1992
1992
white
ALL
116
284
AND Year BETWEEN 1990 AND 1992 ALL
ALL
ALL
ALL
blue
red
339
233
GROUP BY CUBE(Model, Year, Color); ALL ALL white 369
ALL ALL ALL 941
Jian Pei: Big Data Analytics -- Multidimensional Analysis 37 Jian Pei: Big Data Analytics -- Multidimensional Analysis 38
Jian Pei: Big Data Analytics -- Multidimensional Analysis 39 Jian Pei: Big Data Analytics -- Multidimensional Analysis 40
Jian Pei: Big Data Analytics -- Multidimensional Analysis 41 Jian Pei: Big Data Analytics -- Multidimensional Analysis 42
7
Iceberg Cube Monotonic Iceberg Condition
• In a data cube, many aggregate cells are • If COUNT(a, b, *)<100, then COUNT(a, b,
trivial c)<100 for any c
– Having an aggregate too small • For cells c1 and c2, c1 is called an ancestor
• Iceberg query of c2 if in all dimensions that c1 takes a non-*
value, c2 agrees with c1
– (a,b,*) is an ancestor of (a,b,c)
• An iceberg condition P is monotonic if for
any aggregate cell c failing P, any
descendants of c cannot honor P
Jian Pei: Big Data Analytics -- Multidimensional Analysis 43 Jian Pei: Big Data Analytics -- Multidimensional Analysis 44
Jian Pei: Big Data Analytics -- Multidimensional Analysis 45 Jian Pei: Big Data Analytics -- Multidimensional Analysis 46
8
A Data Cube Is Often Huge Compression of Data Cubes
• 10 dimensions, cardinality 20 for each • Traditional compression methods, e.g., zip
dimension ! 2110=16,679,880,978,201 – High compression ratio
possible tuples in the cube – The compression cannot be queried directly
• Even 1/1,000 of possible tuples are not • Requirements for data cube compression
empty, still more than 16 billion tuples – The compression can be queried efficiently
– High compression ratio
• Lossless compression and lossy
compression
Jian Pei: Big Data Analytics -- Multidimensional Analysis 49 Jian Pei: Big Data Analytics -- Multidimensional Analysis 50
(*,*,*):9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 53 Jian Pei: Big Data Analytics -- Multidimensional Analysis 54
9
A Naïve Attempt A Better Partitioning
• Put all cells of same agg values into a class • Quotient cube: partitioning preserving the
• The result is not a lattice anymore! rollup/drilldown semantics
– Anomaly: the rollup/drilldown semantics is lost (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
(*,*,*):9 (*,*,*):9
Jian Pei: Big Data Analytics -- Multidimensional Analysis 55 Jian Pei: Big Data Analytics -- Multidimensional Analysis 56
• OLAP browsing
(S2,P1,f):9
C1 C2 C1 C2
(S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9 (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
C4 C4
(*,*,*):9 C5 (*,*,*):9 C5
Jian Pei: Big Data Analytics -- Multidimensional Analysis 57 Jian Pei: Big Data Analytics -- Multidimensional Analysis 58
• Given a cube, characterize a good way (the • Two cells have equivalent aggregate values
quotient cube way) of partitioning its cells if they cover the same set of tuples in the
into classes such that base table
– The partition generates a reduced lattice Tuples in base table (S1,P1,s):6 (S1,P2,s):12 (S2,P1,f):9
as small as possible
• Compute, index and store quotient cubes (S1,*,*):9 (*,*,s):9 (*,P1,*):7.5 (*,P2,*):12 (*,*,f):9 (S2,*,*):9
10
Cover Partition Cover Partitions & Aggregates
• For a cell c, a tuple t in base table is in c s • All cells in a cover partition carry the same
cover if t can be rolled up to c aggregate value with respect to any
– E.g., Cov(S1,*,spring)={(S1,P1,spring), aggregate function
(S1,P2,spring)} – But cells in a class of MIN() may have different
Dimensions Measure
covers
Store Product Season Sales
• For COUNT() and SUM() (positive), cover
S1 P1 Spring 6
S1 P2 Spring 12 equivalence coincides with aggregate
S2 P1 Fall 9 equivalence
Jian Pei: Big Data Analytics -- Multidimensional Analysis 61 Jian Pei: Big Data Analytics -- Multidimensional Analysis 62
11
Example Skyline Computation
Jian Pei: Big Data Analytics -- Multidimensional Analysis 69 Jian Pei: Big Data Analytics -- Multidimensional Analysis 70
Jian Pei: Big Data Analytics -- Multidimensional Analysis 71 Jian Pei: Big Data Analytics -- Multidimensional Analysis 72
12
Redundancy in Sky Cube Mining Decisive Subspaces
Jian Pei: Big Data Analytics -- Multidimensional Analysis 75 Jian Pei: Big Data Analytics -- Multidimensional Analysis 76
Different
customers may
have different
preferences
Jian Pei: Big Data Analytics -- Multidimensional Analysis 77 Jian Pei: Big Data Analytics -- Multidimensional Analysis 78
13
Favorable Facet Mining Monotonicity of Partial Orders
• – · If p is not in the skyline with respect to partial R, p is not in
the skyline with any partial order stronger than R
• A set of points in a multidimensional space
– Fully ordered attributes: the preference orders
are fixed, e.g., price, star-level, and quality
– (Categorical) Partially ordered attributes: the
preference orders are not fully determined, e.g.,
airlines, hotel groups, and property types
• Some templates may apply, e.g., single houses >
semi-detached houses
• Favorable facts of a point p: the partial
orders that make p in the skyline
Jian Pei: Big Data Analytics -- Multidimensional Analysis 79 Jian Pei: Big Data Analytics -- Multidimensional Analysis 80
14
Learning Methods Multidimensional Analysis of Logs
Jian Pei: Big Data Analytics -- Multidimensional Analysis 87 Jian Pei: Big Data Analytics -- Multidimensional Analysis 88
15