Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

dbis.uni-konstanz.

de

Discovering  OLAP  Dimensions  in  


Semi-­‐Structured  Data  
Svetlana  Mansmann,  Nafees  Ur  Rehman,  Andreas  Weiler,  Marc  H.  Scholl  
Database  &  InformaEon  System  (DBIS)  
Dept  of  Computer  Science,  University  of  Konstanz,  Germany  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   1  
Outline  
dbis.uni-konstanz.de

  IntroducEon  &  MoEvaEon  


 Social  Networks  and  Big  Data  
 OLAP  and  Data  Mining  for  “Big  Data”  
  Acquiring  Facts  and  Dimensions  
 Data  TransformaEon  
 Discovering  New  Elements  
  Modeling  Discovered  Elements  
  Usage  &  Maintenance  of  Dynamic  Elements  
  Conclusion    

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   2  
IntroducEon  &  MoEvaEon  
dbis.uni-konstanz.de

  Social  Networks  
 Growing  popularity  
 Huge  data  volumes  
 High  data  generaEon  rate  
 Heterogeneity    
  “Big  Data”  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   3  
IntroducEon  &  MoEvaEon  
dbis.uni-konstanz.de

  Data  Warehouse  vs.  noSQL  


  Established  and  mature  technology  
  Standardized  for  interchangeability  
  IntegraEon  with  Data  Mining  
  Abundance  of  tools  for  various  tasks  
  Challenges  
  Heterogeneous  and  semi-­‐structured  content  
  Dynamic  data,  changing  dimensions  
 High  data  arrival  rate  
 Non-­‐trivial  analysis  tasks  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   4  
dbis.uni-konstanz.de

Twi_er:  A  moEvaEonal  scenario  


  Why  Twi_er?  
 News  broadcast  &  InformaEon  exchange  placorm  
 AcEve  Users  
•   >  140  million  
 Daily  Tweets  
•   >  340  million  
 Set  of  configurable  APIs:    
•  Search,  Rest,  Stream  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   5  
dbis.uni-konstanz.de

Twi_er:  Output  Data  Format  


  Twi_er  APIs  output  the  semi-­‐structured  data  
as  JSON  (JavaScript  Object  NotaEon)  objects:  
•  User  data   <tweet>  
 <text>  

•  Status  (tweet)  data              If  you  havent  read  about  Mario  Balotelli  yet,              
           you  MUST  before  todays  #EURO2012  final:    
             h_p://t.co/2aFDjnsD    
•  Timeline  data    </text>  
           <truncated>true</truncated>  
           <date>2012-­‐01-­‐07  18:36:05.000</date>  

  Over  67  metadata  fields                <source>web</source>  


             <retweeted>true</retweeted>  
 <user>  

  10%  of  the  public  stream  


       <name>Marcel***</name>  
       <date>2011-­‐08-­‐01  06:06:34:12.000</date>  
       <utc-­‐offset>-­‐18000</utc-­‐offset>  
       <language>en</language>  

       is  available  for  free            <geo-­‐enabled>False</geo-­‐enabled>  


       <statuses_count>1521</statuses_count>  
       <followers_count>121</followers_count>        
 </user>  
</tweet>  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   6  
dbis.uni-konstanz.de

MulE-­‐Layered  Architecture  for  Twi_er  Data  Warehouse  


5th layer: PRESENTATION

OLAP frontend Data Mining tool DSS frontend spreadsheet web frontend

4th layer: ANALYSIS

OLAP Data Mining DSS methods

Media Mart Tweet Mart User Mart

Archiving system
Microsos    
Monitoring
SQL  Server
Metadata
Administration
3rd layer: DATA WAREHOUSE

BaseX XML storage


Extractor   Enrichment  
Staging area
2nd layer: ETL

1st layer: DATA SOURCES REST API Search API Streaming API external sources

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   7  
dbis.uni-konstanz.de

Twi_er  stream    -­‐  a  Structured  View  

  Twi_er  data  model:  


 Original  model  is  not  available  
 Streamed  data  is  poorly  documented  
 RelaEonships  between  fields  are  not  obvious  
  Reverse  engineering  of  the  data  model  
 Related  fields  are  grouped  into  classes  
 RelaEonships  between  classes  are  specified  
 Constraints  are  defined  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   8  
dbis.uni-konstanz.de

Twi_er  stream    -­‐  a  Structured  View  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   9  
dbis.uni-konstanz.de

Acquiring  Facts  and  Dimensions  


  Cube  candidates:  
 user-­‐related  data  
 tweet-­‐related  data  
 content  elements  
  Granularity  levels:  
 user  staEsEcs  
 messaging  staEsEcs  
 topics  &    terms    

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   10  
dbis.uni-konstanz.de

Acquiring  Facts  and  Dimensions  


  Simple  derivaEon  /  computaEon  
  Including  external  data  sources  
 geo-­‐informaEon,  vocabularies  
  Applying  external  funcEons  (APIs)  
 language  detecEon  and  translaEon    
 senEment  analysis  
 spam  detecEon  
 …  
  Data  mining  
 hidden  relaEonships,  clustering,  ranking  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   11  
dbis.uni-konstanz.de

Discovered  Hierarchy  -­‐  Example  


  Conceptual  modeling  of  new  elements  
  Consider  user  dimension  in  the  TweetCount  fact  type  
  Including  external  data  sources  
 geo-­‐informaEon,  vocabularies  
  Applying  external  funcEons  (APIs)  
 language  detecEon  and  translaEon    

  What  about  a  hierarchy  of  user  popularity?  


02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   12  
dbis.uni-konstanz.de

Discovered  Hierarchy  -­‐  Example  


 Adopt  some  ranking  funcEon  (e.g.,  based  on  the  number  
of  followers)  
 Define  higher-­‐level  groupings  (e.g.,  based  on  percentages  
or  thresholds)  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   13  
dbis.uni-konstanz.de

Discovered  Hierarchy  -­‐  Example  


 Add  new  aggregaEon  path  to  the  fact  schema  
 Specify  the  computaEon  formula  for  added  elements  

  Problem:  the  added  hierarchy  is  dynamic  


02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   14  
dbis.uni-konstanz.de

Discovered  Facts  and  Dimensions  


  Simple  DerivaEon  
Watching  #Euro  final  at  BriEsh  pub  in  
 Fact/Measure  ExtracEon   Capitola  while  staring  at  the  beach.  
Not  what  I  expected  but  I‚ll  take  it!  
•  Length  of  Tweet  :  64   Viva  Italia  

•  Number  of  Hashtags:  1  


 Dimension  
•  Source   ALL  

–  Web,  App,  Phone   Source  


 Hierarchy  
Brand  
•  Source  
User  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   15  
dbis.uni-konstanz.de

Discovered  Facts  and  Dimensions  


  External  Data  Sources  &  APIs  
•  Language    
Watching  #Euro  final  at  BriEsh  pub  in  
–    English  
Capitola  while  staring  at  the  beach.  
•  EnEty  DetecEon   Not  what  I  expected  but  I‚ll  take  it!  
–  Event   Viva  Italia  
»  Euro  (Championship)    
–  Facility  
»  BriEsh  pub  
–  Country  
»  Italia  
•  Topic  
–  RecreaEon  
•  Tags:                Sports,  Fun,  eurocup  etc  
•  SenEment:  PosiEve  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   16  
dbis.uni-konstanz.de

Discovered  Facts  and  Dimensions  


  Data  Mining  
 Clusters  of  Users  
•  Trending,  Spam,  Lifestyle,  etc.  
 Clusters  of  Tweets  
•  Popularity  
 Non-­‐Trivial  RelaEonships  
•  What  contributes  to  popularity  &  trending  of  
–  users  
–  tweets  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   17  
dbis.uni-konstanz.de

Discovered  Facts  and  Dimensions  


  Tweet  Popularity  
Classifier  

  User  Popularity  
Classifier  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   18  
dbis.uni-konstanz.de

Modeling  Discovered  Elements  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   19  
dbis.uni-konstanz.de

Modeling  Discovered  Elements  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   20  
dbis.uni-konstanz.de

Maintenance  of  Dynamic  Elements  


  Similar  to  Slowly  Changing  Dimensions  
  MulE-­‐versioning  /  historizaEon  
 current  version  in  the  dimension  table  
 Previous  version  in  history  table(s)  
 Temporal  constraints  for  historical  records  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   21  
dbis.uni-konstanz.de

Querying  along  Dynamic  Elements  


  OLAP  queries  with  mulE-­‐versioned  dimensions  
 correct  aggregaEon  by  joining  the  fact  entries  with  
the  matching  versions  of  the  dimension  
 “playing”  with  different  versions  for  what-­‐if  
analysis  
  Examples  
 retrieve  the  messages  twi4ed  in  2009  by  those  
users  who  are  popular  now  (and  not  in  2009!)”  
 “retrieve  recent  tweets  containing  the  hashtags  
which  were  in  TOP  20  in  2008  
02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   22  
Conclusion  
dbis.uni-konstanz.de

  Proposed  extracEon  of  mulE-­‐dimensional  data  


cube  from  semi-­‐structured  data.  
  Extended  the  underlying  dataset  and  model  
using  DM  and  semanEc  enrichment  methods.  
  Adapted  the  DWH  to  deal  with  the  changing/
dynamic  data  using  concept  of  SCD.  
  Enabled  OLAP  for  recent  and  historic  data  
analysis.  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   23  
dbis.uni-konstanz.de

Thank  You  

QuesEons  

02-­‐Nov-­‐12   Discovering  OLAP  Dimensions  in  Semi-­‐Structured  Data  –  DOLAP’12  Hawaii  USA   24  

You might also like