Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

End-to-End Hadoop Development using

OBIEE, ODI and Oracle Big Data Discovery


Mark Rittman, CTO, Rittman Mead
March 2015
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

About the Speaker


Mark Rittman, Co-Founder of Rittman Mead
Oracle ACE Director, specialising in Oracle BI&DW
14 Years Experience with Oracle Technology
Regular columnist for Oracle Magazine
Author of two Oracle Press Oracle BI books
Oracle Business Intelligence Developers Guide
Oracle Exalytics Revealed
Writer for Rittman Mead Blog :
http://www.rittmanmead.com/blog
Email : mark.rittman@rittmanmead.com
Twitter : @markrittman

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

About Rittman Mead


Oracle BI and DW Gold partner
Winner of five UKOUG Partner of the Year awards in 2013 - including BI
World leading specialist partner for technical excellence,
solutions delivery and innovation in Oracle BI
Approximately 80 consultants worldwide
All expert in Oracle BI and DW
Offices in US (Atlanta), Europe, Australia and India
Skills in broad range of supporting Oracle tools:
OBIEE, OBIA, ODIEE
Big Data, Hadoop, NoSQL & Big Data Discovery
Essbase, Oracle OLAP
GoldenGate
Endeca
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

End-to-End Oracle Big Data Example


Rittman Mead want to understand drivers and audience for their website
What is our most popular content? Who are the most in-demand blog authors?
Who are the influencers? What do they read?
Three data sources in scope:
RM Website Logs

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

Twitter Stream

E : info@rittmanmead.com
W : www.rittmanmead.com

Website Posts, Comments etc

Two Analysis Scenarios : Reporting, and Data Discovery


Initial task will be to ingest data from webserver logs, Twitter firehose, site content + ref data
Land in Hadoop cluster, basic transform, format, store; then, analyse the data:

Combine with Oracle Big Data SQL


for structured OBIEE dashboard analysis

Why is some content more popular?


Does sentiment affect viewership?
What content is popular, where?

What pages are people visiting?


Who is referring to us on Twitter?
What content has the most reach?

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

Combine with site content, semantics, text enrichment


Catalog and explore using Oracle Big Data Discovery

E : info@rittmanmead.com
W : www.rittmanmead.com

Step 1
Ingesting Basic Dataset into Hadoop

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Data Loading into Hadoop


Default load type is real-time, streaming loads
Batch / bulk loads only typically used to seed system
Variety of sources including web log activity, event streams
Target is typically HDFS (Hive) or HBase
Data typically lands in raw state
Lots of files and events, need to be filtered/aggregated
Typically semi-structured (JSON, logs etc)
High volume, high velocity
- Which is why we use Hadoop rather than
RBDMS (speed vs. ACID trade-off)
Economics of Hadoop means its often possible to
archive all incoming data at detail level

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Loading
Stage
Real-Time
Logs / Events

File /
Unstructured
Imports

Data Sources used for Project : Stage 1


Ingest

Publish

Process

Cloudera CDH5.3 BDA Hadoop Cluster

Exadata
Dim
Attributes

Flume
Flume

MongoDB

Spark

Spark

Spark

Hive

Hive

Hive

HDFS

HDFS

HDFS

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

SQL for
BDA Exec

E : info@rittmanmead.com
W : www.rittmanmead.com

OBIEE
TimesTen

Big Data
SQL
Filtered &
Projected
Rows /
Columns

Exalytics

12c In-Mem

Apache Flume : Distributed Transport for Log Activity


Apache Flume is the standard way to transport log files from source through to target
Initial use-case was webserver log files, but can transport any file from A>B
Does not do data transformation, but can send to multiple targets / target types
Mechanisms and checks to ensure successful transport of entries
Has a concept of agents, sinks and channels
Agents collect and forward log data
Sinks store it in final destination
Channels store log data en-route
Simple configuration through INI files
Handled outside of ODI12c

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Flume Source / Target Configuration


Conf file for source system agent
TCP port, channel size+type, source type

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

Conf file for target system agent


TCP port, channel size+type, sink type

E : info@rittmanmead.com
W : www.rittmanmead.com

Also - Apache Kafka : Reliable, Message-Based


Developed by LinkedIn, designed to address Flume issues around reliability, throughput
(though many of those issues have been addressed since)
Designed for persistent messages as the common use case
Website messages, events etc vs. log file entries
Consumer (pull) rather than Producer (push) model
Supports multiple consumers per message queue
More complex to set up than Flume, and can use
Flume as a consumer of messages
But gaining popularity, especially
alongside Spark Streaming

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Starting Flume Agents, Check Files Landing in HDFS Directory


Start the Flume agents on source and target (BDA) servers
Check that incoming file data starts appearing in HDFS
Note - files will be continuously written-to as
entries added to source log files
Channel size for source, target agents
determines max no. of events buffered
If buffer exceeded, new events dropped
until buffer < channel size

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Adding Social Media Datasources to the Hadoop Dataset


The log activity from the Rittman Mead website tells us what happened, but not why
Common customer requirement now is to get a 360 degree view of their activity
Understand whats being said about them
External drivers for interest, activity
Understand more about customer intent, opinions
One example is to add details of social media mentions,
likes, tweets and retweets etc to the transactional dataset
Correlate twitter activity with sales increases, drops
Measure impact of social media strategy
Gather and include textual, sentiment, contextual
data from surveys, media etc

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Example : Supplement Webserver Log Activity with Twitter Data


Datasift provide access to the Twitter firehose along with Facebook data, Tumblr etc
Developer-friendly APIs and ability to define search terms, keywords etc
Pull (historical data) or Push (real-time) delivery using many formats / end-points
Most commonly-used consumption format is JSON, loaded into Redis, MongoDB etc

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

What is MongoDB?
Open-source document-store NoSQL database
Flexible data model, each document (record)
can have its own JSON schema
Highly-scalable across multiple nodes (shards)
MongoDB databases made up of
collections of documents
Add new attributes to a document just by using it
Single table (collection) design, no joins etc
Very useful for holding JSON output from web apps
- for example, twitter data from Datasift

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Where Are We Now : Data Ingestion Phase


Log data streamed into HDFS via Flume collector / sink
Twitter data pushed into MongoDB collection
Data is in raw form
Now it needs to be refined, combined,
transformed and processed into
useful data

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Step 2
Processing, Enhancing & Transforming Data

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Core Apache Hadoop Tools


Apache Hadoop, including MapReduce and HDFS
Scaleable, fault-tolerant file storage for Hadoop
Parallel programming framework for Hadoop
Apache Hive
SQL abstraction layer over HDFS
Perform set-based ETL within Hadoop
Apache Pig, Spark
Dataflow-type languages over HDFS, Hive etc
Extensible through UDFs, streaming etc
Apache Flume, Apache Sqoop, Apache Kafka
Real-time and batch loading into HDFS
Modular, fault-tolerant, wide source/target coverage

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Other Tools Typically Used


Python, Scala, Java and other programming languages
For more complex and procedural transformations
Shell scripts, sed, awk, regexes etc
R and R-on-Hadoop
Typically at the discovery phase
And down the line - ETL tools to automate the process
ODI, Pentaho Data Integrator etc

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

ODI on Hadoop - Enterprise Data Integration for Big Data


ODI provides an excellent framework for loading and transforming Hadoop data
ELT approach pushes transformations down to Hadoop - leveraging power of cluster
Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation
Whilst still preserving RDBMS push-down
Extensible to cover Pig, Spark etc
Process orchestration
Data quality / error handling
Metadata and model-driven

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Oracle Big Data Connectors


Oracle-licensed utilities to connect Hadoop to Oracle RBDMS
Bulk-extract data from Hadoop to Oracle, or expose HDFS / Hive data as external tables
Run R analysis and processing on Hadoop
Leverage Hadoop compute resources to offload ETL and other work from Oracle RBDMS
Enable Oracle SQL to access and load Hadoop data

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Hive SerDes & Storage Handlers


Plug-in technologies that extend Hive to handle new data formats and semi-structured sources
Typically distributed as JAR files, hosted on sites such as GitHub
Can be used to parse log files, access data in NoSQL databases, Amazon S3 etc
CREATE EXTERNAL TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\])
([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\")
([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE
LOCATION '/user/root/logs';

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

CREATE TABLE tweet_data(


interactionId string,
username string,
content string,
author_followers int)
ROW FORMAT SERDE
'com.mongodb.hadoop.hive.BSONSerDe'
STORED BY
'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES (
'mongo.columns.mapping'='{"interactionId":"interactionId",
"username":"interaction.interaction.author.username",
"content":\"interaction.interaction.content",
"author_followers_count":"interaction.twitter.user.followers_count"}'
)
TBLPROPERTIES (
'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
)

E : info@rittmanmead.com
W : www.rittmanmead.com

Configuring ODI12c Topology and Models


HDFS data servers (source) defined using generic File technology
Workaround to support IKM Hive Control Append
Leave JDBC driver blank, put HDFS URL in JDBC URL field

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Defining Topology and Model for Hive Sources


Hive supported out-of-the-box with ODI12c (but requires ODIAAH license for KMs)
Most recent Hadoop distributions use HiveServer2 rather than HiveServer
Need to ensure JDBC drivers support Hive version
Use correct JDBC URL format (jdbc:hive2//)

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

HDFS & Hive Model and Datastore Definitions


HDFS files for incoming log data, and any other input data
Hive tables for ETL targets and downstream processing
Use RKM Hive to reverse-engineer column definition from Hive

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Hive and MongoDB


MongoDB Hadoop connector provides a storage handler for Hive tables
Rather than store its data in HDFS, the Hive table uses MongoDB for storage instead
Define in SerDe properties the Collection elements you want to access, using dot notation
https://github.com/mongodb/mongo-hadoop
CREATE TABLE tweet_data(
interactionId string,
username string,
content string,
author_followers int)
ROW FORMAT SERDE
'com.mongodb.hadoop.hive.BSONSerDe'
STORED BY
'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES (
'mongo.columns.mapping'='{"interactionId":"interactionId",
"username":"interaction.interaction.author.username",
"content":\"interaction.interaction.content",
"author_followers_count":"interaction.twitter.user.followers_count"}'
)
TBLPROPERTIES (
'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
)

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Adding MongoDB Datasets into the ODI Repository


Define Hive table outside of ODI, using MongoDB storage handler
Select the document elements of interest, project into Hive columns
Add Hive source to Topology if needed, then use Hive RKM to bring in column metadata

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

1 Load Incoming Log Files into Hive Table


First step in process is to load the incoming log files into a Hive table
Also need to parse the log entries to extract request, date, IP address etc columns
Hive table can then easily be used in
downstream transformations
Use IKM File to Hive (LOAD DATA) KM
Source can be local files or HDFS
Either load file into Hive HDFS area,
or leave as external Hive table
Ability to use SerDe to parse file data

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Specifying a SerDe to Parse Incoming Hive Data


SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats
Distributed as JAR file, gives Hive ability to parse semi-structured formats
We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns
Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

2 Join to Additional Hive Tables, Transform using HiveQL


IKM Hive to Hive Control Append can be used to perform Hive table joins, filtering, agg. etc.
INSERT only, no DELETE, UPDATE etc
Not all ODI12c mapping operators supported, but basic functionality works OK
Use this KM to join to other Hive tables,
adding more details on post, title etc
Perform DISTINCT on join output, load
into summary Hive table

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Joining Hive Tables


Only equi-joins supported
Must use ANSI syntax
More complex joins may not produce
valid HiveQL (subqueries etc)

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Filtering, Aggregating and Transforming Within Hive


Aggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT etc mapping
operators can be added to mapping to manipulate data
Generates HiveQL functions, clauses etc

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Loading Hive Tables - Streaming CDC using GoldenGate


Hive tables in Hadoop cluster can be loaded from file and other sources
For streaming updates from RDBMS sources, Oracle GoldenGate for Big Data can be used
RDBMS to Hive
RDBMS to HDFS
RDBMS to Flume
Can combine with Flume,
Kafka etc for real-time
Hive/HDFS data sync

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

3 Bring in Reference Data from Oracle Database


In this third step, additional reference data from Oracle Database needs to be added
In theory, should be able to add Oracle-sourced datastores to mapping and join as usual
But Oracle / JDBC-generic LKMs dont get work with Hive

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Options for Importing Oracle / RDBMS Data into Hadoop


Could export RBDMS data to file, and load using IKM File to Hive
Oracle Big Data Connectors only export to Oracle, not import to Hadoop
Best option is to use Apache Sqoop, and new
IKM SQL to Hive-HBase-File knowledge module
Hadoop-native, automatically runs in parallel
Uses native JDBC drivers, or OraOop (for example)
Bi-directional in-and-out of Hadoop to RDBMS
Run from OS command-line

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Loading RDBMS Data into Hive using Sqoop


First step is to stage Oracle data into equivalent Hive table
Use special LKM SQL Multi-Connect Global load knowledge module for Oracle source
Passes responsibility for load (extract) to following IKM
Then use IKM SQL to Hive-HBase-File (Sqoop) to load the Hive table

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Join Oracle-Sourced Hive Table to Existing Hive Table


Oracle-sourced reference data in Hive can then be joined to existing Hive table as normal
Filters, aggregation operators etc can be added to mapping if required
Use IKM Hive Control Append as integration KM

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Adding the Twitter Data from MongoDB


Previous steps exposed the Twitter data, in MongoDB, through a Hive table
RM_RELATED_TWEETS Hive table now included in ODI Topology + Model

hive> describe rm_related_tweets;


OK
id
string
interactionid
string
username
string
author_name
string
created_at
string
content
string
twitter_link
string
language
string
sentiment
int
author_tweet_count
int
author_followers_co
int
author_profile_img_u
string
author_timezone
string
normalized_url
string
mentions
string
hashtags
string
Time taken: 0.109 seconds, Fetched: 16 row(s)

from
from
from
from
from
from
from
from
from
from
from
from
from
from
from
from

deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer
deserializer

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Filter Log Entries to Only Leave Blog Post Views


Were only interested in Twitter activity around blog posts
Create an additional Hive table to contain just blog post views, to then join to tweets

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Filter Tweets Down to Just Those Linking to RM Website


Filter the list of tweets down to just those that link to RM blog

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Join Twitter Data to Page View Data


Create summary table showing twitter activity per page on the blog

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Create ODI Package for Processing Steps, and Execute


Create ODI Package or Load Plan to run steps in sequence
With load plan, can also add exceptions and recoverability
Execute package to load data into final Hive tables

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Summary : Data Processing Phase


Weve now processed the incoming data, filtering it and transforming to required state
Joined (mashed-up) datasets from website activity, and social media mentions
Ingestion and the load/processing stages are now complete
Now we want to make the Hadoop
output available to a wider,
non-technical audience

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Step 3
Publishing & Analyzing Data

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

New in OBIEE 11.1.1.7 : Hadoop Connectivity through Hive


MapReduce jobs are typically written in Java, but Hive can make this simpler
Hive is a query environment over Hadoop/MapReduce to support SQL-like queries
Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically
creates MapReduce jobs against data previously loaded into the Hive HDFS tables
Approach used by ODI and OBIEE to gain access to Hadoop data
Allows Hadoop data to be accessed just like any other data source

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Importing Hadoop/Hive Metadata into RPD


HiveODBC driver has to be installed into Windows environment, so that
BI Administration tool can connect to Hive and return table metadata
Import as ODBC datasource, change physical DB type to Apache Hadoop afterwards
Note that OBIEE queries cannot span >1 Hive schema (no table prefixes)
2

1
3

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Set up ODBC Connection at the OBIEE Server


OBIEE 11.1.1.7+ ships with HiveODBC drivers, need to use 7.x versions though (only Linux
supported)
Configure the ODBC connection in odbc.ini, name needs to match RPD ODBC name
BI Server should then be able to connect to the Hive server, and Hadoop/MapReduce
[ODBC Data Sources]
AnalyticsWeb=Oracle BI Server
Cluster=Oracle BI Server
SSL_Sample=Oracle BI Server
bigdatalite=Oracle 7.1 Apache Hive Wire Protocol
[bigdatalite]
Driver=/u01/app/Middleware/Oracle_BI1/common/ODBC/
Merant/7.0.1/lib/ARhive27.so
Description=Oracle 7.1 Apache Hive Wire Protocol
ArraySize=16384
Database=default
DefaultLongDataBuffLen=1024
EnableLongDataBuffLen=1024
EnableDescribeParam=0
Hostname=bigdatalite
LoginTimeout=30
MaxVarcharSize=2000
PortNumber=10000
RemoveColumnQualifiers=0
StringDescribeType=12
TransactionMode=0
UseCurrentSchema=0

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

OBIEE 11.1.1.7 / HiveServer2 ODBC Driver Issue


Most customers using BDAs are using CDH4 or CDH5 - which uses HiveServer2
OBIEE 11.1.1.7 only ships/supports HiveServer1 ODBC drivers
But OBIEE 11.1.1.7 on Windows can use the Cloudera HiveServer2 ODBC drivers
which isnt supported by Oracle
but works!

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Using Big Data SQL with OBIEE11g


Preferred Hadoop access path for customers with Oracle Big Data Appliance is Big Data SQL
Oracle SQL Access to both relational, and Hive/NoSQL data sources
Exadata-type SmartScan against Hadoop datasets
Response-time equivalent to Impala or Hive on Tez
No issues around HiveQL limitations
Insulates end-users around differences
between Oracle and Hive datasets

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Using Oracle Big Data SQL to Federate Hive with Oracle


Big Data SQL can also be used to add, on-the-fly, additional Oracle reference data to Hive
For example : geoip country lookup
requiring non-equi join
Additional dimensions not present in Hive
Calculated measures, derived attributes
using Oracle SQL
Hive Weblog Activity table

Oracle Dimension lookup tables

Combined output
in report form

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Big Data SQL Example : Add Geolookup Data to Hive


Hive ACCESS_PER_POST_CATEGORIES table contains Host IP addresses
Ideally would like to translate into country, city etc
See readership by country
Match topics of interest to country, city

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

How GeoIP Geocoding Works


Uses free Geocoding API and database from Maxmind
Convert IP address to an integer
Find which integer range our IP address sits within
But Hive cant use BETWEEN in a join
Solution : Expose ACCESS_PER_POST_CATEGORIES using Big Data SQL, then join

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Create ORACLE_HIVE version of Hive Table, Join to GeoIP

access_per_post_categories.ip_integer
BETWEEN geoip_country.start_ip_int
AND geoip_country.end_ip_int

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Import Oracle Tables, Create RPD joining Tables Together


No need to use Hive ODBC drivers - Oracle OCI connection instead
No issue around HiveServer1 vs HiveServer2; also Big Data SQL handles authentication
with Hadoop cluster in background, Kerberos etc
Transparent to OBIEE - all appear as Oracle tables
Join across schemas if required

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Create Physical Data Model from Imported Table Metadata


Join ORACLE_HIVE external table containing log data, to reference tables from Oracle DB

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Create Business Model and Presentation Layers


Map incoming physical tables into a star schema
Add aggregation method for fact measures
Add logical keys for logical dimension tables
Aggregate (Count) on page ID fact column
Remove columns from fact table
that arent measures

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Create Initial Analyses Against Combined Dataset


Create analyses using
full SQL features
Access to Oracle RDBMS
Advanced Analytics functions
through EVALUATE,
EVALUATE_AGGR etc
Big Data SQL SmartScan feature
provides fast, ad-hoc access
to Hive data, avoiding MapReduce

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Create Views over Fact Table To Support Time-Series Analysis


Time field comes in from Hive as string (date and time, plus timezone)
Create view over base ORACLE_HIVE external table to break-out date, time etc
Import time dimension table from Oracle to support date/time analysis, forward forecasting

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Now Enables Time-Series Reporting Incl. Country Lookups

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Use Exalytics In-Memory Aggregate Cache if Required


If further query acceleration is required, Exalytics In-Memory Cache can be used
Enabled through Summary Advisor, caches commonly-used aggregates in in-memory cache
Options for TimesTen or Oracle Database 12c In-Memory Option
Returns aggregated data at the speed of thought

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Step 4
Enabling for Data Discovery

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Enable Incoming Site Activity Data for Data Discovery


Another use-case for Hadoop data is data discovery
Load data into the data reservoir
Catalog and understand separate datasets
2
Enrich data using graphical tools
Join separate datasets together
Present textual data alongside measures
and key attributes
Explore and analyse using faceted search

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Combine with site content, semantics, text enrichment


Catalog and explore using Oracle Big Data Discovery

Why is some content more popular?


Does sentiment affect viewership?
What content is popular, where?

Oracle Big Data Discovery


The Visual Face of Hadoop - cataloging, analysis and discovery for the data reservoir
Runs on Cloudera CDH5.3+ (Hortonworks support coming soon)
Combines Endeca Server + Studio technology with Hadoop-native (Spark) transformations

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Data Sources used for Project : Stage 2


BDD Node
Cloudera CDH5.3 BDA Hadoop Cluster

BDD Data
Processing

BDD Data
Processing

BDD Data
Processing

Spark

Spark

Spark

Hive

Hive

Hive

HDFS

HDFS

HDFS

Ingest semi-
process logs
(1m rows)

BDD
Studio
Web UI

Ingest processed
Twitter activity

BDD
DGraph
Gateway
HDFS Client

Write-back
Transformations
to full
datasets

Hive Client

Data Discovery
using Studio
web-based app

Upload
Site page and
comment
contents

Persist uploaded DGraph


content in Hive / HDFS

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Oracle Big Data Discovery Architecture


Adds additional nodes into the CDH5.3 cluster, for running DGraph and Studio
DGraph engine based on Endeca Server technology, can also be clustered
Hive (HCatalog) used for reading table metadata,
mapping back to underlying HDFS files
Apache Spark then used to upload (ingest)
data into DGraph, typically 1m row sample
Data then held for online analysis in DGraph
Option to write-back transformations to
underlying Hive/HDFS files using Apache Spark

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Ingesting & Sampling Datasets for the DGraph Engine


Datasets in Hive have to be ingested into DGraph engine before analysis, transformation
Can either define an automatic Hive table detector process, or manually upload
Typically ingests 1m row random sample
1m row sample provides > 99% confidence that answer is within 2% of value shown
no matter how big the full dataset (1m, 1b, 1q+)
Makes interactivity cheap - representative dataset
The'100%'premium

Cost
Accuracy
Amount'of'data'queried

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Ingesting Site Activity and Tweet Data into DGraph


Two output datasets from ODI process have to be ingested into DGraph engine
Upload triggered by manual call to BDD Data Processing CLI
Runs Oozie job in the background to profile,
Hive
enrich and then ingest data into DGraph
pageviews
X rows

"@class" : "com.oracle.endeca.pdi.client.config.workflow.
ProvisionDataSetFromHiveConfig",
"hiveTableName" : "rm_linked_tweets",
"hiveDatabaseName" : "default",
"newCollectionName" : edp_cli_edp_a5dbdb38-b065,
"runEnrichment" : true,
"maxRecordsForNewDataSet" : 1000000,
"languageOverride" : "unknown"

3
Apache Spark

pageviews
>1m rows

Profiling

pageviews
>1m rows

Enrichment

pageviews
>1m rows

}
BDD

[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli


[oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t pageviews
[oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t rm_linked_tweets

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

pageviews
>1m rows

View Ingested Datasets, Create New Project


Ingested datasets are now visible in Big Data Discovery Studio
Create new project from first dataset, then add second

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Automatic Enrichment of Ingested Datasets


Ingestion process has automatically geo-coded host IP addresses
Other automatic enrichments run after initial discovery step, based on datatypes, content

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Datatype Conversion Example : String to Date / Time


Datatypes can be converted into other datatypes, with data transformed if required
Example : convert Apache Combined Format Log date/time to Java date/time

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Other Transformation Types, Filtering and Grouping


Group and bin attribute values; filter on attribute values, etc
Use Transformation Editor for custom transformations (Groovy, incl. enrichment functions)

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Commit Transforms to DGraph, or Create New Hive Table


Transformation changes have to be committed to DGraph sample of dataset
Project transformations kept separate from other project copies of dataset
Transformations can also be applied to full dataset, using Apache Spark
Creates new Hive table of complete dataset

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Upload Additional Datasets


Users can upload their own datasets into BDD, from MS Excel or CSV file
Uploaded data is first loaded into Hive table, then sampled/ingested as normal

1
2

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Join Datasets On Common Attributes


Creates left-outer join between primary, and secondary dataset
Add metrics (sum of page visits, for example) to activity dataset (e.g. tweets)
Join more attributes to main dataset

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Create Discovery Pages for Dataset Analysis


Select from palette of visualisation components
Select measures, attributes for display

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Select From Multiple Visualisation Types

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Faceted Search Across Project


Search on attribute values, text in attributes across all datasets
Extracted keywords, free text field search

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Key BDD Studio Differentiator : Faceted Search Across Hadoop


BDD Studio dashboards support faceted search across all attributes, refinements
Auto-filter dashboard contents on selected attribute values
Fast analysis and summarisation through Endeca Server technology

Mark Rittman selected


from Post Authors

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Results filtered on
selected refinement

Key BDD Studio Differentiator : Faceted Search Across Hadoop


BDD Studio dashboards support faceted search across all attributes, refinements
Auto-filter dashboard contents on selected attribute values - for data discovery
Fast analysis and summarisation through Endeca Server technology

4
3

Further refinement on
OBIEE in post keywords

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Results now filtered


on two refinements

Summary
Oracle Big Data, together with OBIEE, ODI and Oracle Big Data Discovery
Complete end-to-end solution with engineered hardware, and Hadoop-native tooling

Combine with Oracle Big Data SQL


for structured OBIEE dashboard analysis

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

Combine with site content, semantics, text enrichment


Catalog and explore using Oracle Big Data Discovery

Thank You for Attending!


Thank you for attending this presentation, and more information can be found at http://
www.rittmanmead.com
Contact us at info@rittmanmead.com or mark.rittman@rittmanmead.com
Look out for our book, Oracle Business Intelligence Developers Guide out now!
Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)

T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or


+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

End-to-End Hadoop Development using


OBIEE, ODI and Oracle Big Data Discovery
Mark Rittman, CTO, Rittman Mead
March 2015
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)

E : info@rittmanmead.com
W : www.rittmanmead.com

You might also like