Big Data Essentials: Activity Guide

Big Data Essentials
Activity Guide
Table of Contents
Practices for Lesson 1: Introduction ..............................................................................................................1-1
Practices for Lesson 1....................................................................................................................................1-2
Practices for Lesson 2: Overview of Big Data ...............................................................................................2-1
Practice 2-1: Overview of Big Data ................................................................................................................2-2
Solution 2-1: Overview of Big Data ................................................................................................................2-3
Practices for Lesson 3: Understanding the Oracle Big Data Solution ........................................................3-1
Practice 3-1: Understanding the Oracle Big Data Solution.............................................................................3-2
Solution 3-1: Understanding the Oracle Big Data Solution.............................................................................3-3
Practices for Lesson 4: Using Oracle Big Data Appliance ...........................................................................4-1
Practice 4-1: Using Oracle Big Data Appliance ..............................................................................................4-2
Solution 4-1: Using Oracle Big Data Appliance ..............................................................................................4-3
Practices for Lesson 5: Data Acquisition Options in BDA ...........................................................................5-1
Practice 5-1: Data Acquisition Options in BDA ...............................................................................................5-2
Solution 5-1: Data Acquisition Options in BDA ...............................................................................................5-3
Practices for Lesson 6: Using the Hadoop Distributed File System ...........................................................6-1
Practice 6-1: Loading JSON Data into HDFS .................................................................................................6-2
Solution 6-1: Loading JSON Data into HDFS .................................................................................................6-4
Practices for Lesson 7: Using Flume in HDFS ..............................................................................................7-1
Practices for Lesson 7-1: Introduction to Flume .............................................................................................7-2
Solution 7-1: Introduction to Flume ................................................................................................................7-3
Practices for Lesson 8: Using Oracle NoSQL Database ...............................................................................8-1
Practice 8-1: Using KVLite .............................................................................................................................8-2
Solution 8-1: Using KVLite .............................................................................................................................8-4
Practice 8-2: Loading Movie Data into Oracle NoSQL Database ...................................................................8-7
Solution 8-2: Loading Movie Data into Oracle NoSQL Database ...................................................................8-8
Practices for Lesson 9: Using Hive ................................................................................................................9-1
Practice 9-1: Manipulating Data with Hive......................................................................................................9-2
Solution 9-1: Manipulating Data with Hive ......................................................................................................9-4
Practice 9-2: Extracting Facts by Using Hive .................................................................................................9-7
Solution 9-2: Extracting Facts by Using Hive .................................................................................................9-12
Practices for Lesson 10: Introduction to Oracle Big Data Connectors .......................................................10-1
Practices for Lesson 10..................................................................................................................................10-2
Practices for Lesson 11: Using Oracle Loader for Hadoop ..........................................................................11-1
Practice 11-1: Loading Session Data into Oracle Database ..........................................................................11-2
Solution 11-1: Loading Session Data into Oracle Database ..........................................................................11-4
Practices for Lesson 12: Using Oracle SQL Connector for HDFS ...............................................................12-1
Practice 12-1: Accessing Hadoop Data with Oracle SQL Connector for HDFS .............................................12-2
Solution 12-1: Accessing Hadoop Data with Oracle SQL Connector for HDFS .............................................12-4
Practices for Lesson 13: Using ODI Application Adapter for Hadoop ........................................................13-1
Practices for Lesson 14: Using Oracle R Connector for Hadoop ................................................................14-1
Practice 14-1: Working with Data in HDFS and Oracle Database ..................................................................14-2
Solution 14-1: Working with Data in HDFS and Oracle Database ..................................................................14-4
Big Data Essentials Table of Contents

iii
Practice 14-2: Executing a Simple MapReduce Job by Using Oracle R Connector for Hadoop ....................14-8
Solution 14-2: Executing a Simple MapReduce Job by Using Oracle R Connector for Hadoop ....................14-10
Practices for Lesson 15: Using In-Database Analytics .................................................................................15-1
Practice 15-1: Getting Started with Oracle R Enterprise ................................................................................15-2
Solution 15-1: Getting Started with Oracle R Enterprise ................................................................................15-6
Practices for Lesson 16: Oracle Big Data Integration Options ....................................................................16-1
Practices for Lesson 17: Oracle Big Data Use Cases ...................................................................................17-1
Practices for Lesson 17: Overview .................................................................................................................17-2
Practice 17-1: Case Study .............................................................................................................................17-3
Solution 17-1: Case Study .............................................................................................................................17-4
Big Data Essentials Table of Contents

iv
Practices for Lesson 2:
Overview of Big Data
Chapter 2
Practices for Lesson 2: Overview of Big Data

Chapter 2 - Page 1
Practice 2-1: Overview of Big Data
Overview
In this practice, you answer questions to check your understanding of the concepts that you
learned in the lesson.
Questions
a. Which of the following statements best defines the term Big Data?
1) Data that is very large in volume but consistent in structure
2) Data that is very large in volume and high in value
3) Data that has low value but very large volume
4) Data that is rapidly changing and has very high value and volume
b. The four Vs that describe big data are ___________, ___________, __________, and
__________.
c. Fill in the blank column:
Today’s challenge New data What’s possible?
Retail: One-size-fits-all Social media ?
marketing
d. Will Big Data replace data warehouses and databases?
e. If you are a business analyst, what kinds of skills do you need if you want to use Big
Data?

Chapter 2 - Page 2
Solution 2-1: Overview of Big Data
Overview
The answers are formatted in bold.
Answers
a. Which of the following statements best defines the term Big Data?
1) Data that is very large in volume but consistent in structure
2) Data that is very large in volume and high in value
3) Data that has low value but very large volume
4) Data that is rapidly changing and has very high value and volume
b. The four Vs that describe big data are volume, value, variety, and velocity.
c. Fill in the blank column:
Today’s challenge New data What’s possible?
Retail: One-size-fits-all Social media Sentiment analysis and
marketing segmentation
Note: Sentiment analysis is a statistical model that is used to gauge the
sentimental outlook for a brand, a service, or any other subject of interest by
analyzing the semantics of lexical or nonlexical text. It is a weighted, score-based
analysis.
d. Will Big Data replace data warehouses and databases?
No.
Big Data needs data warehouses and databases to store structured data for further
analysis. Big Data should be combined with traditional systems to obtain a better
solution.
e. If you are a business analyst, what kinds of skills do you need if you want to use Big
Data?
• Ability to map Big Data with organizational goals
• Skills to extract massive amount of data
• Good leadership skills to make business decisions

Chapter 2 - Page 3
Understanding the Oracle Big
Data Solution
Chapter 3
Practices for Lesson 3: Understanding Big Data Solution

Chapter 3 - Page 1
Practice 3-1: Understanding Big Data Solution
Overview
Questions
a. Identify the capabilities of the Oracle Big Data solution.
1) Deep analytics
2) High agility
3) Massive scalability
4) Low latency
5) All of the above
b. Oracle R Connector for Hadoop is used to refine, organize, and load the data from
Hadoop to Oracle Database.
• True
• False
c. Oracle Database is used in the Acquire phase.
• True
• False
Match the following:
Big Data Storage System Used
Unstructured data Oracle NoSQL Database
Schema-based data HDFS
Schema-less data Oracle Database

Chapter 3 - Page 2
Solution 3-1: Understanding Big Data Solution
Overview
The answers to Practice 3-1 are formatted in bold.
Answers
a. Identify the capabilities of the Oracle Big Data solution.
1) Deep analytics
2) High agility
3) Massive scalability
4) Low latency
5) All of the above
Explanation
1) Deep analytics is a fully parallel, extensive, and extensible toolbox full of advanced
and novel statistical and data-mining capabilities.
2) High agility is the ability to create temporary analytics environments in an end-
user-driven yet secure and scalable environment to deliver new and novel insights
into the operational business.
3) Massive scalability is the ability to scale analytics and sandboxes to previously
unknown sizes while leveraging previously untapped data potential.
4) Low latency is the ability to instantly act based on these advanced analytics in your
operational production environments.
All these features are integrated in the Oracle Big Data solution.
b. Oracle R Connector for Hadoop is used to refine, organize, and load the data from
Hadoop to Oracle Database.
• True
• False
Explanation
Oracle R Connector is the only connector that is used in the analysis phase of the
Oracle Big Data solution.
c. Oracle Database is used in the Acquire phase.
• True
• False

Chapter 3 - Page 3
Explanation
Oracle Database can be used in the acquire phase if the incoming data is structured.
Oracle Database is widely used to store the user profiles and the location data.
Match the following:
Big Data Storage System Used
Unstructured data HDFS
Schema-based data Oracle Database
Schema-less data Oracle NoSQL Database

Chapter 3 - Page 4
Practices for Lesson 4: Using
Big Data Appliance
Chapter 4
Practices for Lesson 4: Using Big Data Appliance

Chapter 4 - Page 1
Practice 4-1: Using Big Data Appliance
Overview
In this practice, you gain a better understanding of Oracle Big Data Appliance by answering
questions about its administration.
Questions
a. How much usable storage is provided per rack?
b. How much usable storage is provided per rack for Oracle NoSQL Database?
c. Is it possible to add a SAN or other external storage device to increase storage
capacity?
d. How is InfiniBand used?
e. There are 18 nodes in the machine. How are the nodes used?
f. What is the difference between Big Data Appliance and an Exadata system/expansion
rack?
g. How does an administrator monitor Big Data Appliance?
h. Why should you buy this machine from Oracle? How does its cost compare to the cost
of building other systems?
Practices for Lesson 4: Using Oracle Big Data Appliance

Chapter 4 - Page 2
Solution 4-1: Using Oracle Big Data Appliance
Overview
The answers to Practice 4-1 are added in this section.
Answers
a. How much usable storage is provided per rack?
This depends on the software configuration.
Raw storage = 648 TB (18 nodes × 12 disks × 3 TB). By default, Hadoop runs on a
triple-replication scheme, so 648 TB of raw storage delivers over 200 TB of usable
space.
b. How much usable storage is provided per rack for Oracle NoSQL Database?
This depends on the software configuration. For NoSQL Database, we preconfigure
the set of masters and replicas on BDA. A single-rack BDA has 6 master nodes and 12
replicas. Each master has a 6 TB maximum.
So there can be 6 masters × 6 TB = 36 TB of storage in NoSQL Database on a single
rack. Each additional rack adds 36 TB of storage.
Because of triple replication, a NoSQL Database gets 3 * 36 TB = 108 TB of data
storage on BDA at full capacity.
Note that these numbers will change when Oracle NoSQL Database v.2 is released.
c. Is it possible to add a SAN or other external storage device to increase storage
capacity?
You cannot add SAN or NAS devices to scale storage. Hadoop is designed for directly
attached disks, and you scale both compute and storage at the same time by adding
nodes to the Hadoop cluster.
d. How is InfiniBand used?
Hadoop uses an enormous amount of network bandwidth by design because all nodes
continuously talk to each other and data is moved among nodes in the shuffle phase of
running a map reduce program. The most common scaling problem in Hadoop clusters
is network saturation.
For large clusters, 1 Gb Ethernet requires difficult tuning (data locality is extremely
important to avoid excessive shuffling of data) and configuration. At 40 Gb/sec,
InfiniBand provides unprecedented scaling and performance for Hadoop, and it
provides extremely fast connectivity to other Oracle machines such as Exadata via Big
Data Connectors.
e. There are 18 nodes in the machine. How are the nodes used?
Node 1 is the Hadoop Name Node and the HBase master.
Node 2 runs Zookeeper and the Cloudera Manager.
Node 3 runs the JobTracker, a MySQL database for Cloudera Manager and Hive (and
an ODI Agent). These three nodes are also configured as data nodes by default; this
feature can be turned off if you add additional racks.
The remaining 15 nodes are data nodes for Hadoop.
If you hook up another rack, the 18 new nodes are data nodes that join the cluster.
Because Hadoop is designed for scale out, adding nodes is very easy. Data
automatically rebalances itself in Hadoop.

Chapter 4 - Page 3
f. What is the difference between Big Data Appliance and an Exadata
system/expansion rack?
Oracle Exadata is optimized for Oracle Database processing. An Exadata Storage
expansion rack runs Oracle Exadata Storage Server software and expands the
capacity of Oracle Exadata. It therefore enables more dedicated high-performance
storage of database data sets.
Oracle Big Data Appliance does not run Oracle Database. It is intended to leverage
Oracle NoSQL Database and Hadoop to ingest and preprocess large volumes of data.
After the data is processed to some meaningful data set, BDA enables fast loading or
fast connections with the data in Exadata.
The end-to-end solution for big data therefore involves both Big Data Appliance and an
Oracle Database/Exadata system.
g. How does an administrator monitor Big Data Appliance?
Hadoop is monitored with the Hadoop management framework. Oracle NoSQL
Database is monitored by using a simple web interface. Enterprise Manager Support
may be added at a later time.
h. Why should you buy this machine from Oracle? How does its cost compare to
the cost of building other systems?
There are several advantages to purchasing Oracle Big Data Appliance (BDA).
Time to value: The configuration and management of Hadoop clusters require deep
expertise in Linux, networking, Java, and Hadoop configuration. Very few Hadoop
experts are available on the open job market. BDA is preconfigured and pretuned by
the top Hadoop experts in the industry. Actual customer deployments have shown that
Oracle Corporation can deploy a fully functional, production-grade Hadoop system in
two days.
TCO: The cost of building a system with equivalent hardware and software is actually
higher than buying the preconfigured and preinstalled BDA system from Oracle. For full
details, see:
https://blogs.oracle.com/datawarehousing/entry/price_comparison_for_big_data.
Maintenance: Oracle is automatically notified if a component in the cluster (for
example, a disk) fails, and Oracle Support comes to your site to replace it. With large
home-grown clusters, other systems often require several full-time administrators who
must take care of the hardware―which is not necessary with BDA.
InfiniBand: InfiniBand provides unprecedented performance, scaling, and connectivity
to other Oracle machines such as Exadata. Home-grown InfiniBand and 10 Gb
Ethernet are very difficult to configure and tune, and the Linux drivers have bugs that
Oracle has fixed.

Chapter 4 - Page 4
Summary of Advantages of BDA
• Supported by a single vendor
• Hardware and software optimized throughout the architecture
• Already integrated with Exadata
• Delivers a complete solution
• Easiest and fastest way to deploy Hadoop
• Competitive pricing compared to other solutions

Chapter 4 - Page 5
Practices for Lesson 5: Data
Acquisition Options in BDA
Chapter 5
Practices for Lesson 5: Data Acquisition Options in BDA

Chapter 5 - Page 1
Practice 5-1: Data Acquisition Options in BDA
Overview
Questions
a. A NoSQL database typically does not support traditional SQL database queries.
• True
• False
b. A key difference between NoSQL and RDBMS systems is that a NoSQL system stores
unstructured data and an RDBMS system stores structured data.
• True
• False
c. Suggest the acquiring mechanism for the following types of data:
• Customer profiles
• Blogs (weblogs)
• Reports
• Logs
• Time stamp
d. Place the following features in the correct column in the table.
1) Stores high-value data
2) Schema is application-defined.
3) Data is highly structured.
4) Stores low-value data
5) Data can be structured or unstructured.
6) Schema is self-describing.
NoSQL Database RDBMS

Chapter 5 - Page 2
Solution 5-1: Data Acquisition Options in BDA
Overview
The answers to Practice 5-1 are formatted in bold.
Answers
a. A NoSQL database typically does not support traditional SQL database queries.
• True
• False
b. A key difference between NoSQL and RDBMS systems is that a NoSQL system stores
unstructured data and an RDBMS system stores structured data.
• True
• False
c. Suggest the acquiring mechanism for the following types of data:
• Customer profiles
• Blogs (weblogs)
• Reports
• Logs
• Time stamp
NoSQL: customer profiles, time stamp
HDFS: weblogs, reports, logs, time stamp
d. Place the following features in the correct column in the table.
1) Stores high-value data
2) Schema is application-defined.
3) Data is highly structured.
4) Stores low-value data
5) Data can be structured or unstructured.
6) Schema is self-describing.
NoSQL Database RDBMS
2) Schema is application- 1) Stores high-value data
defined. 3) Data is highly structured.
4) Stores low-value data 6) Schema is self-describing.
5) Data can be structured or
unstructured.

Chapter 5 - Page 3
the Hadoop Distributed File
System
Chapter 6
Practices for Lesson 6: Using the Hadoop Distributed File System

Chapter 6 - Page 1
Practice 6-1: Loading JSON Data into HDFS
Overview
In this practice, you review HDFS commands. You also load a JSON log file that tracked activity
in an online movie application into HDFS.
Tasks
Note: Run the commands from the /home/oracle/movie/moviedemo/mapreduce
directory.
You may see a difference in output with respect to the number of files/items displayed.
1. Start the terminal window, and run the reset_mapreduce.sh script to reset the practice
directory.
cd /home/oracle/movie/moviework/reset
./reset_mapreduce.sh
2. Review the commands that are available for the Hadoop Distributed File System. You will
find that its composition is similar to your local Linux file system. You will use the hadoop
fs command when interacting with HDFS.
cd /home/oracle/movie/moviework/mapreduce
hadoop fs
3. List the contents of /user/oracle.
hadoop fs –ls /user/oracle
4. Create a subdirectory called my_stuff in the /user/oracle folder, and then ensure that
the directory has been created.
hadoop fs –mkdir /user/oracle/my_stuff
5. Remove the my_stuff directory and then ensure that it has been removed.
hadoop fs –rmr my_stuff
hadoop fs –ls
6. Inspect the compressed JSON application log.
zcat movieapp_3months.log.gz|head

Chapter 6 - Page 2
7. Load a file into the HDFS from the local file system. Specifically, you load a JSON log file
that tracked activity in an online movie application. The JSON data represents individual
clicks from an online movie rental site. You use the basic put commands for moving data
into the HDFS.
Review the commands that are available for the HDFS, and then copy the gzipped file to the
HDFS.
hadoop fs
hadoop fs –put movieapp_3months.log.gz /user/oracle/moviework/applog
8. Verify the copy by listing the directory contents in HDFS.
hadoop fs –ls /user/oracle/moviework/applog

Chapter 6 - Page 3
Solution 6-1: Loading JSON Data into HDFS
1. Start the terminal window, and run the reset_mapreduce.sh script to reset the practice
directory.
cd /home/oracle/movie/moviework/reset
./reset_mapreduce.sh
Note: The output may vary if the files already exist.

2. Review the commands that are available for the Hadoop Distributed File System. You will
find that its composition is similar to your local Linux file system. You use the hadoop fs
command when interacting with HDFS:
hadoop fs

Chapter 6 - Page 4
3. List the contents of /user/oracle.
4. Create a subdirectory called my_stuff in the /user/oracle folder, and then ensure that
the directory has been created:
hadoop fs –mkdir /user/oracle/my_stuff
5. Remove the my_stuff directory and then ensure that it has been removed:
hadoop fs –rmr my_stuff
hadoop fs –ls

Chapter 6 - Page 5
6. Inspect the compressed JSON application log:
zcat movieapp_3months.log.gz|head
7. Load a file into the HDFS from the local file system.
Specifically, you load a JSON log file that tracked activity in an online movie application. The
JSON data represents individual clicks from an online movie rental site. You use the basic
put commands for moving data into the HDFS.
Review the commands available for the HDFS, and then copy the gzipped file to the HDFS.
hadoop fs
hadoop fs –put movieapp_3months.log.gz
/user/oracle/moviework/applog
8. Verify the copy by listing the directory contents in HDFS:

hadoop fs –ls /user/oracle/moviework/applog
Note: The output may vary based on the existing files.

Chapter 6 - Page 6
Flume in HDFS
Chapter 7
Practices for Lesson 7: Using Flume in HDFS

Chapter 7 - Page 1
Practices for Lesson 7-1: Introduction to Flume
Overview
In this practice, you start flume-ng and review the commands and configuration options that
are available in Flume.
Tasks
1. Start the Firefox browser and open Cloudera Manager (which is bookmarked for you).
2. Log in to Cloudera Manager:
Username: admin
Password: welcome1
3. You can view the status of the Hadoop services running on BDA.
Right-click and start Flume if it is in the stopped state.
4. Open a terminal window and view the command options in Flume.
$flume-ng help
5. Review the configuration file for the MoviePlex application.
$cd /home/oracle/movie/moviedemo/scripts
$more flume.conf
6. Review the agent file for the MoviePlex application.
$more flume_movieagent.sh

Chapter 7 - Page 2
Solution 7-1: Introduction to Flume
1. Start the Firefox browser and open Cloudera Manager (which is bookmarked for you).
2. Log in to Cloudera Manager:

Username: admin
Password: welcome1
3. You can view the status of the Hadoop services running on the BDA.
Right-click and start Flume if it is in the stopped state.

Chapter 7 - Page 3
4. Open a terminal window and view the command options in Flume.
$flume-ng help

Chapter 7 - Page 4
5. Review the configuration file for the MoviePlex application.
$cd /home/oracle/movie/moviedemo/scripts
$ls
$more flume.conf
6. Review the agent file for the MoviePlex application.

$more flume_movieagent.sh

Chapter 7 - Page 5
Oracle NoSQL Database
Chapter 8
Practices for Lesson 8: Using Oracle NoSQL Database

Chapter 8 - Page 1
Practice 8-1: Using KVLite
Overview
In this practice, you create an Oracle NoSQL Database instance and register the schemas that
are used for the MoviePlex application.
Tasks
1. Open a terminal window:
2. There are three Oracle NoSQL Database–specific environment variables. KVHOME is
where binaries are installed, KVROOT is where data files and config files are saved, and
KVDEMOHOME is where the source of the practice project is saved.
echo $KVROOT
echo $KVHOME
echo $KVDEMOHOME
3. Make sure $KVROOT does not exist already.
rm -rf $KVROOT
4. Start KVLite from the current working directory:
java -jar $KVHOME/lib/kvstore.jar kvlite -host localhost -root
$KVROOT
Look for a response similar to what is listed below. Minimize the window (leave it running).
java -jar $KVHOME/lib/kvstore-2.*.jar kvlite -root $KVROOT
Created new kvlite store with args: -root /u02/kvroot -store
kvstore -host localhost -port 5000 -admin 5001
5. Open a new tab in the terminal window and start an admin session:
java -jar $KVHOME/lib/kvstore.jar runadmin -host localhost -port
5000
You should be logged in to the KV shell.
6. Register the customer schema:
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/customer.avsc

Chapter 8 - Page 2
7. Register the following schemas:
• movie.avsc
• cast.avsc
• crew.avsc
• genre.avsc
• activity.avsc
as/movie.avsc
as/cast.avsc
as/crew.avsc
as/genre.avsc
as/activity.avsc
Note: Review the schema files located in the
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schemas
directory.
8. Run the show schemas command to make sure all six schemas are registered.
Kv-> show schemas

Chapter 8 - Page 3
Solution 8-1: Using KVLite
1. Open a terminal window.
2. There are three Oracle NoSQL Database–specific environment variables. KVHOME is where
binaries are installed, KVROOT is where data files and config files are saved, and
KVDEMOHOME is where the source of the practice project is saved.
echo $KVROOT
echo $KVHOME
echo $KVDEMOHOME
3. Make sure $KVROOT does not exist already.

rm -rf $KVROOT
4. Start KVLite from the current working directory:
java -jar $KVHOME/lib/kvstore.jar kvlite -host localhost -root
$KVROOT
Minimize the window (leave it running).

5. Open a new tab in the terminal window and start an admin session:
java -jar $KVHOME/lib/kvstore.jar runadmin -host localhost -port
5000
You should be logged in to the KV shell.

Chapter 8 - Page 4
6. Register the customer schema:
The customer schema looks like this:

as/customer.avsc
This is what you should see after you successfully registering the Customer schema:
7. Register the following schemas:

• movie.avsc
• cast.avsc
• crew.avsc
• genre.avsc
• activity.avsc

Chapter 8 - Page 5
as/movie.avsc
as/cast.avsc
as/crew.avsc
as/genre.avsc
as/activity.avsc
8. Run the show schemas command to make sure all six schemas are registered.
Kv-> show schemas

Chapter 8 - Page 6
Practice 8-2: Loading Movie Data into Oracle NoSQL Database
Overview
In this practice, you start the MoviePlex demo and load the movie data into the Oracle NoSQL
Database.
Tasks
1. Open a new terminal window and explore the resetDemo.sh script in the scripts folder:
cd /home/oracle/movie/moviedemo/nosqldb/scripts
more resetDemo.sh
2. Run the resetDemo.sh script to reset the MoviePlex demo.
./resetDemo.sh
This script deletes the existing KVRoot and stops the running server processes and logs.
3. Explore the startDemo.sh script in the scripts folder:
more startDemo.sh
4. Run startDemo.sh to start the MoviePlex demo:
./startDemo.sh
This script starts Oracle WebLogic Server and the Oracle NoSQL Server, and loads the
movies into the database.
5. Open the Firefox browser and connect to the Oracle MoviePlex demo application by using
this URL:
http://localhost:7001/bigdatademo-UI-context-root/login.jsp
Note: You can also click the bookmarked Oracle MoviePlex login page.
6. Give the credentials as shown here:
Username: guest1/guest2/.../guest100
Password: Welcome1
Note: The home page content varies vary based on the guestXX login that you enter.
7. You should be able to use the startDemo script to view the movies that are loaded.
You can click a movie title, view its description, and personalize your favorite movies.
Note: You might not be able to view the movies in the classroom environment because of the
firewall settings.

Chapter 8 - Page 7
Solution 8-2: Loading Movie Data into Oracle NoSQL Database
1. Open a new terminal window and explore the resetDemo.sh script in the scripts folder:
cd /home/oracle/movie/moviedemo/nosqldb/scripts
more resetDemo.sh
2. Run the resetDemo.sh script to reset the MoviePlex demo.

./resetDemo.sh
This script deletes the existing KVRoot and stops the running server processes and logs.

Chapter 8 - Page 8
3. Explore the startDemo.sh script in the scripts folder:
more startDemo.sh
4. Run startDemo.sh to start the MoviePlex demo:

./startDemo.sh
This script starts Oracle WebLogic Server and the Oracle NoSQL Server, and loads the
movies into the database.

Chapter 8 - Page 9
5. Open the Firefox browser and connect to the Oracle MoviePlex demo application by using
this URL:
http://localhost:7001/bigdatademo-UI-context-root/login.jsp
Note: You can also click the bookmarked Oracle MoviePlex login page.
6. Give the credentials as shown here:

Username: guest1/guest2/.../guest100
Password: Welcome1
Note: The home page content varies vary based on the guestXX login that you enter.

Chapter 8 - Page 10
7. You should be able to use the startDemo script to view the movies that are loaded.
You can click a movie title, view its description, and personalize your favorite movies.
Note: You might not be able to view the movies in the classroom environment because of the
firewall settings.

Chapter 8 - Page 11
Hive
Chapter 9
Practices for Lesson 9: Using Hive

Chapter 9 - Page 1
Practice 9-1: Manipulating Data with Hive
Overview
In this practice, you:
• Create a database to store your Hive tables
• Create a simple external table in Hive that enables you to view the contents of the file
• Create a more sophisticated external table that parses the JSON fields and maps them
to the columns in the table
• Select the minimum and maximum time periods contained in the table by using HiveQL
Assumptions
You have successfully completed Practice 6.
Tasks
1. Access the Hive command line by typing hive at the Linux prompt.
$hive
2. Create a new Hive database called moviework. Ensure that the database has been
successfully created.
hive> create database moviework;
hive> show databases;
3. Specify that you want all DDL and DML operations to apply to a specific database. For
simplicity, you apply subsequent operations to the moviework database:
hive> use moviework;
4. In the moviework database, create a simple external table called movieapp_log_json.
This table will contain a single column called the_record.
hive> CREATE EXTERNAL TABLE movieapp_log_json (
the_record STRING
)
LOCATION '/user/oracle/moviework/applog/';
5. Write a query in the Hive command line that returns the first five rows from the table. After
reviewing the results, drop the table.
hive> SELECT * FROM movieapp_log_json LIMIT 5;
hive> drop table movieapp_log_json;

Chapter 9 - Page 2
6. Define a more sophisticated table that parses the JSON file and maps its fields to columns.
To process the JSON fields, you use a popular serializer/deserializer (or SerDes) called
org.apache.hadoop.hive.contrib.serde2.JsonSerde. After creating the table,
review the results by selecting the first 20 rows:
hive> CREATE EXTERNAL TABLE movieapp_log_json (
custId INT,
movieId INT,
genreId INT,
time STRING,
recommended STRING,
activity INT,
rating INT,
price FLOAT
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/user/oracle/moviework/applog/';
hive> SELECT * FROM movieapp_log_json LIMIT 20;
7. HiveQL supports many standard SQL operations. Find the minimum and maximum time
periods that are in the log file:
hive> SELECT MIN(time), MAX(time) FROM movieapp_log_json;

Chapter 9 - Page 3
Solution 9-1: Manipulating Data with Hive
1. Access the Hive command line by typing hive at the Linux prompt.
2. Create a new hive database called moviework. Ensure that the database has been
successfully created.
3. Specify that you want all DDL and DML operations to apply to a specific database. For
simplicity, you apply subsequent operations to the moviework database:
4. In the moviework database, create a simple external table called movieapp_log_json.

This table will contain a single column called the_record.

Chapter 9 - Page 4
5. Write a query in the Hive command line that returns the first five rows from the table. After
reviewing the results, drop the table.
6. Define a more sophisticated table that parses the JSON file and maps its fields to columns.
To process the JSON fields, you use a popular serializer/deserializer (or SerDes) called
org.apache.hadoop.hive.contrib.serde2.JsonSerde. After creating the table,
review the results by selecting the first 20 rows:

Chapter 9 - Page 5
7. HiveQL supports many standard SQL operations. Find the minimum and maximum time
periods that are in the log file:

Chapter 9 - Page 6
Practice 9-2: Extracting Facts by Using Hive
Overview
In this practice, you use HiveQL to filter and aggregate click data to build facts about users’
movie preferences. The query results are saved in a staging table that is used to populate
Oracle Database.
Assumptions
You have successfully completed Practice 9-1.
Tasks
1. Write a query to select only those clicks that correspond to starting, browsing, completing, or
purchasing movies. Use a CASE statement to transform the RECOMMENDED column into
integers, where ‘Y’ is 1 and ‘N’ is 0. Also, ensure that GENREID is not null. Include only the
first 25 rows.
hive> SELECT custid,
movieid,
CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,
time,
CASE recommended WHEN 'Y' THEN 1 ELSE 0 END recommended,
activity,
price
FROM movieapp_log_json
WHERE activity IN (2,4,5,11) LIMIT 25;
Select the movie ratings made by a user. Also consider the following: What happens if a
user rates the same movie multiple times? In this scenario, you should load only the user’s
most recent movie rating.
In Oracle Database 11g, you can use a windowing function. However, HiveQL does not
provide sophisticated analytic functions. Instead, you must use an inner join to compute the
result.
Note: Joins occur before WHERE clauses. To restrict the output of a join, a requirement
should be in the WHERE clause. Otherwise, it should be in the JOIN clause.

Chapter 9 - Page 7
2. Write a query to select the customer ID, movie ID, recommended state, and most recent
rating for each movie.
hive> SELECT
m1.custid,
m1.movieid,
CASE WHEN m1.genreid > 0 THEN m1.genreid ELSE -1 END genreid,
m1.time,
CASE m1.recommended WHEN 'Y' THEN 1 ELSE 0 END recommended,
m1.activity,
m1.rating
FROM movieapp_log_json m1
JOIN
(SELECT
custid,
movieid,
MAX(time) max_time,
activity
GROUP BY custid,
movieid,
genreid,
activity
) m2
ON (
m1.custid = m2.custid
AND m1.movieid = m2.movieid
AND m1.genreid = m2.genreid
AND m1.time = m2.max_time
AND m1.activity = 1
AND m2.activity = 1
) LIMIT 25;

Chapter 9 - Page 8
3. Load the results of the previous two queries into a staging table. You first create the staging
table:
hive> CREATE TABLE movieapp_log_stage (
custId INT,
movieId INT,
genreId INT,
time STRING,
recommended INT,
activity INT,
rating INT,
sales FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Note: If you come across “parse error/exception” at the time of table creation, run the
following script and restart from Practice 9-1.
/home/oracle/movie/moviework/reset/reset_mapreduce.sh
4. Load the results of the queries into the staging table:

Chapter 9 - Page 9
hive>INSERT OVERWRITE TABLE movieapp_log_stage
SELECT * FROM (
SELECT custid,
movieid,
time,
CAST((CASE recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT)
recommended,
activity,
cast(null AS INT) rating,
price
WHERE activity IN (2,4,5,11)
UNION ALL
SELECT
m1.custid,
m1.movieid,
CASE WHEN m1.genreid > 0 THEN m1.genreid ELSE -1 END genreid,
m1.time,
CAST((CASE m1.recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT)
recommended,
m1.activity,
m1.rating,
cast(null as float) price
FROM movieapp_log_json m1
JOIN
(SELECT
custid,
movieid,
MAX(time) max_time,
activity
GROUP BY custid,
movieid,
genreid,
activity
) m2
ON (
m1.custid = m2.custid
AND m1.movieid = m2.movieid
AND m1.genreid = m2.genreid

Chapter 9 - Page 10
AND m1.time = m2.max_time
AND m1.activity = 1
AND m2.activity = 1
)
) union_result;

Chapter 9 - Page 11
Solution 9-2: Extracting Facts by Using Hive
1. Write a query to select only those clicks that correspond to starting, browsing, completing, or
purchasing movies. Use a CASE statement to transform the RECOMMENDED column into
integers where ‘Y’ is 1 and ‘N’ is 0. Also, ensure that GENREID is not null. Include only the
first 25 rows.
Select the movie ratings made by a user. Also consider the following: What happens if a
user rates the same movie multiple times? In this scenario, you should load only the user’s
most recent movie rating.
In Oracle Database 11g, you can use a windowing function. However, HiveQL does not
provide sophisticated analytic functions. Instead, you must use an inner join to compute the
result.
Note: Joins occur before WHERE clauses. To restrict the output of a join, a requirement
should be in the WHERE clause. Otherwise, it should be in the JOIN clause.

Chapter 9 - Page 12
2. Write a query to select the customer ID, movie ID, recommended state, and most recent
rating for each movie.

Chapter 9 - Page 13
3. Load the results of the previous two queries into a staging table. You first create the staging
table:

Chapter 9 - Page 14
4. Load the results of the queries into the staging table:

Chapter 9 - Page 15
Chapter 9 - Page 16
Introduction to Oracle Big
Data Connectors
Chapter 10
Practices for Lesson 10: Introduction to Oracle Big Data Connectors

Chapter 10 - Page 1
Using Oracle Loader for
Hadoop
Chapter 11
Practices for Lesson 11: Using Oracle Loader for Hadoop

Chapter 11 - Page 1
Overview
In this practice, you learn how to use Oracle Loader for Hadoop (OLH) to load data into a table
in Oracle Database.
You load data in online mode with the direct path load option.
Tasks
1. Open a terminal window and clean up the existing HDFS files.
source /home/oracle/movie/moviework/reset/reset_conn.sh
The reset_conn.sh script cleans up any directories and files, and it creates a new HDFS
directory for storing output files.
2. Examine the following scripts:
a. moviesession.xml
b. loaderMap_moviesession.xml
c. runolh_session.sh
cd /home/oracle/movie/moviework/olh
more moviesession.xml
more loaderMap_moviesession.xml
more runolh_session.sh
Note: You should keep pressing the Enter key if a script is large.
3. Create the target table in the database where the data needs to be loaded. This table is
hash-partitioned on the cust_id column.
sqlplus moviedemo/welcome1
@moviesession.sql
exit
4. Run the runolh_session.sh script to invoke the OLH job to load the session data. This
starts start the MapReduce job, which loads the data into the target table.
sh runolh_session.sh
5. After the MapReduce job is completed, check to see if the rows are loaded.
select count(*) from movie_sessions_tab;

Chapter 11 - Page 2
6. The table is available for querying. You can execute the following queries, or you can try
your own queries.
select cust_id from movie_sessions_tab where rownum < 10;
select max(num_browsed) from movie_sessions_tab;
select cust_id from movie_sessions_tab where num_browsed = 6;
select cust_id from movie_sessions_tab where num_browsed in (select
max(num_browsed) from movie_sessions_tab);

Chapter 11 - Page 3
Solution 11-1: Loading Session Data into Oracle Database
1. Open a terminal window and clean up the existing HDFS files.
source /home/oracle/movie/moviework/reset/reset_conn.sh
The reset_conn.sh script cleans up any directories and files, and it creates a new HDFS
directory for storing output files.
2. Examine the following scripts:
a. moviesession.xml
This file contains the configuration parameters for the execution of OLH.
You can review the file to see the parameters, including:
mapreduce.inputformat.class: Specifies the input format of the input data
file
In this example, the input data is delimited text, so the value for this parameter is
the class name
oracle.hadoop.loader.lib.input.DelimitedTextInputFormat.
oracle.hadoop.loader.input.fieldTerminator: Specifies the character
that is used as a field terminator in the input data file
In this example, the field terminator is Tab (represented by its hex value).
mapreduce.outputformat.class: Specifies the type of load. We specify
here the value OCIOutputFormat to use the direct path online load option.
mapred.input.dir: Location of the input data file on HDFS
mapred.output.dir: Specifies the HDFS directory where output files should
be written, such as the _SUCCESS and _log files
oracle.hadoop.loader.loaderMapFile: Specifies the name and location
(typically on the client file system) of the loaderMap file
b. loaderMap_moviesession.xml
This file specifies the target table into which data will be loaded, as well as the
mapping of input data to columns in the target table. Note the date format in the
DATE_ID column, which specifies the date format in the input data file. The Java
date format should be used to specify the date format in the Loader Map file.
If a Loader Map file is not specified, OLH assumes that all columns in the table will
be loaded, and OLH will expect input data in column order.
In that case, the oracle.hadoop.loader.targetTable property and the
oracle.hadoop.loader.input.fieldNames property must be specified.

Chapter 11 - Page 4
c. runolh_session.sh
This is the script to invoke Oracle Loader for Hadoop, which runs as a MapReduce
job on the Hadoop cluster. It uses olh_moviefact.xml, the file containing the
configuration parameters.
cd /home/oracle/movie/moviework/olh
more moviesession.xml
more loaderMap_moviesession.xml
more runolh_session.sh
Note: You should keep pressing the Enter key if a script is large.
3. Create the target table in the database where the data needs to be loaded. This table is
hash-partitioned on the cust_id column.
@moviesession.sql
exit

Chapter 11 - Page 5
You see that an external table is created in the Oracle database to load the data from Hadoop.

Chapter 11 - Page 6
4. Run the runolh_session.sh script to invoke the OLH job to load the session data. This
starts the MapReduce job, which loads the data into the target table.
sh runolh_session.sh
Note that one row had a parse error. The .bad file containing the row and the error are
logged in the _olh directory under the directory specified in mapred.output.dir.
In this example, the date value is invalid because it is a time between 2:00 AM and 3:00 AM
on the day that Daylight Savings Time begins.

Chapter 11 - Page 7
5. After the MapReduce job is completed, check to see if the rows are loaded.
select count(*) from movie_sessions_tab;

Chapter 11 - Page 8
6. The table is available for querying. You can execute the following queries, or you can try
your own queries.
select cust_id from movie_sessions_tab where rownum < 10;
select max(num_browsed) from movie_sessions_tab;
select cust_id from movie_sessions_tab where num_browsed = 6;
select cust_id from movie_sessions_tab where num_browsed in (select
max(num_browsed) from movie_sessions_tab);

Chapter 11 - Page 9
Using Oracle SQL Connector
for HDFS
Chapter 12
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 1
Practice 12-1: Accessing Hadoop Data with Oracle SQL Connector for
HDFS
Overview
In this practice, you use Oracle SQL Connector for HDFS to access the data in:
• Hive tables
• HDFS files
Assumptions
The reset_conn.sh script has already been run to clean up the directories.
Tasks
Accessing the Hive Tables by Using OSCH

1. Execute the following steps to access the data residing in Hive by using OSCH:
a. Review the following files located in the OSCH directory:
cd /home/oracle/movie/moviework/osch
more genloc_moviefact_hive.sh
more moviefact_hive.xml
b. Run the following script.
When prompted for a password, type welcome1.
sh genloc_moviefact_hive.sh
c. Take a look at the external table definition in SQL*Plus:
describe movie_fact_ext_tab_hive;
d. Try the following queries on this external table, which will query data in the Hive
table.
select count(*) from movie_fact_ext_tab_hive;
select custid from movie_fact_ext_tab_hive where rownum < 10;
e. You can also join the external table with the table in Oracle Database to list movie
titles by custid.
select custid, title from movie_fact_ext_tab_hive p, movie q
where p.movieid = q.movie_id and rownum < 10;
f. The data in the Hive table can be inserted into a database table by using SQL.
create table movie_fact_local as select * from
movie_fact_ext_tab_hive;
Chapter 12 - Page 2
Accessing HDFS Files by Using OSCH
2. Execute the following steps to access the data residing in HDFS by using OSCH:
a. Review the following files:
more genloc_moviefact_text.sh
more moviefact_text.xml
sh genloc_moviefact_text.sh
You are prompted for the password, which is welcome1.
From the output on your monitor, you can see the external table definition as well as
the location files. The location files contain the URIs of the data files on HDFS.
c. Query the newly created external table movie_fact_ext_tab_text.
select count(*) from movie_fact_ext_tab_file;
describe movie_fact_ext_tab_file;
d. Select cust_id for a few rows in the table.
select cust_id from movie_fact_ext_tab_file where rownum < 10;
e. Join with the movie table in the database to list movie titles by cust_id. The first
few rows are listed here.
select cust_id, title from movie_fact_ext_tab_file p, movie q
where p.movie_id = q.movie_id and rownum < 10;
Chapter 12 - Page 3
Solution Practices for Lesson 12: Using Oracle SQL Connector for
HDFS
Accessing Hive Tables by Using OSCH
1. Execute the following steps to access the data residing in Hive by using OSCH:
a. Review the following files located in the OSCH directory:
cd /home/oracle/movie/moviework/osch
more genloc_moviefact_hive.sh
Chapter 12 - Page 4
more moviefact_hive.xml
Chapter 12 - Page 5
When prompted for a password, type welcome1.
sh genloc_moviefact_hive.sh
Chapter 12 - Page 6
c. Take a look at the external table definition in SQL*Plus:
describe movie_fact_ext_tab_hive;
Chapter 12 - Page 7
d. Try the following queries on this external table, which will query data in the Hive
table:
select count(*) from movie_fact_ext_tab_hive;
select custid from movie_fact_ext_tab_hive where rownum < 10;
Chapter 12 - Page 8
e. You can also join the external table with the table in Oracle Database to list movie
titles by custid.
select custid, title from movie_fact_ext_tab_hive p, movie q
where p.movieid = q.movie_id and rownum < 10;
f. The data in the Hive table can be inserted into a database table by using SQL.
create table movie_fact_local as select * from
movie_fact_ext_tab_hive;
exit;
Chapter 12 - Page 9
Accessing HDFS Files by Using OSCH
2. Execute the following steps to access the data residing in HDFS by using OSCH:
a. Review the following files:
more genloc_moviefact_text.sh
more moviefact_text.xml
Chapter 12 - Page 10
sh genloc_moviefact_text.sh
You will be prompted for the password, which is welcome1.
From the output on your monitor, you can see the external table definition as well as
the location files. The location files contain the URIs of the data files on HDFS.
c. Query the newly created external table movie_fact_ext_tab_text.
select count(*) from movie_fact_ext_tab_file;
describe movie_fact_ext_tab_file;
d. Select cust_id for a few rows in the table.
select cust_id from movie_fact_ext_tab_file where rownum < 10;
e. Join with the movie table in the database to list movie titles by cust_id. The first
few rows are listed here:
select cust_id, title from movie_fact_ext_tab_file p, movie q
where p.movie_id = q.movie_id and rownum < 10;
Using ODI Application
Adapter for Hadoop
Chapter 13
Practices for Lesson 13: Using ODI Application Adapter for Hadoop
Chapter 13 - Page 1
Using Oracle R Connector for
Hadoop
Chapter 14
Practices for Lesson 14: Using Oracle R Connector for Hadoop

Chapter 14 - Page 1
Practice 14-1: Working with Data in HDFS and Oracle Database
Overview
In this practice, you load the ORCH library to access some basic functions for manipulating
HDFS and Oracle Database.
Tasks
1. Change directory and start R:
$cd /home/oracle/movie/moviework/advancedanalytics
$R
2. Load the Oracle R Enterprise (ORE) library and connect to the Oracle database, and then
list the contents of the database to test the connection.
Note: If a table contains columns with unsupported data types, a warning message is
returned. If you are connected, you can just invoke ore.ls().
library(ORE)
ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE)
ore.ls()
3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory,
and list the directory contents in Hadoop Distributed File System (HDFS). Change directory
in HDFS and view the contents there:
library(ORCH)
hdfs.pwd()
hdfs.ls()
hdfs.cd ("/user/oracle/moviework/advancedanalytics/data")
hdfs.ls()
4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look
at the first few rows of each table, and get the table dimensions:
ore.sync("MOVIEDEMO","MOVIE_FACT")
MF <- MOVIE_FACT
names(MF)
head(MF,3)
dim(MF)
names(MOVIE_GENRE)
head(MOVIE_GENRE,3)
dim(MOVIE_GENRE)

Chapter 14 - Page 2
5. Because the MOVIE_GENRE table is used later in the Hadoop recommendation jobs, copy a
subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This
requires using orch.connect to establish the connection to the database from ORCH.
MG_SUBSET <- MOVIE_GENRE[1:10000,]
hdfs.rm('movie_genre_subset')
orch.connect(host="localhost", user="moviedemo",
sid="orcl",passwd="welcome1",secure=F)
mg.dfs <- hdfs.push(MG_SUBSET, dfs.name='movie_genre_subset')
hdfs.exists('movie_genre_subset')
hdfs.describe('movie_genre_subset')
hdfs.size('movie_genre_subset')

Chapter 14 - Page 3
Solution 14-1: Working with Data in HDFS and Oracle Database
1. Change directory and start R:
$cd /home/oracle/movie/moviework/advancedanalytics
$R
2. Load the Oracle R Enterprise (ORE) library and connect to the Oracle database, and then
list the contents of the database to test the connection.
Note: If a table contains columns with unsupported data types, a warning message is
returned. If you are connected, you can just invoke ore.ls().
library(ORE)
ore.ls()

Chapter 14 - Page 4
3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory,
and list the directory contents in Hadoop Distributed File System (HDFS). Change directory
in HDFS and view the contents there:
library(ORCH)
hdfs.pwd()
hdfs.ls()
hdfs.cd ("/user/oracle/moviework/advancedanalytics/data")
hdfs.ls()

Chapter 14 - Page 5
4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look
at the first few rows of each table, and get the table dimensions:
ore.sync("MOVIEDEMO","MOVIE_FACT")
MF <- MOVIE_FACT
names(MF)
head(MF,3)
dim(MF)
names(MOVIE_GENRE)
head(MOVIE_GENRE,3)
dim(MOVIE_GENRE)

Chapter 14 - Page 6
5. Because the MOVIE_GENRE table is used later in the Hadoop recommendation jobs, copy a
subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This
requires using orch.connect to establish the connection to the database from ORCH.
MG_SUBSET <- MOVIE_GENRE[1:10000,]
hdfs.rm('movie_genre_subset')
orch.connect(host="localhost", user="moviedemo",
sid="orcl",passwd="welcome1",secure=F)
mg.dfs <- hdfs.push(MG_SUBSET, dfs.name='movie_genre_subset')
hdfs.exists('movie_genre_subset')
hdfs.describe('movie_genre_subset')
hdfs.size('movie_genre_subset')

Chapter 14 - Page 7
Practice 14-2: Executing a Simple MapReduce Job by Using Oracle R
Connector for Hadoop
Overview
In this practice, you use Oracle R Connector for Hadoop to execute a simple MapReduce job
that counts the number of movies for each genre. You then compare the results with ORE.
Assumptions
None
Tasks
1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session:
mg.dfs <-
hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_
genre_subset")
mg.dfs
hdfs.describe(mg.dfs)
2. Specify dry run mode, and then execute the MapReduce job that partitions the data based
on genre_id and counts the number of movies in each genre.
Note: You receive debug output while in dry run mode.
orch.dryrun(T)
res.dryrun <- NULL
res.dryrun <- hadoop.run(mg.dfs, mapper = function(key, val) {
orch.keyvals(val$GENRE_ID, rep(1, nrow(val)))
},
reducer = function(key, vals) {
count <- nrow(vals)
orch.keyval(NULL, key, count)
} ,
config = new("mapred.config", map.output = data.frame(key=0,
val=0), reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0))
)
3. Retrieve the result of the Hadoop job, which is stored as an HDFS file.
Note: Because this is dry run mode, not all data may be used. As a result, only a subset of
results may be returned.
hdfs.get(res.dryrun)

Chapter 14 - Page 8
4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same
MapReduce job, and view the result.
Note: This takes longer to execute because it starts actual Hadoop jobs on the cluster.
orch.dryrun(F)
res.cluster <- NULL
res.cluster <- hadoop.run(
mg.dfs, mapper = function(key, val) {
},
count <- nrow(vals)
orch.keyval(NULL, key, count) } ,
config = new("mapred.config",
map.output = data.frame(key=0, val=0),
reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0))
)
hdfs.get(res.cluster)
5. Perform the same analysis by using ORE:
res.table <- table(MG_SUBSET$GENRE_ID)
res.table

Chapter 14 - Page 9
Solution 14-2: Executing a Simple MapReduce Job by Using Oracle R
Connector for Hadoop
1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session:
mg.dfs <-
hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_
genre_subset")
mg.dfs
hdfs.describe(mg.dfs)
2. Specify dry run mode, and then execute the MapReduce job that partitions the data based
on genre_id and counts the number of movies in each genre.
Note: You receive debug output while in dry run mode.
orch.dryrun(T)
res.dryrun <- NULL
res.dryrun <- hadoop.run(mg.dfs, mapper = function(key, val) {
},
count <- nrow(vals)
orch.keyval(NULL, key, count)
} ,
config = new("mapred.config", map.output = data.frame(key=0, val=0),
reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0)))

3. Retrieve the result of the Hadoop job, which is stored as an HDFS file.
Note: Because this is dry run mode, not all data may be used. As a result, only a subset of
results may be returned.
hdfs.get(res.dryrun)

4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same
MapReduce job, and view the result.
Note: This takes longer to execute because it starts actual Hadoop jobs on the cluster.
orch.dryrun(F)
res.cluster <- NULL
res.cluster <- hadoop.run(
mg.dfs, mapper = function(key, val) {
},
count <- nrow(vals)
orch.keyval(NULL, key, count) } ,
config = new("mapred.config",
map.output = data.frame(key=0, val=0),
reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0))
)
hdfs.get(res.cluster)

5. Perform the same analysis by using ORE:
res.table <- table(MG_SUBSET$GENRE_ID)
res.table

Using In-Database Analytics
Chapter 15
Practices for Lesson 15: Using In-Database Analytics

Chapter 15 - Page 1
Practice 15-1: Getting Started with Oracle R Enterprise
Overview
In this practice, you are introduced to Oracle R Enterprise (ORE). You run R scripts in the R
environment and explore the analysis options that are available in ORE.
Tasks
1. Start the R Console.
$R
2. Load the ORE packages and connect to the moviedemo schema in the database with the
SID orcl on localhost and the password welcome1. Specify that table and view
metadata should be synchronized, and specify that tables are accessible in the current R
environment.
Note: You receive one or more warning messages if tables contain data types that are not
recognized by ORE.
library(ORE)
3. Perform the following steps:
a. View the contents of the database schema:
ore.ls()
b. Determine if the CUSTOMER_V table exists:
ore.exists("CUSTOMER_V")
c. See that the table is an ore.frame (an R object proxy for the table in the
database):
class(CUSTOMER_V)
d. Determine that table’s dimensions and summary statistics. The dimensions and
summary are computed in the database with only the results being retrieved:
dim(CUSTOMER_V)
names(CUSTOMER_V)
e. Answer the following questions about our customers:
i. Which gender (male or female) is better represented?
ii. Is the customer base skewed toward young customers or old customers?
iii. Are customers highly educated?
summary(CUSTOMER_V[,c("GENDER","INCOME","AGE","EDUCATION")])

Chapter 15 - Page 2
4. Answer the following questions:
a. Are customers generally upper-income or lower-income?
b. Are there any surprises in the distribution of customer incomes?
c. What is the income range of the middle 50% of customers?
To answer these questions, use ORE functions to generate a histogram and boxplot for
income.
cust <- CUSTOMER_V
hist(cust$INCOME,col="red")
boxplot(cust$INCOME,xlab="income column",ylab="income",
main="Distribution of Customer Income",
col="red",notch=TRUE)
a. Does it make sense to consider segmenting customers based on education and
income?
b. Can you extend your answer to step (a) to produce a boxplot for income by
education?
We first use the overloaded function split to partition the data in Oracle Database. Then
we use this list result for the boxplot:
cust.split <- with(cust, split(INCOME,as.factor(EDUCATION)))
boxplot(cust.split, xlab="Education",ylab="Income",boxwex = 0.5,
main="Distribution of Customer Income by Education",
a. Are there more single or married customers in each education group?
b. Can you use the overloaded table function on the ore.frame object cust to build a
contingency table of counts at each combination of factor levels to show the table
numerically?
cust.tab <- with (cust, table(EDUCATION,MARITAL_STATUS))
cust.tab
7. For customers, are there correlations between age, income, and the number of years since
first becoming a customer? In this example, we produce a “pairs” plot that produces
scatterplots of pairs of columns. From the CUSTOMER_V table, we sample 20% of the
customers and select the columns AGE, INCOME, and YRS_CUSTOMER. Using the pairs
function, we not only produce a scatterplot but also draw a regression line (in red) and a
lowess curve in blue. Along the diagonal, a histogram of each column’s data is plotted. You
will notice that age correlates strongly with the number of years as a customer. Income and
age show a mild correlation.

Chapter 15 - Page 3
C1 <- CUSTOMER_V
row.names(C1) <- C1$CUST_ID
N <- nrow(C1)
s <- sample(1:N,N*0.2)
with(C1[s,],
pairs(cbind(AGE, INCOME, YRS_CUSTOMER),
panel=function(x,y) {
points(x,y)
abline(lm(y~x),lty="dashed",col="red",lwd=2)
lines(lowess(x,y),col="blue",lwd=3)
},
diag.panel=function(x){
par(new=TRUE)
hist(x,main="",axes=FALSE, col="tan")
}
))
8. In this example, you answer the question “Which actor has the most movie titles to his or her
credit?” You draw on three tables: MOVIE_CAST, CAST, and MOVIE. You join these tables,
aggregating based on the actor’s name, and selecting those with a count greater than 110.
MC <- MOVIE_CAST
C1 <- CAST
M1 <- MOVIE[,c("MOVIE_ID","TITLE","YEAR")]
m1 <- merge(MC,C1,by="CAST_ID")[,c("NAME","MOVIE_ID","CAST_ID")]
names(m1) <- c("ACTOR","MOVIE_ID","CAST_ID")
ACTORS <- merge(m1,M1,by="MOVIE_ID")[,c("CAST_ID","ACTOR",
"MOVIE_ID","TITLE","YEAR")]
#row.names(ACTORS) <- c(ACTORS$MOVIE_ID,ACTORS$CAST_ID)
MOVIE_ACTORS <- ACTORS[,c("ACTOR","TITLE","YEAR")]
aggdata <- aggregate(MOVIE_ACTORS$ACTOR,
by = list(MOVIE_ACTORS$ACTOR),
FUN = length)
aggdata[aggdata$x > 50,]

Chapter 15 - Page 4
9. In this example, you answer the question “Which are the most popular movie genres based
on the number of movies produced in that genre?” Using the MOVIE_GENRE and GENRE
ore.frame objects, merge (join) the data so that you can use genre names instead of IDs.
Use the overloaded function aggregate on the joined ore.frame to count the number of
movies in each genre. The barplot window can be widened so that more labels can be
shown.
MG <- MOVIE_GENRE
G1 <- GENRE
m1 <- merge(MG,G1, by="GENRE_ID")
genre.cnts <- with (m1, aggregate(NAME,
by = list(NAME),
FUN = length))
class(genre.cnts)
names(genre.cnts) <- c("genre","cnt")
genre.cnts
gcnts <- ore.pull(genre.cnts)
gcnts.sorted <- gcnts[order(gcnts$cnt,decreasing=TRUE),]
barplot(height=gcnts.sorted$cnt, names=gcnts.sorted$genre,
main="Barplot of Movie Counts by Genre",
col="red",cex.names=0.7,las=2)
10. Which movie genre generally has the lowest gross income? Use the overloaded function
merge on the ore.frame objects MOVIE_GENRE, GENRE, and MOVIE to get a sense of the
distribution of gross earnings in a genre. Use the dim function to see the size of the
resulting data set. Use the split function and then graph the distribution of popularity
using the boxplot function.
Note: All of the heavy computational work is done in the database. Only summary data is
brought to the client to graph the statistics.
g <- merge(MOVIE_GENRE,GENRE,by="GENRE_ID")
m <- merge(MOVIE,g[,2:3],by="MOVIE_ID")
m$GENRE <- m$NAME
dim(m)
m.split <- split(m$GROSS,m$GENRE)
boxplot(m.split, ylab="Gross Earnings",xlab="Genre",col="green",
main="Distribution of Gross Earnings by Genre",
cex.axis=0.6, boxwex=.5, las=2)

Chapter 15 - Page 5
Solution 15-1: Getting Started with Oracle R Enterprise
1. Start the R Console.

$R

Chapter 15 - Page 6
2. Load the ORE packages and connect to the moviedemo schema in the database with the
SID orcl on localhost and the password welcome1. Specify that table and view
metadata should be synchronized, and specify that tables are accessible in the current R
environment.
Note: You receive one or more warning messages if tables contain data types that are not
recognized by ORE.
library(ORE)

Chapter 15 - Page 7
3. Perform the following steps:
a. View the contents of the database schema:
ore.ls()
b. Determine if the CUSTOMER_V table exists:

ore.exists("CUSTOMER_V")
c. See that the table is an ore.frame (an R object proxy for the table in the
database)
class(CUSTOMER_V)
d. Determine that table’s dimensions and summary statistics. The dimensions and
summary are computed in the database with only the results being retrieved:
dim(CUSTOMER_V)
names(CUSTOMER_V)

Chapter 15 - Page 8
e. Answer the following questions about our customers:
i. Which gender (male or female) is better represented?
ii. Is the customer base skewed toward young customers or old customers?
iii. Are customers highly educated?
summary(CUSTOMER_V[,c("GENDER","INCOME","AGE","EDUCATION")])

a. Are customers generally upper-income or lower-income?
b. Are there any surprises in the distribution of customer incomes?
c. What is the income range of the middle 50% of customers?
To answer these questions, use ORE functions to generate a histogram and boxplot for
income.

Chapter 15 - Page 9
cust <- CUSTOMER_V
hist(cust$INCOME,col="red")
boxplot(cust$INCOME,xlab="income column",ylab="income",
main="Distribution of Customer Income",

a. Does it make sense to consider segmenting customers based on education and
income?
b. Can you extend your answer to step (a) to produce a boxplot for income by
education?
We first use the overloaded function split to partition the data in Oracle Database. Then
we use this list result for the boxplot:
cust.split <- with(cust, split(INCOME,as.factor(EDUCATION)))
boxplot(cust.split, xlab="Education",ylab="Income",boxwex = 0.5,
main="Distribution of Customer Income by Education",

a. Are there more single or married customers in each education group?
b. Can you use the overloaded table function on the ore.frame object cust to build a
contingency table of counts at each combination of factor levels to show the table
numerically?
cust.tab <- with (cust, table(EDUCATION,MARITAL_STATUS))
cust.tab
7. For customers, are there correlations between age, income, and the number of years since
first becoming a customer? In this example, we produce a “pairs” plot that produces
scatterplots of pairs of columns. From the CUSTOMER_V table, we sample 20% of the
customers and select the columns AGE, INCOME, and YRS_CUSTOMER. Using the pairs
function, we not only produce a scatterplot but also draw a regression line (in red) and a
lowess curve in blue. Along the diagonal, a histogram of each column’s data is plotted. You
will notice that age correlates strongly with the number of years as a customer. Income and
age show a mild correlation.

C1 <- CUSTOMER_V
row.names(C1) <- C1$CUST_ID
N <- nrow(C1)
s <- sample(1:N,N*0.2)
with(C1[s,],
pairs(cbind(AGE, INCOME, YRS_CUSTOMER),
panel=function(x,y) {
points(x,y)
abline(lm(y~x),lty="dashed",col="red",lwd=2)
lines(lowess(x,y),col="blue",lwd=3)
},
diag.panel=function(x){
par(new=TRUE)
hist(x,main="",axes=FALSE, col="tan")
}
))

8. In this example, you answer the question “Which actor has the most movie titles to his or
her credit?” You draw on three tables: MOVIE_CAST, CAST, and MOVIE. You join these
tables, aggregating based on the actor’s name, and selecting those with a count greater
than 110.
MC <- MOVIE_CAST
C1 <- CAST
M1 <- MOVIE[,c("MOVIE_ID","TITLE","YEAR")]
m1 <- merge(MC,C1,by="CAST_ID")[,c("NAME","MOVIE_ID","CAST_ID")]
names(m1) <- c("ACTOR","MOVIE_ID","CAST_ID")
ACTORS <- merge(m1,M1,by="MOVIE_ID")[,c("CAST_ID","ACTOR",
"MOVIE_ID","TITLE","YEAR")]
#row.names(ACTORS) <- c(ACTORS$MOVIE_ID,ACTORS$CAST_ID)
MOVIE_ACTORS <- ACTORS[,c("ACTOR","TITLE","YEAR")]
aggdata <- aggregate(MOVIE_ACTORS$ACTOR,
by = list(MOVIE_ACTORS$ACTOR),
FUN = length)
aggdata[aggdata$x > 50,]

9. In this example, you answer the question “Which are the most popular movie genres based
on the number of movies produced in that genre?” Using the MOVIE_GENRE and GENRE
ore.frame objects, merge (join) the data so that you can use genre names instead of IDs.
Use the overloaded function aggregate on the joined ore.frame to count the number of
movies in each genre. The barplot window can be widened so that more labels can be
shown.
MG <- MOVIE_GENRE
G1 <- GENRE
m1 <- merge(MG,G1, by="GENRE_ID")
genre.cnts <- with (m1, aggregate(NAME,
by = list(NAME),
FUN = length))
class(genre.cnts)
names(genre.cnts) <- c("genre","cnt")
genre.cnts
gcnts <- ore.pull(genre.cnts)
gcnts.sorted <- gcnts[order(gcnts$cnt,decreasing=TRUE),]
barplot(height=gcnts.sorted$cnt, names=gcnts.sorted$genre,
main="Barplot of Movie Counts by Genre",
col="red",cex.names=0.7,las=2)

10. Which movie genre generally has the lowest gross income? Use the overloaded function
merge on the ore.frame objects MOVIE_GENRE, GENRE, and MOVIE to get a sense of the
distribution of gross earnings in a genre. Use the dim function to see the size of the
resulting data set. Use the split function and then graph the distribution of popularity
using the boxplot function.
Note: All of the heavy computational work is done in the database. Only summary data is
brought to the client to graph the statistics.
g <- merge(MOVIE_GENRE,GENRE,by="GENRE_ID")
m <- merge(MOVIE,g[,2:3],by="MOVIE_ID")
m$GENRE <- m$NAME
dim(m)
m.split <- split(m$GROSS,m$GENRE)
boxplot(m.split, ylab="Gross Earnings",xlab="Genre",col="green",
main="Distribution of Gross Earnings by Genre",
cex.axis=0.6, boxwex=.5, las=2)

Oracle Big Data Integration
Options
Chapter 16
Practices for Lesson 16: Oracle Big Data Integration Options

Chapter 16 - Page 1
Practices for Lesson 16
Practices Overview
There are no practices for this lesson.
Practices for Lesson 16: Oracle Big Data Integration Options

Chapter 16 - Page 2
Oracle Big Data Use Cases
Chapter 17
Practices for Lesson 17: Oracle Big Data Use Cases

Chapter 17 - Page 1
Practices for Lesson 17: Overview
Practices Overview
In these practices, you examine a case scenario for one of the companies in the
Telecommunications domain. This company has been facing an ever-increasing set of threats
brought on by new technologies. These threats are disrupting the company’s traditional profit
margins and reducing its ability to control its business and customers.
You analyze the given problem and suggest the best solution using the Oracle Big Data Stack.

Chapter 17 - Page 2
Practice 17-1: Case Study
Overview
In this practice, you analyze the case scenario and provide a solution using the Oracle Big Data
Stack.
Case Scenario
XYZ Telecom is one of the largest Asia-Pacific communications service providers (CSP). The
company has recently been facing increasing competition from social media sites and from
other providers (IP providers, WiFi, WiMAX, and so on) and decreasing customer acquisition.
These are all challenges to the future business growth.
To help resolve these issues, XYZ Telecom has decided to focus on the following areas this
year:
• Increase its customers’ adoption of smartphones
• Optimize its network and increase its data services revenue
• Accelerate adoption of its new mobile Internet
Analytical Questions
1. How do you think XYZ Telecom can achieve its three key goals?
2. In which of the following ways can XYZ Telecom learn what people are saying about its
products and services?
a. Lengthy surveys
b. Real-time analysis of data
c. Analysis of large volume of structured and nonstructured data
d. Analysis of nonstructured content, IP network traffic, and Web proxy information
3. To improve the XYZ Telecom customer experience, which of the following should the
company analyze proactively?
a. Audio conversations
b. CRM service records
c. Wait for customer complaints
d. Social media comments
e. Network traffic information
4. Suggest two ways for XYZ Telecom to ensure better security.
•
•
5. Now that you have analyzed the key areas that XYZ Telecom needs to focus on, how and
why would you recommend Oracle Big Data as a solution?
Hint: Consider the four Vs of Big Data.

Chapter 17 - Page 3
Solution 17-1: Case Study
Analytical Questions
1. How do you think XYZ Telecom can achieve its three key goals?
Use its data warehouse, mobile network feeds, and social media to provide better
insights into its customer base and network utilization.
2. In which of the following ways can XYZ Telecom learn what people are saying about its
products and services?
a. Lengthy surveys
b. Real-time analysis of data
c. Analysis of large volume of structured and nonstructured data
d. Analysis of nonstructured content, IP network traffic, and Web proxy information
3. To improve the XYZ Telecom customer experience, which of the following should the
company analyze proactively?
a. Audio conversations
b. CRM service records
c. Wait for customer complaints
d. Social media comments
e. Network traffic information
4. Suggest two ways for XYZ Telecom to ensure better security.
• Real-time fraud detection

• Government regulatory compliance (through the efficient and cost-effective
storage of CDRs and customer information)

Chapter 17 - Page 4
5. Now that you have analyzed the key areas that XYZ Telecom needs to focus on, how and
why would you recommend Oracle Big Data as a solution?
Hint: Consider the four Vs of Big Data.
Big Data Solution
Variety Multiple Data Sources

BSS/ OSS / VAS /Network
BSS – Billing Systems, CRM, ODS, ECM, Knowledge
Management
OSS – Provisioning Systems, Mediation System, and so on
VAS – SDP, Reload, M-Commerce, Ring Back Tone etc
Network – BTS, MSC, HLR, GGSN, IN
Type of Data
Text Files: Log Files, SMS Text, Tweets, Email
Binary Files: ASN.1, Network Routers, Email Attachments
Audio Files: Ring tones, Voice Messages, Call Center
Data
Video Files – Movies, Video Clips
Value Better 360 View of Customer

- More detailed customer profile
- Improved accuracy and timeliness of data
Better Customer Experience
- Real-time response to issues in Customer
- Better quality of service
Better ARPU, Customer Wallet Share
- Real-time marketing
Quicker Response to the Market and Government
Requests
Velocity Access Speed of Data and Events

- 10,000s of xDR/Sec
- 10,000s of VAS Transactions/Sec
- 100,000s of Network Events/Sec
Volume Volume of Data

- 1,000s GB of xDR Data/Day
- 10sTB of Network Data/Day
- Different sources of data from new business models

Chapter 17 - Page 5

Big Data Essentials: Activity Guide

Uploaded by

Copyright:

Available Formats

You might also like

Big Data Essentials: Activity Guide

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Essentials: Activity Guide

Uploaded by

Copyright:

Available Formats

Big Data Essentials

Big Data Essentials Table of Contents

Big Data Essentials Table of Contents

Practices for Lesson 2: Overview of Big Data

d. Will Big Data replace data warehouses and databases?

Practices for Lesson 2: Overview of Big Data

Practices for Lesson 2: Overview of Big Data

Practices for Lesson 3: Understanding Big Data Solution

Practices for Lesson 3: Understanding Big Data Solution

Practices for Lesson 3: Understanding Big Data Solution

Practices for Lesson 3: Understanding Big Data Solution

Practices for Lesson 4: Using Big Data Appliance

Practices for Lesson 4: Using Oracle Big Data Appliance

Practices for Lesson 4: Using Big Data Appliance

Practices for Lesson 4: Using Big Data Appliance

Practices for Lesson 4: Using Big Data Appliance

Practices for Lesson 5: Data Acquisition Options in BDA

Practices for Lesson 5: Data Acquisition Options in BDA

Practices for Lesson 5: Data Acquisition Options in BDA

Practices for Lesson 6: Using the Hadoop Distributed File System

Practices for Lesson 6: Using the Hadoop Distributed File System

Practices for Lesson 6: Using the Hadoop Distributed File System

Note: The output may vary if the files already exist.

Practices for Lesson 6: Using the Hadoop Distributed File System

Practices for Lesson 6: Using the Hadoop Distributed File System

8. Verify the copy by listing the directory contents in HDFS:

Note: The output may vary based on the existing files.

Practices for Lesson 6: Using the Hadoop Distributed File System

Practices for Lesson 7: Using Flume in HDFS

Practices for Lesson 7: Using Flume in HDFS

2. Log in to Cloudera Manager:

Practices for Lesson 7: Using Flume in HDFS

Practices for Lesson 7: Using Flume in HDFS

6. Review the agent file for the MoviePlex application.

Practices for Lesson 7: Using Flume in HDFS

Practices for Lesson 8: Using Oracle NoSQL Database

Practices for Lesson 8: Using Oracle NoSQL Database

Practices for Lesson 8: Using Oracle NoSQL Database

3. Make sure $KVROOT does not exist already.

Minimize the window (leave it running).

You should be logged in to the KV shell.

Practices for Lesson 8: Using Oracle NoSQL Database

Kv-> ddl add-schema -file

7. Register the following schemas:

Practices for Lesson 8: Using Oracle NoSQL Database

Practices for Lesson 8: Using Oracle NoSQL Database

Practices for Lesson 8: Using Oracle NoSQL Database

2. Run the resetDemo.sh script to reset the MoviePlex demo.

Practices for Lesson 8: Using Oracle NoSQL Database

4. Run startDemo.sh to start the MoviePlex demo:

Practices for Lesson 8: Using Oracle NoSQL Database

6. Give the credentials as shown here:

Practices for Lesson 8: Using Oracle NoSQL Database

Practices for Lesson 8: Using Oracle NoSQL Database

Practices for Lesson 9: Using Hive

Practices for Lesson 9: Using Hive

Practices for Lesson 9: Using Hive