Big Data Essentials: Activity Guide

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 122

Big Data Essentials

Activity Guide
Table of Contents
Practices for Lesson 1: Introduction ..............................................................................................................1-1
Practices for Lesson 1....................................................................................................................................1-2
Practices for Lesson 2: Overview of Big Data ...............................................................................................2-1
Practice 2-1: Overview of Big Data ................................................................................................................2-2
Solution 2-1: Overview of Big Data ................................................................................................................2-3
Practices for Lesson 3: Understanding the Oracle Big Data Solution ........................................................3-1
Practice 3-1: Understanding the Oracle Big Data Solution.............................................................................3-2
Solution 3-1: Understanding the Oracle Big Data Solution.............................................................................3-3
Practices for Lesson 4: Using Oracle Big Data Appliance ...........................................................................4-1
Practice 4-1: Using Oracle Big Data Appliance ..............................................................................................4-2
Solution 4-1: Using Oracle Big Data Appliance ..............................................................................................4-3
Practices for Lesson 5: Data Acquisition Options in BDA ...........................................................................5-1
Practice 5-1: Data Acquisition Options in BDA ...............................................................................................5-2
Solution 5-1: Data Acquisition Options in BDA ...............................................................................................5-3
Practices for Lesson 6: Using the Hadoop Distributed File System ...........................................................6-1
Practice 6-1: Loading JSON Data into HDFS .................................................................................................6-2
Solution 6-1: Loading JSON Data into HDFS .................................................................................................6-4
Practices for Lesson 7: Using Flume in HDFS ..............................................................................................7-1
Practices for Lesson 7-1: Introduction to Flume .............................................................................................7-2
Solution 7-1: Introduction to Flume ................................................................................................................7-3
Practices for Lesson 8: Using Oracle NoSQL Database ...............................................................................8-1
Practice 8-1: Using KVLite .............................................................................................................................8-2
Solution 8-1: Using KVLite .............................................................................................................................8-4
Practice 8-2: Loading Movie Data into Oracle NoSQL Database ...................................................................8-7
Solution 8-2: Loading Movie Data into Oracle NoSQL Database ...................................................................8-8
Practices for Lesson 9: Using Hive ................................................................................................................9-1
Practice 9-1: Manipulating Data with Hive......................................................................................................9-2
Solution 9-1: Manipulating Data with Hive ......................................................................................................9-4
Practice 9-2: Extracting Facts by Using Hive .................................................................................................9-7
Solution 9-2: Extracting Facts by Using Hive .................................................................................................9-12
Practices for Lesson 10: Introduction to Oracle Big Data Connectors .......................................................10-1
Practices for Lesson 10..................................................................................................................................10-2
Practices for Lesson 11: Using Oracle Loader for Hadoop ..........................................................................11-1
Practice 11-1: Loading Session Data into Oracle Database ..........................................................................11-2
Solution 11-1: Loading Session Data into Oracle Database ..........................................................................11-4
Practices for Lesson 12: Using Oracle SQL Connector for HDFS ...............................................................12-1
Practice 12-1: Accessing Hadoop Data with Oracle SQL Connector for HDFS .............................................12-2
Solution 12-1: Accessing Hadoop Data with Oracle SQL Connector for HDFS .............................................12-4
Practices for Lesson 13: Using ODI Application Adapter for Hadoop ........................................................13-1
Practices for Lesson 13..................................................................................................................................13-2
Practices for Lesson 14: Using Oracle R Connector for Hadoop ................................................................14-1
Practice 14-1: Working with Data in HDFS and Oracle Database ..................................................................14-2
Solution 14-1: Working with Data in HDFS and Oracle Database ..................................................................14-4

Big Data Essentials Table of Contents


iii
Practice 14-2: Executing a Simple MapReduce Job by Using Oracle R Connector for Hadoop ....................14-8
Solution 14-2: Executing a Simple MapReduce Job by Using Oracle R Connector for Hadoop ....................14-10
Practices for Lesson 15: Using In-Database Analytics .................................................................................15-1
Practice 15-1: Getting Started with Oracle R Enterprise ................................................................................15-2
Solution 15-1: Getting Started with Oracle R Enterprise ................................................................................15-6
Practices for Lesson 16: Oracle Big Data Integration Options ....................................................................16-1
Practices for Lesson 16..................................................................................................................................16-2
Practices for Lesson 17: Oracle Big Data Use Cases ...................................................................................17-1
Practices for Lesson 17: Overview .................................................................................................................17-2
Practice 17-1: Case Study .............................................................................................................................17-3
Solution 17-1: Case Study .............................................................................................................................17-4

Big Data Essentials Table of Contents


iv
Practices for Lesson 2:
Overview of Big Data
Chapter 2

Practices for Lesson 2: Overview of Big Data


Chapter 2 - Page 1
Practice 2-1: Overview of Big Data
Overview
In this practice, you answer questions to check your understanding of the concepts that you
learned in the lesson.

Questions
a. Which of the following statements best defines the term Big Data?
1) Data that is very large in volume but consistent in structure
2) Data that is very large in volume and high in value
3) Data that has low value but very large volume
4) Data that is rapidly changing and has very high value and volume
b. The four Vs that describe big data are ___________, ___________, __________, and
__________.
c. Fill in the blank column:
Today’s challenge New data What’s possible?
Retail: One-size-fits-all Social media ?
marketing

d. Will Big Data replace data warehouses and databases?

e. If you are a business analyst, what kinds of skills do you need if you want to use Big
Data?

Practices for Lesson 2: Overview of Big Data


Chapter 2 - Page 2
Solution 2-1: Overview of Big Data
Overview
The answers are formatted in bold.

Answers
a. Which of the following statements best defines the term Big Data?
1) Data that is very large in volume but consistent in structure
2) Data that is very large in volume and high in value
3) Data that has low value but very large volume
4) Data that is rapidly changing and has very high value and volume
b. The four Vs that describe big data are volume, value, variety, and velocity.
c. Fill in the blank column:
Today’s challenge New data What’s possible?
Retail: One-size-fits-all Social media Sentiment analysis and
marketing segmentation
Note: Sentiment analysis is a statistical model that is used to gauge the
sentimental outlook for a brand, a service, or any other subject of interest by
analyzing the semantics of lexical or nonlexical text. It is a weighted, score-based
analysis.
d. Will Big Data replace data warehouses and databases?
No.
Big Data needs data warehouses and databases to store structured data for further
analysis. Big Data should be combined with traditional systems to obtain a better
solution.
e. If you are a business analyst, what kinds of skills do you need if you want to use Big
Data?
• Ability to map Big Data with organizational goals
• Skills to extract massive amount of data
• Good leadership skills to make business decisions

Practices for Lesson 2: Overview of Big Data


Chapter 2 - Page 3
Practices for Lesson 3:
Understanding the Oracle Big
Data Solution
Chapter 3

Practices for Lesson 3: Understanding Big Data Solution


Chapter 3 - Page 1
Practice 3-1: Understanding Big Data Solution
Overview
In this practice, you answer questions to check your understanding of the concepts that you
learned in the lesson.

Questions
a. Identify the capabilities of the Oracle Big Data solution.
1) Deep analytics
2) High agility
3) Massive scalability
4) Low latency
5) All of the above
b. Oracle R Connector for Hadoop is used to refine, organize, and load the data from
Hadoop to Oracle Database.
• True
• False
c. Oracle Database is used in the Acquire phase.
• True
• False
Match the following:
Big Data Storage System Used
Unstructured data Oracle NoSQL Database
Schema-based data HDFS
Schema-less data Oracle Database

Practices for Lesson 3: Understanding Big Data Solution


Chapter 3 - Page 2
Solution 3-1: Understanding Big Data Solution
Overview
The answers to Practice 3-1 are formatted in bold.

Answers
a. Identify the capabilities of the Oracle Big Data solution.
1) Deep analytics
2) High agility
3) Massive scalability
4) Low latency
5) All of the above
Explanation
1) Deep analytics is a fully parallel, extensive, and extensible toolbox full of advanced
and novel statistical and data-mining capabilities.
2) High agility is the ability to create temporary analytics environments in an end-
user-driven yet secure and scalable environment to deliver new and novel insights
into the operational business.
3) Massive scalability is the ability to scale analytics and sandboxes to previously
unknown sizes while leveraging previously untapped data potential.
4) Low latency is the ability to instantly act based on these advanced analytics in your
operational production environments.
All these features are integrated in the Oracle Big Data solution.
b. Oracle R Connector for Hadoop is used to refine, organize, and load the data from
Hadoop to Oracle Database.
• True
• False
Explanation
Oracle R Connector is the only connector that is used in the analysis phase of the
Oracle Big Data solution.
c. Oracle Database is used in the Acquire phase.
• True
• False

Practices for Lesson 3: Understanding Big Data Solution


Chapter 3 - Page 3
Explanation
Oracle Database can be used in the acquire phase if the incoming data is structured.
Oracle Database is widely used to store the user profiles and the location data.
Match the following:
Big Data Storage System Used
Unstructured data HDFS
Schema-based data Oracle Database
Schema-less data Oracle NoSQL Database

Practices for Lesson 3: Understanding Big Data Solution


Chapter 3 - Page 4
Practices for Lesson 4: Using
Big Data Appliance
Chapter 4

Practices for Lesson 4: Using Big Data Appliance


Chapter 4 - Page 1
Practice 4-1: Using Big Data Appliance
Overview
In this practice, you gain a better understanding of Oracle Big Data Appliance by answering
questions about its administration.

Questions
a. How much usable storage is provided per rack?
b. How much usable storage is provided per rack for Oracle NoSQL Database?
c. Is it possible to add a SAN or other external storage device to increase storage
capacity?
d. How is InfiniBand used?
e. There are 18 nodes in the machine. How are the nodes used?
f. What is the difference between Big Data Appliance and an Exadata system/expansion
rack?
g. How does an administrator monitor Big Data Appliance?
h. Why should you buy this machine from Oracle? How does its cost compare to the cost
of building other systems?

Practices for Lesson 4: Using Oracle Big Data Appliance


Chapter 4 - Page 2
Solution 4-1: Using Oracle Big Data Appliance
Overview
The answers to Practice 4-1 are added in this section.

Answers
a. How much usable storage is provided per rack?
This depends on the software configuration.
Raw storage = 648 TB (18 nodes × 12 disks × 3 TB). By default, Hadoop runs on a
triple-replication scheme, so 648 TB of raw storage delivers over 200 TB of usable
space.
b. How much usable storage is provided per rack for Oracle NoSQL Database?
This depends on the software configuration. For NoSQL Database, we preconfigure
the set of masters and replicas on BDA. A single-rack BDA has 6 master nodes and 12
replicas. Each master has a 6 TB maximum.
So there can be 6 masters × 6 TB = 36 TB of storage in NoSQL Database on a single
rack. Each additional rack adds 36 TB of storage.
Because of triple replication, a NoSQL Database gets 3 * 36 TB = 108 TB of data
storage on BDA at full capacity.
Note that these numbers will change when Oracle NoSQL Database v.2 is released.
c. Is it possible to add a SAN or other external storage device to increase storage
capacity?
You cannot add SAN or NAS devices to scale storage. Hadoop is designed for directly
attached disks, and you scale both compute and storage at the same time by adding
nodes to the Hadoop cluster.
d. How is InfiniBand used?
Hadoop uses an enormous amount of network bandwidth by design because all nodes
continuously talk to each other and data is moved among nodes in the shuffle phase of
running a map reduce program. The most common scaling problem in Hadoop clusters
is network saturation.
For large clusters, 1 Gb Ethernet requires difficult tuning (data locality is extremely
important to avoid excessive shuffling of data) and configuration. At 40 Gb/sec,
InfiniBand provides unprecedented scaling and performance for Hadoop, and it
provides extremely fast connectivity to other Oracle machines such as Exadata via Big
Data Connectors.
e. There are 18 nodes in the machine. How are the nodes used?
Node 1 is the Hadoop Name Node and the HBase master.
Node 2 runs Zookeeper and the Cloudera Manager.
Node 3 runs the JobTracker, a MySQL database for Cloudera Manager and Hive (and
an ODI Agent). These three nodes are also configured as data nodes by default; this
feature can be turned off if you add additional racks.
The remaining 15 nodes are data nodes for Hadoop.
If you hook up another rack, the 18 new nodes are data nodes that join the cluster.
Because Hadoop is designed for scale out, adding nodes is very easy. Data
automatically rebalances itself in Hadoop.

Practices for Lesson 4: Using Big Data Appliance


Chapter 4 - Page 3
f. What is the difference between Big Data Appliance and an Exadata
system/expansion rack?
Oracle Exadata is optimized for Oracle Database processing. An Exadata Storage
expansion rack runs Oracle Exadata Storage Server software and expands the
capacity of Oracle Exadata. It therefore enables more dedicated high-performance
storage of database data sets.
Oracle Big Data Appliance does not run Oracle Database. It is intended to leverage
Oracle NoSQL Database and Hadoop to ingest and preprocess large volumes of data.
After the data is processed to some meaningful data set, BDA enables fast loading or
fast connections with the data in Exadata.
The end-to-end solution for big data therefore involves both Big Data Appliance and an
Oracle Database/Exadata system.
g. How does an administrator monitor Big Data Appliance?
Hadoop is monitored with the Hadoop management framework. Oracle NoSQL
Database is monitored by using a simple web interface. Enterprise Manager Support
may be added at a later time.
h. Why should you buy this machine from Oracle? How does its cost compare to
the cost of building other systems?
There are several advantages to purchasing Oracle Big Data Appliance (BDA).
Time to value: The configuration and management of Hadoop clusters require deep
expertise in Linux, networking, Java, and Hadoop configuration. Very few Hadoop
experts are available on the open job market. BDA is preconfigured and pretuned by
the top Hadoop experts in the industry. Actual customer deployments have shown that
Oracle Corporation can deploy a fully functional, production-grade Hadoop system in
two days.
TCO: The cost of building a system with equivalent hardware and software is actually
higher than buying the preconfigured and preinstalled BDA system from Oracle. For full
details, see:
https://blogs.oracle.com/datawarehousing/entry/price_comparison_for_big_data.
Maintenance: Oracle is automatically notified if a component in the cluster (for
example, a disk) fails, and Oracle Support comes to your site to replace it. With large
home-grown clusters, other systems often require several full-time administrators who
must take care of the hardware―which is not necessary with BDA.
InfiniBand: InfiniBand provides unprecedented performance, scaling, and connectivity
to other Oracle machines such as Exadata. Home-grown InfiniBand and 10 Gb
Ethernet are very difficult to configure and tune, and the Linux drivers have bugs that
Oracle has fixed.

Practices for Lesson 4: Using Big Data Appliance


Chapter 4 - Page 4
Summary of Advantages of BDA
• Supported by a single vendor
• Hardware and software optimized throughout the architecture
• Already integrated with Exadata
• Delivers a complete solution
• Easiest and fastest way to deploy Hadoop
• Competitive pricing compared to other solutions

Practices for Lesson 4: Using Big Data Appliance


Chapter 4 - Page 5
Practices for Lesson 5: Data
Acquisition Options in BDA
Chapter 5

Practices for Lesson 5: Data Acquisition Options in BDA


Chapter 5 - Page 1
Practice 5-1: Data Acquisition Options in BDA
Overview
In this practice, you answer questions to check your understanding of the concepts that you
learned in the lesson.

Questions
a. A NoSQL database typically does not support traditional SQL database queries.
• True
• False
b. A key difference between NoSQL and RDBMS systems is that a NoSQL system stores
unstructured data and an RDBMS system stores structured data.
• True
• False
c. Suggest the acquiring mechanism for the following types of data:
• Customer profiles
• Blogs (weblogs)
• Reports
• Logs
• Time stamp
d. Place the following features in the correct column in the table.
1) Stores high-value data
2) Schema is application-defined.
3) Data is highly structured.
4) Stores low-value data
5) Data can be structured or unstructured.
6) Schema is self-describing.
NoSQL Database RDBMS

Practices for Lesson 5: Data Acquisition Options in BDA


Chapter 5 - Page 2
Solution 5-1: Data Acquisition Options in BDA
Overview
The answers to Practice 5-1 are formatted in bold.

Answers
a. A NoSQL database typically does not support traditional SQL database queries.
• True
• False
b. A key difference between NoSQL and RDBMS systems is that a NoSQL system stores
unstructured data and an RDBMS system stores structured data.
• True
• False
c. Suggest the acquiring mechanism for the following types of data:
• Customer profiles
• Blogs (weblogs)
• Reports
• Logs
• Time stamp
NoSQL: customer profiles, time stamp
HDFS: weblogs, reports, logs, time stamp
d. Place the following features in the correct column in the table.
1) Stores high-value data
2) Schema is application-defined.
3) Data is highly structured.
4) Stores low-value data
5) Data can be structured or unstructured.
6) Schema is self-describing.
NoSQL Database RDBMS
2) Schema is application- 1) Stores high-value data
defined. 3) Data is highly structured.
4) Stores low-value data 6) Schema is self-describing.
5) Data can be structured or
unstructured.

Practices for Lesson 5: Data Acquisition Options in BDA


Chapter 5 - Page 3
Practices for Lesson 6: Using
the Hadoop Distributed File
System
Chapter 6

Practices for Lesson 6: Using the Hadoop Distributed File System


Chapter 6 - Page 1
Practice 6-1: Loading JSON Data into HDFS
Overview
In this practice, you review HDFS commands. You also load a JSON log file that tracked activity
in an online movie application into HDFS.

Tasks
Note: Run the commands from the /home/oracle/movie/moviedemo/mapreduce
directory.
You may see a difference in output with respect to the number of files/items displayed.
1. Start the terminal window, and run the reset_mapreduce.sh script to reset the practice
directory.
cd /home/oracle/movie/moviework/reset
./reset_mapreduce.sh
2. Review the commands that are available for the Hadoop Distributed File System. You will
find that its composition is similar to your local Linux file system. You will use the hadoop
fs command when interacting with HDFS.
cd /home/oracle/movie/moviework/mapreduce
hadoop fs
3. List the contents of /user/oracle.
hadoop fs –ls /user/oracle
4. Create a subdirectory called my_stuff in the /user/oracle folder, and then ensure that
the directory has been created.
hadoop fs –mkdir /user/oracle/my_stuff
hadoop fs –ls /user/oracle
5. Remove the my_stuff directory and then ensure that it has been removed.
hadoop fs –rmr my_stuff
hadoop fs –ls
6. Inspect the compressed JSON application log.
cd /home/oracle/movie/moviework/mapreduce
zcat movieapp_3months.log.gz|head

Practices for Lesson 6: Using the Hadoop Distributed File System


Chapter 6 - Page 2
7. Load a file into the HDFS from the local file system. Specifically, you load a JSON log file
that tracked activity in an online movie application. The JSON data represents individual
clicks from an online movie rental site. You use the basic put commands for moving data
into the HDFS.
Review the commands that are available for the HDFS, and then copy the gzipped file to the
HDFS.
hadoop fs
hadoop fs –put movieapp_3months.log.gz /user/oracle/moviework/applog
8. Verify the copy by listing the directory contents in HDFS.
hadoop fs –ls /user/oracle/moviework/applog

Practices for Lesson 6: Using the Hadoop Distributed File System


Chapter 6 - Page 3
Solution 6-1: Loading JSON Data into HDFS
1. Start the terminal window, and run the reset_mapreduce.sh script to reset the practice
directory.
cd /home/oracle/movie/moviework/reset
./reset_mapreduce.sh

Note: The output may vary if the files already exist.


2. Review the commands that are available for the Hadoop Distributed File System. You will
find that its composition is similar to your local Linux file system. You use the hadoop fs
command when interacting with HDFS:
cd /home/oracle/movie/moviework/mapreduce
hadoop fs

Practices for Lesson 6: Using the Hadoop Distributed File System


Chapter 6 - Page 4
3. List the contents of /user/oracle.
hadoop fs –ls /user/oracle

4. Create a subdirectory called my_stuff in the /user/oracle folder, and then ensure that
the directory has been created:
hadoop fs –mkdir /user/oracle/my_stuff
hadoop fs –ls /user/oracle

5. Remove the my_stuff directory and then ensure that it has been removed:
hadoop fs –rmr my_stuff
hadoop fs –ls

Practices for Lesson 6: Using the Hadoop Distributed File System


Chapter 6 - Page 5
6. Inspect the compressed JSON application log:
cd /home/oracle/movie/moviework/mapreduce
zcat movieapp_3months.log.gz|head

7. Load a file into the HDFS from the local file system.
Specifically, you load a JSON log file that tracked activity in an online movie application. The
JSON data represents individual clicks from an online movie rental site. You use the basic
put commands for moving data into the HDFS.
Review the commands available for the HDFS, and then copy the gzipped file to the HDFS.
hadoop fs
hadoop fs –put movieapp_3months.log.gz
/user/oracle/moviework/applog

8. Verify the copy by listing the directory contents in HDFS:


hadoop fs –ls /user/oracle/moviework/applog

Note: The output may vary based on the existing files.

Practices for Lesson 6: Using the Hadoop Distributed File System


Chapter 6 - Page 6
Practices for Lesson 7: Using
Flume in HDFS
Chapter 7

Practices for Lesson 7: Using Flume in HDFS


Chapter 7 - Page 1
Practices for Lesson 7-1: Introduction to Flume
Overview
In this practice, you start flume-ng and review the commands and configuration options that
are available in Flume.

Tasks
1. Start the Firefox browser and open Cloudera Manager (which is bookmarked for you).
2. Log in to Cloudera Manager:
Username: admin
Password: welcome1
3. You can view the status of the Hadoop services running on BDA.
Right-click and start Flume if it is in the stopped state.
4. Open a terminal window and view the command options in Flume.
$flume-ng help
5. Review the configuration file for the MoviePlex application.
$cd /home/oracle/movie/moviedemo/scripts
$more flume.conf
6. Review the agent file for the MoviePlex application.
$more flume_movieagent.sh

Practices for Lesson 7: Using Flume in HDFS


Chapter 7 - Page 2
Solution 7-1: Introduction to Flume
1. Start the Firefox browser and open Cloudera Manager (which is bookmarked for you).

2. Log in to Cloudera Manager:


Username: admin
Password: welcome1

3. You can view the status of the Hadoop services running on the BDA.
Right-click and start Flume if it is in the stopped state.

Practices for Lesson 7: Using Flume in HDFS


Chapter 7 - Page 3
4. Open a terminal window and view the command options in Flume.
$flume-ng help

Practices for Lesson 7: Using Flume in HDFS


Chapter 7 - Page 4
5. Review the configuration file for the MoviePlex application.
$cd /home/oracle/movie/moviedemo/scripts
$ls
$more flume.conf

6. Review the agent file for the MoviePlex application.


$more flume_movieagent.sh

Practices for Lesson 7: Using Flume in HDFS


Chapter 7 - Page 5
Practices for Lesson 8: Using
Oracle NoSQL Database
Chapter 8

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 1
Practice 8-1: Using KVLite
Overview
In this practice, you create an Oracle NoSQL Database instance and register the schemas that
are used for the MoviePlex application.

Tasks
1. Open a terminal window:
2. There are three Oracle NoSQL Database–specific environment variables. KVHOME is
where binaries are installed, KVROOT is where data files and config files are saved, and
KVDEMOHOME is where the source of the practice project is saved.
echo $KVROOT
echo $KVHOME
echo $KVDEMOHOME
3. Make sure $KVROOT does not exist already.
rm -rf $KVROOT
4. Start KVLite from the current working directory:
java -jar $KVHOME/lib/kvstore.jar kvlite -host localhost -root
$KVROOT
Look for a response similar to what is listed below. Minimize the window (leave it running).
java -jar $KVHOME/lib/kvstore-2.*.jar kvlite -root $KVROOT
Created new kvlite store with args: -root /u02/kvroot -store
kvstore -host localhost -port 5000 -admin 5001
5. Open a new tab in the terminal window and start an admin session:
java -jar $KVHOME/lib/kvstore.jar runadmin -host localhost -port
5000
You should be logged in to the KV shell.
6. Register the customer schema:
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/customer.avsc

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 2
7. Register the following schemas:
• movie.avsc
• cast.avsc
• crew.avsc
• genre.avsc
• activity.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/movie.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/cast.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/crew.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/genre.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/activity.avsc
Note: Review the schema files located in the
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schemas
directory.
8. Run the show schemas command to make sure all six schemas are registered.
Kv-> show schemas

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 3
Solution 8-1: Using KVLite
1. Open a terminal window.
2. There are three Oracle NoSQL Database–specific environment variables. KVHOME is where
binaries are installed, KVROOT is where data files and config files are saved, and
KVDEMOHOME is where the source of the practice project is saved.
echo $KVROOT
echo $KVHOME
echo $KVDEMOHOME

3. Make sure $KVROOT does not exist already.


rm -rf $KVROOT
4. Start KVLite from the current working directory:
java -jar $KVHOME/lib/kvstore.jar kvlite -host localhost -root
$KVROOT

Minimize the window (leave it running).


5. Open a new tab in the terminal window and start an admin session:
java -jar $KVHOME/lib/kvstore.jar runadmin -host localhost -port
5000

You should be logged in to the KV shell.

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 4
6. Register the customer schema:
The customer schema looks like this:

Kv-> ddl add-schema -file


/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/customer.avsc
This is what you should see after you successfully registering the Customer schema:

7. Register the following schemas:


• movie.avsc
• cast.avsc
• crew.avsc
• genre.avsc
• activity.avsc

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 5
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/movie.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/cast.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/crew.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/genre.avsc
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/activity.avsc

8. Run the show schemas command to make sure all six schemas are registered.
Kv-> show schemas

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 6
Practice 8-2: Loading Movie Data into Oracle NoSQL Database
Overview
In this practice, you start the MoviePlex demo and load the movie data into the Oracle NoSQL
Database.

Tasks
1. Open a new terminal window and explore the resetDemo.sh script in the scripts folder:
cd /home/oracle/movie/moviedemo/nosqldb/scripts
more resetDemo.sh
2. Run the resetDemo.sh script to reset the MoviePlex demo.
./resetDemo.sh
This script deletes the existing KVRoot and stops the running server processes and logs.
3. Explore the startDemo.sh script in the scripts folder:
more startDemo.sh
4. Run startDemo.sh to start the MoviePlex demo:
./startDemo.sh
This script starts Oracle WebLogic Server and the Oracle NoSQL Server, and loads the
movies into the database.
5. Open the Firefox browser and connect to the Oracle MoviePlex demo application by using
this URL:
http://localhost:7001/bigdatademo-UI-context-root/login.jsp
Note: You can also click the bookmarked Oracle MoviePlex login page.
6. Give the credentials as shown here:
Username: guest1/guest2/.../guest100
Password: Welcome1
Note: The home page content varies vary based on the guestXX login that you enter.
7. You should be able to use the startDemo script to view the movies that are loaded.
You can click a movie title, view its description, and personalize your favorite movies.

Note: You might not be able to view the movies in the classroom environment because of the
firewall settings.

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 7
Solution 8-2: Loading Movie Data into Oracle NoSQL Database
1. Open a new terminal window and explore the resetDemo.sh script in the scripts folder:
cd /home/oracle/movie/moviedemo/nosqldb/scripts
more resetDemo.sh

2. Run the resetDemo.sh script to reset the MoviePlex demo.


./resetDemo.sh

This script deletes the existing KVRoot and stops the running server processes and logs.

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 8
3. Explore the startDemo.sh script in the scripts folder:
more startDemo.sh

4. Run startDemo.sh to start the MoviePlex demo:


./startDemo.sh

This script starts Oracle WebLogic Server and the Oracle NoSQL Server, and loads the
movies into the database.

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 9
5. Open the Firefox browser and connect to the Oracle MoviePlex demo application by using
this URL:
http://localhost:7001/bigdatademo-UI-context-root/login.jsp
Note: You can also click the bookmarked Oracle MoviePlex login page.

6. Give the credentials as shown here:


Username: guest1/guest2/.../guest100
Password: Welcome1
Note: The home page content varies vary based on the guestXX login that you enter.

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 10
7. You should be able to use the startDemo script to view the movies that are loaded.

You can click a movie title, view its description, and personalize your favorite movies.

Note: You might not be able to view the movies in the classroom environment because of the
firewall settings.

Practices for Lesson 8: Using Oracle NoSQL Database


Chapter 8 - Page 11
Practices for Lesson 9: Using
Hive
Chapter 9

Practices for Lesson 9: Using Hive


Chapter 9 - Page 1
Practice 9-1: Manipulating Data with Hive
Overview
In this practice, you:
• Create a database to store your Hive tables
• Create a simple external table in Hive that enables you to view the contents of the file
• Create a more sophisticated external table that parses the JSON fields and maps them
to the columns in the table
• Select the minimum and maximum time periods contained in the table by using HiveQL

Assumptions
You have successfully completed Practice 6.

Tasks
1. Access the Hive command line by typing hive at the Linux prompt.
$hive
2. Create a new Hive database called moviework. Ensure that the database has been
successfully created.
hive> create database moviework;
hive> show databases;
3. Specify that you want all DDL and DML operations to apply to a specific database. For
simplicity, you apply subsequent operations to the moviework database:
hive> use moviework;
4. In the moviework database, create a simple external table called movieapp_log_json.
This table will contain a single column called the_record.
hive> CREATE EXTERNAL TABLE movieapp_log_json (
the_record STRING
)
LOCATION '/user/oracle/moviework/applog/';
5. Write a query in the Hive command line that returns the first five rows from the table. After
reviewing the results, drop the table.
hive> SELECT * FROM movieapp_log_json LIMIT 5;
hive> drop table movieapp_log_json;

Practices for Lesson 9: Using Hive


Chapter 9 - Page 2
6. Define a more sophisticated table that parses the JSON file and maps its fields to columns.
To process the JSON fields, you use a popular serializer/deserializer (or SerDes) called
org.apache.hadoop.hive.contrib.serde2.JsonSerde. After creating the table,
review the results by selecting the first 20 rows:
hive> CREATE EXTERNAL TABLE movieapp_log_json (
custId INT,
movieId INT,
genreId INT,
time STRING,
recommended STRING,
activity INT,
rating INT,
price FLOAT
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/user/oracle/moviework/applog/';
hive> SELECT * FROM movieapp_log_json LIMIT 20;
7. HiveQL supports many standard SQL operations. Find the minimum and maximum time
periods that are in the log file:
hive> SELECT MIN(time), MAX(time) FROM movieapp_log_json;

Practices for Lesson 9: Using Hive


Chapter 9 - Page 3
Solution 9-1: Manipulating Data with Hive
1. Access the Hive command line by typing hive at the Linux prompt.

2. Create a new hive database called moviework. Ensure that the database has been
successfully created.

3. Specify that you want all DDL and DML operations to apply to a specific database. For
simplicity, you apply subsequent operations to the moviework database:

4. In the moviework database, create a simple external table called movieapp_log_json.


This table will contain a single column called the_record.

Practices for Lesson 9: Using Hive


Chapter 9 - Page 4
5. Write a query in the Hive command line that returns the first five rows from the table. After
reviewing the results, drop the table.

6. Define a more sophisticated table that parses the JSON file and maps its fields to columns.
To process the JSON fields, you use a popular serializer/deserializer (or SerDes) called
org.apache.hadoop.hive.contrib.serde2.JsonSerde. After creating the table,
review the results by selecting the first 20 rows:

Practices for Lesson 9: Using Hive


Chapter 9 - Page 5
7. HiveQL supports many standard SQL operations. Find the minimum and maximum time
periods that are in the log file:

Practices for Lesson 9: Using Hive


Chapter 9 - Page 6
Practice 9-2: Extracting Facts by Using Hive
Overview
In this practice, you use HiveQL to filter and aggregate click data to build facts about users’
movie preferences. The query results are saved in a staging table that is used to populate
Oracle Database.

Assumptions
You have successfully completed Practice 9-1.

Tasks
1. Write a query to select only those clicks that correspond to starting, browsing, completing, or
purchasing movies. Use a CASE statement to transform the RECOMMENDED column into
integers, where ‘Y’ is 1 and ‘N’ is 0. Also, ensure that GENREID is not null. Include only the
first 25 rows.
hive> SELECT custid,
movieid,
CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,
time,
CASE recommended WHEN 'Y' THEN 1 ELSE 0 END recommended,
activity,
price
FROM movieapp_log_json
WHERE activity IN (2,4,5,11) LIMIT 25;
Select the movie ratings made by a user. Also consider the following: What happens if a
user rates the same movie multiple times? In this scenario, you should load only the user’s
most recent movie rating.
In Oracle Database 11g, you can use a windowing function. However, HiveQL does not
provide sophisticated analytic functions. Instead, you must use an inner join to compute the
result.
Note: Joins occur before WHERE clauses. To restrict the output of a join, a requirement
should be in the WHERE clause. Otherwise, it should be in the JOIN clause.

Practices for Lesson 9: Using Hive


Chapter 9 - Page 7
2. Write a query to select the customer ID, movie ID, recommended state, and most recent
rating for each movie.
hive> SELECT
m1.custid,
m1.movieid,
CASE WHEN m1.genreid > 0 THEN m1.genreid ELSE -1 END genreid,
m1.time,
CASE m1.recommended WHEN 'Y' THEN 1 ELSE 0 END recommended,
m1.activity,
m1.rating
FROM movieapp_log_json m1
JOIN
(SELECT
custid,
movieid,
CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,
MAX(time) max_time,
activity
FROM movieapp_log_json
GROUP BY custid,
movieid,
genreid,
activity
) m2
ON (
m1.custid = m2.custid
AND m1.movieid = m2.movieid
AND m1.genreid = m2.genreid
AND m1.time = m2.max_time
AND m1.activity = 1
AND m2.activity = 1
) LIMIT 25;

Practices for Lesson 9: Using Hive


Chapter 9 - Page 8
3. Load the results of the previous two queries into a staging table. You first create the staging
table:
hive> CREATE TABLE movieapp_log_stage (
custId INT,
movieId INT,
genreId INT,
time STRING,
recommended INT,
activity INT,
rating INT,
sales FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Note: If you come across “parse error/exception” at the time of table creation, run the
following script and restart from Practice 9-1.
/home/oracle/movie/moviework/reset/reset_mapreduce.sh
4. Load the results of the queries into the staging table:

Practices for Lesson 9: Using Hive


Chapter 9 - Page 9
hive>INSERT OVERWRITE TABLE movieapp_log_stage
SELECT * FROM (
SELECT custid,
movieid,
CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,
time,
CAST((CASE recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT)
recommended,
activity,
cast(null AS INT) rating,
price
FROM movieapp_log_json
WHERE activity IN (2,4,5,11)
UNION ALL
SELECT
m1.custid,
m1.movieid,
CASE WHEN m1.genreid > 0 THEN m1.genreid ELSE -1 END genreid,
m1.time,
CAST((CASE m1.recommended WHEN 'Y' THEN 1 ELSE 0 END) AS INT)
recommended,
m1.activity,
m1.rating,
cast(null as float) price
FROM movieapp_log_json m1
JOIN
(SELECT
custid,
movieid,
CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,
MAX(time) max_time,
activity
FROM movieapp_log_json
GROUP BY custid,
movieid,
genreid,
activity
) m2
ON (
m1.custid = m2.custid
AND m1.movieid = m2.movieid
AND m1.genreid = m2.genreid

Practices for Lesson 9: Using Hive


Chapter 9 - Page 10
AND m1.time = m2.max_time
AND m1.activity = 1
AND m2.activity = 1
)
) union_result;

Practices for Lesson 9: Using Hive


Chapter 9 - Page 11
Solution 9-2: Extracting Facts by Using Hive
1. Write a query to select only those clicks that correspond to starting, browsing, completing, or
purchasing movies. Use a CASE statement to transform the RECOMMENDED column into
integers where ‘Y’ is 1 and ‘N’ is 0. Also, ensure that GENREID is not null. Include only the
first 25 rows.
Select the movie ratings made by a user. Also consider the following: What happens if a
user rates the same movie multiple times? In this scenario, you should load only the user’s
most recent movie rating.
In Oracle Database 11g, you can use a windowing function. However, HiveQL does not
provide sophisticated analytic functions. Instead, you must use an inner join to compute the
result.
Note: Joins occur before WHERE clauses. To restrict the output of a join, a requirement
should be in the WHERE clause. Otherwise, it should be in the JOIN clause.

Practices for Lesson 9: Using Hive


Chapter 9 - Page 12
2. Write a query to select the customer ID, movie ID, recommended state, and most recent
rating for each movie.

Practices for Lesson 9: Using Hive


Chapter 9 - Page 13
3. Load the results of the previous two queries into a staging table. You first create the staging
table:

Practices for Lesson 9: Using Hive


Chapter 9 - Page 14
4. Load the results of the queries into the staging table:

Practices for Lesson 9: Using Hive


Chapter 9 - Page 15
Practices for Lesson 9: Using Hive
Chapter 9 - Page 16
Practices for Lesson 10:
Introduction to Oracle Big
Data Connectors
Chapter 10

Practices for Lesson 10: Introduction to Oracle Big Data Connectors


Chapter 10 - Page 1
Practices for Lesson 11:
Using Oracle Loader for
Hadoop
Chapter 11

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 1
Practices for Lesson 11: Using Oracle Loader for Hadoop
Overview
In this practice, you learn how to use Oracle Loader for Hadoop (OLH) to load data into a table
in Oracle Database.
You load data in online mode with the direct path load option.

Tasks
1. Open a terminal window and clean up the existing HDFS files.
source /home/oracle/movie/moviework/reset/reset_conn.sh
The reset_conn.sh script cleans up any directories and files, and it creates a new HDFS
directory for storing output files.
2. Examine the following scripts:
a. moviesession.xml
b. loaderMap_moviesession.xml
c. runolh_session.sh
cd /home/oracle/movie/moviework/olh
more moviesession.xml
more loaderMap_moviesession.xml
more runolh_session.sh
Note: You should keep pressing the Enter key if a script is large.
3. Create the target table in the database where the data needs to be loaded. This table is
hash-partitioned on the cust_id column.
sqlplus moviedemo/welcome1
@moviesession.sql
exit
4. Run the runolh_session.sh script to invoke the OLH job to load the session data. This
starts start the MapReduce job, which loads the data into the target table.
sh runolh_session.sh
5. After the MapReduce job is completed, check to see if the rows are loaded.
sqlplus moviedemo/welcome1
select count(*) from movie_sessions_tab;

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 2
6. The table is available for querying. You can execute the following queries, or you can try
your own queries.
select cust_id from movie_sessions_tab where rownum < 10;
select max(num_browsed) from movie_sessions_tab;
select cust_id from movie_sessions_tab where num_browsed = 6;
select cust_id from movie_sessions_tab where num_browsed in (select
max(num_browsed) from movie_sessions_tab);

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 3
Solution 11-1: Loading Session Data into Oracle Database
1. Open a terminal window and clean up the existing HDFS files.
source /home/oracle/movie/moviework/reset/reset_conn.sh
The reset_conn.sh script cleans up any directories and files, and it creates a new HDFS
directory for storing output files.
2. Examine the following scripts:
a. moviesession.xml
This file contains the configuration parameters for the execution of OLH.
You can review the file to see the parameters, including:
mapreduce.inputformat.class: Specifies the input format of the input data
file
In this example, the input data is delimited text, so the value for this parameter is
the class name
oracle.hadoop.loader.lib.input.DelimitedTextInputFormat.
oracle.hadoop.loader.input.fieldTerminator: Specifies the character
that is used as a field terminator in the input data file
In this example, the field terminator is Tab (represented by its hex value).
mapreduce.outputformat.class: Specifies the type of load. We specify
here the value OCIOutputFormat to use the direct path online load option.
mapred.input.dir: Location of the input data file on HDFS
mapred.output.dir: Specifies the HDFS directory where output files should
be written, such as the _SUCCESS and _log files
oracle.hadoop.loader.loaderMapFile: Specifies the name and location
(typically on the client file system) of the loaderMap file
b. loaderMap_moviesession.xml
This file specifies the target table into which data will be loaded, as well as the
mapping of input data to columns in the target table. Note the date format in the
DATE_ID column, which specifies the date format in the input data file. The Java
date format should be used to specify the date format in the Loader Map file.
If a Loader Map file is not specified, OLH assumes that all columns in the table will
be loaded, and OLH will expect input data in column order.
In that case, the oracle.hadoop.loader.targetTable property and the
oracle.hadoop.loader.input.fieldNames property must be specified.

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 4
c. runolh_session.sh
This is the script to invoke Oracle Loader for Hadoop, which runs as a MapReduce
job on the Hadoop cluster. It uses olh_moviefact.xml, the file containing the
configuration parameters.
cd /home/oracle/movie/moviework/olh
more moviesession.xml
more loaderMap_moviesession.xml
more runolh_session.sh
Note: You should keep pressing the Enter key if a script is large.
3. Create the target table in the database where the data needs to be loaded. This table is
hash-partitioned on the cust_id column.
sqlplus moviedemo/welcome1
@moviesession.sql
exit

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 5
You see that an external table is created in the Oracle database to load the data from Hadoop.

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 6
4. Run the runolh_session.sh script to invoke the OLH job to load the session data. This
starts the MapReduce job, which loads the data into the target table.
sh runolh_session.sh

Note that one row had a parse error. The .bad file containing the row and the error are
logged in the _olh directory under the directory specified in mapred.output.dir.
In this example, the date value is invalid because it is a time between 2:00 AM and 3:00 AM
on the day that Daylight Savings Time begins.

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 7
5. After the MapReduce job is completed, check to see if the rows are loaded.
sqlplus moviedemo/welcome1
select count(*) from movie_sessions_tab;

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 8
6. The table is available for querying. You can execute the following queries, or you can try
your own queries.
select cust_id from movie_sessions_tab where rownum < 10;
select max(num_browsed) from movie_sessions_tab;
select cust_id from movie_sessions_tab where num_browsed = 6;
select cust_id from movie_sessions_tab where num_browsed in (select
max(num_browsed) from movie_sessions_tab);

Practices for Lesson 11: Using Oracle Loader for Hadoop


Chapter 11 - Page 9
Practices for Lesson 12:
Using Oracle SQL Connector
for HDFS
Chapter 12

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 1
Practice 12-1: Accessing Hadoop Data with Oracle SQL Connector for
HDFS
Overview
In this practice, you use Oracle SQL Connector for HDFS to access the data in:
• Hive tables
• HDFS files

Assumptions
The reset_conn.sh script has already been run to clean up the directories.

Tasks

Accessing the Hive Tables by Using OSCH


1. Execute the following steps to access the data residing in Hive by using OSCH:
a. Review the following files located in the OSCH directory:
cd /home/oracle/movie/moviework/osch
more genloc_moviefact_hive.sh
more moviefact_hive.xml
b. Run the following script.
When prompted for a password, type welcome1.
sh genloc_moviefact_hive.sh
c. Take a look at the external table definition in SQL*Plus:
sqlplus moviedemo/welcome1
describe movie_fact_ext_tab_hive;
d. Try the following queries on this external table, which will query data in the Hive
table.
select count(*) from movie_fact_ext_tab_hive;
select custid from movie_fact_ext_tab_hive where rownum < 10;
e. You can also join the external table with the table in Oracle Database to list movie
titles by custid.
select custid, title from movie_fact_ext_tab_hive p, movie q
where p.movieid = q.movie_id and rownum < 10;
f. The data in the Hive table can be inserted into a database table by using SQL.
create table movie_fact_local as select * from
movie_fact_ext_tab_hive;

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 2
Accessing HDFS Files by Using OSCH
2. Execute the following steps to access the data residing in HDFS by using OSCH:
a. Review the following files:
more genloc_moviefact_text.sh
more moviefact_text.xml
b. Run the following script.
sh genloc_moviefact_text.sh
You are prompted for the password, which is welcome1.
From the output on your monitor, you can see the external table definition as well as
the location files. The location files contain the URIs of the data files on HDFS.
c. Query the newly created external table movie_fact_ext_tab_text.
sqlplus moviedemo/welcome1
select count(*) from movie_fact_ext_tab_file;
describe movie_fact_ext_tab_file;
d. Select cust_id for a few rows in the table.
select cust_id from movie_fact_ext_tab_file where rownum < 10;
e. Join with the movie table in the database to list movie titles by cust_id. The first
few rows are listed here.
select cust_id, title from movie_fact_ext_tab_file p, movie q
where p.movie_id = q.movie_id and rownum < 10;

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 3
Solution Practices for Lesson 12: Using Oracle SQL Connector for
HDFS
Accessing Hive Tables by Using OSCH
1. Execute the following steps to access the data residing in Hive by using OSCH:
a. Review the following files located in the OSCH directory:
cd /home/oracle/movie/moviework/osch

more genloc_moviefact_hive.sh

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 4
more moviefact_hive.xml

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 5
b. Run the following script.
When prompted for a password, type welcome1.
sh genloc_moviefact_hive.sh

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 6
c. Take a look at the external table definition in SQL*Plus:
sqlplus moviedemo/welcome1
describe movie_fact_ext_tab_hive;

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 7
d. Try the following queries on this external table, which will query data in the Hive
table:
select count(*) from movie_fact_ext_tab_hive;
select custid from movie_fact_ext_tab_hive where rownum < 10;

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 8
e. You can also join the external table with the table in Oracle Database to list movie
titles by custid.
select custid, title from movie_fact_ext_tab_hive p, movie q
where p.movieid = q.movie_id and rownum < 10;

f. The data in the Hive table can be inserted into a database table by using SQL.
create table movie_fact_local as select * from
movie_fact_ext_tab_hive;
exit;

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 9
Accessing HDFS Files by Using OSCH
2. Execute the following steps to access the data residing in HDFS by using OSCH:
a. Review the following files:
more genloc_moviefact_text.sh
more moviefact_text.xml

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 10
b. Run the following script.
sh genloc_moviefact_text.sh
You will be prompted for the password, which is welcome1.

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 11
From the output on your monitor, you can see the external table definition as well as
the location files. The location files contain the URIs of the data files on HDFS.
c. Query the newly created external table movie_fact_ext_tab_text.
sqlplus moviedemo/welcome1
select count(*) from movie_fact_ext_tab_file;

describe movie_fact_ext_tab_file;

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 12
d. Select cust_id for a few rows in the table.
select cust_id from movie_fact_ext_tab_file where rownum < 10;

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 13
e. Join with the movie table in the database to list movie titles by cust_id. The first
few rows are listed here:
select cust_id, title from movie_fact_ext_tab_file p, movie q
where p.movie_id = q.movie_id and rownum < 10;

Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 14
Practices for Lesson 13:
Using ODI Application
Adapter for Hadoop
Chapter 13

Practices for Lesson 13: Using ODI Application Adapter for Hadoop
Chapter 13 - Page 1
Practices for Lesson 14:
Using Oracle R Connector for
Hadoop
Chapter 14

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 1
Practice 14-1: Working with Data in HDFS and Oracle Database
Overview
In this practice, you load the ORCH library to access some basic functions for manipulating
HDFS and Oracle Database.

Tasks
1. Change directory and start R:
$cd /home/oracle/movie/moviework/advancedanalytics
$R
2. Load the Oracle R Enterprise (ORE) library and connect to the Oracle database, and then
list the contents of the database to test the connection.
Note: If a table contains columns with unsupported data types, a warning message is
returned. If you are connected, you can just invoke ore.ls().
library(ORE)
ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE)
ore.ls()
3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory,
and list the directory contents in Hadoop Distributed File System (HDFS). Change directory
in HDFS and view the contents there:
library(ORCH)
hdfs.pwd()
hdfs.ls()
hdfs.cd ("/user/oracle/moviework/advancedanalytics/data")
hdfs.ls()
4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look
at the first few rows of each table, and get the table dimensions:
ore.sync("MOVIEDEMO","MOVIE_FACT")
MF <- MOVIE_FACT
names(MF)
head(MF,3)
dim(MF)

names(MOVIE_GENRE)
head(MOVIE_GENRE,3)
dim(MOVIE_GENRE)

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 2
5. Because the MOVIE_GENRE table is used later in the Hadoop recommendation jobs, copy a
subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This
requires using orch.connect to establish the connection to the database from ORCH.
MG_SUBSET <- MOVIE_GENRE[1:10000,]
hdfs.rm('movie_genre_subset')
orch.connect(host="localhost", user="moviedemo",
sid="orcl",passwd="welcome1",secure=F)
mg.dfs <- hdfs.push(MG_SUBSET, dfs.name='movie_genre_subset')
hdfs.exists('movie_genre_subset')
hdfs.describe('movie_genre_subset')
hdfs.size('movie_genre_subset')

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 3
Solution 14-1: Working with Data in HDFS and Oracle Database
1. Change directory and start R:
$cd /home/oracle/movie/moviework/advancedanalytics
$R

2. Load the Oracle R Enterprise (ORE) library and connect to the Oracle database, and then
list the contents of the database to test the connection.
Note: If a table contains columns with unsupported data types, a warning message is
returned. If you are connected, you can just invoke ore.ls().
library(ORE)
ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE)
ore.ls()

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 4
3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory,
and list the directory contents in Hadoop Distributed File System (HDFS). Change directory
in HDFS and view the contents there:
library(ORCH)
hdfs.pwd()
hdfs.ls()
hdfs.cd ("/user/oracle/moviework/advancedanalytics/data")
hdfs.ls()

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 5
4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look
at the first few rows of each table, and get the table dimensions:
ore.sync("MOVIEDEMO","MOVIE_FACT")
MF <- MOVIE_FACT
names(MF)
head(MF,3)
dim(MF)

names(MOVIE_GENRE)
head(MOVIE_GENRE,3)
dim(MOVIE_GENRE)

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 6
5. Because the MOVIE_GENRE table is used later in the Hadoop recommendation jobs, copy a
subset of MOVIE_GENRE from the database to HDFS and validate that it exists. This
requires using orch.connect to establish the connection to the database from ORCH.
MG_SUBSET <- MOVIE_GENRE[1:10000,]
hdfs.rm('movie_genre_subset')
orch.connect(host="localhost", user="moviedemo",
sid="orcl",passwd="welcome1",secure=F)
mg.dfs <- hdfs.push(MG_SUBSET, dfs.name='movie_genre_subset')
hdfs.exists('movie_genre_subset')
hdfs.describe('movie_genre_subset')
hdfs.size('movie_genre_subset')

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 7
Practice 14-2: Executing a Simple MapReduce Job by Using Oracle R
Connector for Hadoop
Overview
In this practice, you use Oracle R Connector for Hadoop to execute a simple MapReduce job
that counts the number of movies for each genre. You then compare the results with ORE.

Assumptions
None

Tasks
1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session:
mg.dfs <-
hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_
genre_subset")
mg.dfs
hdfs.describe(mg.dfs)
2. Specify dry run mode, and then execute the MapReduce job that partitions the data based
on genre_id and counts the number of movies in each genre.
Note: You receive debug output while in dry run mode.
orch.dryrun(T)
res.dryrun <- NULL
res.dryrun <- hadoop.run(mg.dfs, mapper = function(key, val) {
orch.keyvals(val$GENRE_ID, rep(1, nrow(val)))
},
reducer = function(key, vals) {
count <- nrow(vals)
orch.keyval(NULL, key, count)
} ,
config = new("mapred.config", map.output = data.frame(key=0,
val=0), reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0))
)
3. Retrieve the result of the Hadoop job, which is stored as an HDFS file.
Note: Because this is dry run mode, not all data may be used. As a result, only a subset of
results may be returned.
hdfs.get(res.dryrun)

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 8
4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same
MapReduce job, and view the result.
Note: This takes longer to execute because it starts actual Hadoop jobs on the cluster.
orch.dryrun(F)
res.cluster <- NULL
res.cluster <- hadoop.run(
mg.dfs, mapper = function(key, val) {
orch.keyvals(val$GENRE_ID, rep(1, nrow(val)))
},
reducer = function(key, vals) {
count <- nrow(vals)
orch.keyval(NULL, key, count) } ,
config = new("mapred.config",
map.output = data.frame(key=0, val=0),
reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0))
)
hdfs.get(res.cluster)
5. Perform the same analysis by using ORE:
res.table <- table(MG_SUBSET$GENRE_ID)
res.table

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 9
Solution 14-2: Executing a Simple MapReduce Job by Using Oracle R
Connector for Hadoop
1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session:
mg.dfs <-
hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_
genre_subset")
mg.dfs
hdfs.describe(mg.dfs)

2. Specify dry run mode, and then execute the MapReduce job that partitions the data based
on genre_id and counts the number of movies in each genre.
Note: You receive debug output while in dry run mode.
orch.dryrun(T)
res.dryrun <- NULL
res.dryrun <- hadoop.run(mg.dfs, mapper = function(key, val) {
orch.keyvals(val$GENRE_ID, rep(1, nrow(val)))
},
reducer = function(key, vals) {
count <- nrow(vals)
orch.keyval(NULL, key, count)
} ,
config = new("mapred.config", map.output = data.frame(key=0, val=0),
reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0)))

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 10
3. Retrieve the result of the Hadoop job, which is stored as an HDFS file.
Note: Because this is dry run mode, not all data may be used. As a result, only a subset of
results may be returned.
hdfs.get(res.dryrun)

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 11
4. Specify to execute using the cluster by setting orch.dryrun to FALSE, rerun the same
MapReduce job, and view the result.
Note: This takes longer to execute because it starts actual Hadoop jobs on the cluster.
orch.dryrun(F)
res.cluster <- NULL
res.cluster <- hadoop.run(
mg.dfs, mapper = function(key, val) {
orch.keyvals(val$GENRE_ID, rep(1, nrow(val)))
},
reducer = function(key, vals) {
count <- nrow(vals)
orch.keyval(NULL, key, count) } ,
config = new("mapred.config",
map.output = data.frame(key=0, val=0),
reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0))
)
hdfs.get(res.cluster)

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 12
5. Perform the same analysis by using ORE:
res.table <- table(MG_SUBSET$GENRE_ID)
res.table

Practices for Lesson 14: Using Oracle R Connector for Hadoop


Chapter 14 - Page 13
Practices for Lesson 15:
Using In-Database Analytics
Chapter 15

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 1
Practice 15-1: Getting Started with Oracle R Enterprise
Overview
In this practice, you are introduced to Oracle R Enterprise (ORE). You run R scripts in the R
environment and explore the analysis options that are available in ORE.

Tasks
1. Start the R Console.
$R
2. Load the ORE packages and connect to the moviedemo schema in the database with the
SID orcl on localhost and the password welcome1. Specify that table and view
metadata should be synchronized, and specify that tables are accessible in the current R
environment.
Note: You receive one or more warning messages if tables contain data types that are not
recognized by ORE.
library(ORE)
ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE)
3. Perform the following steps:
a. View the contents of the database schema:
ore.ls()
b. Determine if the CUSTOMER_V table exists:
ore.exists("CUSTOMER_V")
c. See that the table is an ore.frame (an R object proxy for the table in the
database):
class(CUSTOMER_V)
d. Determine that table’s dimensions and summary statistics. The dimensions and
summary are computed in the database with only the results being retrieved:
dim(CUSTOMER_V)
names(CUSTOMER_V)
e. Answer the following questions about our customers:
i. Which gender (male or female) is better represented?
ii. Is the customer base skewed toward young customers or old customers?
iii. Are customers highly educated?
summary(CUSTOMER_V[,c("GENDER","INCOME","AGE","EDUCATION")])

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 2
4. Answer the following questions:
a. Are customers generally upper-income or lower-income?
b. Are there any surprises in the distribution of customer incomes?
c. What is the income range of the middle 50% of customers?
To answer these questions, use ORE functions to generate a histogram and boxplot for
income.
cust <- CUSTOMER_V
hist(cust$INCOME,col="red")
boxplot(cust$INCOME,xlab="income column",ylab="income",
main="Distribution of Customer Income",
col="red",notch=TRUE)
5. Answer the following questions:
a. Does it make sense to consider segmenting customers based on education and
income?
b. Can you extend your answer to step (a) to produce a boxplot for income by
education?
We first use the overloaded function split to partition the data in Oracle Database. Then
we use this list result for the boxplot:
cust.split <- with(cust, split(INCOME,as.factor(EDUCATION)))
boxplot(cust.split, xlab="Education",ylab="Income",boxwex = 0.5,
main="Distribution of Customer Income by Education",
col="red",notch=TRUE)
6. Answer the following questions:
a. Are there more single or married customers in each education group?
b. Can you use the overloaded table function on the ore.frame object cust to build a
contingency table of counts at each combination of factor levels to show the table
numerically?
cust.tab <- with (cust, table(EDUCATION,MARITAL_STATUS))
cust.tab
7. For customers, are there correlations between age, income, and the number of years since
first becoming a customer? In this example, we produce a “pairs” plot that produces
scatterplots of pairs of columns. From the CUSTOMER_V table, we sample 20% of the
customers and select the columns AGE, INCOME, and YRS_CUSTOMER. Using the pairs
function, we not only produce a scatterplot but also draw a regression line (in red) and a
lowess curve in blue. Along the diagonal, a histogram of each column’s data is plotted. You
will notice that age correlates strongly with the number of years as a customer. Income and
age show a mild correlation.

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 3
C1 <- CUSTOMER_V
row.names(C1) <- C1$CUST_ID
N <- nrow(C1)
s <- sample(1:N,N*0.2)
with(C1[s,],
pairs(cbind(AGE, INCOME, YRS_CUSTOMER),
panel=function(x,y) {
points(x,y)
abline(lm(y~x),lty="dashed",col="red",lwd=2)
lines(lowess(x,y),col="blue",lwd=3)
},
diag.panel=function(x){
par(new=TRUE)
hist(x,main="",axes=FALSE, col="tan")
}
))
8. In this example, you answer the question “Which actor has the most movie titles to his or her
credit?” You draw on three tables: MOVIE_CAST, CAST, and MOVIE. You join these tables,
aggregating based on the actor’s name, and selecting those with a count greater than 110.
MC <- MOVIE_CAST
C1 <- CAST
M1 <- MOVIE[,c("MOVIE_ID","TITLE","YEAR")]
m1 <- merge(MC,C1,by="CAST_ID")[,c("NAME","MOVIE_ID","CAST_ID")]
names(m1) <- c("ACTOR","MOVIE_ID","CAST_ID")
ACTORS <- merge(m1,M1,by="MOVIE_ID")[,c("CAST_ID","ACTOR",
"MOVIE_ID","TITLE","YEAR")]
#row.names(ACTORS) <- c(ACTORS$MOVIE_ID,ACTORS$CAST_ID)
MOVIE_ACTORS <- ACTORS[,c("ACTOR","TITLE","YEAR")]
aggdata <- aggregate(MOVIE_ACTORS$ACTOR,
by = list(MOVIE_ACTORS$ACTOR),
FUN = length)
aggdata[aggdata$x > 50,]

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 4
9. In this example, you answer the question “Which are the most popular movie genres based
on the number of movies produced in that genre?” Using the MOVIE_GENRE and GENRE
ore.frame objects, merge (join) the data so that you can use genre names instead of IDs.
Use the overloaded function aggregate on the joined ore.frame to count the number of
movies in each genre. The barplot window can be widened so that more labels can be
shown.
MG <- MOVIE_GENRE
G1 <- GENRE
m1 <- merge(MG,G1, by="GENRE_ID")
genre.cnts <- with (m1, aggregate(NAME,
by = list(NAME),
FUN = length))
class(genre.cnts)
names(genre.cnts) <- c("genre","cnt")
genre.cnts
gcnts <- ore.pull(genre.cnts)
gcnts.sorted <- gcnts[order(gcnts$cnt,decreasing=TRUE),]
barplot(height=gcnts.sorted$cnt, names=gcnts.sorted$genre,
main="Barplot of Movie Counts by Genre",
col="red",cex.names=0.7,las=2)
10. Which movie genre generally has the lowest gross income? Use the overloaded function
merge on the ore.frame objects MOVIE_GENRE, GENRE, and MOVIE to get a sense of the
distribution of gross earnings in a genre. Use the dim function to see the size of the
resulting data set. Use the split function and then graph the distribution of popularity
using the boxplot function.
Note: All of the heavy computational work is done in the database. Only summary data is
brought to the client to graph the statistics.
g <- merge(MOVIE_GENRE,GENRE,by="GENRE_ID")
m <- merge(MOVIE,g[,2:3],by="MOVIE_ID")
m$GENRE <- m$NAME
dim(m)
m.split <- split(m$GROSS,m$GENRE)
boxplot(m.split, ylab="Gross Earnings",xlab="Genre",col="green",
main="Distribution of Gross Earnings by Genre",
cex.axis=0.6, boxwex=.5, las=2)

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 5
Solution 15-1: Getting Started with Oracle R Enterprise

1. Start the R Console.


$R

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 6
2. Load the ORE packages and connect to the moviedemo schema in the database with the
SID orcl on localhost and the password welcome1. Specify that table and view
metadata should be synchronized, and specify that tables are accessible in the current R
environment.
Note: You receive one or more warning messages if tables contain data types that are not
recognized by ORE.
library(ORE)
ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE)

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 7
3. Perform the following steps:
a. View the contents of the database schema:
ore.ls()

b. Determine if the CUSTOMER_V table exists:


ore.exists("CUSTOMER_V")

c. See that the table is an ore.frame (an R object proxy for the table in the
database)
class(CUSTOMER_V)

d. Determine that table’s dimensions and summary statistics. The dimensions and
summary are computed in the database with only the results being retrieved:
dim(CUSTOMER_V)
names(CUSTOMER_V)

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 8
e. Answer the following questions about our customers:
i. Which gender (male or female) is better represented?
ii. Is the customer base skewed toward young customers or old customers?
iii. Are customers highly educated?
summary(CUSTOMER_V[,c("GENDER","INCOME","AGE","EDUCATION")])

4. Answer the following questions:


a. Are customers generally upper-income or lower-income?
b. Are there any surprises in the distribution of customer incomes?
c. What is the income range of the middle 50% of customers?
To answer these questions, use ORE functions to generate a histogram and boxplot for
income.

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 9
cust <- CUSTOMER_V
hist(cust$INCOME,col="red")
boxplot(cust$INCOME,xlab="income column",ylab="income",
main="Distribution of Customer Income",
col="red",notch=TRUE)

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 10
Practices for Lesson 15: Using In-Database Analytics
Chapter 15 - Page 11
5. Answer the following questions:
a. Does it make sense to consider segmenting customers based on education and
income?
b. Can you extend your answer to step (a) to produce a boxplot for income by
education?
We first use the overloaded function split to partition the data in Oracle Database. Then
we use this list result for the boxplot:
cust.split <- with(cust, split(INCOME,as.factor(EDUCATION)))
boxplot(cust.split, xlab="Education",ylab="Income",boxwex = 0.5,
main="Distribution of Customer Income by Education",
col="red",notch=TRUE)

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 12
6. Answer the following questions:
a. Are there more single or married customers in each education group?
b. Can you use the overloaded table function on the ore.frame object cust to build a
contingency table of counts at each combination of factor levels to show the table
numerically?
cust.tab <- with (cust, table(EDUCATION,MARITAL_STATUS))
cust.tab

7. For customers, are there correlations between age, income, and the number of years since
first becoming a customer? In this example, we produce a “pairs” plot that produces
scatterplots of pairs of columns. From the CUSTOMER_V table, we sample 20% of the
customers and select the columns AGE, INCOME, and YRS_CUSTOMER. Using the pairs
function, we not only produce a scatterplot but also draw a regression line (in red) and a
lowess curve in blue. Along the diagonal, a histogram of each column’s data is plotted. You
will notice that age correlates strongly with the number of years as a customer. Income and
age show a mild correlation.

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 13
C1 <- CUSTOMER_V
row.names(C1) <- C1$CUST_ID
N <- nrow(C1)
s <- sample(1:N,N*0.2)
with(C1[s,],
pairs(cbind(AGE, INCOME, YRS_CUSTOMER),
panel=function(x,y) {
points(x,y)
abline(lm(y~x),lty="dashed",col="red",lwd=2)
lines(lowess(x,y),col="blue",lwd=3)
},
diag.panel=function(x){
par(new=TRUE)
hist(x,main="",axes=FALSE, col="tan")
}
))

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 14
Practices for Lesson 15: Using In-Database Analytics
Chapter 15 - Page 15
8. In this example, you answer the question “Which actor has the most movie titles to his or
her credit?” You draw on three tables: MOVIE_CAST, CAST, and MOVIE. You join these
tables, aggregating based on the actor’s name, and selecting those with a count greater
than 110.
MC <- MOVIE_CAST
C1 <- CAST
M1 <- MOVIE[,c("MOVIE_ID","TITLE","YEAR")]
m1 <- merge(MC,C1,by="CAST_ID")[,c("NAME","MOVIE_ID","CAST_ID")]
names(m1) <- c("ACTOR","MOVIE_ID","CAST_ID")
ACTORS <- merge(m1,M1,by="MOVIE_ID")[,c("CAST_ID","ACTOR",
"MOVIE_ID","TITLE","YEAR")]
#row.names(ACTORS) <- c(ACTORS$MOVIE_ID,ACTORS$CAST_ID)
MOVIE_ACTORS <- ACTORS[,c("ACTOR","TITLE","YEAR")]
aggdata <- aggregate(MOVIE_ACTORS$ACTOR,
by = list(MOVIE_ACTORS$ACTOR),
FUN = length)
aggdata[aggdata$x > 50,]

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 16
9. In this example, you answer the question “Which are the most popular movie genres based
on the number of movies produced in that genre?” Using the MOVIE_GENRE and GENRE
ore.frame objects, merge (join) the data so that you can use genre names instead of IDs.
Use the overloaded function aggregate on the joined ore.frame to count the number of
movies in each genre. The barplot window can be widened so that more labels can be
shown.
MG <- MOVIE_GENRE
G1 <- GENRE
m1 <- merge(MG,G1, by="GENRE_ID")
genre.cnts <- with (m1, aggregate(NAME,
by = list(NAME),
FUN = length))
class(genre.cnts)
names(genre.cnts) <- c("genre","cnt")
genre.cnts
gcnts <- ore.pull(genre.cnts)
gcnts.sorted <- gcnts[order(gcnts$cnt,decreasing=TRUE),]
barplot(height=gcnts.sorted$cnt, names=gcnts.sorted$genre,
main="Barplot of Movie Counts by Genre",
col="red",cex.names=0.7,las=2)

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 17
Practices for Lesson 15: Using In-Database Analytics
Chapter 15 - Page 18
Practices for Lesson 15: Using In-Database Analytics
Chapter 15 - Page 19
10. Which movie genre generally has the lowest gross income? Use the overloaded function
merge on the ore.frame objects MOVIE_GENRE, GENRE, and MOVIE to get a sense of the
distribution of gross earnings in a genre. Use the dim function to see the size of the
resulting data set. Use the split function and then graph the distribution of popularity
using the boxplot function.
Note: All of the heavy computational work is done in the database. Only summary data is
brought to the client to graph the statistics.
g <- merge(MOVIE_GENRE,GENRE,by="GENRE_ID")
m <- merge(MOVIE,g[,2:3],by="MOVIE_ID")
m$GENRE <- m$NAME
dim(m)
m.split <- split(m$GROSS,m$GENRE)
boxplot(m.split, ylab="Gross Earnings",xlab="Genre",col="green",
main="Distribution of Gross Earnings by Genre",
cex.axis=0.6, boxwex=.5, las=2)

Practices for Lesson 15: Using In-Database Analytics


Chapter 15 - Page 20
Practices for Lesson 15: Using In-Database Analytics
Chapter 15 - Page 21
Practices for Lesson 16:
Oracle Big Data Integration
Options
Chapter 16

Practices for Lesson 16: Oracle Big Data Integration Options


Chapter 16 - Page 1
Practices for Lesson 16
Practices Overview
There are no practices for this lesson.

Practices for Lesson 16: Oracle Big Data Integration Options


Chapter 16 - Page 2
Practices for Lesson 17:
Oracle Big Data Use Cases
Chapter 17

Practices for Lesson 17: Oracle Big Data Use Cases


Chapter 17 - Page 1
Practices for Lesson 17: Overview
Practices Overview
In these practices, you examine a case scenario for one of the companies in the
Telecommunications domain. This company has been facing an ever-increasing set of threats
brought on by new technologies. These threats are disrupting the company’s traditional profit
margins and reducing its ability to control its business and customers.
You analyze the given problem and suggest the best solution using the Oracle Big Data Stack.

Practices for Lesson 17: Oracle Big Data Use Cases


Chapter 17 - Page 2
Practice 17-1: Case Study
Overview
In this practice, you analyze the case scenario and provide a solution using the Oracle Big Data
Stack.

Case Scenario
XYZ Telecom is one of the largest Asia-Pacific communications service providers (CSP). The
company has recently been facing increasing competition from social media sites and from
other providers (IP providers, WiFi, WiMAX, and so on) and decreasing customer acquisition.
These are all challenges to the future business growth.
To help resolve these issues, XYZ Telecom has decided to focus on the following areas this
year:
• Increase its customers’ adoption of smartphones
• Optimize its network and increase its data services revenue
• Accelerate adoption of its new mobile Internet

Analytical Questions
1. How do you think XYZ Telecom can achieve its three key goals?

2. In which of the following ways can XYZ Telecom learn what people are saying about its
products and services?
a. Lengthy surveys
b. Real-time analysis of data
c. Analysis of large volume of structured and nonstructured data
d. Analysis of nonstructured content, IP network traffic, and Web proxy information
3. To improve the XYZ Telecom customer experience, which of the following should the
company analyze proactively?
a. Audio conversations
b. CRM service records
c. Wait for customer complaints
d. Social media comments
e. Network traffic information
4. Suggest two ways for XYZ Telecom to ensure better security.

5. Now that you have analyzed the key areas that XYZ Telecom needs to focus on, how and
why would you recommend Oracle Big Data as a solution?
Hint: Consider the four Vs of Big Data.

Practices for Lesson 17: Oracle Big Data Use Cases


Chapter 17 - Page 3
Solution 17-1: Case Study
Analytical Questions
1. How do you think XYZ Telecom can achieve its three key goals?
Use its data warehouse, mobile network feeds, and social media to provide better
insights into its customer base and network utilization.

2. In which of the following ways can XYZ Telecom learn what people are saying about its
products and services?
a. Lengthy surveys
b. Real-time analysis of data
c. Analysis of large volume of structured and nonstructured data
d. Analysis of nonstructured content, IP network traffic, and Web proxy information
3. To improve the XYZ Telecom customer experience, which of the following should the
company analyze proactively?
a. Audio conversations
b. CRM service records
c. Wait for customer complaints
d. Social media comments
e. Network traffic information
4. Suggest two ways for XYZ Telecom to ensure better security.

• Real-time fraud detection


• Government regulatory compliance (through the efficient and cost-effective
storage of CDRs and customer information)

Practices for Lesson 17: Oracle Big Data Use Cases


Chapter 17 - Page 4
5. Now that you have analyzed the key areas that XYZ Telecom needs to focus on, how and
why would you recommend Oracle Big Data as a solution?
Hint: Consider the four Vs of Big Data.
Big Data Solution

Variety Multiple Data Sources


BSS/ OSS / VAS /Network
BSS – Billing Systems, CRM, ODS, ECM, Knowledge
Management
OSS – Provisioning Systems, Mediation System, and so on
VAS – SDP, Reload, M-Commerce, Ring Back Tone etc
Network – BTS, MSC, HLR, GGSN, IN
Type of Data
Text Files: Log Files, SMS Text, Tweets, Email
Binary Files: ASN.1, Network Routers, Email Attachments
Audio Files: Ring tones, Voice Messages, Call Center
Data
Video Files – Movies, Video Clips

Value Better 360 View of Customer


- More detailed customer profile
- Improved accuracy and timeliness of data
Better Customer Experience
- Real-time response to issues in Customer
- Better quality of service
Better ARPU, Customer Wallet Share
- Real-time marketing
Quicker Response to the Market and Government
Requests

Velocity Access Speed of Data and Events


- 10,000s of xDR/Sec
- 10,000s of VAS Transactions/Sec
- 100,000s of Network Events/Sec

Volume Volume of Data


- 1,000s GB of xDR Data/Day
- 10sTB of Network Data/Day
- Different sources of data from new business models

Practices for Lesson 17: Oracle Big Data Use Cases


Chapter 17 - Page 5

You might also like