Professional Documents
Culture Documents
Big Data Essentials: Activity Guide
Big Data Essentials: Activity Guide
Big Data Essentials: Activity Guide
Activity Guide
Table of Contents
Practices for Lesson 1: Introduction ..............................................................................................................1-1
Practices for Lesson 1....................................................................................................................................1-2
Practices for Lesson 2: Overview of Big Data ...............................................................................................2-1
Practice 2-1: Overview of Big Data ................................................................................................................2-2
Solution 2-1: Overview of Big Data ................................................................................................................2-3
Practices for Lesson 3: Understanding the Oracle Big Data Solution ........................................................3-1
Practice 3-1: Understanding the Oracle Big Data Solution.............................................................................3-2
Solution 3-1: Understanding the Oracle Big Data Solution.............................................................................3-3
Practices for Lesson 4: Using Oracle Big Data Appliance ...........................................................................4-1
Practice 4-1: Using Oracle Big Data Appliance ..............................................................................................4-2
Solution 4-1: Using Oracle Big Data Appliance ..............................................................................................4-3
Practices for Lesson 5: Data Acquisition Options in BDA ...........................................................................5-1
Practice 5-1: Data Acquisition Options in BDA ...............................................................................................5-2
Solution 5-1: Data Acquisition Options in BDA ...............................................................................................5-3
Practices for Lesson 6: Using the Hadoop Distributed File System ...........................................................6-1
Practice 6-1: Loading JSON Data into HDFS .................................................................................................6-2
Solution 6-1: Loading JSON Data into HDFS .................................................................................................6-4
Practices for Lesson 7: Using Flume in HDFS ..............................................................................................7-1
Practices for Lesson 7-1: Introduction to Flume .............................................................................................7-2
Solution 7-1: Introduction to Flume ................................................................................................................7-3
Practices for Lesson 8: Using Oracle NoSQL Database ...............................................................................8-1
Practice 8-1: Using KVLite .............................................................................................................................8-2
Solution 8-1: Using KVLite .............................................................................................................................8-4
Practice 8-2: Loading Movie Data into Oracle NoSQL Database ...................................................................8-7
Solution 8-2: Loading Movie Data into Oracle NoSQL Database ...................................................................8-8
Practices for Lesson 9: Using Hive ................................................................................................................9-1
Practice 9-1: Manipulating Data with Hive......................................................................................................9-2
Solution 9-1: Manipulating Data with Hive ......................................................................................................9-4
Practice 9-2: Extracting Facts by Using Hive .................................................................................................9-7
Solution 9-2: Extracting Facts by Using Hive .................................................................................................9-12
Practices for Lesson 10: Introduction to Oracle Big Data Connectors .......................................................10-1
Practices for Lesson 10..................................................................................................................................10-2
Practices for Lesson 11: Using Oracle Loader for Hadoop ..........................................................................11-1
Practice 11-1: Loading Session Data into Oracle Database ..........................................................................11-2
Solution 11-1: Loading Session Data into Oracle Database ..........................................................................11-4
Practices for Lesson 12: Using Oracle SQL Connector for HDFS ...............................................................12-1
Practice 12-1: Accessing Hadoop Data with Oracle SQL Connector for HDFS .............................................12-2
Solution 12-1: Accessing Hadoop Data with Oracle SQL Connector for HDFS .............................................12-4
Practices for Lesson 13: Using ODI Application Adapter for Hadoop ........................................................13-1
Practices for Lesson 13..................................................................................................................................13-2
Practices for Lesson 14: Using Oracle R Connector for Hadoop ................................................................14-1
Practice 14-1: Working with Data in HDFS and Oracle Database ..................................................................14-2
Solution 14-1: Working with Data in HDFS and Oracle Database ..................................................................14-4
Questions
a. Which of the following statements best defines the term Big Data?
1) Data that is very large in volume but consistent in structure
2) Data that is very large in volume and high in value
3) Data that has low value but very large volume
4) Data that is rapidly changing and has very high value and volume
b. The four Vs that describe big data are ___________, ___________, __________, and
__________.
c. Fill in the blank column:
Today’s challenge New data What’s possible?
Retail: One-size-fits-all Social media ?
marketing
e. If you are a business analyst, what kinds of skills do you need if you want to use Big
Data?
Answers
a. Which of the following statements best defines the term Big Data?
1) Data that is very large in volume but consistent in structure
2) Data that is very large in volume and high in value
3) Data that has low value but very large volume
4) Data that is rapidly changing and has very high value and volume
b. The four Vs that describe big data are volume, value, variety, and velocity.
c. Fill in the blank column:
Today’s challenge New data What’s possible?
Retail: One-size-fits-all Social media Sentiment analysis and
marketing segmentation
Note: Sentiment analysis is a statistical model that is used to gauge the
sentimental outlook for a brand, a service, or any other subject of interest by
analyzing the semantics of lexical or nonlexical text. It is a weighted, score-based
analysis.
d. Will Big Data replace data warehouses and databases?
No.
Big Data needs data warehouses and databases to store structured data for further
analysis. Big Data should be combined with traditional systems to obtain a better
solution.
e. If you are a business analyst, what kinds of skills do you need if you want to use Big
Data?
• Ability to map Big Data with organizational goals
• Skills to extract massive amount of data
• Good leadership skills to make business decisions
Questions
a. Identify the capabilities of the Oracle Big Data solution.
1) Deep analytics
2) High agility
3) Massive scalability
4) Low latency
5) All of the above
b. Oracle R Connector for Hadoop is used to refine, organize, and load the data from
Hadoop to Oracle Database.
• True
• False
c. Oracle Database is used in the Acquire phase.
• True
• False
Match the following:
Big Data Storage System Used
Unstructured data Oracle NoSQL Database
Schema-based data HDFS
Schema-less data Oracle Database
Answers
a. Identify the capabilities of the Oracle Big Data solution.
1) Deep analytics
2) High agility
3) Massive scalability
4) Low latency
5) All of the above
Explanation
1) Deep analytics is a fully parallel, extensive, and extensible toolbox full of advanced
and novel statistical and data-mining capabilities.
2) High agility is the ability to create temporary analytics environments in an end-
user-driven yet secure and scalable environment to deliver new and novel insights
into the operational business.
3) Massive scalability is the ability to scale analytics and sandboxes to previously
unknown sizes while leveraging previously untapped data potential.
4) Low latency is the ability to instantly act based on these advanced analytics in your
operational production environments.
All these features are integrated in the Oracle Big Data solution.
b. Oracle R Connector for Hadoop is used to refine, organize, and load the data from
Hadoop to Oracle Database.
• True
• False
Explanation
Oracle R Connector is the only connector that is used in the analysis phase of the
Oracle Big Data solution.
c. Oracle Database is used in the Acquire phase.
• True
• False
Questions
a. How much usable storage is provided per rack?
b. How much usable storage is provided per rack for Oracle NoSQL Database?
c. Is it possible to add a SAN or other external storage device to increase storage
capacity?
d. How is InfiniBand used?
e. There are 18 nodes in the machine. How are the nodes used?
f. What is the difference between Big Data Appliance and an Exadata system/expansion
rack?
g. How does an administrator monitor Big Data Appliance?
h. Why should you buy this machine from Oracle? How does its cost compare to the cost
of building other systems?
Answers
a. How much usable storage is provided per rack?
This depends on the software configuration.
Raw storage = 648 TB (18 nodes × 12 disks × 3 TB). By default, Hadoop runs on a
triple-replication scheme, so 648 TB of raw storage delivers over 200 TB of usable
space.
b. How much usable storage is provided per rack for Oracle NoSQL Database?
This depends on the software configuration. For NoSQL Database, we preconfigure
the set of masters and replicas on BDA. A single-rack BDA has 6 master nodes and 12
replicas. Each master has a 6 TB maximum.
So there can be 6 masters × 6 TB = 36 TB of storage in NoSQL Database on a single
rack. Each additional rack adds 36 TB of storage.
Because of triple replication, a NoSQL Database gets 3 * 36 TB = 108 TB of data
storage on BDA at full capacity.
Note that these numbers will change when Oracle NoSQL Database v.2 is released.
c. Is it possible to add a SAN or other external storage device to increase storage
capacity?
You cannot add SAN or NAS devices to scale storage. Hadoop is designed for directly
attached disks, and you scale both compute and storage at the same time by adding
nodes to the Hadoop cluster.
d. How is InfiniBand used?
Hadoop uses an enormous amount of network bandwidth by design because all nodes
continuously talk to each other and data is moved among nodes in the shuffle phase of
running a map reduce program. The most common scaling problem in Hadoop clusters
is network saturation.
For large clusters, 1 Gb Ethernet requires difficult tuning (data locality is extremely
important to avoid excessive shuffling of data) and configuration. At 40 Gb/sec,
InfiniBand provides unprecedented scaling and performance for Hadoop, and it
provides extremely fast connectivity to other Oracle machines such as Exadata via Big
Data Connectors.
e. There are 18 nodes in the machine. How are the nodes used?
Node 1 is the Hadoop Name Node and the HBase master.
Node 2 runs Zookeeper and the Cloudera Manager.
Node 3 runs the JobTracker, a MySQL database for Cloudera Manager and Hive (and
an ODI Agent). These three nodes are also configured as data nodes by default; this
feature can be turned off if you add additional racks.
The remaining 15 nodes are data nodes for Hadoop.
If you hook up another rack, the 18 new nodes are data nodes that join the cluster.
Because Hadoop is designed for scale out, adding nodes is very easy. Data
automatically rebalances itself in Hadoop.
Questions
a. A NoSQL database typically does not support traditional SQL database queries.
• True
• False
b. A key difference between NoSQL and RDBMS systems is that a NoSQL system stores
unstructured data and an RDBMS system stores structured data.
• True
• False
c. Suggest the acquiring mechanism for the following types of data:
• Customer profiles
• Blogs (weblogs)
• Reports
• Logs
• Time stamp
d. Place the following features in the correct column in the table.
1) Stores high-value data
2) Schema is application-defined.
3) Data is highly structured.
4) Stores low-value data
5) Data can be structured or unstructured.
6) Schema is self-describing.
NoSQL Database RDBMS
Answers
a. A NoSQL database typically does not support traditional SQL database queries.
• True
• False
b. A key difference between NoSQL and RDBMS systems is that a NoSQL system stores
unstructured data and an RDBMS system stores structured data.
• True
• False
c. Suggest the acquiring mechanism for the following types of data:
• Customer profiles
• Blogs (weblogs)
• Reports
• Logs
• Time stamp
NoSQL: customer profiles, time stamp
HDFS: weblogs, reports, logs, time stamp
d. Place the following features in the correct column in the table.
1) Stores high-value data
2) Schema is application-defined.
3) Data is highly structured.
4) Stores low-value data
5) Data can be structured or unstructured.
6) Schema is self-describing.
NoSQL Database RDBMS
2) Schema is application- 1) Stores high-value data
defined. 3) Data is highly structured.
4) Stores low-value data 6) Schema is self-describing.
5) Data can be structured or
unstructured.
Tasks
Note: Run the commands from the /home/oracle/movie/moviedemo/mapreduce
directory.
You may see a difference in output with respect to the number of files/items displayed.
1. Start the terminal window, and run the reset_mapreduce.sh script to reset the practice
directory.
cd /home/oracle/movie/moviework/reset
./reset_mapreduce.sh
2. Review the commands that are available for the Hadoop Distributed File System. You will
find that its composition is similar to your local Linux file system. You will use the hadoop
fs command when interacting with HDFS.
cd /home/oracle/movie/moviework/mapreduce
hadoop fs
3. List the contents of /user/oracle.
hadoop fs –ls /user/oracle
4. Create a subdirectory called my_stuff in the /user/oracle folder, and then ensure that
the directory has been created.
hadoop fs –mkdir /user/oracle/my_stuff
hadoop fs –ls /user/oracle
5. Remove the my_stuff directory and then ensure that it has been removed.
hadoop fs –rmr my_stuff
hadoop fs –ls
6. Inspect the compressed JSON application log.
cd /home/oracle/movie/moviework/mapreduce
zcat movieapp_3months.log.gz|head
4. Create a subdirectory called my_stuff in the /user/oracle folder, and then ensure that
the directory has been created:
hadoop fs –mkdir /user/oracle/my_stuff
hadoop fs –ls /user/oracle
5. Remove the my_stuff directory and then ensure that it has been removed:
hadoop fs –rmr my_stuff
hadoop fs –ls
7. Load a file into the HDFS from the local file system.
Specifically, you load a JSON log file that tracked activity in an online movie application. The
JSON data represents individual clicks from an online movie rental site. You use the basic
put commands for moving data into the HDFS.
Review the commands available for the HDFS, and then copy the gzipped file to the HDFS.
hadoop fs
hadoop fs –put movieapp_3months.log.gz
/user/oracle/moviework/applog
Tasks
1. Start the Firefox browser and open Cloudera Manager (which is bookmarked for you).
2. Log in to Cloudera Manager:
Username: admin
Password: welcome1
3. You can view the status of the Hadoop services running on BDA.
Right-click and start Flume if it is in the stopped state.
4. Open a terminal window and view the command options in Flume.
$flume-ng help
5. Review the configuration file for the MoviePlex application.
$cd /home/oracle/movie/moviedemo/scripts
$more flume.conf
6. Review the agent file for the MoviePlex application.
$more flume_movieagent.sh
3. You can view the status of the Hadoop services running on the BDA.
Right-click and start Flume if it is in the stopped state.
Tasks
1. Open a terminal window:
2. There are three Oracle NoSQL Database–specific environment variables. KVHOME is
where binaries are installed, KVROOT is where data files and config files are saved, and
KVDEMOHOME is where the source of the practice project is saved.
echo $KVROOT
echo $KVHOME
echo $KVDEMOHOME
3. Make sure $KVROOT does not exist already.
rm -rf $KVROOT
4. Start KVLite from the current working directory:
java -jar $KVHOME/lib/kvstore.jar kvlite -host localhost -root
$KVROOT
Look for a response similar to what is listed below. Minimize the window (leave it running).
java -jar $KVHOME/lib/kvstore-2.*.jar kvlite -root $KVROOT
Created new kvlite store with args: -root /u02/kvroot -store
kvstore -host localhost -port 5000 -admin 5001
5. Open a new tab in the terminal window and start an admin session:
java -jar $KVHOME/lib/kvstore.jar runadmin -host localhost -port
5000
You should be logged in to the KV shell.
6. Register the customer schema:
Kv-> ddl add-schema -file
/home/oracle/movie/moviedemo/nosqldb/bigdatademo/dataparser/schem
as/customer.avsc
8. Run the show schemas command to make sure all six schemas are registered.
Kv-> show schemas
Tasks
1. Open a new terminal window and explore the resetDemo.sh script in the scripts folder:
cd /home/oracle/movie/moviedemo/nosqldb/scripts
more resetDemo.sh
2. Run the resetDemo.sh script to reset the MoviePlex demo.
./resetDemo.sh
This script deletes the existing KVRoot and stops the running server processes and logs.
3. Explore the startDemo.sh script in the scripts folder:
more startDemo.sh
4. Run startDemo.sh to start the MoviePlex demo:
./startDemo.sh
This script starts Oracle WebLogic Server and the Oracle NoSQL Server, and loads the
movies into the database.
5. Open the Firefox browser and connect to the Oracle MoviePlex demo application by using
this URL:
http://localhost:7001/bigdatademo-UI-context-root/login.jsp
Note: You can also click the bookmarked Oracle MoviePlex login page.
6. Give the credentials as shown here:
Username: guest1/guest2/.../guest100
Password: Welcome1
Note: The home page content varies vary based on the guestXX login that you enter.
7. You should be able to use the startDemo script to view the movies that are loaded.
You can click a movie title, view its description, and personalize your favorite movies.
Note: You might not be able to view the movies in the classroom environment because of the
firewall settings.
This script deletes the existing KVRoot and stops the running server processes and logs.
This script starts Oracle WebLogic Server and the Oracle NoSQL Server, and loads the
movies into the database.
You can click a movie title, view its description, and personalize your favorite movies.
Note: You might not be able to view the movies in the classroom environment because of the
firewall settings.
Assumptions
You have successfully completed Practice 6.
Tasks
1. Access the Hive command line by typing hive at the Linux prompt.
$hive
2. Create a new Hive database called moviework. Ensure that the database has been
successfully created.
hive> create database moviework;
hive> show databases;
3. Specify that you want all DDL and DML operations to apply to a specific database. For
simplicity, you apply subsequent operations to the moviework database:
hive> use moviework;
4. In the moviework database, create a simple external table called movieapp_log_json.
This table will contain a single column called the_record.
hive> CREATE EXTERNAL TABLE movieapp_log_json (
the_record STRING
)
LOCATION '/user/oracle/moviework/applog/';
5. Write a query in the Hive command line that returns the first five rows from the table. After
reviewing the results, drop the table.
hive> SELECT * FROM movieapp_log_json LIMIT 5;
hive> drop table movieapp_log_json;
2. Create a new hive database called moviework. Ensure that the database has been
successfully created.
3. Specify that you want all DDL and DML operations to apply to a specific database. For
simplicity, you apply subsequent operations to the moviework database:
6. Define a more sophisticated table that parses the JSON file and maps its fields to columns.
To process the JSON fields, you use a popular serializer/deserializer (or SerDes) called
org.apache.hadoop.hive.contrib.serde2.JsonSerde. After creating the table,
review the results by selecting the first 20 rows:
Assumptions
You have successfully completed Practice 9-1.
Tasks
1. Write a query to select only those clicks that correspond to starting, browsing, completing, or
purchasing movies. Use a CASE statement to transform the RECOMMENDED column into
integers, where ‘Y’ is 1 and ‘N’ is 0. Also, ensure that GENREID is not null. Include only the
first 25 rows.
hive> SELECT custid,
movieid,
CASE WHEN genreid > 0 THEN genreid ELSE -1 END genreid,
time,
CASE recommended WHEN 'Y' THEN 1 ELSE 0 END recommended,
activity,
price
FROM movieapp_log_json
WHERE activity IN (2,4,5,11) LIMIT 25;
Select the movie ratings made by a user. Also consider the following: What happens if a
user rates the same movie multiple times? In this scenario, you should load only the user’s
most recent movie rating.
In Oracle Database 11g, you can use a windowing function. However, HiveQL does not
provide sophisticated analytic functions. Instead, you must use an inner join to compute the
result.
Note: Joins occur before WHERE clauses. To restrict the output of a join, a requirement
should be in the WHERE clause. Otherwise, it should be in the JOIN clause.
Tasks
1. Open a terminal window and clean up the existing HDFS files.
source /home/oracle/movie/moviework/reset/reset_conn.sh
The reset_conn.sh script cleans up any directories and files, and it creates a new HDFS
directory for storing output files.
2. Examine the following scripts:
a. moviesession.xml
b. loaderMap_moviesession.xml
c. runolh_session.sh
cd /home/oracle/movie/moviework/olh
more moviesession.xml
more loaderMap_moviesession.xml
more runolh_session.sh
Note: You should keep pressing the Enter key if a script is large.
3. Create the target table in the database where the data needs to be loaded. This table is
hash-partitioned on the cust_id column.
sqlplus moviedemo/welcome1
@moviesession.sql
exit
4. Run the runolh_session.sh script to invoke the OLH job to load the session data. This
starts start the MapReduce job, which loads the data into the target table.
sh runolh_session.sh
5. After the MapReduce job is completed, check to see if the rows are loaded.
sqlplus moviedemo/welcome1
select count(*) from movie_sessions_tab;
Note that one row had a parse error. The .bad file containing the row and the error are
logged in the _olh directory under the directory specified in mapred.output.dir.
In this example, the date value is invalid because it is a time between 2:00 AM and 3:00 AM
on the day that Daylight Savings Time begins.
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 1
Practice 12-1: Accessing Hadoop Data with Oracle SQL Connector for
HDFS
Overview
In this practice, you use Oracle SQL Connector for HDFS to access the data in:
• Hive tables
• HDFS files
Assumptions
The reset_conn.sh script has already been run to clean up the directories.
Tasks
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 2
Accessing HDFS Files by Using OSCH
2. Execute the following steps to access the data residing in HDFS by using OSCH:
a. Review the following files:
more genloc_moviefact_text.sh
more moviefact_text.xml
b. Run the following script.
sh genloc_moviefact_text.sh
You are prompted for the password, which is welcome1.
From the output on your monitor, you can see the external table definition as well as
the location files. The location files contain the URIs of the data files on HDFS.
c. Query the newly created external table movie_fact_ext_tab_text.
sqlplus moviedemo/welcome1
select count(*) from movie_fact_ext_tab_file;
describe movie_fact_ext_tab_file;
d. Select cust_id for a few rows in the table.
select cust_id from movie_fact_ext_tab_file where rownum < 10;
e. Join with the movie table in the database to list movie titles by cust_id. The first
few rows are listed here.
select cust_id, title from movie_fact_ext_tab_file p, movie q
where p.movie_id = q.movie_id and rownum < 10;
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 3
Solution Practices for Lesson 12: Using Oracle SQL Connector for
HDFS
Accessing Hive Tables by Using OSCH
1. Execute the following steps to access the data residing in Hive by using OSCH:
a. Review the following files located in the OSCH directory:
cd /home/oracle/movie/moviework/osch
more genloc_moviefact_hive.sh
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 4
more moviefact_hive.xml
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 5
b. Run the following script.
When prompted for a password, type welcome1.
sh genloc_moviefact_hive.sh
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 6
c. Take a look at the external table definition in SQL*Plus:
sqlplus moviedemo/welcome1
describe movie_fact_ext_tab_hive;
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 7
d. Try the following queries on this external table, which will query data in the Hive
table:
select count(*) from movie_fact_ext_tab_hive;
select custid from movie_fact_ext_tab_hive where rownum < 10;
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 8
e. You can also join the external table with the table in Oracle Database to list movie
titles by custid.
select custid, title from movie_fact_ext_tab_hive p, movie q
where p.movieid = q.movie_id and rownum < 10;
f. The data in the Hive table can be inserted into a database table by using SQL.
create table movie_fact_local as select * from
movie_fact_ext_tab_hive;
exit;
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 9
Accessing HDFS Files by Using OSCH
2. Execute the following steps to access the data residing in HDFS by using OSCH:
a. Review the following files:
more genloc_moviefact_text.sh
more moviefact_text.xml
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 10
b. Run the following script.
sh genloc_moviefact_text.sh
You will be prompted for the password, which is welcome1.
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 11
From the output on your monitor, you can see the external table definition as well as
the location files. The location files contain the URIs of the data files on HDFS.
c. Query the newly created external table movie_fact_ext_tab_text.
sqlplus moviedemo/welcome1
select count(*) from movie_fact_ext_tab_file;
describe movie_fact_ext_tab_file;
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 12
d. Select cust_id for a few rows in the table.
select cust_id from movie_fact_ext_tab_file where rownum < 10;
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 13
e. Join with the movie table in the database to list movie titles by cust_id. The first
few rows are listed here:
select cust_id, title from movie_fact_ext_tab_file p, movie q
where p.movie_id = q.movie_id and rownum < 10;
Practices for Lesson 12: Using Oracle SQL Connector for HDFS
Chapter 12 - Page 14
Practices for Lesson 13:
Using ODI Application
Adapter for Hadoop
Chapter 13
Practices for Lesson 13: Using ODI Application Adapter for Hadoop
Chapter 13 - Page 1
Practices for Lesson 14:
Using Oracle R Connector for
Hadoop
Chapter 14
Tasks
1. Change directory and start R:
$cd /home/oracle/movie/moviework/advancedanalytics
$R
2. Load the Oracle R Enterprise (ORE) library and connect to the Oracle database, and then
list the contents of the database to test the connection.
Note: If a table contains columns with unsupported data types, a warning message is
returned. If you are connected, you can just invoke ore.ls().
library(ORE)
ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE)
ore.ls()
3. Load the Oracle R Connector for Hadoop (ORCH) library, get the current working directory,
and list the directory contents in Hadoop Distributed File System (HDFS). Change directory
in HDFS and view the contents there:
library(ORCH)
hdfs.pwd()
hdfs.ls()
hdfs.cd ("/user/oracle/moviework/advancedanalytics/data")
hdfs.ls()
4. Using ORE, view the names of the database tables MOVIE_FACT and MOVIE_GENRE, look
at the first few rows of each table, and get the table dimensions:
ore.sync("MOVIEDEMO","MOVIE_FACT")
MF <- MOVIE_FACT
names(MF)
head(MF,3)
dim(MF)
names(MOVIE_GENRE)
head(MOVIE_GENRE,3)
dim(MOVIE_GENRE)
2. Load the Oracle R Enterprise (ORE) library and connect to the Oracle database, and then
list the contents of the database to test the connection.
Note: If a table contains columns with unsupported data types, a warning message is
returned. If you are connected, you can just invoke ore.ls().
library(ORE)
ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE)
ore.ls()
names(MOVIE_GENRE)
head(MOVIE_GENRE,3)
dim(MOVIE_GENRE)
Assumptions
None
Tasks
1. Use hdfs.attach() to attach the movie_genre HDFS file to the working session:
mg.dfs <-
hdfs.attach("/user/oracle/moviework/advancedanalytics/data/movie_
genre_subset")
mg.dfs
hdfs.describe(mg.dfs)
2. Specify dry run mode, and then execute the MapReduce job that partitions the data based
on genre_id and counts the number of movies in each genre.
Note: You receive debug output while in dry run mode.
orch.dryrun(T)
res.dryrun <- NULL
res.dryrun <- hadoop.run(mg.dfs, mapper = function(key, val) {
orch.keyvals(val$GENRE_ID, rep(1, nrow(val)))
},
reducer = function(key, vals) {
count <- nrow(vals)
orch.keyval(NULL, key, count)
} ,
config = new("mapred.config", map.output = data.frame(key=0,
val=0), reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0))
)
3. Retrieve the result of the Hadoop job, which is stored as an HDFS file.
Note: Because this is dry run mode, not all data may be used. As a result, only a subset of
results may be returned.
hdfs.get(res.dryrun)
2. Specify dry run mode, and then execute the MapReduce job that partitions the data based
on genre_id and counts the number of movies in each genre.
Note: You receive debug output while in dry run mode.
orch.dryrun(T)
res.dryrun <- NULL
res.dryrun <- hadoop.run(mg.dfs, mapper = function(key, val) {
orch.keyvals(val$GENRE_ID, rep(1, nrow(val)))
},
reducer = function(key, vals) {
count <- nrow(vals)
orch.keyval(NULL, key, count)
} ,
config = new("mapred.config", map.output = data.frame(key=0, val=0),
reduce.output = data.frame(key=NA, GENRE_ID=0, COUNT=0)))
Tasks
1. Start the R Console.
$R
2. Load the ORE packages and connect to the moviedemo schema in the database with the
SID orcl on localhost and the password welcome1. Specify that table and view
metadata should be synchronized, and specify that tables are accessible in the current R
environment.
Note: You receive one or more warning messages if tables contain data types that are not
recognized by ORE.
library(ORE)
ore.connect("moviedemo","orcl","localhost","welcome1",all=TRUE)
3. Perform the following steps:
a. View the contents of the database schema:
ore.ls()
b. Determine if the CUSTOMER_V table exists:
ore.exists("CUSTOMER_V")
c. See that the table is an ore.frame (an R object proxy for the table in the
database):
class(CUSTOMER_V)
d. Determine that table’s dimensions and summary statistics. The dimensions and
summary are computed in the database with only the results being retrieved:
dim(CUSTOMER_V)
names(CUSTOMER_V)
e. Answer the following questions about our customers:
i. Which gender (male or female) is better represented?
ii. Is the customer base skewed toward young customers or old customers?
iii. Are customers highly educated?
summary(CUSTOMER_V[,c("GENDER","INCOME","AGE","EDUCATION")])
c. See that the table is an ore.frame (an R object proxy for the table in the
database)
class(CUSTOMER_V)
d. Determine that table’s dimensions and summary statistics. The dimensions and
summary are computed in the database with only the results being retrieved:
dim(CUSTOMER_V)
names(CUSTOMER_V)
7. For customers, are there correlations between age, income, and the number of years since
first becoming a customer? In this example, we produce a “pairs” plot that produces
scatterplots of pairs of columns. From the CUSTOMER_V table, we sample 20% of the
customers and select the columns AGE, INCOME, and YRS_CUSTOMER. Using the pairs
function, we not only produce a scatterplot but also draw a regression line (in red) and a
lowess curve in blue. Along the diagonal, a histogram of each column’s data is plotted. You
will notice that age correlates strongly with the number of years as a customer. Income and
age show a mild correlation.
Case Scenario
XYZ Telecom is one of the largest Asia-Pacific communications service providers (CSP). The
company has recently been facing increasing competition from social media sites and from
other providers (IP providers, WiFi, WiMAX, and so on) and decreasing customer acquisition.
These are all challenges to the future business growth.
To help resolve these issues, XYZ Telecom has decided to focus on the following areas this
year:
• Increase its customers’ adoption of smartphones
• Optimize its network and increase its data services revenue
• Accelerate adoption of its new mobile Internet
Analytical Questions
1. How do you think XYZ Telecom can achieve its three key goals?
2. In which of the following ways can XYZ Telecom learn what people are saying about its
products and services?
a. Lengthy surveys
b. Real-time analysis of data
c. Analysis of large volume of structured and nonstructured data
d. Analysis of nonstructured content, IP network traffic, and Web proxy information
3. To improve the XYZ Telecom customer experience, which of the following should the
company analyze proactively?
a. Audio conversations
b. CRM service records
c. Wait for customer complaints
d. Social media comments
e. Network traffic information
4. Suggest two ways for XYZ Telecom to ensure better security.
•
•
5. Now that you have analyzed the key areas that XYZ Telecom needs to focus on, how and
why would you recommend Oracle Big Data as a solution?
Hint: Consider the four Vs of Big Data.
2. In which of the following ways can XYZ Telecom learn what people are saying about its
products and services?
a. Lengthy surveys
b. Real-time analysis of data
c. Analysis of large volume of structured and nonstructured data
d. Analysis of nonstructured content, IP network traffic, and Web proxy information
3. To improve the XYZ Telecom customer experience, which of the following should the
company analyze proactively?
a. Audio conversations
b. CRM service records
c. Wait for customer complaints
d. Social media comments
e. Network traffic information
4. Suggest two ways for XYZ Telecom to ensure better security.