Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Impact Determination Service cloud

computing part design

CloudHadoopInitiator & IDSFilter & HiveReport

High level design

The HadoopInitiator component will responsible for communicating with the Hadoop cluster, so
that we can control the MapReduce job, monitoring the job status and extend the cluster when
the computing resource is lacking of.
The IDSFilter (MapReduce component) responsible for computing the Transaction data from
retailers, get a list of impacted customer and product information.
And the HiveReport component is a new component which is used to demonstrate the BI feature
in Cloud. We directly use the Hive (a subproject of Hadoop) since it’s easy to reuse the current
environment of Hadoop cluster.

HadoopMaster
HadoopInitiator

ScaleListener IDSFilter

HiveJasperS Hive data


ervlet warehouse
As the diagram shows, there are 4 components related with MapReduceHadoop:
 IDSFilter: focus on the data filtering and sorting.
 HadoopInitiator -focus on
 Initiating a MapReduce job
 Monitoring the MapReduce job status, provide input for related Q listeners
 Scale: A sub component of HadoopInitiator, focus on monitoring cluster scalability status
and scaling monitoring and send out notification to cloud provisioning system to scale
up/down the Hadoop cluster.
 HiveReport – Simple component for demonstrating the BI in cloud feature.

Detail Design of each component

 IDSFilter - MapReduce component


Classes to analysis and filtering/sorting
 IDSMapper - implement the Mapper interface of MapReduce
 IDSReducer - implement the Reducer interface of MapReduce
 IDSPartitioner - implement the Partitioner interface of MapReduce, just a placeholder
incase the output file format will need multiple Reducer to deal with.
 IDSFilter - As the job control class to invoke the Mapper and Reducer, the
entry of our custom Java program which will be uploaded to AWS S3.
 FilterDriver – Just the entry for Java program jar.

The input file will be uploaded to Amazon S3 storage:


Todo:
 Decide the input file path and naming convention, eg. S3://input/$Product or
S3://input/$Produc.$date
Input file format:
TransactionID|TransactionDate|CustomerName|EmailID|
Product,Manufacturer,BatchNo$Product,Manufacturer,BatchNo$Product,Manufacturer,BatchNo
TransactionID|TransactionDate|CustomerName|EmailID|
Product,Manufacturer,BatchNo$Product,Manufacturer,BatchNo$Product,Manufacturer,BatchNo

Suggestion:
1. Suggest adding a count number after each BatchNo, so that we can really know the impact,
just to make it more real.
2. Suggest only including one product for each line, so that we don’t need to have complicated
logic in MapReduce module. Put more than one record in a line will make MapReduce
component more complicated, and we might need to add more steps into this module, that
means more than one Job Flow needed.

Eg:
TransactionID|TransactionDate|CustomerName|EmailID|
Product,Manufacturer,BatchNo,Count

TransactionID|TransactionDate|CustomerName|EmailID|
Product,Manufacturer,BatchNo,Count

Output file format:

Product|Manufacturer|BatchNo|TotalCound|
EmailID$EmailID$EmailID$EmailID

Product|Manufacturer|BatchNo|TotalCound|
EmailID$EmailID$EmailID$EmailID

Todo:
To decide the output file path and naming convention, eg.
S3://output/$manufacturer.$product.$batchno

AmazonConsoleAgent HadoopInitiator – Remote MapReduce

Job Control

Description: Class to communicate with AWS Management ConsoleHadoop cluster, focuses on:
 add a job flow
 get job state - this function can be abstracted to generate the message
Queue of job flow state
 terminate a job flow
 add a file to Hadoop distributed file system
o copy the file onto Hadoop master
o call put sub command to add it
 get file from Hadoop distributed file system
o Option 1:
 save it to local
 copy it to server that need this file
o Option 2:
 Get this file directly via the URL of this file:
http://datanode:50075/$user_dir/$file_name
 update the job flow with parameters (id, instances)
 monitoring on the scalability of Hadoop cluster, dynamically scale up/down

Here we use SSH connection to connect with the primary Hadoop namenode (first
master, we will call it Hadoop master. If the job tracker is not configured on the
same server, than the job controller better talk to the job tracker directly. Let’s
keep it simple here).
The HadoopInitiator can reach the Hadoop master via SSH, we need to make sure
about this when the Hadoop cluster is built. These information can be stored in a
configuration file (we will hard code them in the ScaleListener for temporary).
There are two ways to do this: 
Use SOAP - just use the MapReduce wsdl provided by Amazon
Use HttpClient to send request and receive XML response, it's easier and faster
Notes:
Need a generic design here to make sure the component can work with both
Apache MapReduce and Amazon MapReduce. 

ScaleListener – Hadoop Auto-Scaling component

 IScaleListener – Interface of this component, we can use different


implementations based on the environment our cloud is built on.
o Just a place holder. The implementation is depends on how the
scalability information is collect.
 ScaleListener – Dynamically check the scaling of MR component.
o Option 1: get CPU/Mem utilization data from HP Cloud controller
o Option 2: get the utilization from Hadoop Vaidya script.
o Option 3: use tool like CSSH/Puppet to directly get utilization info
form all server instances in the Hadoop cluster.
 AWS scale toolkit Wrapper - Wrap the Amazon's AutoScaling toolkit
(deprecated, we won’t use Amazon MapReduce, just leave it here for
future reuse)

HiveReport – BI in Cloud sample

Hive is a sub project of Hadoop. By using Hive, you can build a data warehouse directly on top of
the Hadoop distributed file system. It provides the tool to query data from the HDFS directly,
without transforming or integration.
Since Hive enable the HiveQL (similar with normal SQL) and JDBC liked
operations, we can just use the normal reporting tools (eg. JasperReport),
analyze the database and generate summary or analysis reports of all distributed
data.

Steps:
1. Design what kind of report we want to generate
2. Create Hive tables to store required data
3. Use JasperReport to generate a report, use JDBC connection to get data from
Hive data warehouse.
4. Create HiveServlet to generate the report

Notes:
- The performance of Hive is a bit poor, the similar feature by using normal MR
is almost 3 times faster than Hive, but the Hive team is improving the
performance. I think it’s pretty good to try it.
- We can even put all data in the HDFS, instead of using standalone database to
store all of those data. Then we can have a pure distributed database.

You might also like