Professional Documents
Culture Documents
MapReduce Component Design
MapReduce Component Design
The HadoopInitiator component will responsible for communicating with the Hadoop cluster, so
that we can control the MapReduce job, monitoring the job status and extend the cluster when
the computing resource is lacking of.
The IDSFilter (MapReduce component) responsible for computing the Transaction data from
retailers, get a list of impacted customer and product information.
And the HiveReport component is a new component which is used to demonstrate the BI feature
in Cloud. We directly use the Hive (a subproject of Hadoop) since it’s easy to reuse the current
environment of Hadoop cluster.
HadoopMaster
HadoopInitiator
ScaleListener IDSFilter
Suggestion:
1. Suggest adding a count number after each BatchNo, so that we can really know the impact,
just to make it more real.
2. Suggest only including one product for each line, so that we don’t need to have complicated
logic in MapReduce module. Put more than one record in a line will make MapReduce
component more complicated, and we might need to add more steps into this module, that
means more than one Job Flow needed.
Eg:
TransactionID|TransactionDate|CustomerName|EmailID|
Product,Manufacturer,BatchNo,Count
TransactionID|TransactionDate|CustomerName|EmailID|
Product,Manufacturer,BatchNo,Count
Product|Manufacturer|BatchNo|TotalCound|
EmailID$EmailID$EmailID$EmailID
Product|Manufacturer|BatchNo|TotalCound|
EmailID$EmailID$EmailID$EmailID
Todo:
To decide the output file path and naming convention, eg.
S3://output/$manufacturer.$product.$batchno
Job Control
Description: Class to communicate with AWS Management ConsoleHadoop cluster, focuses on:
add a job flow
get job state - this function can be abstracted to generate the message
Queue of job flow state
terminate a job flow
add a file to Hadoop distributed file system
o copy the file onto Hadoop master
o call put sub command to add it
get file from Hadoop distributed file system
o Option 1:
save it to local
copy it to server that need this file
o Option 2:
Get this file directly via the URL of this file:
http://datanode:50075/$user_dir/$file_name
update the job flow with parameters (id, instances)
monitoring on the scalability of Hadoop cluster, dynamically scale up/down
Here we use SSH connection to connect with the primary Hadoop namenode (first
master, we will call it Hadoop master. If the job tracker is not configured on the
same server, than the job controller better talk to the job tracker directly. Let’s
keep it simple here).
The HadoopInitiator can reach the Hadoop master via SSH, we need to make sure
about this when the Hadoop cluster is built. These information can be stored in a
configuration file (we will hard code them in the ScaleListener for temporary).
There are two ways to do this:
Use SOAP - just use the MapReduce wsdl provided by Amazon
Use HttpClient to send request and receive XML response, it's easier and faster
Notes:
Need a generic design here to make sure the component can work with both
Apache MapReduce and Amazon MapReduce.
Hive is a sub project of Hadoop. By using Hive, you can build a data warehouse directly on top of
the Hadoop distributed file system. It provides the tool to query data from the HDFS directly,
without transforming or integration.
Since Hive enable the HiveQL (similar with normal SQL) and JDBC liked
operations, we can just use the normal reporting tools (eg. JasperReport),
analyze the database and generate summary or analysis reports of all distributed
data.
Steps:
1. Design what kind of report we want to generate
2. Create Hive tables to store required data
3. Use JasperReport to generate a report, use JDBC connection to get data from
Hive data warehouse.
4. Create HiveServlet to generate the report
Notes:
- The performance of Hive is a bit poor, the similar feature by using normal MR
is almost 3 times faster than Hive, but the Hive team is improving the
performance. I think it’s pretty good to try it.
- We can even put all data in the HDFS, instead of using standalone database to
store all of those data. Then we can have a pure distributed database.