Unit 3

UNIT 3
1.i)Incase if you are assigned to implement the PageRank

algorithm using MapReduce for a webpage graph analysis system,
Breakdown the parameters that are supposed to be associated
with anatomy of MapReduce job run.
ANS:
Input Path: Path to the input graph data in HDFS.
Output Path: Path to store the PageRank results.
Number of Iterations: Number of times the MapReduce job will iterate for convergence.
Damping Factor: Factor to handle random jumps in PageRank.
Mapper and Reducer Classes: Custom classes for the PageRank algorithm.
Job Configuration: Configuration settings for the MapReduce job such as memory and
parallelism.
ii)Interpolate how you would generate test data for your
collaborative filtering MapReduce job and conduction of local tests
and mention the outcome of this approach.
ANS:
Synthetic Data Generation: Create user-item rating matrices with random values.
Subsample Real Data: Extract a small, representative subset from a real dataset.
Local Testing: Run the MapReduce job on a local Hadoop instance or a pseudo-distributed
mode using the generated data.
iii)Assume that you are tasked with implementing a word count job
using classic MapReduce on a large dataset stored in HDFS,
sketch the MapReduce for this job.
ANS:
 Mapper:
 Input: (File Offset, Line)

 Output: (Word, 1)
 Logic: Tokenize each line and emit (word, 1) for each word.
Reducer:
 Input: (Word, List<1>) Output: (Word, Count)

 Logic: Sum up the list of 1s for each word and emit (word, count).
2.i)As per the failure occurrences in classic MapReduce and YARN,
Specify the action parameters of classic MapReduce in handling
task failures during job execution time and resource utilization.
Ans:
 Task Retries: When a task fails, it is retried a default number of times (usually 4).
 Task Re-Execution: If a task fails beyond retries, it is re-executed on a different node.
 Speculative Execution: Slow tasks are detected and duplicates are executed on different
nodes.
 Heartbeat Mechanism: TaskTrackers send regular heartbeats to the JobTracker; missed
heartbeats indicate failure.
 JobTracker Actions: Reassigns failed tasks and rebalances the job load.
ii)In a YARN cluster, while a node manager becomes unresponsive,

Signify the response of YARN in handling node failures and
redistributes resources to ensure the continued executions of
applications.
Ans:
 Node Manager Unresponsiveness: The ResourceManager detects unresponsive
NodeManagers through missed heartbeats.
 ResourceManager Actions: Marks the node as lost and redistributes tasks to other
healthy nodes.
 Container Reassignment: Containers from the failed node are reallocated to available
nodes.
 Application Master Recovery: The ApplicationMaster can restart on a different node to
continue managing its application.
iii)While optimizing the performance of a classic MapReduce job

on a large dataset, carryout the certain standardized techniques to
improve job execution time and resource utilization.
Ans:
 Data Locality: Ensure tasks run on nodes where data is located to minimize data
transfer.
 Combiner Usage: Use a combiner to reduce the amount of data transferred between
Mapper and Reducer.
 Configuration Tuning: Adjust parameters like the number of mappers, reducers, and
memory allocation.
 Intermediate Data Compression: Compress intermediate data to reduce I/O and
network load.
 Incremental Processing: Process only new or changed data to avoid reprocessing the
entire dataset.
 Custom Partitioners: Use custom partitioners to ensure even data distribution among
reducers.
3.i)Differentiate MongoDB and DBMS
ii)Give the importance of drop Database() method and drop() method
 The dropDatabase() method is crucial for completely removing a database, including

all collections and data, freeing up resources and simplifying database management.
 The drop() method is important for deleting specific collections within a database,
allowing for more granular control over data cleanup and resource management.
iii)Specify the Key components of MongoDB architecture
1. Documents: Basic units of data in BSON format.
2. Collections: Groups of documents, similar to tables.
3. Databases: Containers for collections.
4. Replica Sets: Ensure high availability through data replication.
5. Sharding: Distributes data across servers for horizontal scaling.
6. Mongod: Primary daemon handling data requests.
7. Mongos: Routes client requests in sharded clusters.
8. Config Servers: Store metadata and configuration for sharded clusters.

Unit 3 - Bigdata

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3 - Bigdata

Uploaded by

Copyright:

Available Formats

1.i)Incase if you are assigned to implement the PageRank

 Input: (File Offset, Line)

 Input: (Word, List<1>) Output: (Word, Count)

ii)In a YARN cluster, while a node manager becomes unresponsive,

iii)While optimizing the performance of a classic MapReduce job

ii)Give the importance of drop Database() method and drop() method

 The dropDatabase() method is crucial for completely removing a database, including

iii)Specify the Key components of MongoDB architecture

1. Documents: Basic units of data in BSON format.

2. Collections: Groups of documents, similar to tables.

3. Databases: Containers for collections.

4. Replica Sets: Ensure high availability through data replication.

5. Sharding: Distributes data across servers for horizontal scaling.

6. Mongod: Primary daemon handling data requests.

7. Mongos: Routes client requests in sharded clusters.

8. Config Servers: Store metadata and configuration for sharded clusters.

You might also like