Professional Documents
Culture Documents
Unit 3 - Bigdata
Unit 3 - Bigdata
Synthetic Data Generation: Create user-item rating matrices with random values.
Subsample Real Data: Extract a small, representative subset from a real dataset.
Local Testing: Run the MapReduce job on a local Hadoop instance or a pseudo-distributed
mode using the generated data.
iii)Assume that you are tasked with implementing a word count job
using classic MapReduce on a large dataset stored in HDFS,
sketch the MapReduce for this job.
ANS:
Mapper:
Reducer:
Ans:
Task Retries: When a task fails, it is retried a default number of times (usually 4).
Task Re-Execution: If a task fails beyond retries, it is re-executed on a different node.
Speculative Execution: Slow tasks are detected and duplicates are executed on different
nodes.
Heartbeat Mechanism: TaskTrackers send regular heartbeats to the JobTracker; missed
heartbeats indicate failure.
JobTracker Actions: Reassigns failed tasks and rebalances the job load.
Ans:
Node Manager Unresponsiveness: The ResourceManager detects unresponsive
NodeManagers through missed heartbeats.
ResourceManager Actions: Marks the node as lost and redistributes tasks to other
healthy nodes.
Container Reassignment: Containers from the failed node are reallocated to available
nodes.
Application Master Recovery: The ApplicationMaster can restart on a different node to
continue managing its application.
Ans:
Data Locality: Ensure tasks run on nodes where data is located to minimize data
transfer.
Combiner Usage: Use a combiner to reduce the amount of data transferred between
Mapper and Reducer.
Configuration Tuning: Adjust parameters like the number of mappers, reducers, and
memory allocation.
Intermediate Data Compression: Compress intermediate data to reduce I/O and
network load.
Incremental Processing: Process only new or changed data to avoid reprocessing the
entire dataset.
Custom Partitioners: Use custom partitioners to ensure even data distribution among
reducers.
3.i)Differentiate MongoDB and DBMS