Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Unit - 4

1.
(i) Audit management wants to analyze large volumes of transaction logs to identify potential
fraudulent activity. Assess the exploitation of HDFS (Hadoop distributed file system) in
this context.

HDFS (Hadoop Distributed File System) is an excellent choice for analyzing large transaction
logs to identify fraudulent activities. It distributes data across multiple nodes, allowing for
parallel processing and rapid analysis of vast volumes of data. This scalability ensures efficient
handling of big data, making it easier to detect anomalies and suspicious patterns indicative of
fraud. HDFS's fault tolerance and high availability guarantee data integrity, even in case of
hardware failures. Additionally, HDFS integrates well with other Hadoop ecosystem tools like
MapReduce and Apache Spark, providing powerful data processing and analytics capabilities
essential for effective fraud detection.
(ii) Scaling out a Hadoop cluster by adding more nodes is more cost effective and efficient
than scaling up by adding more resources to existing nodes – Conclude this assertion with
respect to the below given statements (a) and(b).
(a) The outcomes of scaling out in terms of performance.
(b) The outcomes of scaling up in terms of performance.

Scaling out a Hadoop cluster by adding more nodes is more cost-effective and efficient than
scaling up by adding more resources to existing nodes. This conclusion can be drawn based on
the following:

(a) The outcomes of scaling out in terms of performance: Scaling out involves adding more
nodes to the cluster, which distributes the data and processing load across more machines. This
improves performance by allowing parallel processing and reducing the time required for
data-intensive tasks. It enhances fault tolerance and provides better resource utilization, making
it easier to handle larger datasets and complex computations efficiently.

(b) The outcomes of scaling up in terms of performance: Scaling up involves adding more
resources (CPU, RAM, storage) to existing nodes. While this can improve performance to some
extent, it has limitations due to hardware constraints and diminishing returns. As the resource
requirements grow, the cost of high-end hardware increases significantly, making it less
cost-effective. Additionally, scaling up does not enhance fault tolerance and can lead to
bottlenecks if a single node fails.

Conclusively, scaling out offers better performance, scalability, and cost-effectiveness for
large-scale data processing in Hadoop clusters compared to scaling up.
2.
(i) Construct the design choice that you would make to ensure high availability and fault
tolerance in the HDFS setup. Consider the factors such as NameNode configuration, data
replication strategy and hardware selection.

(ii) In view of sensor data processing, a production plant wants to analyze real time sensor
data from machines to monitor production efficiency. Suggest the integrative actions of
HDFS.

HDFS can integrate with real-time processing tools like Apache Kafka and Apache Spark
Streaming to analyze sensor data in a production plant. Sensor data is ingested via Kafka, stored
in HDFS for durability, and processed in real-time using Spark Streaming. This setup ensures
efficient, scalable monitoring of production efficiency, enabling timely insights and
decision-making.

(iii) Conclude the associated dataflow in HDFS during a MapReduce job and identify
potential bottlenecks. Consider the factors such as data locality and storage.

In HDFS during a MapReduce job, the data is divided into blocks and stored across different
nodes, leveraging data locality by assigning map tasks to nodes where the data resides. This
minimizes network bandwidth usage. Map tasks process these data blocks, generating
intermediate key-value pairs. The shuffle and sort phase redistributes these pairs across the
network to appropriate reduce tasks, which aggregate and produce the final output. Potential
bottlenecks include network congestion during the shuffle phase, which can slow down data
transfer, and data skew, where uneven data distribution causes some nodes to handle
disproportionately more data, leading to imbalanced workloads and reduced overall efficiency.
3.
(i) Give an example for Schema less Database

Schemaless databases are known as NoSQL databases because data isn't stored in relational
tables. Instead, you store data differently, such as key-value pairs, documents, columns, or graph
data models. Examples of schemaless databases include MongoDB and RavenDB.

(ii) Specify the features of key value and document data models

Key-Value Data Model Features:

1. Simplicity: Data is stored as a collection of key-value pairs, where each key is unique.
2. Scalability: Highly scalable, suitable for distributed systems, allowing horizontal scaling by
adding more nodes.
3. Performance: Optimized for fast read and write operations, making it efficient for
high-velocity data.
4. Flexibility: Allows for storing arbitrary data types, providing schema-less data storage.
5. Ease of Use: Simple data model makes it easy to use and understand, ideal for applications
requiring straightforward data retrieval by key.

Document Data Model Features:

1. Schema Flexibility: Supports semi-structured data, allowing documents to have varying


structures, which can evolve over time.
2. Hierarchical Data: Documents can contain nested structures, arrays, and complex data types,
closely resembling real-world data models.
3. Rich Querying: Offers advanced querying capabilities, including filtering, aggregation, and
indexing on document fields.
4. ACID Transactions: Some document databases provide ACID compliance for transactions,
ensuring data integrity.
5. Compatibility with JSON/BSON: Often uses JSON or BSON formats, making it compatible
with web technologies and easy to work with in various programming languages.

(iii) Why Facebook is using Graph database

Facebook uses a graph database to efficiently manage and query its complex social network data.
Graph databases excel in handling interconnected data, enabling fast and intuitive queries about
relationships between users, such as friends, likes, and shares. This structure supports real-time
recommendations, social graph searches, and personalized content delivery, enhancing user
experience and enabling sophisticated data analysis and insights.

You might also like