Professional Documents
Culture Documents
BDA COMBINED
BDA COMBINED
1
Big Data Analysis – Module 1
Q. Explain the types of Big Data, its source and, its pros and cons
Big data can be classified into different types based on its structure, source, and nature. Here's
an overview of the main types of big data, along with their sources, pros, and cons:
1. Structured Data:
• Source: This type of data is highly organized and typically resides in relational
databases or spreadsheets. It follows a clear and defined schema.
• Pros:
• Easy to store, query, and analyze using traditional database systems.
• Offers consistency and uniformity.
• Cons:
• Limited flexibility for accommodating new data types or changes in
schema.
• Not well-suited for handling unstructured or semi-structured data.
2. Unstructured Data:
• Source: This data doesn't have a specific form or structure and can come from
sources like text files, emails, videos, images, social media posts, and more.
• Pros:
• Captures a wide range of information, including human insights,
sentiments, and interactions.
• Enables analysis of rich and diverse data sources.
• Cons:
• Difficult to process and analyze using traditional database tools.
• Requires advanced analytics techniques, such as natural language
processing or image recognition.
3. Semi-Structured Data:
• Source: This type of data doesn't conform to a strict schema but has some
organizational properties. Examples include JSON, XML, and NoSQL
databases.
• Pros:
• Combines the flexibility of unstructured data with some level of
organization.
• Well-suited for storing and processing diverse data types.
• Cons:
• May still pose challenges for integration and analysis due to varying
structures and formats.
• Requires specialized tools and skills for effective management.
2
Big Data Analysis – Module 1
3
Big Data Analysis – Module 1
• Skill Gap: Requires specialized skills, expertise, and training in data science,
analytics, and technology.
4
Big Data Analysis – Module 1
5
Big Data Analysis – Module 1
6
Big Data Analysis – Module 1
7
Big Data Analysis – Module 1
8
Big Data Analysis – Module 1
9
Big Data Analysis – Module 1
10
Big Data Analysis – Module 1
11
Big Data Analysis – Module 1
12
Big Data Analysis – Module 1
1
Big Data Analysis – Module 1
2
Big Data Analysis – Module 1
Disadvantages of Hadoop:
3
Big Data Analysis – Module 1
Limitations of Hadoop:
1. Real-Time Processing: Hadoop's traditional MapReduce framework is not well-suited
for real-time and interactive data processing and analysis due to its batch processing
nature, latency, and overhead.
2. Data Security: Hadoop's native security features and capabilities are limited, requiring
additional tools, technologies, and configurations to enhance and enforce data
security, privacy, compliance, and governance.
3. Data Management and Governance: Hadoop lacks comprehensive data
management, governance, metadata management, lineage tracking, and data
cataloging capabilities, requiring integration with additional tools, platforms, and
solutions to address and manage these aspects effectively.
4. Complex Ecosystem: Hadoop's growing and evolving ecosystem, including various
tools, technologies, frameworks, and versions, can lead to compatibility,
interoperability, integration, and versioning issues, challenges, and complexities for
users, organizations, and stakeholders.
5. Hardware and Infrastructure Dependencies: Hadoop's performance, reliability, and
scalability are dependent on the underlying hardware, infrastructure, network, and
environment, requiring careful planning, design, configuration, optimization, and
maintenance to ensure optimal and efficient operation, utilization, and performance of
the Hadoop cluster and ecosystem.
4
Big Data Analysis – Module 1
Hadoop is a framework for distributed storage and processing of large datasets across clusters
of computers using simple programming models. It consists of several key components:
1. MapReduce: MapReduce is a programming model and processing engine for
distributed computing on large data sets. It divides the computation into two phases:
Map and Reduce. In the Map phase, input data is divided into smaller chunks,
processed in parallel by map tasks, and outputs intermediate key-value pairs. In the
Reduce phase, these intermediate results are aggregated and processed to produce
the final output. MapReduce abstracts away the complexities of parallel and distributed
processing, making it easier to process large-scale data sets across a cluster of
machines efficiently.
2. HDFS (Hadoop Distributed File System): HDFS is a distributed file system designed
to store large volumes of data reliably and efficiently across commodity hardware. It
follows a master-slave architecture where the NameNode serves as the master and
manages the file system namespace and metadata, while DataNodes store the actual
data blocks and handle read/write requests from clients. HDFS is fault-tolerant and
highly scalable, making it suitable for storing and processing Big Data applications in
Hadoop.
3. YARN (Yet Another Resource Negotiator): YARN is the resource management layer
of Hadoop, introduced in Hadoop 2.x to separate the resource management and job
scheduling functionalities from MapReduce. YARN allows multiple data processing
engines to run on top of Hadoop, enabling a broader range of applications beyond
MapReduce, such as Apache Spark, Apache Flink, and Apache Tez. It consists of
ResourceManager, which manages cluster resources, and NodeManagers, which run
5
Big Data Analysis – Module 1
on individual nodes and manage resources available on that node. YARN enhances
the flexibility, scalability, and utilization of resources in Hadoop clusters.
4. Hadoop Common: Hadoop Common includes the essential utilities and libraries that
support other Hadoop modules. It provides the foundational components for Hadoop,
including the Java libraries, utilities, and necessary infrastructure for distributed
computing. Hadoop Common includes modules for authentication, configuration
management, I/O operations, and other common functionalities required by various
Hadoop components. It ensures interoperability and consistency across different
Hadoop ecosystem projects and serves as the core framework upon which other
Hadoop components are built.
6
Big Data Analysis – Module 1
HDFS:
• HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
1. Name node
2. Data Node
• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware in the distributed environment. Undoubtedly,
making Hadoop cost effective.
• HDFS maintains all the coordination between the clusters and hardware, thus working
at the heart of the system.
YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
• Consists of three major components i.e.
1. Resource Manager
7
Big Data Analysis – Module 1
2. Nodes Manager
3. Application Manager
• Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
• By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
• MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input
and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
• It is a platform for structuring the data flow, processing and analyzing huge data sets.
• Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
• With the help of SQL methodology and interface, HIVE performs reading and writing
of large data sets. However, its query language is called as HQL (Hive Query
Language).
• It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing easier.
• Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
• JDBC, along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.
Mahout:
8
Big Data Analysis – Module 1
9
Big Data Analysis – Module 1
10
Big Data Analysis – Module 1
11
Big Data Analysis – Module 1
12
Big Data Analysis – Module 1
5. Data Security and Governance: NoSQL databases often lack comprehensive data
security, governance, compliance, auditing, and monitoring capabilities and features
compared to traditional relational databases and management systems, requiring
additional tools, technologies, configurations, and practices to ensure and enforce data
security, privacy, compliance, and governance.
6. Integration and Compatibility: NoSQL databases may face challenges, issues, and
limitations in integration, compatibility, interoperability, migration, and coexistence with
existing relational databases, systems, applications, tools, and environments, requiring
careful planning, design, architecture, development, and management to ensure
seamless and successful integration and coexistence.
7. Skills and Expertise Gap: Organizations may encounter skills and expertise gaps in
NoSQL, distributed computing, big data, data science, analytics, and related domains,
requiring recruitment, training, development, and retention of talent, capabilities, and
competencies to build, operate, and optimize NoSQL databases and ecosystems
effectively and successfully.
13
Big Data Analysis – Module 1
14
Big Data Analysis – Module 1
15
Big Data Analysis – Module 1
Q. Business Drivers
Business driver refers to any factor or catalyst that influences an organization's decision to
adopt big data analytics solutions or initiatives. These drivers are typically aligned with the
strategic objectives and goals of the business and are instrumental in justifying investments
in big data technologies and analytics capabilities. Understanding and identifying these drivers
is crucial for organizations to effectively harness the power of big data and derive actionable
insights to gain a competitive advantage. Here's a detailed explanation of business drivers in
the context of big data analysis:
1. Competitive Advantage: Many organizations leverage big data analytics to gain a
competitive edge in their respective industries. By analyzing large volumes of data
from various sources such as customer transactions, social media interactions, and
market trends, companies can uncover valuable insights about customer preferences,
market trends, and competitor strategies. These insights enable organizations to make
informed decisions, develop innovative products and services, and better serve their
customers, ultimately helping them outperform their competitors.
2. Improved Decision Making: Big data analytics empowers organizations to make
data-driven decisions based on real-time insights rather than relying on intuition or past
experiences. By analyzing historical data and identifying patterns, trends, and
correlations, businesses can anticipate market changes, identify emerging
opportunities, and mitigate risks more effectively. This leads to more informed
decision-making across all levels of the organization, resulting in better outcomes and
improved performance.
3. Enhanced Customer Experience: Customer data is one of the most valuable assets
for businesses, and big data analytics enables organizations to gain deeper insights
into customer behavior, preferences, and sentiment. By analyzing customer
interactions across multiple touchpoints, businesses can personalize their products,
services, and marketing efforts to meet the individual needs and preferences of their
customers. This not only improves customer satisfaction and loyalty but also drives
revenue growth through increased sales and repeat business.
4. Cost Reduction and Efficiency Improvement: Big data analytics can help
organizations optimize their operations, streamline processes, and identify
inefficiencies that may be causing unnecessary costs. By analyzing operational data
and identifying areas for improvement, businesses can optimize their supply chain,
reduce waste, and enhance productivity, leading to cost savings and improved
efficiency. Additionally, predictive analytics can help organizations anticipate
equipment failures, prevent downtime, and optimize maintenance schedules, further
reducing operational costs and improving overall efficiency.
5. Risk Management and Compliance: Big data analytics can play a crucial role in risk
management and compliance by identifying potential risks and compliance issues
before they escalate into major problems. By analyzing data from various sources,
including financial transactions, customer interactions, and regulatory filings,
organizations can detect anomalies, fraudulent activities, and compliance breaches in
real-time. This enables businesses to take proactive measures to mitigate risks, ensure
regulatory compliance, and protect their reputation and brand image.
6. Innovation and New Revenue Streams: Big data analytics can fuel innovation by
uncovering new insights, trends, and opportunities that organizations may not have
16
Big Data Analysis – Module 1
17
Big Data Analysis – Module 1
18
Big Data Analysis – Module 1
3. Document Database:
The document database fetches and accumulates data in form of key-value pairs but here,
the values are called as Documents. Document can be stated as a complex data structure.
Document here can be a form of text, arrays, strings, JSON, XML or any such format. The use
of nested documents is also very common. It is very effective as most of the data created is
usually in form of JSONs and is unstructured.
Advantages:
• This type of format is very useful and apt for semi-structured data.
• Storage retrieval and managing of documents is easy.
Limitations:
• Handling multiple documents is challenging
• Aggregation operations may not work accurately.
Examples:
• MongoDB
• CouchDB
19
Big Data Analysis – Module 1
20
Big Data Analysis – Module 1
21
Big Data Analysis – Module 3
1
Big Data Analysis – Module 3
• It is similar to the Reduce phase but is applied locally on each mapper node
before the data is shuffled and sent to reducers.
• The purpose of the Combiner is to perform partial aggregation of intermediate
key-value pairs to reduce the volume of data transferred during the shuffling
phase.
• By combining intermediate results locally, it reduces the amount of data that
needs to be transferred over the network, thereby improving performance.
Workflow:
1. Splitting: The input data is divided into smaller chunks or splits.
2. Mapping: Each split is processed by the Map function to produce intermediate key-
value pairs.
3. Shuffling & Sorting: Intermediate key-value pairs are shuffled and sorted by key to
group the values associated with each key.
4. Combining (Optional): Local aggregation of intermediate key-value pairs to reduce
data transfer.
5. Reducing: Each unique key and its list of values are processed by the Reduce function
to produce the final output.
Example:
2
Big Data Analysis – Module 3
3
Big Data Analysis – Module 3
Q. What is combiners? How does it work, and give its pros and con
Combiners, also known as "mini-reducers," are optional components in the MapReduce
framework used for improving the efficiency of data processing by performing partial
aggregation of intermediate key-value pairs locally on the mapper nodes before data is
shuffled and sent to the reducer nodes. Combiners are essentially a subset of the Reduce
function and are applied within the Map phase.
How Combiners Work:
1. Local Aggregation:
• During the Map phase, each mapper node processes a portion of the input data
and produces intermediate key-value pairs.
• Instead of immediately sending these intermediate results to the reducer
nodes, the mapper node first applies the Combiner function to locally aggregate
the intermediate key-value pairs with the same key.
• The Combiner function performs partial aggregation, such as summing up
counts or finding maximum values, on the intermediate data.
2. Reducing Data Volume:
• By performing partial aggregation locally on the mapper nodes, the Combiner
reduces the volume of data that needs to be transferred over the network during
the shuffling phase.
• This reduction in data volume can lead to significant improvements in
performance, especially in scenarios where the volume of intermediate data is
substantial.
3. Example:
• Consider a word count example where each mapper node processes a portion
of a large text document.
• Instead of sending all intermediate word-count pairs to the reducers, the
mapper node first applies the Combiner function to locally sum up the counts
for each word.
• As a result, the amount of data transferred during the shuffling phase is
reduced, leading to faster processing.
Pros of Combiners:
1. Reduced Data Transfer:
• Combiners reduce the volume of intermediate data transferred over the
network during the shuffling phase, leading to reduced network traffic and faster
processing.
• This is particularly beneficial in distributed computing environments where
network bandwidth may be a bottleneck.
2. Improved Performance:
4
Big Data Analysis – Module 3
5
Big Data Analysis – Module 3
2. Projection (π):
• Projection is the process of selecting specific columns (attributes) from a
relation while eliminating duplicates.
• It's denoted by the Greek letter pi (π) followed by the list of attributes to be
retained inside parentheses.
6
Big Data Analysis – Module 3
• Union combines the results of two queries and returns a set of all distinct rows
present in either or both result sets.
• Intersection returns only the rows that appear in both result sets.
7
Big Data Analysis – Module 3
8
Big Data Analysis – Module 3
<Numerical>
Q. Map Reduce Matrix Multiplication
9
Big Data Analysis – Module 3
10
Big Data Analysis – Module 3
11
Big Data Analysis – Module 3
12
Big Data Analysis – Module 4
1
Big Data Analysis – Module 4
2
Big Data Analysis – Module 4
3
Big Data Analysis – Module 4
4
Big Data Analysis – Module 4
5
Big Data Analysis – Module 4
6
Big Data Analysis – Module 4
7
Big Data Analysis – Module 4
8
Big Data Analysis – Module 4
9
Big Data Analysis – Module 4
10
Big Data Analysis – Module 4
11
Big Data Analysis – Module 4
12
Big Data Analysis – Module 4
13
Big Data Analysis – Module 4
Applications:
1. Network Traffic Monitoring: DGIM can be used to estimate the number of active
connections or packets with certain characteristics in network traffic streams.
2. Social Media Analytics: It can approximate the frequency of specific events or
keywords in real-time social media feeds.
3. Web Traffic Analysis: DGIM can help estimate the number of active users on a
website or the popularity of certain pages.
Advantages:
1. Memory Efficiency: DGIM consumes a relatively small amount of memory
compared to storing the entire data stream, making it suitable for memory-
constrained environments.
2. Real-Time Analysis: It provides approximate counts in real-time, making it suitable
for monitoring rapidly changing data streams.
3. Accuracy: DGIM provides reasonably accurate estimates of the count of 1s in the
stream, especially for large data sets.
14
Big Data Analysis – Module 4
Disadvantages:
1. Approximation: While DGIM provides estimates, these estimates can have a margin
of error depending on the specific characteristics of the data stream and the chosen
parameters.
2. Limited Scope: DGIM is specifically designed for counting the number of 1s in
binary data streams and may not be directly applicable to other types of data analysis
tasks.
3. Complexity: Understanding and implementing DGIM requires a solid understanding
of data structures and algorithms, as well as the specific nuances of streaming data
processing. It may not be suitable for all development teams or applications.
15
Big Data Analysis – Module 4
16
Big Data Analysis – Module 4
17
Big Data Analysis – Module 4
18
Big Data Analysis – Module 4
<Numerical>
Q. Bloom filter <link>
19
Big Data Analysis – Module 4
20
Big Data Analysis – Module 4
21
Big Data Analysis – Module 4
Q. FM algorithm <link>
22
Big Data Analysis – Module 4
23
Big Data Analysis – Module 5
1
Big Data Analysis – Module 5
2
Big Data Analysis – Module 5
3
Big Data Analysis – Module 5
4
Big Data Analysis – Module 5
Overview:
Support Vector Machine (SVM) is primarily known as a supervised learning algorithm for
classification and regression tasks. However, it can also be adapted for clustering tasks. SVM
clustering aims to find a hyperplane that separates data points into different clusters while
maximizing the margin between clusters.
Key Concepts:
1. Hyperplane: In SVM clustering, the hyperplane represents the decision boundary that
separates data points into different clusters. The goal is to find the hyperplane that
maximizes the margin between clusters.
2. Support Vectors: Support vectors are the data points closest to the hyperplane. They
play a crucial role in determining the position and orientation of the hyperplane.
3. Kernel Trick: SVM clustering often uses the kernel trick to map the input data into a
higher-dimensional space where it is easier to find a hyperplane that separates
clusters. Common kernels used include linear, polynomial, and radial basis function
(RBF) kernels.
Advantages of SVM Clustering:
• Effective for high-dimensional data.
• Can handle non-linearly separable data using kernel trick.
• Robust to outliers due to the use of support vectors.
• Can work well with small to medium-sized datasets.
5
Big Data Analysis – Module 5
Overview:
Parallel SVM clustering is an extension of SVM clustering that leverages parallel computing
techniques to accelerate the training process, especially for large-scale datasets.
Key Concepts:
1. Parallel Computing: Parallel SVM clustering distributes the computation across
multiple processing units (e.g., CPU cores, GPUs, or distributed systems) to train the
SVM model concurrently on different parts of the dataset.
2. Data Partitioning: The dataset is divided into smaller partitions, and each partition is
processed independently on different processing units. This allows for parallel training
of SVM models on each partition.
3. Aggregation: After training SVM models on individual partitions, the results are
aggregated to obtain the final clustering model.
Advantages of Parallel SVM Clustering:
• Scalability: Can handle large-scale datasets by distributing computation across
multiple processing units.
• Faster Training: Parallel processing accelerates the training process, leading to
reduced training time.
• Improved Efficiency: Utilizes available computational resources more efficiently by
parallelizing training tasks.
6
Big Data Analysis – Module 5
Overview:
K-Nearest Neighbors (KNN) clustering is a simple and intuitive clustering algorithm that
assigns each data point to the cluster represented by the majority of its K nearest neighbors.
Key Concepts:
1. K Neighbors: KNN clustering determines the cluster assignment of each data point
based on the majority vote of its K nearest neighbors in the feature space.
2. Distance Metric: The choice of distance metric (e.g., Euclidean distance, Manhattan
distance, etc.) plays a crucial role in determining the nearest neighbors.
3. Hyperparameter K: The value of K is a hyperparameter that needs to be specified by
the user. It controls the level of granularity in clustering. A smaller K value leads to
more local clustering, while a larger K value leads to more global clustering.
Advantages of KNN Clustering:
• Simple and easy to understand.
• No assumptions about the underlying data distribution.
• Can handle non-linear decision boundaries.
• Can work well with both small and large datasets.
7
Big Data Analysis – Module 5
<Numerical>
Q. PCY Algorithm <link>
8
Big Data Analysis – Module 5
9
Big Data Analysis – Module 5
10
Big Data Analysis – Module 5
11
Big Data Analysis – Module 6
1
Big Data Analysis – Module 6
2
Big Data Analysis – Module 6
• User-Centric Ranking: Search engines can leverage user data and behavior
to personalize search results based on individual preferences, search history,
location, and demographics. By incorporating user feedback and implicit
signals, such as click-through rates and dwell time, search engines can deliver
more relevant and diverse results.
5. Scale and Computation Complexity:
• Problem: PageRank requires iterative computation over a large web graph, which can
be computationally intensive and time-consuming, especially for search engines
indexing billions of web pages.
• Solution:
• Parallelization and Distributed Computing: To address scalability issues,
search engines deploy distributed computing frameworks and parallel
processing techniques to compute PageRank efficiently across multiple
servers or clusters. This enables faster processing and real-time updates of
search indices.
3
Big Data Analysis – Module 6
The "Bow Tie Structure" is a conceptual model used to describe the organization and
connectivity of the World Wide Web. Proposed by researchers at IBM in 2000, this model
visualizes the web as a bow tie, with various components representing different types of web
pages and their relationships. Let's explore the structure of the web using the bow tie analogy:
1. Core:
• Description: The core of the bow tie represents the central hub of the web, consisting
of highly interconnected and authoritative web pages. These pages typically include
major search engines, directories, and popular websites with a significant number of
incoming and outgoing links.
• Characteristics:
• High PageRank: Core pages often have high PageRank scores, indicating their
importance and authority within the web graph.
• Dense Connectivity: Pages in the core are densely interconnected, forming a
highly cohesive network.
2. Tendrils:
• Description: Tendrils extend outward from the core and represent pages that are
linked to the core but have fewer connections among themselves. These pages include
niche websites, blogs, and forums that are linked to from the core but may not have
extensive connections with other pages outside their niche.
• Characteristics:
4
Big Data Analysis – Module 6
5
Big Data Analysis – Module 6
6
Big Data Analysis – Module 6
1. Comprehensive Linking:
• Ensure that all web pages are properly linked within the website's navigation
menus, footer, sidebar, and contextual links within content. This helps users
navigate between pages and reduces the likelihood of dead ends.
2. Site Audits:
• Conduct regular site audits to identify and rectify dead-end pages. Review
website analytics to identify pages with low engagement or high bounce rates,
which may indicate dead-end pages that need attention.
3. Internal Linking Strategy:
• Develop an internal linking strategy to interconnect related pages and content
topics within the website. Use anchor text and contextual links to guide users
to relevant content and improve navigation flow.
4. 404 Error Handling:
• Implement custom 404 error pages with helpful navigation links to redirect
users who encounter dead-end pages. Provide suggestions for alternative
content or actions to keep users engaged on the website.
7
Big Data Analysis – Module 6
8
Big Data Analysis – Module 6
2. Serendipitous Discovery:
• Collaborative filtering can uncover new and unexpected items that a user might
like, leading to serendipitous discoveries and exploration of diverse content.
3. Scalability:
• Collaborative filtering can scale to large datasets and diverse types of items,
making it suitable for various applications and domains.
Challenges of Collaborative Filtering:
1. Cold Start Problem:
• Collaborative filtering struggles to make recommendations for new users or
items with limited interaction data, leading to the cold start problem.
2. Sparsity:
• The user-item interaction matrix can be highly sparse, especially for large
datasets with many users and items. This sparsity can make it challenging to
find sufficient overlaps between users or items for accurate recommendations.
3. Popularity Bias:
• Collaborative filtering tends to recommend popular items more frequently,
leading to a bias towards mainstream or well-known items and overlooking
niche or long-tail content.
4. Data Privacy and Security:
• Collaborative filtering relies on user data for making recommendations, raising
concerns about data privacy, security, and potential misuse of personal
information.
9
Big Data Analysis – Module 6
10
Big Data Analysis – Module 6
11
Big Data Analysis – Module 6
12
Big Data Analysis – Module 6
13
Big Data Analysis – Module 6
1. Nodes (Vertices):
• Individuals or Entities: Each node in the graph represents an individual user, entity,
or object within the social network. For example, in a social media network, nodes can
represent users, while in a professional network like LinkedIn, nodes can represent
professionals or organizations.
• Attributes: Nodes may have attributes associated with them, such as user profiles,
demographics, interests, affiliations, or any other relevant information. These attributes
enrich the graph and provide additional context for analysis.
2. Edges (Links):
• Relationships or Interactions: Edges between nodes represent relationships or
interactions between individuals in the social network. These interactions can take
various forms depending on the nature of the social network, including friendships,
follows, connections, interactions, collaborations, or any other type of relationship.
• Directed vs. Undirected: Edges in social networks can be either directed or
undirected. In a directed graph, edges have a direction, indicating the flow or
asymmetry of the relationship (e.g., follows on Twitter). In an undirected graph, edges
have no direction, representing symmetric relationships (e.g., friendships on
Facebook).
3. Types of Social Networks:
• Friendship Networks: Social networks like Facebook, where nodes represent users,
and edges represent friendships or mutual connections between users.
• Follower Networks: Social media platforms like Twitter, where nodes represent users,
and directed edges represent the "follows" relationship between users.
14
Big Data Analysis – Module 6
15
Big Data Analysis – Module 6
16
Big Data Analysis – Module 6
17
Big Data Analysis – Module 6
18
Big Data Analysis – Module 6
<Numerical>
Q. Clique Percolation Method <Link>
Clique percolation is a method for identifying overlapping communities based on cliques,
which are subsets of nodes where each node is directly connected to every other node in the
subset.
Steps:
1. Find Maximal Cliques:
• Identify all maximal cliques in the network. Maximal cliques are cliques that
cannot be extended by adding another node from the graph while still
maintaining the clique property.
2. Construct Clique Graph:
• Create a new graph where each node represents a maximal clique.
• Connect nodes in the clique graph if the corresponding cliques overlap by k-1
nodes, where k is the size of the cliques.
3. Identify Communities:
• Communities in the original graph correspond to connected components in the
clique graph.
• Each connected component represents a community, and nodes belonging to
the same component (connected by edges) form the community.
19
Big Data Analysis – Module 6
20
Big Data Analysis – Module 6
21
Big Data Analysis – Module 6
22
Big Data Analysis – Module 6
23