Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

SETHU INSTITUTE OF TECHNOLOGY, KARIAPATTI

(An Autonomous Institution Affiliated to Anna University, Chennai)


Regulation – 2021
21UAD404 – Data Warehousing and Data Mining
(Answer Key)
1. Compare a data warehouse from a database? How are they similar?
The data warehouse and a database store structured data, allow querying, prioritize data
integrity, require management tasks, support scalability, and integrate with other
systems. However, databases primarily handle transactional data with optimized
performance for real-time operations, while data warehouses focus on historical and
analytical data for complex querying and reporting purposes.
2. Apply the Galaxy schema for a college library and draw the proper schema for the above.
The Galaxy schema is a dimensional modeling technique used to design data
warehouses. In the context of a college library, the Galaxy schema could include
dimensions such as time, book, author, student, and location, along with a fact table
containing information about book borrowings.

3. Differentiate fact and dimension table.


Aspect Fact Table Dimension Table
Contains quantitative data or
Content Contains descriptive attributes
measures
Lowest level (e.g., transactional
Granularity Higher level (e.g., categories)
level)
Size Often large Relatively smaller
Sales revenue, quantity sold,
Example Measures Time, product, customer, location, employee
temperature

Foreign keys referencing dimension


Relationship No foreign keys, used for context and filtering
tables

4. What is Snowflake Schema?


The snowflake schema is a type of dimensional modeling technique used in data
warehousing. It extends the star schema by normalizing dimension tables to eliminate
redundancy and improve data integrity. In a snowflake schema, dimension tables are
organized into multiple related tables, resembling a snowflake shape when viewed in a
diagram, hence the name.
5. State why concept hierarchies are useful in data mining.
Concept hierarchies are useful in data mining because they provide organization,
summarization, exploration, integration, classification, prediction, and the incorporation of
domain knowledge into the analysis.
6. What are the essential step in the process of knowledge discovery in databases
(KDD)
The essential steps in the process of Knowledge Discovery in Databases (KDD) are data
cleaning, data integration, data selection, and data transformation.
The final step in KDD is pattern evaluation, where data mining algorithms are applied to
discover meaningful patterns, associations, or models in the data.
7. What is meant by Lazy Learner?
A "Lazy Learner" is a term often used to describe a machine learning algorithm that defers
the majority of its computation or learning process until it is presented with a query or
request for prediction. Instead of actively building a model based on a training dataset
upfront, a lazy learner postpones this process until it receives a specific task, at which
point it analyzes the training data relevant to that task and makes predictions accordingly.
8. What is MBA in data mining (Candidate generation technique)?
Market Basket Analysis (MBA) is a data mining technique used to identify patterns and
relationships in large datasets, particularly in the context of retail and sales transactions.
The primary goal is to understand which items are frequently purchased together. This
information can be utilized for various business strategies, including inventory
management, marketing, and sales promotions.
9. State the difference between Classification and Clustering?
Aspect Classification Clustering
Supervised learning technique for assigning Unsupervised learning technique for
labels to new observations based on labeled grouping similar data points into clusters
Definition training data. without predefined labels.
Supervision Supervised (requires labeled data) Unsupervised (does not require labeled data)
Output Discrete labels for new data points Groups (clusters) of similar data points
Decision Trees, SVM, Naive Bayes, k-NN, k-Means, Hierarchical Clustering, DBSCAN,
Algorithms Neural Networks GMM
Example Use Email filtering, Image recognition, Medical Market segmentation, Image segmentation,
Cases diagnosis, Credit scoring Anomaly detection, Document clustering

10. Compare CLARA and CLARANS?


Suitable for Large
Algorithm Approach Key Feature
Datasets?
Representative-based clustering
CLARA Yes Combines results from subsets
using sampling
Flexible number of clusters,
CLARANS Randomized search strategy Yes
handles noise and outliers

PART B
11. (a) Suppose that a data warehouse consists of the three dimensions time, doctor, and
patient, and the two measures count and charge, where charge is the fee that a doctor
charges a patient for a visit.
(a) Enumerate three classes of schemas that are popularly used for modeling data
warehouses. 8 Marks
(b) Draw a schema diagram for the above data warehouse using one of the schema classes
listed in (a). 8 Marks
Answer:
(a) Three Classes of Schemas for Data Warehouses (8 Marks):
1. Star Schema
2. Snowflake Schema (4 marks) *diagram or explanation
3. Galaxy Schema (4 marks) *diagram or explanation
(b) Schema Diagram using Star Schema (8 Marks):
 A well-defined Star Schema diagram depicting the relationships between dimension tables
(Time, Doctor, Patient) and the central Fact Table. The diagram should include:
(4 marks) *diagram or explanation
o Primary keys (PK) for each dimension table.
o Foreign keys (FK) in the Fact Table referencing the dimension tables.
o Measures (count and charge) stored in the Fact Table.
 Clear and concise explanation of the schema (4 marks): *diagram or explanation
o Description of the dimension tables and their attributes.
o Explanation of the Fact Table and its role in storing measures.
o How the relationships between tables enable data analysis.

(OR)
11. (b) Construct a data warehouse for a University / Hospital / Enterprise using
Galaxy schemas with necessary description.
Galaxy schema for University Data Warehouse: (4 marks) *diagram or explanation
1. Fact Table: Enrollment
o Columns: StudentID, CourseID, SemesterID, Grade
o Description: Stores information about student enrollments in courses,
including the student ID, course ID, semester ID, and grade achieved.
2. Dimension Table: Student
o Columns: StudentID, Name, Gender, DateOfBirth, Major
o Description: Stores details about students, including their ID, name, gender,
date of birth, and major.
3. Dimension Table: Course
o Columns: CourseID, CourseName, Department, Credits
o Description: Contains information about courses offered by the university,
including course ID, name, department, and credits.
4. Dimension Table: Semester
o Columns: SemesterID, Year, Term
o Description: Holds information about academic semesters, including the
semester ID, year, and term (e.g., Fall, Spring).
5. Dimension Table: Faculty
o Columns: FacultyID, Name, Department
o Description: Stores details about faculty members, including their ID, name,
and department.
The relationships between the tables can be represented as follows:
Enrollment (Fact Table): (4 marks) *diagram or explanation
 StudentID (Foreign Key) -> Student (Dimension Table)
 CourseID (Foreign Key) -> Course (Dimension Table)
 SemesterID (Foreign Key) -> Semester (Dimension Table)
Student (Dimension Table): (4 marks) *diagram or explanation
 Major (Foreign Key) -> Department (Dimension Table)
Course (Dimension Table):
 Department (Foreign Key) -> Department (Dimension Table)
Faculty (Dimension Table): (4 marks) *diagram or explanation
 Department (Foreign Key) -> Department (Dimension Table)
This Galaxy schema allows for analyzing student enrollments, grades, student details,
course information, semester data, and faculty information in a comprehensive manner
within the University's data warehouse.
Note: The schemas for a Hospital or Enterprise data warehouse would have different sets of
tables and relationships based on their specific requirements and data structures.
12 (a). Define in detail about the OLAP Operations in Multi-dimensional Data Model.
the OLAP operations in a multi-dimensional data model: (4 marks) *diagram or explanation
1. Slice: The slice operation selects a sub-cube by fixing a specific value or range of values along
one or more dimensions. It reduces the data to a subset based on specific criteria, focusing on a
particular "slice" of the multi-dimensional cube. For example, slicing by the time dimension to
retrieve data for a specific month or year. (Answer: "The slice operation selects a sub-cube by
fixing specific values along one or more dimensions.")
2. Dice: The dice operation selects a sub-cube by specifying conditions on multiple dimensions
simultaneously. It enables the extraction of data that satisfies multiple criteria across different
dimensions. For example, dicing by the time and product dimensions to retrieve data for a
specific month and a particular product category. (Answer: "The dice operation selects a sub-
cube by specifying conditions on multiple dimensions simultaneously.") (4 marks) *diagram
or explanation
3. Drill-Down: The drill-down operation allows for navigating from a higher level of aggregation
to a lower level of detail. It involves expanding a dimension to view more specific levels of
data. For example, drilling down from the year level to the month level or from a region level
to the city level. (Answer: "The drill-down operation allows for navigating from a higher level
of aggregation to a lower level of detail.") (4 marks) *diagram or explanation
4. Roll-Up: The roll-up operation is the opposite of drill-down. It involves summarizing data from
a lower level of detail to a higher level of aggregation. It aggregates data along one or more
dimensions, collapsing the data to a higher level of abstraction. For example, rolling up from
the month level to the quarter level or from a product category level to the product group level.
(Answer: "The roll-up operation involves summarizing data from a lower level of detail to a
higher level of aggregation.") (4 marks) *diagram or explanation
5. Pivot: The pivot operation rotates the multi-dimensional cube to view the data from different
perspectives. It allows for the reorientation of dimensions and measures to provide alternative
views of the data. For example, pivoting the rows and columns to view sales data by different
product categories and regions. (Answer: "The pivot operation allows for the reorientation of
dimensions and measures to provide alternative views of the data.")
12 (b).
i) Differentiate Star schema vs Snow flake schema vs Galaxy schema (10)
ii) With relevant examples discuss the different schema operations. (6)
Answer:
i)
Query
Structure (4 Normalization (4 Joins
Schema Performance (2 Complexity
marks) marks) Required
marks)
Fact table
Star Schema Denormalized Minimal Fast Simple
connected to
multiple
dimension
tables
Snowflake Fact table Normalized Increased Potential impact Increased
Query
Structure (4 Normalization (4 Joins
Schema Performance (2 Complexity
marks) marks) Required
marks)
connected to
multiple
Schema dimension on performance
tables
Galaxy Fact table
Flexible Depends on Depends on the Moderate
Schema connected to
multiple
dimension specific schema design
tables
ii) different schema operations:
1. Star Schema Operations: (4 marks) *diagram or explanation
a. Slice: Selecting sales data for a specific month or product category from a star schema data
warehouse.
b. Drill-Down: Analyzing sales data at a city or store level instead of just at the region level.
c. Roll-Up: Aggregating monthly sales data to quarterly or annual totals for a broader view of
performance.
d. Pivot: Viewing sales data by different dimensions, such as analyzing sales by product category and
region simultaneously.
2. Snowflake Schema Operations:
Snowflake schema operations are similar to star schema operations, but with additional joins due to the
normalization of dimension tables. The examples provided for star schema operations can also apply to
snowflake schema with the added complexity of joins.
3. Galaxy Schema Operations: (2 marks) *diagram or explanation
Galaxy schema operations are like star schema operations. The examples provided for star schema
operations can be extended to galaxy schema for analyzing data from multiple dimensions and their
relationships. For example, slicing enrollment data for a specific semester, drill-down to analyze
enrollment by department or course level, roll-up to analyze overall student enrollment, and pivot to
view enrollment data from different perspectives like student gender or faculty department.
13 a)
i)Explain in detail about Data mining functionalities. (8)
ii) Explain in detail about Interestingness of patterns in data mining (8)
Answer Key
(i) Data Mining Functionalities (8 Marks):
Data mining functionalities refer to the various techniques used to extract knowledge and patterns from
large datasets. These functionalities can be broadly categorized into two main types:
1. Descriptive Mining: (4 Marks) *diagram or explanation
This type of mining focuses on summarizing and describing the general characteristics of the
data. It helps understand the current state of the data and identify trends or relationships. Here
are some functionalities within descriptive mining:
o Data Characterization: Summarizes the general features of a data set, such as
measures of central tendency (mean, median) and dispersion (standard deviation).
o Concept Description: Identifies and describes groups or categories within the data.
This could involve finding customer segments or product categories.
o Mining Frequent Patterns: Discovers patterns that occur frequently in the data. This
could involve finding commonly purchased product combinations or identifying
frequent customer demographics.
2. Predictive Mining: (4 Marks) *diagram or explanation
This type of mining aims to predict future trends or behaviors based on historical data. These
functionalities leverage patterns found in the data to make forecasts or classifications. Here are
some functionalities within predictive mining:
o Classification: Builds models to classify new data points into predefined categories.
For example, classifying a customer as high-risk or low-risk based on credit history.
o Regression: Identifies relationships between variables and predicts the value of a
dependent variable based on independent variables. For example, predicting future sales
based on historical sales data and marketing campaigns.
o Clustering: Groups data points into clusters based on similarities. This helps identify
customer segments or product categories with similar characteristics.
(ii) Interestingness of Patterns in Data Mining (8 Marks):
Not all discovered patterns in data mining are equally useful. Interestingness measures are used to
evaluate the quality and relevance of patterns for specific business goals. Here are some key aspects of
interestingness: (4 Marks) *diagram or explanation
 Actionability: Does the pattern provide insights that can be used to make informed decisions
or take action?
 Novelty: Is the pattern unexpected or surprising? Does it reveal new knowledge not previously
known?
 Understandability: (4 Marks) *diagram or explanation
Can the pattern be easily interpreted and understood by domain experts?
 Accuracy: In the case of predictive models, how accurate are the predictions made based on
the pattern? Metrics like precision, recall, and F1-score can be used.
 Significance: Is the pattern statistically significant? Does it represent a true trend or simply
random noise? Statistical tests can be used to assess significance.
Data mining algorithms often employ various interestingness measures to prioritize patterns and guide
users towards the most valuable insights. These measures are typically a combination of the factors
mentioned above, weighted based on the specific goals of the data mining task.
(OR)
13 b) Describe about the Classification in DataMining Systems and explain the various types of
classification algorithm. Provide answer key for this question
Answer Key: Issues in Data Mining and Data Preprocessing (16 Marks)
(a) Issues in Data Mining (4 Marks): *diagram or explanation
Data mining, while powerful for extracting knowledge, faces several challenges:
 Data Quality: A significant issue is the quality of the raw data. Data may be incomplete
(missing values), inconsistent (errors or inconsistencies), or irrelevant (data not relevant to the
task). This can lead to inaccurate or misleading patterns.
 Data Complexity: Modern datasets are often large, complex, and heterogeneous. This
complexity can make processing and analysis time-consuming and computationally expensive.
Choosing the right algorithms and techniques for such data can be challenging.
 Scalability: Data mining algorithms need to scale effectively to handle ever-growing datasets.
Traditional algorithms might struggle with massive data volumes, requiring specialized
techniques or distributed processing.
 Privacy Concerns: (4 Marks): *diagram or explanation
 Data mining may raise privacy concerns as it involves accessing and analyzing potentially
sensitive information. Balancing data exploration with user privacy is crucial.
 Interpretability: Data mining models can be complex and difficult to understand. Explaining
the reasoning behind a model's predictions is crucial for gaining user trust and ensuring reliable
insights.
 Knowledge Discovery Process: The knowledge discovery process itself is iterative and
requires expertise. Choosing the right algorithms, data preparation techniques, and interpreting
the results effectively requires domain knowledge and experience in data mining.
(b) Issues in Data Preprocessing (4 Marks): *diagram or explanation
Data preprocessing is a crucial step in data mining, but it also comes with its own set of challenges:
 Time-consuming: Data preprocessing can be a time-consuming and resource-intensive task,
especially for large datasets. Automating or optimizing data cleaning and transformation steps
is essential.
 Data Loss: Cleaning data often involves removing incomplete or noisy data points. This can
lead to information loss, and it's essential to find a balance between cleaning and retaining
valuable data.
 Choosing Techniques: Selecting the right data preprocessing techniques depends on the
specific data and the intended analysis. Different data types require different cleaning methods.
(4 Marks) *diagram or explanation
 Bias Introduction: Incorrect or biased cleaning methods can introduce bias into the data,
impacting the results of the data mining process. Careful evaluation of chosen methods is
crucial.
 Cost-effectiveness: Data preprocessing may require specialized tools and expertise. Finding
cost-effective ways to prepare data while ensuring quality is important.
Note: You can award marks based on the clarity, comprehensiveness, and explanation provided for
each issue in both sections (a) and (b).

14 a) Apply the Apriori algorithm for discovering frequent item sets for mining
association
rules of the following table. Use 3 for the minimum support value and Confidence of 50%.
Illustrate each step of the Apriori algorithm.
Trans ID Items Purchased
101 milk, bread,eggs
102 milk, juice
103 juice,b u t t e r
104 milk,bread,eggs
105 coffee,eggs
106 coffee
107 coffee , juice
108 milk, bread,cookies,eggs
109 cookies, butter
110 milk , bread

Answer Key: Applying Apriori Algorithm (16 Marks)


Minimum Support (min_sup): 3 (given)
Minimum Confidence (min_conf): 50% (given)
Apriori Algorithm Steps:
We will illustrate each step of the Apriori algorithm for the given transaction data.
Step 1: Find Frequent Itemsets (L-sets) (4 Marks): *table or diagram explanation
 L1 (Frequent Single Items):
o Count the occurrences of each item:
 milk: 5
 bread: 4
 eggs: 4
 juice: 3
 coffee: 3
 cookies: 2
 butter: 2
o Calculate support for each item. Items with support >= min_sup (3) are frequent.
 Frequent items: milk, bread, eggs
o L1 = {milk, bread, eggs}
 L2 (Frequent Pairs): (4 Marks): *table or diagram explanation
o Join frequent items from L1 to generate candidate pairs.
o Candidate pairs: {milk, bread}, {milk, eggs}, {bread, eggs}
o Calculate support for each candidate pair by counting their co-occurrence in
transactions.
o Prune infrequent pairs (support < min_sup).
 Frequent pairs: {milk, bread} (support: 3), {milk, eggs} (support: 4)
o L2 = {{milk, bread}, {milk, eggs}}
 L3 (Frequent Triplets): (Not applicable in this case)
o Since we have no frequent pairs with more than two items, we don't need to generate L3
as there won't be any frequent triplets with minimum support of 3.
Step 2: Generate Association Rules
From the frequent itemsets (L1 and L2), generate association rules based on the formula:
Confidence (X => Y) = (Support({X U Y}) / Support(X)) * 100%
where X is the antecedent (LHS) and Y is the consequent (RHS) of the rule.
 Rules from L1: (Not applicable in this case)
o Since we don't have any frequent itemsets with size greater than 1, there are no
association rules to generate from L1.
 Rules from L2: (4 Marks): *table or diagram explanation
o {milk, bread} => eggs: Confidence = (Support({milk, bread, eggs}) / Support({milk,
bread})) * 100%
 Confidence = (4 / 3) * 100% = 133.33% (This rule exceeds the minimum
confidence threshold of 50%, but we typically keep only rules with confidence
less than 100% as they indicate a positive association)
o {milk, eggs} => bread: Confidence = (Support({milk, bread, eggs}) / Support({milk,
eggs})) * 100%
 Confidence = (4 / 4) * 100% = 100% (This rule has a confidence of 100%,
indicating no filtering based on confidence is needed)
Answer: (4 Marks): *table or diagram explanation
 Frequent itemsets (L-sets): L1 = {milk, bread, eggs}, L2 = {{milk, bread}, {milk, eggs}}
 Association rules:
o {milk, bread} => eggs (Confidence: 133.33%) May not be kept due to confidence
exceeding 100%
o {milk, eggs} => bread (Confidence: 100%)
Note: You can award marks based on the clarity of each step explanation, including calculations and
justifications for pruning infrequent itemsets or rules.

(OR)
14 b) Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions.
Problem: If the weather is sunny, then the Player should play or not?
Play
Outlook Temperature Humidity Windy
Golf
1 Rainy Hot High False No
2 Rainy Hot High True No
3 Overcast Hot High False Yes
4 Sunny Mild High False Yes
5 Sunny Cool Normal False Yes
6 Sunny Cool Normal True No
7 Overcast Cool Normal True Yes
8 Rainy Mild High False No
9 Rainy Cool Normal False Yes
10 Sunny Mild Normal False Yes
11 Rainy Mild Normal True Yes
12 Overcast Mild High True Yes
13 Overcast Hot Normal False Yes
14 Sunny Mild High True No
Answer Key: Play Golf Based on Sunny Weather (8 Marks)
Problem: Based on the provided dataset, can we determine if someone should play golf on a sunny
day?
Analysis: (4 Marks): *table or diagram explanation
Looking only at sunny days (rows 4, 5, 6, 10, and 14), we see mixed results for the "Play Golf"
variable. There are both "Yes" and "No" values associated with sunny weather.
Limitations: (4 Marks): *table or diagram explanation
 This dataset is very small (14 entries) and may not capture the full range of weather conditions
that can affect playing golf.
 Other factors beyond sunshine, such as temperature, humidity, and wind, also significantly
influence the decision to play golf.
Conclusion: (4 Marks): *table or diagram explanation
Based solely on sunshine, the data provides inconclusive evidence. We cannot definitively say
whether someone should play golf on a sunny day based on this limited dataset.
Additional Considerations (Bonus Points): (4 Marks): *table or diagram explanation
 Decision Tree Learning: This dataset could be used to train a decision tree learning algorithm,
considering all weather attributes (Outlook, Temperature, Humidity, Windy) to predict "Play
Golf" for future scenarios.
 Feature Importance: The decision tree could reveal which weather attributes are most
important in influencing the decision to play golf.
 More Data: A larger dataset capturing a wider range of weather conditions would be necessary
to draw more reliable conclusions about playing golf on sunny days.
Note: You can award marks based on the clarity of the analysis, recognizing limitations of the data,
and suggesting potential solutions using machine learning techniques.

15 a) Let's say we want to cluster a group of 20 individuals between the ages of 20 and 40. We
have collected data on their ages, which are as follows:
25, 22, 28, 36, 32, 23, 27, 30, 31, 29, 33, 24, 26, 34, 37, 38, 21, 35, 39, 40
Our goal is to divide these individuals into two clusters based on their age using the k-means
algorithm.
Answer Key: Clustering Individuals by Age using K-Means (8 Marks)
Scenario:
We want to cluster 20 individuals (ages 20-40) into two groups (k=2) using the k-means algorithm
based on their age.
K-Means Algorithm Steps: (4 Marks): *table or diagram explanation
1. Define k (number of clusters): k = 2 (two clusters)
2. Initialize centroids: Choose two initial centroids (cluster centers). Since we don't have prior
knowledge, a common approach is to randomly select two data points from the dataset.
 Example: We randomly pick ages 28 and 35 as initial centroids.
*Note: the clusters may differ accordingly
(4 Marks): *table or diagram explanation
3. Assign data points to clusters: Calculate the distance (e.g., Euclidean distance) between each
individual's age and both centroids. Assign each data point to the cluster with the closest
centroid.
4. Recompute centroids: Calculate the average age of all individuals within each cluster. These
new averages become the updated centroids.
5. Repeat steps 3 and 4: Re-assign data points based on the updated centroids and recalculate
new centroids again. (4 Marks): *table or diagram explanation
6. Convergence: Continue iterating steps 3 and 4 until the centroids no longer change
significantly (convergence is achieved). This indicates stable clusters.
Result:
After iterating the k-means algorithm, you will obtain two clusters containing individuals with ages
closer to their respective cluster centroids.
Note: Since the initial centroids are chosen randomly, the specific cluster assignments for each
individual might vary slightly across different runs of the k-means algorithm. However, the overall
clustering pattern (dividing individuals into two age groups) should remain consistent.
Additional Considerations (Bonus Points):
 Evaluation Metrics: You can evaluate the quality of the clusters using metrics like Silhouette
Coefficient.
 K-means Limitations: K-means assumes spherical clusters. If the age distribution is not
evenly spread, the clusters might not be perfectly divided.
 Alternative Clustering: Hierarchical clustering could be explored for a more exploratory
approach to identify potential cluster structures in the data.
Key Points: (4 Marks): *table or diagram explanation
 The answer focuses on applying the k-means algorithm steps with k=2 for the given scenario.
 It acknowledges the random initialization and potential variations in cluster assignments across
different runs.
(OR)
15 b)
i) Select the suitable example to compare and analyze the systematic way of implementing
agglomerative and Divisive hierarchical clustering. (10)
ii) Compare and contrast the CLARA and CLARANS. (6)
Answer Key: Hierarchical Clustering & Clustering Algorithms (16 Marks)
(i) Comparing and Analyzing Agglomerative vs. Divisive Hierarchical Clustering (10 Marks):
Suitable Example: Customer segmentation in a retail store based on purchase history.
Agglomerative Hierarchical Clustering: (4 Marks): *table or diagram explanation
 Start: Treat each customer as a separate cluster (20 clusters for 20 individuals).
 Merging: In each step, merge the two most similar customer clusters based on a similarity
measure (e.g., total purchase amount similarity).
 Continue merging: Repeat the merging process until a desired number of clusters (e.g., 2
clusters) is reached.
 Analysis: This approach helps identify groups of customers with similar purchase patterns,
allowing for targeted marketing campaigns.
Divisive Hierarchical Clustering: (4 Marks): *table or diagram explanation
 Start: Consider all customers as a single large cluster.
 Splitting: In each step, recursively divide the current cluster into two sub-clusters based on a
dissimilarity measure (e.g., difference in average purchase amount).
 Continue splitting: Repeat the splitting process until a desired number of clusters (e.g., 2
clusters) is reached or a stopping criterion is met (e.g., minimum cluster size).
 Analysis: This approach helps identify distinct customer segments with very different
purchasing behaviors.
Comparison and Analysis: (2 Marks): *table or diagram explanation
 Bottom-up vs. Top-down: Agglomerative builds clusters from individual data points, while
Divisive splits a single cluster into sub-clusters.
 Merging vs. Splitting: Agglomerative focuses on merging similar data points, while Divisive
focuses on separating dissimilar data points.
 Flexibility: Agglomerative offers more flexibility in choosing the merging criteria and order,
potentially leading to more nuanced clusters.
 Interpretability: The dendrogram generated by agglomerative clustering provides a visual
representation of the merging process, aiding in understanding cluster formation.
Choosing the Right Approach:
 Agglomerative is well-suited for identifying natural groupings where data points have varying
degrees of similarity.
 Divisive is better suited for finding well-separated clusters or identifying outliers that might be
hidden in agglomerative clustering.
(ii) Comparing and Contrasting CLARA and CLARANS (6 Marks):
CLARA (Clustering Large Applications) and CLARANS (Clustering Large Applications with
Noise) are partitioning algorithms designed to handle large datasets efficiently.
Similarities: (2 Marks): *table or diagram explanation
 Both are partitioning algorithms that divide the data points into a pre-defined number of
clusters (k).
 They sample a small subset of the data to reduce computational cost while maintaining cluster
quality.
 They can handle large datasets that wouldn't fit in memory at once.
Differences: (2 Marks): *table or diagram explanation
 Noise Handling: CLARA assumes minimal noise in the data. CLARANS specifically
addresses noisy data by identifying and excluding potential outliers from the sampling process.
 Sampling Strategy: CLARA uses random sampling. CLARANS uses a medoid-based
sampling approach, where medoids (representative points) are chosen instead of random data
points, potentially leading to more robust clusters in the presence of noise.
Choosing the Right Approach: (2 Marks): *table or diagram explanation
 Use CLARA for large datasets with minimal noise to improve efficiency compared to
traditional partitioning algorithms like k-means.
 Use CLARANS for large datasets with potential noise to ensure the clustering process is less
sensitive to outliers and improves cluster quality.

You might also like