Data Mining

1.
Explain with suitable examples overfitting and underfitting for a regression machine
learning model. (10)
=>
Overfitting:
● Overfitting occurs when a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new data.
● This means that the noise or random fluctuations in the training data is picked up and
learned as concepts by the model.
● The problem is that these concepts do not apply to new data and negatively impact the
model’s ability to generalize.
Example of Overfitting:
● Suppose we have a dataset of house prices where the features include the number of
rooms, the total area, the age of the house, and the postal code.
● An overfitted model might learn that a specific postal code (e.g., 90210) has extremely
high house prices, and therefore predicts high prices for all houses with this postal code.
● However, this model will perform poorly when predicting prices for new data,
especially for houses in postal code 90210 that are not as expensive.
Underfitting:
● Underfitting occurs when a model cannot adequately capture the underlying structure of
the data.
● An under fitted model results in problematic or erroneous outcomes on new data, and
also performs poorly on the training data.
Example of Underfitting:
● Using the same dataset of house prices, an underfitted model might only consider the
number of rooms when predicting the house price.
● This model will perform poorly because it fails to consider other important features
such as the total area and the age of the house.
● The model will not generalize well to new data because it has not adequately learned
the underlying relationship between all the features and the house price.
Preventing Overfitting and Underfitting:

● To prevent overfitting, techniques such as cross-validation, training with more data,
removing irrelevant features, early stopping, regularization, and ensembling methods
can be used.
● To prevent underfitting, we can increase the model complexity, increase the number of
features, remove noise from the data, and increase the training time.
2. What is unsupervised machine learning? Explain significance of clustering as outlier
detection using a suitable example.(10)
=>
3. Differentiate between soft and hard clustering. Give a suitable example of each and
mention one machine learning technique.(10)
=>
Hard Clustering:
● In hard clustering, each data point either belongs to a cluster completely or not.
● There is no concept of a partial membership in hard clustering. A data point belongs to
exactly one cluster.
● It is also known as strict partitioning.
Example of Hard Clustering:
● K-means is an example of a hard clustering algorithm. In K-means, each data point is

assigned to exactly one of the K clusters based on the feature similarity.
Soft Clustering:
● In soft clustering, instead of putting each data point into a separate cluster, a probability
or likelihood of that data point to be in those clusters is assigned.
● It allows data points to belong to multiple clusters with different degrees of
membership. This degree of membership is often based on the probability of the data
point being generated from each cluster’s (usually Gaussian) distribution.
Example of Soft Clustering:
● Gaussian Mixture Models (GMM) is an example of a soft clustering algorithm. In

GMM, each data point is assigned a probability for each cluster, indicating the
likelihood of it belonging to the cluster. Data points with about equal probabilities for
two clusters are considered to be at the boundary of the clusters.
Machine Learning Techniques:
● Hard Clustering: K-means, Hierarchical Clustering.

● Soft Clustering: Gaussian Mixture Models, Fuzzy C-Means.
4. Explain in detail the regression machine learning techniques with a suitable
example.(10)
=>
1. Linear Regression:
● Linear regression is a statistical method that allows us to study relationships between

two continuous (quantitative) variables.
● One variable is considered to be an explanatory variable (independent variable), and the
other is considered to be a dependent variable.
● For example, a modeler might want to relate the weights of individuals to their heights
using a linear regression model.
2. Logistic Regression:
● Despite its name, logistic regression is used to fit a regression model when the response
variable is binary.
● Logistic regression uses the concept of odds ratios, which is the odds of an event
happening.
● For example, we could use logistic regression to model the probability of a student
passing an exam based on the number of hours they study.
3. Polynomial Regression:
● Polynomial regression is a form of regression analysis in which the relationship

between the independent variable x and the dependent variable y is modeled as an nth
degree polynomial.
● Polynomial regression can model relationships between variables that aren’t linear.
● For example, you could use polynomial regression to model the trajectory of a thrown
ball, which follows a parabolic path.
4. Ridge Regression:
● Ridge regression is a method of estimating the coefficients of multiple-regression

models in scenarios where independent variables are highly correlated.
● It’s a type of linear regression that uses shrinkage.
● For example, ridge regression can be used in a scenario where there are many predictor
variables, such as predicting the price of a house based on its features like size, location,
age, etc.
5. Lasso Regression:
● Lasso (Least Absolute Shrinkage and Selection Operator) Regression is a type of linear
regression that uses shrinkage like ridge regression, but with the ability to reduce the
coefficient of less important features to zero.
● This property makes it useful for feature selection in cases where there are a large
number of predictor variables.
● For example, lasso regression can be used in genetic studies where there are thousands
of genes, but only a few are likely to be relevant to the disease being studied.
5. Explain working principle of K-Means machine learning techniques with a suitable
example.Comment on elbow plot and convergence.(10)
=>
K-means clustering is a popular unsupervised machine learning technique for grouping
similar data points together :
Working Principle (4 marks):
1. Define K: You specify the desired number of clusters (K) beforehand. This is crucial
for K-means.
2. Initialize centroids: The algorithm randomly selects K data points as initial cluster
centers (centroids).
3. Assign points to clusters: Each data point is assigned to the closest centroid based on
distance (usually Euclidean distance).
4. Recompute centroids: The centroids are recalculated as the mean of the points
assigned to each cluster.
5. Repeat: Steps 3 and 4 are repeated until a stopping condition is met (usually no
significant changes in cluster assignments).
Example :
Imagine clustering customer data based on purchase history. You might set K=3 (e.g.,
high-spenders, budget-conscious, moderate spenders). K-means would iteratively group
customers into these categories based on their spending patterns.
Elbow Plot :
● The elbow plot helps determine the optimal number of clusters (K).
● It plots the sum of squared distances (inertia) within each cluster for different values of
K.
● The "elbow" point indicates where adding more clusters no longer significantly
improves the clustering.
Convergence :
● Convergence refers to the stopping condition for the K-means algorithm.

● It's usually reached when there are minimal changes in cluster assignments between
iterations.
● This ensures the algorithm has stabilized and found (locally) optimal clusters.
6. What is the significance of elbow plot in K–NN machine learning technique. Explain
k-Nearest Neighbor supervised method with suitable example.(10)
=>
K-Nearest Neighbors (K-NN):
● K-NN is a type of instance-based learning, or lazy learning, where the function is only
approximated locally and all computation is deferred until function evaluation.
● It is a non-parametric method used for classification and regression. In both cases, the
input consists of the k closest training examples in the feature space.
Working Principle of K-NN:
1. A positive integer k is specified, along with a new sample.

2. We select the k entries in our database which are closest to the new sample.
3. We find the most common classification of these entries.
4. This is the classification we give to the new sample.
Example of K-NN:
● Suppose we have a dataset of movies with features like length, budget, and genre. We
want to predict the genre of a new movie.
● We could use the K-NN algorithm to find the k movies that are most similar based on
the features. The genre of the new movie could be predicted as the most common genre
among these k movies.
Significance of Elbow Plot in K-NN:
● In the context of K-NN, the elbow plot is typically used to choose the optimal number
of neighbors k.
● The x-axis represents the number of neighbors, and the y-axis is some measure of
prediction error.
● As the number of neighbors increases, the prediction error decreases and reaches a
minimum, after which it starts to increase again. This point, where the error is at its
minimum, is called the “elbow”.
● The elbow point is considered to be a good choice for k because it represents a point of
diminishing returns where increasing k does not result in significant improvement in
prediction.
K-NN is a powerful and simple classification algorithm, it has its limitations. It is sensitive to
the scale of the data and irrelevant features can cause problems because all features contribute
to the similarity and thus affect the classification. Feature selection and data scaling are
important pre-processing steps when using K-NN. Also, K-NN can be computationally
expensive when dealing with large datasets because the distance to all points in the dataset
needs to be calculated for each prediction.
7. Explain k-NN machine learning algorithm with following parameters ,consider k

equals to 5 and no of classes as 2 . Explain with a suitable classification diagram as
per given parameters.(10)
=>
Working Principle of K-NN:
1. A positive integer k is specified, along with a new sample.

2. We select the k entries in our database which are closest to the new sample.
3. We find the most common classification of these entries.
4. This is the classification we give to the new sample.
Example of K-NN with k=5 and 2 classes:
● Suppose we have a dataset of patients with features like age and blood pressure, and we
want to predict whether a new patient will have a disease (class 1) or not (class 2).
● We could use the K-NN algorithm with k=5. This means we find the 5 patients in the
dataset that are most similar to the new patient based on their features.
● Suppose out of these 5 nearest neighbors, 3 of them have the disease (class 1) and 2 do
not (class 2). Since the majority class among the neighbors is class 1, we predict that the
new patient will have the disease.
Classification Diagram:
● Imagine a scatter plot with age on the x-axis and blood pressure on the y-axis. Each
point on the plot represents a patient from the dataset. Points are colored according to
their class: let’s say red for class 1 (disease) and blue for class 2 (no disease).
● When a new patient comes in, we plot their point on the graph. We then draw a circle
around the point that encompasses the 5 closest points (as per Euclidean distance or
some other distance metric). This is the k=5 nearest neighbors.
● If the majority of the points inside the circle are red, we classify the new patient as
having the disease (class 1). If the majority are blue, we classify them as not having the
disease (class 2).
Remember, while K-NN is a powerful and simple classification algorithm, it has its limitations.
It is sensitive to the scale of the data and irrelevant features can cause problems because all
features contribute to the similarity and thus affect the classification. Feature selection and data
scaling are important pre-processing steps when using K-NN. Also, K-NN can be
computationally expensive when dealing with large datasets because the distance to all points
in the dataset needs to be calculated for each prediction. The choice of k is also crucial as a
small k can lead to a model that is sensitive to noise, while a large k can lead to a model that is
too generalized. The elbow method can be used to choose an optimal k.
8. Explain Apriori machine learning techniques. What are some common applications of
association rule mining in real-world scenarios?(10)
=>
Apriori Algorithm:
● The Apriori algorithm is a popular algorithm for mining frequent itemsets for boolean
association rules.
● It uses a breadth-first search strategy to count the support of itemsets and uses a
candidate generation function which exploits the downward closure property of support.
Working Principle of Apriori:
1. The algorithm starts with frequent itemsets of length 1 (the items themselves), and in
each subsequent iteration, generates the frequent itemsets of length k+1 from the
frequent itemsets of length k.
2. After all frequent itemsets have been found, strong association rules (rules that satisfy
the minimum support and confidence) are generated from the frequent itemsets.
Example of Apriori:
● Suppose we have a dataset of supermarket transactions. Each transaction is a set of

items bought by a customer.
● We could use the Apriori algorithm to find sets of items that are frequently bought
together. These sets are then used to generate association rules which help in
understanding the purchasing behavior of customers.
Applications of Association Rule Mining:
● Market Basket Analysis: This is perhaps the most well-known application of

association rule mining, where the goal is to find associations between products that are
often bought together. This can help in designing marketing strategies such as
promotional pricing or product placements.
● Cross-Selling: Association rules can be used to identify products that are often bought
together, which can help in cross-selling strategies. For example, if customers who buy
a smartphone often also buy a screen protector, a store might offer a discount on screen
protectors to customers who have just bought a smartphone.
● Medical Diagnosis: Association rules can be used to find associations between
different symptoms and diseases. This can help doctors in diagnosing diseases.
● Website Navigation Patterns: Association rules can be used to understand the
navigation patterns of users on a website. This can help in improving the website layout
and in providing personalized recommendations.
Remember, while the Apriori algorithm is a powerful tool for mining frequent itemsets, it has
its limitations. It can be quite slow and memory-intensive for large datasets. Other algorithms
like FP-Growth can be used when dealing with large datasets. Also, the choice of minimum
support and confidence levels can greatly affect the results of the algorithm. These should be
chosen carefully based on the specific requirements of the problem.
9. Decision tree is an eager learner classifier

=>
Decision Trees: Eager Learners for Classification
● Decision Trees: A popular classification algorithm that builds a tree-like model to

predict class labels for unseen data.
● Eager Learner: Decision trees are considered eager learners. This means they learn
from the entire training dataset during the training phase.
Eager Learning vs. Lazy Learning :
● Eager Learners:
○ Build a complete model using the entire training data before making predictions.
○ Examples: Decision Trees, Support Vector Machines (SVM), Logistic
Regression.
○ Advantages: Fast prediction times for unseen data.
○ Disadvantages: Can be computationally expensive for large datasets.
● Lazy Learners:
○ Do not build a model upfront. They analyze the training data only when making
a prediction for a new instance.
○ Examples: k-Nearest Neighbors (k-NN), Instance-based learning.
○ Advantages: Less memory usage for training.
○ Disadvantages: Slower prediction times compared to eager learners.
Decision Tree Classification :
● The decision tree model consists of internal nodes representing features (questions) and
leaf nodes representing class labels (answers).
● During classification, a new data point traverses the tree based on its feature values,
reaching a leaf node that predicts its class.
● Decision trees are interpretable, allowing you to understand the logic behind the
predictions.
Example :
Imagine classifying emails as spam or not-spam. A decision tree might ask questions like
"Does the email contain certain keywords?" or "Is the sender unknown?". Based on the
answers, the email is classified as spam or not-spam.
10. Explain in detail the decision tree algorithm as a regression application with a suitable
example.
=> refer Q.9 as well
While decision trees are primarily used for classification, they can also be effective for
regression tasks:
Decision Tree Regression Explained :
● Similar Structure: Decision tree regression builds a tree-like model similar to

classification trees.
● Predicting Continuous Values: Unlike classification, the goal here is to predict a
continuous numerical value (e.g., house price, temperature) instead of a class label.
● Splitting Criteria: At each internal node, the algorithm chooses the feature and split
point that best divides the data into subsets with reduced variance in the target variable.
Example :
Imagine predicting car prices based on features like mileage, year, and engine size. A decision
tree regression model might:
1. Start by splitting cars based on mileage (e.g., high vs. low mileage).
2. Within each mileage range, further split by year (e.g., recent vs. older models).
3. For each subgroup (mileage range and year), predict the average car price as the final
outcome (leaf node).
Key Differences from Classification Trees :
● Splitting Criteria: Regression trees use variance reduction in the target variable as the
splitting criteria, whereas classification trees use criteria like information gain to
maximize class separation.
● Leaf Nodes: Leaf nodes in regression trees contain predicted continuous values
(average price), while classification trees have class labels (e.g., spam/not-spam).
Choosing the Right Technique :
● Decision tree regression can be a good choice for capturing non-linear relationships
between features and the target variable.
● However, they can be prone to overfitting, so techniques like pruning or setting
minimum samples per split are crucial.
11. What is a decision tree machine learning model ? Explain core logic with suitable
examples of how it is used for classification tasks.
=> refer to this link - link
12. Differentiate between the following approaches used for the integration of a data
mining system with a database or data warehouse system: no coupling, loose coupling,
semi tight coupling, and tight coupling. State which approach you think is the most
popular, and why.
=>
Most Popular Approach :
● Loose Coupling: This is generally considered the most popular approach for data
mining system integration.
Reasons for Popularity :
● Flexibility: Loose coupling allows for independent development and maintenance of

the data mining system and the database/data warehouse system.
● Scalability: Data mining systems can access data from various sources, not just the
main database/data warehouse.
● Security: Sensitive data processing can be isolated within the data mining system,
potentially enhancing security.
13. Explain with suitable examples five primitives for specifying a data mining task
=>
● Type of Data to be Mined : This defines the data set on which data mining is to
be performed. It could be relational databases, data warehouses, transactional
databases, etc. For example, if we are working on a market basket analysis
problem, the type of data to be mined would be transactional data.
● The Knowledge to be Mined : This defines the kind of knowledge or pattern to
be found in the data. It could be association rules, classification rules, clusters,
etc. For instance, in a credit card fraud detection problem, the knowledge to be
mined could be classification rules that classify transactions as fraudulent or
non-fraudulent.
● Target Data : This is the subset of the database on which the user wishes to
perform data mining. For example, in a retail store, the target data could be the
transactions made in the last one month.
● Background Knowledge : This is the prior knowledge about the domain that the
user wants to incorporate into the mining process. This could be in the form of
concept hierarchies, set of relevant attributes, etc. For instance, in a medical
diagnosis problem, the background knowledge could be the symptoms
associated with different diseases.
● Interestingness Measures : These are the criteria to evaluate the patterns found
by the mining process. The measures could be objective (like support and
confidence in association rule mining) or subjective (based on the user's belief
in the data). For example, in a market basket analysis problem, a rule could be
considered interesting if it has a high support and confidence.
14. Explain with suitable diagrams the concept of knowledge discovery in databases
(KDD) and discuss its significance in the context of data mining
=>
Concept and Significance :
● KDD refers to the overall process of uncovering valuable knowledge from large
datasets. It's a broader framework encompassing various techniques, including data
mining.
● Data mining is a crucial step within KDD that focuses on extracting specific patterns
and models from the data.
KDD Process Diagram :

1. Selection: Selecting relevant data for the mining task.
2. Preprocessing: Cleaning, transforming, and preparing the data for mining algorithms.
3. Data Mining: Applying specific algorithms (e.g., decision trees, k-means clustering) to
extract patterns and models.
4. Pattern Evaluation: Assessing the discovered patterns for validity, usefulness, and
interestingness.
5. Knowledge Integration: Combining the discovered knowledge with existing
knowledge and domain expertise.
6. Interpretation: Understanding the meaning of the knowledge and its implications for
decision-making.
Significance of KDD in Data Mining :
● KDD provides a systematic framework for data mining tasks.

● It ensures that the extracted patterns are relevant, interpretable, and actionable.
● KDD emphasizes the importance of data selection, preprocessing, and post-processing,
which are crucial for successful data mining
15. Outline the essential elements comprising a data mining architecture. Explain
in detail all elements with suitable diagrams
=>
1. Database or Data Warehouse Server: The database or data warehouse server is
responsible for fetching the relevant data based on the user’s data mining request. The
actual data to be mined is present here.
2. Database or Data Warehouse: This is where the data is stored and managed. The
database or data warehouse stores and manages the data in a multidimensional database.
This helps in the quick retrieval of data.
3. Knowledge Base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. The knowledge base might even
contain user beliefs or data from user experiences that can be used in the process of data
mining.
4. Data Mining Engine: This is the core component of the data mining system and
consists of a set of mining modules such as association rules, classification, clustering,
etc. The data mining engine is very essential in the data mining system architecture as it
helps in the extraction of patterns from the dataset.
5. Pattern Evaluation Module: This component typically employs interestingness
measures and interacts with the data mining modules in order to focus the search
towards interesting patterns. It may use a threshold to filter out the less interesting
patterns.
6. Graphical User Interface: This module communicates between users and the data
mining system, allowing the user to interact with the system by specifying a data
mining query or task, providing information to help focus the search, and performing
exploratory data mining based on the intermediate data mining results. In addition, it
also allows the user to browse database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different forms.
16. Hierarchical clustering and its types.

=>
Hierarchical Clustering:
● Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy

of clusters.
● It starts by treating each observation as a separate cluster. Then, it repeatedly executes
the following two steps: (1) identify the two clusters that are closest together, and (2)
merge the two most similar clusters. This iterative process continues until all the
clusters are merged together.
● This is particularly useful for visualizing the data, understanding the levels of
granularity, and interpreting the dendrogram.
Types of Hierarchical Clustering:
1. Agglomerative Hierarchical Clustering (Bottom-Up Approach):

○ Starts with each observation as a separate cluster and merges them into
successively larger clusters.
○ The process continues until all points are merged into a single remaining cluster.
○ The result is a tree-based representation of the observations, called a
dendrogram.
2. Divisive Hierarchical Clustering (Top-Down Approach):
○ Starts with one, all-inclusive cluster and splits it into successively smaller
clusters.
○ The process continues until each observation is in its own cluster.
○ It’s more complex and computationally intensive than agglomerative.
Key Points to Remember:
● The choice between agglomerative and divisive clustering depends on the problem and
the computational resources available.
● Hierarchical clustering doesn’t require us to specify the number of clusters, which is an
advantage over k-means clustering.
● The quality of the hierarchical clustering result can be highly sensitive to the choice of
distance metric.
● Hierarchical clustering can be visualized using a dendrogram, which helps with
understanding the arrangement of the clusters.

Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

1.

Preventing Overfitting and Underfitting:

Example of Hard Clustering:

● K-means is an example of a hard clustering algorithm. In K-means, each data point is

Example of Soft Clustering:

● Gaussian Mixture Models (GMM) is an example of a soft clustering algorithm. In

Machine Learning Techniques:

● Hard Clustering: K-means, Hierarchical Clustering.

● Linear regression is a statistical method that allows us to study relationships between

● Polynomial regression is a form of regression analysis in which the relationship

● Ridge regression is a method of estimating the coefficients of multiple-regression

Working Principle (4 marks):

● Convergence refers to the stopping condition for the K-means algorithm.

Working Principle of K-NN:

1. A positive integer k is specified, along with a new sample.

Significance of Elbow Plot in K-NN:

7. Explain k-NN machine learning algorithm with following parameters ,consider k

Working Principle of K-NN:

1. A positive integer k is specified, along with a new sample.

Example of K-NN with k=5 and 2 classes:

Working Principle of Apriori:

● Suppose we have a dataset of supermarket transactions. Each transaction is a set of

Applications of Association Rule Mining:

● Market Basket Analysis: This is perhaps the most well-known application of

9. Decision tree is an eager learner classifier

● Decision Trees: A popular classification algorithm that builds a tree-like model to

Eager Learning vs. Lazy Learning :

Decision Tree Classification :

Decision Tree Regression Explained :

● Similar Structure: Decision tree regression builds a tree-like model similar to

Key Differences from Classification Trees :

Choosing the Right Technique :

Reasons for Popularity :

● Flexibility: Loose coupling allows for independent development and maintenance of

KDD Process Diagram :

Significance of KDD in Data Mining :

● KDD provides a systematic framework for data mining tasks.

16. Hierarchical clustering and its types.

● Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy

Types of Hierarchical Clustering:

1. Agglomerative Hierarchical Clustering (Bottom-Up Approach):

Key Points to Remember:

You might also like