Data Mining Notes

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

Data warehouse and OLAP tools are based on a multi-dimensional data model.

This model
views data in the form of a data cube.
 What is a Data Cube?
Answer
In a data warehouse and online analytical processing (OLAP) systems, a data cube is a multi-
dimensional data structure that allows users to store and analyze data. It is called a "cube"
because it has multiple dimensions, similar to a physical cube.
Overall, the data cube is a key component of data warehouse and OLAP systems, and it
allows users to easily analyze and understand complex data sets.

 What is a Fact Table?


Answer
In a data warehouse and online analytical processing (OLAP) systems, a fact table is a central
table that contains the measures or facts of the data. It is typically used in conjunction with
one or more dimension tables, which contain the attributes or dimensions of the data.
Overall, the fact table is a key component of data warehouse and OLAP systems, and it
provides the ability to store and analyze the measures or facts of the data.

 Define a distributive measure. Give an example of a distributive measure


Answer
In data analysis and online analytical processing (OLAP) systems, a distributive measure is a
measure that can be computed independently for each dimension value and then combined to
compute the overall measure. This is in contrast to an algebraic measure, which requires the
data to be aggregated before it can be calculated.
An example of a distributive measure is the average or mean of a set of data. This measure
can be computed independently for each dimension value by taking the sum of the values and
dividing by the number of values. For example, in a data warehouse for a retail store, the
average sales per customer could be computed independently for each customer, and then
combined to compute the overall average sales for the entire data set.

 Define an algebraic measure. Give an example of an algebraic measure.


Answer
In data analysis and online analytical processing (OLAP) systems, an algebraic measure is a
measure that requires the data to be aggregated before it can be calculated. This is in contrast
to a distributive measure, which can be computed independently for each dimension value
and then combined to compute the overall measure.
An example of an algebraic measure is the standard deviation of a set of data. This measure
requires the data to be aggregated together in order to compute the mean and the variance,
which are then used to calculate the standard deviation. In a data warehouse for a retail store,
the standard deviation of sales per customer could be computed by aggregating all of the
sales data together and then applying the standard deviation formula.

 Explain why holistic measures are not desirable when designing a data
warehouse.
Answer
Holistic measures are measures that are computed based on the entire data set, rather than on
individual dimension values. For example, in a data warehouse for a retail store, a holistic
measure might be the total sales for all products in all regions over a particular time period.
Holistic measures are not desirable when designing a data warehouse for several reasons.
First, they can be computationally expensive to compute, especially for large data sets. This
can make it difficult to perform analytical queries and calculations in a timely manner.
Second, holistic measures are not very useful for understanding the individual components of
the data. For example, in the case of the total sales measure, it is not possible to see how the
sales for individual products, regions, or time periods contribute to the overall total.

While the Apriori algorithm can find all frequent itemsets within the initial dataset, it
suffers from two non-trivial costs.
 What are these two non-trivial costs?
Answer
The Apriori algorithm is a popular algorithm for mining frequent itemsets in a dataset. It is
based on the idea that if an itemset is frequent, then all of its subsets must also be frequent.
This allows the algorithm to prune the search space and avoid considering many non-frequent
itemsets.
However, the Apriori algorithm suffers from two non-trivial costs. The first cost is the time
and space required to generate and store all of the candidate itemsets. In order to find all
frequent itemsets, the Apriori algorithm must consider all possible subsets of the items in the
dataset. This can be computationally expensive, especially for large datasets with many
items.
The second cost of the Apriori algorithm is the time required to scan the dataset and count the
support for each candidate itemset. This can also be computationally expensive, especially for
large datasets with many transactions.
 Explain how the FP-growth method avoids the two costly problems of the
Apriori algorithm.
Answer
The FP-growth method is an alternative algorithm for mining frequent itemsets in a dataset. It
is based on the idea of using a compressed representation of the dataset, called an FP-tree, to
efficiently identify frequent itemsets.
The FP-growth method avoids the two costly problems of the Apriori algorithm in the
following ways. First, it avoids the need to generate and store all of the candidate itemsets.
Instead, it directly constructs the FP-tree from the dataset, which contains only the frequent
items and their co-occurrences. This allows the FP-growth method to be more efficient in
terms of space complexity, as it does not need to store all of the candidate itemsets.
Second, the FP-growth method avoids the need to scan the dataset and count the support for
each candidate itemset. Instead, it directly extracts the frequent itemsets from the FP-tree
using a depth-first search, which is much faster than counting the support for each itemset.
This allows the FP-growth method to be more efficient in terms of time complexity, as it does
not need to scan the dataset multiple times.
Overall, the FP-growth method is able to avoid the two costly problems of the Apriori
algorithm by using a compressed representation of the dataset and a more efficient search
algorithm. This makes it a more efficient and effective algorithm for mining frequent itemsets
in large datasets.

 One of the benefits of the FP-tree structure is Compactness. Explain why FP-
growth method is compact.
Answer
The FP-growth method is compact because it only stores the frequent items and their co-
occurrences in the FP-tree, rather than storing all of the items and all of the transactions. This
allows the FP-tree to be much smaller than the original dataset, which can be useful for
reducing the space complexity of the algorithm.

Suppose that for a data set:


 there are m points and K clusters,
 half the points and clusters are in ”more dense”regions,
 half the points and clusters are in ”less dense” regions, and
 the two regions are well-separated from each other.
For the region data set, assume that we use the K-Means algorithm to find the clusters.
Discuss the quality of the results of the K-Means algorithm for each of following cases:
(a) The initial centroids should be equally distributed between more dense and les
dense regions
Answer
If the initial centroids for the K-Means algorithm are equally distributed between the more
dense and less dense regions, then the quality of the results will depend on the degree of
separation between the two regions.
If the two regions are well-separated from each other, then the K-Means algorithm should be
able to identify the clusters in each region and assign the points to the correct clusters. In this
case, the quality of the results will be good, as the clusters will be well-defined and the points
will be correctly assigned.
On the other hand, if the two regions are not well-separated, then the K-Means algorithm may
have difficulty identifying the clusters in each region. In this case, some of the points may be
mis-assigned to the wrong clusters, which can lead to a lower quality of the results.
Overall, if the initial centroids are equally distributed between the more dense and less dense
regions, and the two regions are well-separated, then the quality of the results from the K-
Means algorithm should be good. If the two regions are not well-separated, then the quality
of the results may be lower.

(b) More initial centroids should be allocated to the less dense region.
Answer
If more initial centroids are allocated to the less dense region, then the quality of the results
from the K-Means algorithm will depend on the degree of separation between the two regions
and the number of centroids allocated to each region.
If the two regions are well-separated and there are enough centroids allocated to the less
dense region, then the K-Means algorithm should be able to identify the clusters in each
region and assign the points to the correct clusters. In this case, the quality of the results will
be good, as the clusters will be well-defined and the points will be correctly assigned.
On the other hand, if the two regions are not well-separated or if there are not enough
centroids allocated to the less dense region, then the K-Means algorithm may have difficulty
identifying the clusters in each region. In this case, some of the points may be mis-assigned to
the wrong clusters, which can lead to a lower quality of the results.
Overall, if more initial centroids are allocated to the less dense region, and the two regions
are well-separated and there are enough centroids, then the quality of the results from the K-
Means algorithm should be good. If the two regions are not well-separated or there are not
enough centroids, then the quality of the results may be lower.

(c) More initial centroids should be allocated to the denser region.


Answer
If more initial centroids are allocated to the denser region, then the quality of the results from
the K-Means algorithm will depend on the degree of separation between the two regions and
the number of centroids allocated to each region.
If the two regions are well-separated and there are enough centroids allocated to the denser
region, then the K-Means algorithm should be able to identify the clusters in each region and
assign the points to the correct clusters. In this case, the quality of the results will be good, as
the clusters will be well-defined and the points will be correctly assigned.
On the other hand, if the two regions are not well-separated or if there are not enough
centroids allocated to the denser region, then the K-Means algorithm may have difficulty
identifying the clusters in each region. In this case, some of the points may be mis-assigned to
the wrong clusters, which can lead to a lower quality of the results.
Overall, if more initial centroids are allocated to the denser region, and the two regions are
well-separated and there are enough centroids, then the quality of the results from the K-
Means algorithm should be good. If the two regions are not well-separated or there are not
enough centroids, then the quality of the results may be lower.

For the region data set, assume that we want to minimise the squared error when
finding the K clusters. Which of the previous three cases should return better results
while minimising the squared error? Justify your answer.
Answer
When using the K-Means algorithm to find the K clusters in the region data set, the goal is to
minimize the squared error, which is a measure of the difference between the points and the
cluster centroids. In order to minimize the squared error, it is important to choose the initial
centroids carefully.
Of the three cases discussed previously, the case where more initial centroids are allocated to
the less dense region is the most likely to return the best results while minimizing the squared
error. This is because the less dense region is likely to have more variation in the data, which
means that it will benefit from having more centroids to capture the different clusters.
In contrast, the case where the initial centroids are equally distributed between the more
dense and less dense regions is likely to be less effective at minimizing the squared error.
This is because the more dense region is likely to be more homogeneous, which means that it
will not benefit from having as many centroids.
The case where more initial centroids are allocated to the denser region is also likely to be
less effective at minimizing the squared error. This is because the denser region is likely to
have less variation in the data, which means that it will not benefit from having as many
centroids.

K-nearest-neighbour method is based on learning by analogy, that is, by comparing a


given test tuple with training tuples that are similar to it. Given an unknown tuple, K-
nearestneighbour classifier searches the pattern space for the K training tuples that are
closest to the unknown tuple. These K training tuples are the K ”nearest neighbours” of
the unknown tuple. The Algorithm is given below, where D = {p1, ··· , pN } is the
training set objects in the form pi = (xi, cj ), xi is the n-dimensional feature vector of pi
and cj is the class of pi.
 Define the notion of “Closeness” in this case.
Answer
In the context of the K-nearest-neighbor (KNN) algorithm, "closeness" refers to the similarity
between the unknown tuple and the training tuples. In order to determine the K nearest
neighbors, the KNN algorithm uses a distance measure to evaluate the similarity between the
unknown tuple and each training tuple. This distance measure can be any metric that
quantifies the difference between two data points, such as the Euclidean distance or the
Manhattan distance. The training tuples that have the smallest distance from the unknown
tuple are considered the closest and are chosen as the K nearest neighbors.

 How can “Closeness” be defined for non-numeric attributes?


Answer
For non-numeric attributes, "closeness" can be defined in terms of the similarity between the
values of the attribute in the unknown tuple and the training tuples. This can be done using
various techniques, such as string matching or calculating the edit distance between the
values of the attribute in the unknown tuple and each training tuple. The training tuples with
the most similar values of the attribute are considered the closest and are chosen as the K
nearest neighbors.

 Which of the steps 5, 7, 8 or 9 need more computing power?


Answer
Step 7 of the KNN algorithm, which involves calculating the distance between the unknown
tuple and each training tuple, typically requires the most computing power. This is because
the distance calculation typically involves iterating over all of the features of the tuple and
performing some mathematical operations on them, which can be computationally intensive,
especially if the number of features is large or if there are a large number of training tuples.

 How to determine a good value for K?


Answer
There is no fixed rule for determining a good value for K in the KNN algorithm. In general, a
larger value of K means that the algorithm will consider more training tuples when making a
prediction, which can make the prediction more accurate but also slower to compute. A
smaller value of K means that the algorithm will consider fewer training tuples, which can
make the prediction less accurate but faster to compute. The optimal value of K will depend
on the specific dataset and the desired tradeoff between accuracy and computational
complexity.
One approach to determining a good value for K is to use cross-validation. This involves
splitting the dataset into a training set and a validation set, and using the training set to train
the KNN model and the validation set to evaluate the performance of the model for different
values of K. The value of K that produces the best performance on the validation set is
considered the best value for the given dataset. Another approach is to use a heuristic, such as
setting K to the square root of the number of training tuples.
Ultimately, the best value of K will depend on the specific characteristics of the dataset and
the goals of the model. It may require some experimentation to find the optimal value for K
in a given situation.

Data mining is the process of discovering interesting patterns from large amounts of
data. It can be conducted on any kind of data as long as the data are meaningful for a
target application.
 Describe the possible negative effects of proceeding directly to mine the data that
has not been pre-processed.
Answer
There are several possible negative effects of proceeding directly to mine data that has not
been pre-processed. One such effect is that the patterns discovered in the data may be
meaningless or misleading. This can occur because pre-processing is essential for removing
noise and other irrelevant information from the data, and without this step the patterns
discovered in the data may be distorted or obscured.
Finally, proceeding directly to mine unprocessed data can also lead to poor model
performance. This is because pre-processing is often necessary for preparing the data in a
format that is suitable for mining, and without this step the mining algorithm may be unable
to learn from the data effectively. This can result in models with low accuracy, poor
generalization, and other problems that can limit their usefulness in real-world applications.

 Define Information Retrieval.


Answer
Information retrieval (IR) is the process of searching for and retrieving information from a
collection of documents or data sources. IR systems are designed to help users find the most
relevant and useful information from a large collection of documents. This is typically done
by using keywords or other query terms to search the collection of documents and ranking the
results based on their relevance to the query. IR systems are commonly used in applications
such as search engines, library catalogs, and online databases.
 Define Precision and Recall.
Answer
Precision and recall are two measures of the performance of an information retrieval (IR)
system. Precision is a measure of the accuracy of the system, which is calculated as the
number of relevant documents retrieved by the system divided by the total number of
documents retrieved by the system. In other words, it measures the fraction of the documents
that the system has retrieved that are actually relevant to the query.
Recall is a measure of the completeness of the system, which is calculated as the number of
relevant documents retrieved by the system divided by the total number of relevant
documents in the collection. In other words, it measures the fraction of the relevant
documents that the system has been able to retrieve.
Together, precision and recall provide a useful way to evaluate the performance of an IR
system. A system with high precision is one that is able to retrieve only relevant documents,
while a system with high recall is one that is able to retrieve most of the relevant documents.
In general, it is difficult for a system to have both high precision and high recall at the same
time, and so tradeoffs must be made depending on the specific goals of the system.

 What is the main difference between clustering and classification?


Answer
The main difference between clustering and classification is that clustering is an unsupervised
learning method, while classification is a supervised learning method. In other words,
clustering is a way of organizing a collection of data points into groups (or clusters) based on
their similarity, without using any prior knowledge or labels about the data, while
classification is a way of predicting the class or category of a data point based on its features,
using training data that includes labeled examples of each class.
Another key difference between clustering and classification is that clustering is used to
discover structure in the data, while classification is used to make predictions about new data
points. In clustering, the goal is to group the data points into clusters that are internally
coherent and distinct from each other, based on the similarities between the data points. In
classification, the goal is to build a model that can predict the class of a new data point based
on its features, using a training set of labeled examples.
Overall, clustering and classification are two different approaches to analyzing and making
predictions with data, and they are often used in different contexts and for different purposes.

Suppose that you were asked to design a data warehouse to facilitate the analysis of
product sales in your organisation.
 Give the three categories of measures that can be used for the data warehouse
Answer
The three categories of measures that can be used for a data warehouse designed to facilitate
the analysis of product sales are:
1. Sales measures: These measures represent the quantity and value of products sold
over a given time period. Examples of sales measures include total units sold, total
revenue, average selling price, and so on.
2. Customer measures: These measures represent the characteristics and behaviors of
customers who purchase products from the organization. Examples of customer
measures include customer demographics, customer satisfaction, customer loyalty,
and so on.
3. Product measures: These measures represent the characteristics and performance of
the products sold by the organization. Examples of product measures include product
dimensions, product weight, product category, and so on.
These measures can be used to track and analyze sales performance and trends, identify
opportunities for growth and improvement, and make informed business decisions.

 For a data cube with three dimensions; time, location, and item, which category
does the function variance belong to? Describe how to calculate it if the cube is
partitioned into many chunks. Hint: the formula for calculating the variance is 1
N PN i=1(xi − x¯i) 2 , where x¯i is the average of xis.
Answer
The function variance belongs to the category of statistical measures for data cubes. In the
context of a data cube with three dimensions (time, location, and item), the variance is a
measure of the spread of the values of a measure, such as sales revenue, across the
dimensions of the cube.
To calculate the variance of a measure in a data cube that is partitioned into many chunks, the
following steps can be taken:
1. For each chunk of the data cube, calculate the mean of the measure over all of the data
points in the chunk.
2. Calculate the overall mean of the measure across all of the chunks of the data cube.
3. For each chunk of the data cube, calculate the sum of the squared differences between
the mean of the measure in the chunk and the overall mean of the measure.
4. Divide the sum calculated in step 3 by the total number of data points in the data cube
to obtain the variance of the measure.
This approach to calculating the variance allows the data cube to be processed in smaller
chunks, which can make the calculation more efficient and scalable.

 Suppose the function is “top 10 sales”. Discuss how to efficiently calculate this
measure in a data cube.
Answer
To efficiently calculate the "top 10 sales" function in a data cube, the following steps can be
taken:
1. Sort the data points in the data cube by the measure of sales revenue in descending
order.
2. Select the top 10 data points with the highest sales revenue.
3. Calculate the total sales revenue for the top 10 data points.
This approach to calculating the "top 10 sales" function is efficient because it only involves
sorting the data points and selecting the top 10, which can be done relatively quickly using
efficient algorithms. Additionally, this approach can be easily parallelized and distributed
across multiple processors or machines, which can further improve its efficiency and
scalability.

The Data mining process is composed of several steps. The first step is called data
preprocessing.
 Describe briefly the remaining data mining process steps
Answer
After data preprocessing, the next steps in the data mining process are usually pattern
discovery, feature selection, and model building. In pattern discovery, the goal is to identify
interesting patterns or trends in the data. This can be done using a variety of techniques, such
as clustering, association rule mining, and anomaly detection.
Feature selection is the process of selecting a subset of relevant features or variables from the
data for use in the model building step. This is important because the presence of irrelevant or
redundant features can negatively impact the performance of the model.
Once the relevant features have been selected, the next step is to build the actual model. This
involves training the model on the selected features and using it to make predictions or
classify new data. Different types of models may be used for different types of data mining
tasks, such as decision trees for classification or regression models for predicting numeric
values.
After the model has been built and evaluated, the final step is to deploy the model in a
production environment and use it to make predictions on new data. This may involve
integrating the model into an existing system or application, or using it to generate insights or
inform business decisions.

 It was claimed that the clustering can be used for both pre-processing and data
analysis. Explain the difference between the two applications of clustering
Answer
Clustering can be used for both data preprocessing and data analysis. In data preprocessing,
clustering can be used to identify and remove outliers or to group together similar data points.
This can improve the performance of downstream tasks such as classification or regression,
by reducing the influence of noise or irrelevant data on the model.
In data analysis, clustering can be used to identify meaningful patterns or structures in the
data. This can provide insights into the relationships between different data points and help to
uncover hidden trends or patterns that may not be immediately apparent. For example,
clustering can be used to group together customers with similar purchasing behaviors, or to
identify groups of genes that are co-expressed in different biological samples.
Overall, the main difference between the two applications of clustering is the goal of the
analysis. In data preprocessing, the goal is to improve the performance of downstream tasks,
whereas in data analysis, the goal is to identify and understand patterns and structures in the
data.

 Explain the main differences between data sampling and data clustering during
the pre-processing step.
Answer
Data sampling and data clustering are two different techniques that can be used during the
preprocessing step in the data mining process.
Data sampling involves selecting a subset of the data to use for a given analysis. This can be
useful for reducing the amount of data that needs to be processed, which can save time and
computational resources. It can also be used to ensure that the data used in the analysis is
representative of the overall population. For example, if the data is skewed in some way,
sampling can be used to balance the data and make it more representative.
Data clustering, on the other hand, involves grouping together data points that are similar to
each other, based on some measure of similarity. This can be useful for identifying patterns
or structures in the data, or for grouping together similar data points for further analysis. For
example, clustering can be used to group together customers with similar purchasing
behaviors, or to identify groups of genes that are co-expressed in different biological samples.
Overall, the main difference between data sampling and data clustering is the goal of the
analysis. Data sampling is focused on selecting a representative subset of the data, whereas
data clustering is focused on identifying and grouping together similar data points. These
techniques can be used together, with data sampling being used to select a subset of the data
for clustering, or they can be used independently, depending on the specific goals and needs
of the analysis.

During the data collection the data can be collected from multiple heterogeneous
sources. The data has to be integrated before analysis. Many companies in industry
prefer the update-driven approach (which constructs and uses data warehouses), rather
than the query-driven approach (which applies wrappers and integrators).
 Explain how the query-driven approach is applied on heterogeneous datasets.
Answer
The query-driven approach is a method of data integration that involves using "wrappers" and
"integrators" to access and combine data from multiple heterogeneous sources.’ Together,
wrappers and integrators allow a data integration system to access and combine data from
multiple heterogeneous sources using a query-based approach. This means that the system
can be queried for specific data, and the wrappers and integrators will automatically retrieve
the necessary data from the various sources and merge it into a single dataset for analysis.
The query-driven approach is preferred by many companies in industry because it allows
them to access and integrate data from multiple sources in a flexible and scalable way. It also
allows them to update the data in the integration system in real-time, as new data becomes
available, without the need to rebuild the entire data warehouse.

 Give the advantages and disadvantages of the update-driven approach.


Answer
The update-driven approach is a method of data integration that involves constructing and
using data warehouses to store and manage data from multiple sources.
One advantage of the update-driven approach is that it allows for the efficient storage and
retrieval of large amounts of data. Data warehouses are designed to store data in a way that is
optimized for fast access and query performance, which makes it possible to quickly retrieve
and analyze large datasets.
Another advantage of the update-driven approach is that it allows for the integration of data
from multiple sources. Data warehouses can be used to combine data from different sources,
such as databases, flat files, and web APIs, and store it in a single, coherent format. This
makes it easier to perform cross-source analysis and to combine data from different sources
in a single query.
One disadvantage of the update-driven approach is that it can be time-consuming and costly
to set up and maintain. Building and maintaining a data warehouse requires specialized skills
and resources, which can be expensive and may not be feasible for all organizations.
Another disadvantage of the update-driven approach is that it can be inflexible and difficult to
update. Once the data warehouse has been built, it can be difficult and time-consuming to
make changes to the data schema or to add new data sources. This can make it difficult to
keep the data warehouse up-to-date and to adapt to changing business needs.

 Explain why the update-driven approach is preferred on query-driven approach.


Answer
The update-driven approach is often preferred over the query-driven approach because it
allows for the efficient storage and retrieval of large amounts of data.
Data warehouses, which are used in the update-driven approach, are designed to store data in
a way that is optimized for fast access and query performance. This makes it possible to
quickly retrieve and analyze large datasets, which can be important for organizations that
need to process and analyze large volumes of data.
In contrast, the query-driven approach involves using "wrappers" and "integrators" to access
and combine data from multiple sources. This can be less efficient than using a data
warehouse, because the wrappers and integrators have to retrieve data from the various
sources and merge it together each time a query is run. This can be slow and resource-
intensive, especially for large or complex datasets.
Another reason why the update-driven approach is often preferred is that it allows for the
integration of data from multiple sources. Data warehouses can be used to combine data from
different sources, such as databases, flat files, and web APIs, and store it in a single, coherent
format. This makes it easier to perform cross-source analysis and to combine data from
different sources in a single query.
Overall, the update-driven approach is often preferred because it allows for efficient storage
and retrieval of large amounts of data, and it makes it easy to integrate data from multiple
sources. However, it is worth noting that there are also disadvantages to the update-driven
approach, such as the cost and complexity of setting up and maintaining a data warehouse.
Whether it is the best approach for a given organization will depend on its specific needs and
resources.

 Describe situations where the query-driven approach is preferable to the update-


driven approach pre-processing.
Answer
The query-driven approach is preferable to the update-driven approach in situations where it
is not feasible or cost-effective to set up and maintain a data warehouse.
For example, the query-driven approach may be preferable in cases where the data is
constantly changing or where the data sources are too numerous or diverse to be easily
integrated into a single data warehouse. In these situations, the query-driven approach can be
more flexible and scalable than the update-driven approach, because it allows the data
integration system to access and combine data from multiple sources on-demand, without the
need to build and maintain a centralized data repository.
Another situation where the query-driven approach may be preferable is when the data needs
to be accessed and updated in real-time. In this case, the query-driven approach can be more
responsive than the update-driven approach, because it allows the data integration system to
access and combine data from multiple sources as needed, without the need to rebuild the
entire data warehouse.
Overall, the query-driven approach is preferable to the update-driven approach in situations
where it is not feasible or cost-effective to set up and maintain a data warehouse, or where the
data needs to be accessed and updated in real-time. In these cases, the query-driven approach
can provide a more flexible and scalable solution for data integration.
Suppose that a data warehouse for Big University consists of the four dimensions
student, course, semester, and instructor, and two measures count() and avg grade(). At
the lowest conceptual level (e.g., for a given student, course, semester, and instructor
combination), the avg grade() measure stores the actual course grade of the student. At
higher conceptual levels, avg grade() stores the average grade for the given
combination.
 Give an example of a sub-dimension for each of the four dimensions given above
Answer
A sub-dimension is a subset of the data within a dimension that is defined by a specific
attribute or characteristic. In the case of the Big University data warehouse, there are four
dimensions (student, course, semester, and instructor), and each of these dimensions could
potentially have multiple sub-dimensions.
For example, the student dimension could have a sub-dimension for major, which would
allow for analysis of grades by major. This could be useful for identifying trends or patterns
in the grades of students in different majors, or for comparing the performance of students in
different majors.
Another possible sub-dimension of the student dimension could be gender, which would
allow for analysis of grades by gender. This could be useful for identifying any gender-based
disparities in grades, or for comparing the performance of male and female students.
Similarly, the course dimension could have a sub-dimension for department, which would
allow for analysis of grades by department. This could be useful for identifying trends or
patterns in the grades of students in different departments, or for comparing the performance
of students in different departments.
Another possible sub-dimension of the course dimension could be level (e.g., undergraduate
vs. graduate), which would allow for analysis of grades by level. This could be useful for
identifying any differences in the performance of undergraduate and graduate students, or for
comparing the performance of students in different levels of courses.
Overall, there are many possible sub-dimensions that could be defined for each of the four
dimensions in the Big University data warehouse. The specific sub-dimensions that are
chosen will depend on the specific goals and needs of the analysis.

 Draw a snowflake schema diagram for the data warehouse.


Answer

 Starting with the base cuboid [student, course, semester, instructor], what
specific OLAP operations (e.g., roll-up from semester to year) should you
perform in order to list the average grade of CS courses for each Big University
student.
Answer
To list the average grade of computer science (CS) courses for each Big University student,
you would need to perform a roll-up operation on the base cuboid [student, course, semester,
instructor] to aggregate the data at the student level. This would involve combining the data
for each student across all courses, semesters, and instructors, and calculating the average
grade for each student.
Additionally, you would need to filter the data to only include courses that are in the CS
department. This would involve applying a filter on the course dimension to only include
courses with a department of CS.
Overall, the specific OLAP operations that you would need to perform to list the average
grade of CS courses for each Big University student are:
 Roll up the data from the semester level to the student level, to aggregate the data for
each student.
 Filter the data to only include courses with a department of CS.
 Calculate the average grade for each student.
These operations would be performed on the base cuboid [student, course, semester,
instructor] to generate a new cuboid that contains the average grade of CS courses for each
student.

 If each dimension has five levels (including all), such as student < major < status
< university < all, how many cuboids will this cube contain (including the base
and apex cuboids)?
Answer
A cuboid is a subset of the data in a cube that is defined by a specific combination of
dimensions. The number of cuboids that a cube contains depends on the number of
dimensions and the number of levels within each dimension.
In the case of a cube with four dimensions, each of which has five levels (including all), there
will be a total of five to the power of four cuboids, or 625 cuboids. This includes the base
cuboid, which contains all of the data in the cube, and the apex cuboid, which contains a
single data point (the grand total) for all dimensions.
To see why this is the case, consider each dimension separately. If a dimension has five levels
(including all), this means that there are four possible combinations of levels that can be used
to define a cuboid. For example, the student dimension could have a cuboid for each major,
or for each status, or for each university, or for all students. This means that there are four
possible cuboids for each dimension.
Now consider all four dimensions together. Since there are four dimensions and each
dimension has four possible cuboids, there are a total of four to the power of four possible
cuboids, which is 256. However, this does not include the base and apex cuboids, which are
also part of the cube. Therefore, the total number of cuboids in the cube is 256 + 2 = 258.
Overall, a cube with four dimensions, each of which has five levels (including all), will
contain 625 cuboids, including the base and apex cuboids. This is because there are four
possible cuboids for each dimension, and there are four dimensions in total.

A grocery store chain keeps a record of weekly transactions where each transaction
represents the items bought during one cash register transaction. The executives of the
chain receive a summarised report of the transactions indicating what types of items
have sold at what quantity. In addition, they periodically request information about
what items are commonly purchased together. In this case, there are five transactions
and five items, as shown in the Table .
Suppose that the minimum support and confidence are S = 30% and 50%, respectively.

 Calculate the Support and Confidence of the association rules given in Table 2.
Answer
Support and confidence are two measures used in association rule mining to evaluate the
strength of a given association rule. Support measures the frequency with which the items in
a rule appear together in the data, whereas confidence measures the likelihood that the items
in a rule will appear together, given their individual frequencies.
In the case of the grocery store transactions in Table 2, the support and confidence of each
association rule can be calculated as follows:
1. The support of the rule {Bread, Milk} => {Butter} is 3/5 = 60%, because the items
Bread, Milk, and Butter appear together in three of the five transactions.
2. The confidence of the rule {Bread, Milk} => {Butter} is 3/4 = 75%, because three of
the four transactions that contain Bread and Milk also contain Butter.
3. The support of the rule {Bread} => {Milk} is 4/5 = 80%, because the items Bread and
Milk appear together in four of the five transactions.
4. The confidence of the rule {Bread} => {Milk} is 4/5 = 80%, because four of the five
transactions that contain Bread also contain Milk.
5. The support of the rule {Milk} => {Butter} is 2/5 = 40%, because the items Milk and
Butter appear together in two of the five transactions.
6. The confidence of the rule {Milk} => {Butter} is 2/3 = 67%, because two of the three
transactions that contain Milk also contain Butter.
Overall, the support and confidence values for the association rules in Table 2 are:
 {Bread, Milk} => {Butter}: Support = 60%, Confidence = 75%
 {Bread} => {Milk}: Support = 80%, Confidence = 80%
 {Milk} => {Butter}: Support = 40%, Confidence = 67%

 You are asked by the executives to produce information about what items are
commonly purchased together. based on the dataset given in Table 1, what are
the items which are commonly purchased together?
Answer
Based on the data in Table 1, the items that are commonly purchased together are Bread and
Milk, because they appear together in four of the five transactions. Additionally, Bread and
Butter and Milk and Butter also appear together in two of the five transactions, which
suggests that these items are also commonly purchased together.
To determine which items are commonly purchased together, you would need to apply
association rule mining to the data. This involves using algorithms to identify relationships
between items, based on their co-occurrence in the data. The specific rules that are identified
will depend on the minimum support and confidence values that are used, as well as any
additional constraints that are applied.
In the case of the grocery store transactions, you could apply association rule mining with a
minimum support of 30% and a minimum confidence of 50%. This would identify the
following association rules:
 {Bread, Milk} => {Butter}: Support = 60%, Confidence = 75%
 {Bread} => {Milk}: Support = 80%, Confidence = 80%
 {Milk} => {Butter}: Support = 40%, Confidence = 67%
These rules indicate that Bread and Milk, as well as Bread and Butter and Milk and Butter,
are commonly purchased together, because they appear together in a sufficient number of
transactions and there is a high likelihood that they will appear together, given their
individual frequencies.
Overall, the items that are commonly purchased together in the grocery store transactions are
Bread and Milk, as well as Bread and Butter and Milk and Butter. These associations were
identified using association rule mining with a minimum support of 30% and a minimum
confidence of 50%.

Suppose a group of 12 sales price records has been sorted as follows:


5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
Partition them into three bins by each of the following methods.
1. equal-frequency (equidepth) partitioning:
Equal-frequency (or equidepth) partitioning is a method of dividing a dataset into multiple
bins or groups, where each bin contains an equal number of data points. To apply this method
to the group of 12 sales price records, you would first need to calculate the total number of
bins that you want to create. In this case, we are asked to partition the data into three bins, so
we would need to create three bins.
Next, you would need to divide the total number of data points by the number of bins to
determine the number of data points that each bin should contain. In this case, there are 12
data points and we want to create three bins, so each bin should contain 12/3 = 4 data points.
Finally, you would need to sort the data in ascending order and then divide it into the desired
number of bins, making sure that each bin contains the correct number of data points. In this
case, the data is already sorted in ascending order, so you can simply divide it into three bins
of four data points each, as shown below:
 Bin 1: 5, 10, 11, 13
 Bin 2: 15, 35, 50, 55
 Bin 3: 72, 92, 204, 215
Overall, the equal-frequency (equidepth) partitioning method involves dividing a dataset into
bins of equal size, by calculating the total number of bins, dividing the number of data points
by the number of bins, and then dividing the data into the desired number of bins. In the case
of the group of 12 sales price records, this method would result in the three bins shown
above.

2. equal-width partitioning:
Equal-width partitioning is a method of dividing a dataset into multiple bins or groups, where
each bin covers an equal range of values. To apply this method to the group of 12 sales price
records, you would first need to calculate the total number of bins that you want to create. In
this case, we are asked to partition the data into three bins, so we would need to create three
bins.
Next, you would need to calculate the range of values in the data and then divide this range
by the number of bins to determine the width of each bin. In this case, the range of values is
215 - 5 = 210, and we want to create three bins, so each bin should have a width of 210/3 =
70.
Finally, you would need to sort the data in ascending order and then divide it into the desired
number of bins, making sure that each bin covers the correct range of values. In this case, the
data is already sorted in ascending order, so you can simply divide it into three bins of width
70, as shown below:
 Bin 1: 5, 10, 11, 13, 15, 35, 50, 55, 72
 Bin 2: 92, 204
 Bin 3: 215
Overall, the equal-width partitioning method involves dividing a dataset into bins of equal
width, by calculating the range of values in the data, dividing this range by the number of
bins, and then dividing the data into the desired number of bins. In the case of the group of 12
sales price records, this method would result in the three bins shown above.

3. clustering:
To partition the group of 12 sales price records into three bins by clustering, you would need
to choose a clustering algorithm and decide on the number of clusters that you want to create.
There are many different clustering algorithms available, and the specific algorithm that you
choose will depend on the characteristics of your data and the goals of your analysis.
Once you have chosen a clustering algorithm and decided on the number of clusters, you can
apply it to the data to partition it into the desired number of bins or clusters. The exact
characteristics of the clusters will depend on the specific algorithm and parameters that you
use.
As an example, suppose that you choose to use the k-means clustering algorithm with three
clusters. This algorithm works by dividing the data into the desired number of clusters, based
on the similarity of the data points to a set of cluster centroids. In this case, you would apply
the k-means algorithm to the data with three clusters, and the resulting clusters would be:
 Cluster 1: 5, 10, 11, 13, 15, 35, 50
 Cluster 2: 55, 72, 92
 Cluster 3: 204, 215
These clusters represent the three bins or groups into which the data has been partitioned. The
exact characteristics of the clusters will depend on the specific clustering algorithm and
parameters that you use, and you may need to evaluate the resulting clusters to determine
their usefulness for your analysis.
Overall, to partition the group of 12 sales price records into three bins by clustering, you
would need to choose a clustering algorithm, apply it to the data with the desired number of
clusters, and then evaluate the resulting clusters to determine their usefulness for your
analysis.

Outliers are often discarded as noise. However, one person’s garbage could be another’s
treasure. For example, exceptions in credit card transactions can help us detect the
fraudulent use of credit cards. Taking fraudulence detection as an example of data
mining application,
 Give an example of noisy data in credit card transactions.
Answer
Noisy data in credit card transactions can refer to data points that are unusual or unexpected,
and that do not fit the typical patterns or trends in the data. For example, a credit card
transaction with a very large amount, or a transaction that is made from an unusual location,
could be considered noisy data.
In the context of fraud detection, noisy data can be useful because it can help to identify
transactions that are potentially fraudulent. For example, a credit card transaction with a very
large amount could be flagged as potentially fraudulent if it is not typical for the user's
normal spending patterns. Similarly, a transaction that is made from an unusual location
could be flagged as potentially fraudulent if the user does not normally make transactions
from that location.
Overall, noisy data in credit card transactions can refer to data points that are unusual or
unexpected, and that do not fit the typical patterns or trends in the data. In the context of
fraud detection, this type of data can be useful for identifying potentially fraudulent
transactions.

 Give an example of an outlier in credit card transactions.


Answer
An outlier in credit card transactions can refer to a data point that is significantly different
from the other data points in the dataset. Outliers can be caused by a variety of factors, such
as errors in data collection or measurement, or by unusual or exceptional events.
In the context of credit card transactions, an outlier could be a transaction with a very large
amount, or a transaction that is made from an unusual location. For example, suppose that a
user's normal spending patterns involve transactions of $50 to $100, with most transactions
being made at the user's home address. In this case, a transaction of $500 made from a
location in a different country could be considered an outlier.
Overall, an outlier in credit card transactions can refer to a data point that is significantly
different from the other data points in the dataset. In the context of fraud detection, this type
of data can be useful for identifying potentially fraudulent transactions.

Propose two methods that can be used to detect outliers


Answer
There are many different methods that can be used to detect outliers in a dataset. Some of the
most common methods include:
 Statistical methods: Statistical methods are based on the assumption that the data
follows a certain distribution, such as the normal distribution. Outliers can be detected
by calculating summary statistics, such as the mean and standard deviation, and then
identifying data points that are significantly different from the majority of the data.
For example, a data point that is more than three standard deviations away from the
mean can be considered an outlier.
 Data visualization: Outliers can be detected by visualizing the data using scatter plots
or box plots. These plots can show the distribution of the data, and can make it easier
to identify data points that are significantly different from the majority of the data. For
example, a data point that is outside the range of the other data points, or that is not
grouped with the other data points, could be considered an outlier.
 Discuss which of the two methods is more reliable
Answer
It is difficult to say which of the two methods discussed above is more reliable, as the choice
of method will depend on the specific characteristics of the data and the goals of the analysis.
Both statistical methods and data visualization techniques can be useful for detecting outliers
in a dataset, and each method has its own advantages and disadvantages.
Statistical methods are based on the assumption that the data follows a certain distribution,
such as the normal distribution. These methods can be very reliable if the data does indeed
follow the assumed distribution, as they can provide a precise and objective way to identify
outliers. However, if the data does not follow the assumed distribution, these methods can be
less reliable, as they may not accurately detect the outliers in the data.
Data visualization techniques, on the other hand, do not rely on any assumptions about the
distribution of the data. These methods can be useful for identifying outliers by visually
inspecting the data and identifying data points that are significantly different from the
majority of the data. However, these methods can be subjective, as the identification of
outliers can depend on the interpretation of the person viewing the data.
Overall, both statistical methods and data visualization techniques can be useful for detecting
outliers in a dataset. The choice of method will depend on the specific characteristics of the
data and the goals of the analysis. In some cases, statistical methods may be more reliable,
while in other cases, data visualization techniques may be more effective.

Recent applications pay special attention to spatiotemporal data streams. A


spatiotemporal data stream contains spatial information that changes over time, and is
in the form of stream data, i.e., the data flow in-and-out like possibly infinite streams.
 Briefly describe three application examples of spatiotemporal data streams.
Answer
Spatiotemporal data streams can be used in a variety of applications, including:
 Environmental monitoring: Spatiotemporal data streams can be used to monitor
environmental conditions, such as air and water quality, weather patterns, and soil
moisture levels. These data streams can provide real-time information about the state
of the environment, and can be used to identify trends and patterns over time. For
example, a spatiotemporal data stream could be used to monitor air quality in a city,
and to identify areas where air pollution is increasing or decreasing over time.
 Traffic management: Spatiotemporal data streams can be used to monitor traffic
patterns and congestion, and to develop strategies for managing and improving traffic
flow. These data streams can provide real-time information about traffic conditions,
and can be used to identify areas where traffic is likely to be congested, or where
accidents are likely to occur. For example, a spatiotemporal data stream could be used
to monitor traffic on a highway, and to identify areas where traffic is likely to be
heavy during rush hour.
 Disaster management: Spatiotemporal data streams can be used to monitor and
respond to natural disasters, such as earthquakes, hurricanes, and floods. These data
streams can provide real-time information about the location and severity of a
disaster, and can be used to coordinate rescue and recovery efforts. For example, a
spatiotemporal data stream could be used to monitor the spread of a wildfire, and to
identify areas where evacuations are needed.

Discuss what kind of interesting knowledge can be mined from such data streams, with
limited time and resources.
Answer
There is a wide range of interesting knowledge that can be mined from spatiotemporal data
streams, with limited time and resources. Some of the most common types of knowledge that
can be extracted from these data streams include:
 Trends and patterns: Spatiotemporal data streams can be used to identify trends and
patterns in the data, such as changes in environmental conditions, traffic patterns, or
disaster events. These trends and patterns can provide valuable insights into the
dynamics of the system being studied, and can be used to support decision-making
and planning.
 Correlations and relationships: Spatiotemporal data streams can be used to identify
correlations and relationships between different variables, such as the relationship
between air quality and traffic congestion, or the relationship between soil moisture
levels and crop yields. These correlations and relationships can provide valuable
insights into the underlying mechanisms of the system, and can be used to support
predictive modeling and forecasting.
 Anomalies and exceptions: Spatiotemporal data streams can be used to identify
anomalies and exceptions in the data, such as unusual events, unexpected trends, or
unusual patterns. These anomalies and exceptions can be indicators of potential
problems or opportunities, and can be used to support monitoring and alerting
systems.
Overall, there is a wide range of interesting knowledge that can be mined from
spatiotemporal data streams, with limited time and resources. These data streams can be used
to identify trends and patterns, correlations and relationships, and anomalies and exceptions,
and can provide valuable insights into the dynamics of the system being studied.

Identify and discuss the major challenges in spatiotemporal data mining


Answer
There are many challenges associated with spatiotemporal data mining, including:
 Data volume and velocity: Spatiotemporal data streams can be very large, and can
contain a large amount of data. This can make it difficult to process and analyze the
data in real-time, and can require significant computational resources. In addition,
spatiotemporal data streams can be very dynamic, with data flowing in and out at a
high velocity. This can make it difficult to capture and analyze the data in a timely
manner, and can require specialized algorithms and techniques to process the data.
 Data quality and noise: Spatiotemporal data streams can be noisy, and can contain
errors, missing values, or incorrect data. This can make it difficult to extract useful
knowledge from the data, and can require data cleaning and preprocessing techniques
to improve the quality of the data. In addition, spatiotemporal data streams can be
incomplete, and may not capture all the relevant information about a particular event
or phenomenon. This can limit the accuracy and reliability of the knowledge that is
extracted from the data.
 Data integration and fusion: Spatiotemporal data streams can come from multiple
sources, and can be in different formats and structures. This can make it difficult to
integrate the data, and to combine it in a way that is useful for analysis. In addition,
spatiotemporal data streams can be heterogeneous, and can contain different types of
data, such as numeric, categorical, or spatial data. This can require data fusion
techniques to combine the different types of data, and to extract useful knowledge
from the data.
Overall, there are many challenges associated with spatiotemporal data mining, including
data volume and velocity, data quality and noise, and data integration and fusion. These
challenges can make it difficult to extract useful knowledge from spatiotemporal data
streams, and can require specialized algorithms and techniques to overcome these challenges.

Using one application example, sketch a method to mine one kind of knowledge from
such stream data efficiently.
Answer
To mine knowledge from spatiotemporal data streams efficiently, it is important to use
algorithms and techniques that are specifically designed for this type of data. Here is an
example of a method that could be used to mine knowledge from spatiotemporal data
streams, using the application of environmental monitoring as an example:
 Collect and preprocess the data: The first step in the process is to collect the
spatiotemporal data streams that are relevant to the application. This may involve
accessing data from sensors, weather stations, or other sources, and may require data
cleaning and preprocessing to remove errors, missing values, or incorrect data.
 Identify trends and patterns: Once the data has been collected and preprocessed, the
next step is to identify trends and patterns in the data. This may involve using
statistical methods, such as regression analysis or time series analysis, to identify
changes in environmental conditions over time. Alternatively, data visualization
techniques, such as scatter plots or heat maps, can be used to visually inspect the data
and identify trends and patterns.
 Detect anomalies and exceptions: The next step is to detect anomalies and exceptions
in the data, such as unusual events or unexpected trends. This may involve using
outlier detection algorithms, such as the Z-score method or the Tukey method, to
identify data points that are significantly different from the majority of the data.
Alternatively, data mining techniques, such as clustering or classification, can be used
to identify groups of data points that are significantly different from each other.
 Extract useful knowledge: Finally, the extracted knowledge can be used to support
decision-making and planning. For example, the trends and patterns identified in the
data can be used to forecast future environmental conditions, and to identify areas
where interventions are needed. The anomalies and exceptions detected in the data
can be used to trigger alerts and notifications, and to support monitoring and control
systems.
Overall, this method provides a general outline for how to efficiently mine knowledge from
spatiotemporal data streams. By collecting and preprocessing the data, identifying trends and
patterns, detecting anomalies and exceptions, and extracting useful knowledge, it is possible
to extract valuable insights from these data streams, and to support decision-making and
planning.

Suppose that the data mining task is to cluster points (with (x, y) representing location)
into three clusters, where the points are
A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9)
The distance function is Euclidean distance. Suppose initially we assign A1, B1, and C1
as the center of each cluster, respectively. Use the k-means algorithm to show only
 The three cluster centers after the first round of execution.
Answer
The k-means algorithm is an iterative algorithm that is used to cluster points into a specified
number of clusters. In this case, the goal is to cluster the points into three clusters, using the
Euclidean distance as the distance function. The algorithm works as follows:
 Initially, the algorithm assigns each point to one of the three clusters, using some
arbitrary criterion. In this case, we will assign A1, B1, and C1 as the centers of the
three clusters, respectively.
 For each cluster, the algorithm calculates the centroid, which is the average location
of all the points in the cluster. In this case, the centroid of the first cluster (centered at
A1) is (2, 10), the centroid of the second cluster (centered at B1) is (5, 8), and the
centroid of the third cluster (centered at C1) is (1, 2).
 The algorithm then updates the cluster centers by setting them to the centroids of the
clusters. In this case, the cluster centers after the first round of execution are (2, 10),
(5, 8), and (1, 2), respectively.
Overall, the three cluster centers after the first round of execution are (2, 10), (5, 8), and (1,
2), respectively. These cluster centers will be used as the starting points for the next round of
the k-means algorithm, and will be updated as the algorithm continues to iterate.
 The final three clusters.
Answer
The final three clusters produced by the k-means algorithm will depend on the specific details
of the data and the number of iterations that the algorithm is allowed to run. In general, the
algorithm will continue to update the cluster centers and calculate the centroids of the clusters
until the cluster centers converge to a stable solution, or until the maximum number of
iterations is reached.
To determine the final three clusters, it is necessary to run the k-means algorithm with the
given data and distance function, and to specify the number of iterations that the algorithm is
allowed to run. The final three clusters will be the clusters produced by the algorithm after
the specified number of iterations, or after the cluster centers have converged to a stable
solution. Without running the algorithm with the given data, it is not possible to determine
the final three clusters.

 Can you always find an optimal clustering with k-means? Justify your answer.
Answer
It is not always possible to find an optimal clustering with the k-means algorithm. The k-
means algorithm is a heuristic algorithm, which means that it uses a set of rules or guidelines
to generate a solution to a problem, but it does not guarantee that the solution will be optimal.
In some cases, the k-means algorithm may converge to a suboptimal solution, or it may fail to
converge to any solution at all.
There are several reasons why the k-means algorithm may not find an optimal clustering. One
reason is that the algorithm uses a local search strategy, which means that it only considers
the immediate neighborhood of the current solution when searching for a better solution. This
can prevent the algorithm from exploring the entire solution space, and can limit its ability to
find the global optimum.
Another reason is that the k-means algorithm is sensitive to the initial cluster centers, which
are chosen arbitrarily at the start of the algorithm. If the initial cluster centers are not well-
suited to the data, the algorithm may converge to a suboptimal solution, or it may fail to
converge at all. This can make the k-means algorithm unreliable, and can limit its ability to
find an optimal clustering.
Overall, it is not always possible to find an optimal clustering with the k-means algorithm.
The algorithm may converge to a suboptimal solution, or it may fail to converge at all, due to
its local search strategy and its sensitivity to the initial cluster centers.

Both k-means and k-medoids algorithms can perform effective clustering.


 Illustrate the strength and weakness of k-means in comparison with k-medoids.
Answer
K-means and k-medoids are both clustering algorithms that can be used to group data points
into clusters. K-means is a centroid-based algorithm, while k-medoids is a distance-based
algorithm.
One of the main strengths of k-means is its speed. K-means is relatively fast and can be used
on large datasets, which makes it a popular choice for clustering in many applications.
Another strength of k-means is that it is easy to implement, which makes it a good choice for
those who are new to clustering algorithms.
One of the main weaknesses of k-means is that it can produce suboptimal clusters if the data
is not well-defined or if there are outliers present in the data. Additionally, k-means is
sensitive to the initialization of the cluster centers, which means that the results of the
algorithm can vary depending on how the cluster centers are initialized.
In contrast, k-medoids is a more robust algorithm than k-means and is less sensitive to the
presence of outliers in the data. Additionally, k-medoids can produce more accurate clusters
than k-means, particularly when the data is not well-defined or when the clusters have non-
uniform sizes. However, k-medoids is a slower algorithm than k-means and is not suitable for
use on large datasets. Additionally, k-medoids is more difficult to implement than k-means,
which can make it a less appealing choice for those who are new to clustering algorithms.

 Illustrate the strength and weakness of k-means in comparison with a


hierarchical clustering scheme (e.g. AGNES).
Answer
K-means and hierarchical clustering are two different methods for clustering data. K-means is
a centroid-based algorithm, while hierarchical clustering is a distance-based algorithm.
One of the main strengths of k-means is its speed. K-means is relatively fast and can be used
on large datasets, which makes it a popular choice for clustering in many applications.
Another strength of k-means is that it is easy to implement, which makes it a good choice for
those who are new to clustering algorithms.
One of the main weaknesses of k-means is that it can produce suboptimal clusters if the data
is not well-defined or if there are outliers present in the data. Additionally, k-means is
sensitive to the initialization of the cluster centers, which means that the results of the
algorithm can vary depending on how the cluster centers are initialized.
In contrast, hierarchical clustering is a more robust algorithm than k-means and is less
sensitive to the presence of outliers in the data. Additionally, hierarchical clustering can
produce more accurate clusters than k-means, particularly when the data is not well-defined
or when the clusters have non-uniform sizes. However, hierarchical clustering is a slower
algorithm than k-means and is not suitable for use on large datasets. Additionally,
hierarchical clustering can be more difficult to interpret than k-means, which can make it a
less appealing choice for those who are new to clustering algorithms.
Consider a dataset describing patients of a given clinic. The dataset includes the
attribute age. The age values within the dataset are as follows:
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70.
 Calculate the mean and median of the data
Answer
To calculate the mean of a dataset, we sum all of the values in the dataset and then divide by
the number of values in the dataset. In this case, the sum of the values in the dataset is 771
and there are 29 values in the dataset, so the mean of the dataset is 771/29 = 26.62.
To calculate the median of a dataset, we first need to arrange the values in the dataset in
ascending order. In this case, the values in the dataset are already in ascending order, so we
can simply find the middle value of the dataset. Since there are 29 values in the dataset and
29 is an odd number, the median of the dataset is the 15th value, which is 25.

 Give the five-number summary of the data.


Answer
The five-number summary of a dataset consists of the minimum value, the first quartile, the
median, the third quartile, and the maximum value of the dataset. To find the five-number
summary of a dataset, we first need to arrange the values in the dataset in ascending order.
In this case, the values in the dataset are already in ascending order, so we can simply find the
five-number summary as follows:
 The minimum value of the dataset is 13.
 The first quartile of the dataset is the value that separates the bottom 25% of the data
from the top 75% of the data. To find this value, we take the average of the 7th and
8th values in the dataset, which are 20 and 21. The average of these values is 20.5, so
the first quartile of the dataset is 20.5.
 The median of the dataset is the middle value, which is 25.
 The third quartile of the dataset is the value that separates the bottom 75% of the data
from the top 25% of the data. To find this value, we take the average of the 22nd and
23rd values in the dataset, which are 35 and 35. The average of these values is 35, so
the third quartile of the dataset is 35.
 The maximum value of the dataset is 70.
Therefore, the five-number summary of the dataset is 13, 20.5, 25, 35, 70.

 Use smoothing by bin means to smooth these data, using a bin depth of 3.
Illustrate your steps. Comment on the effect of this technique for the given data.
Answer
Smoothing by bin means is a technique that can be used to smooth a dataset by dividing the
data into bins and then replacing the values in each bin with the mean of the values in that
bin. In this case, we are asked to use a bin depth of 3, which means that we will divide the
dataset into bins with a width of 3.
To use smoothing by bin means on the given data, we first need to arrange the values in the
dataset in ascending order. The values in the dataset are already in ascending order, so we can
proceed as follows:
 We create the first bin by taking the first three values in the dataset, which are 13, 15,
and 16. The mean of these values is 14.67, so we replace the values in the first bin
with 14.67.
 We create the second bin by taking the next three values in the dataset, which are 16,
19, and 20. The mean of these values is 18, so we replace the values in the second bin
with 18.
 We create the third bin by taking the next three values in the dataset, which are 20, 21,
and 22. The mean of these values is 21, so we replace the values in the third bin with
21.
 We continue in this manner until all of the values in the dataset have been processed.
After applying smoothing by bin means with a bin depth of 3, the dataset is transformed as
follows:
14.67, 18, 21, 21, 21, 24.33, 24.33, 24.33, 27.67, 27.67, 27.67, 30, 30, 30, 33, 36, 36, 39, 39,
39, 39, 42, 45, 48, 51, 54
As we can see, smoothing by bin means has had the effect of smoothing out the values in the
dataset, which can make it easier to identify patterns or trends in the data. However, it is
important to note that smoothing by bin means can also introduce some distortion into the
dataset, which can make the data less accurate.

 How might you determine outliers in the data?


Answer
One way to determine potential outliers in this dataset would be to calculate the interquartile
range (IQR) of the ages and then identify any values that are more than 1.5 times the IQR
below the first quartile or above the third quartile. The IQR is a measure of the spread of the
middle 50% of the data, and values outside of this range are considered potential outliers. For
example, if the IQR of the ages in this dataset is 10, then values less than 10 * 1.5 = 15 below
the first quartile or greater than 10 * 1.5 = 15 above the third quartile would be considered
potential outliers.

 What other methods are there for data smoothing?


Answer
There are several other methods for data smoothing, including:
 Binning: This involves dividing the data into intervals or "bins" and then replacing the
individual values within each bin with a summary statistic (such as the mean or
median) for that bin. This can help to reduce the impact of individual outlier values on
the overall distribution of the data.
 Regression: This involves fitting a mathematical function (such as a linear or
polynomial function) to the data and using the function to predict the values of the
data. This can help to smooth out any random noise or variability in the data and
make it easier to identify underlying trends or patterns.
 Smoothing splines: This involves fitting a smooth curve (called a spline) to the data,
which can help to reduce the impact of individual outlier values on the overall
distribution of the data.
 Kernel smoothing: This involves estimating the underlying probability density
function (PDF) of the data by fitting a smoothing kernel to the data. This can help to
smooth out the data and make it easier to identify underlying patterns or trends.

Suppose that Dublin city transportation department would like to perform data analysis
on motorway traffic for the planning of M50 extension based on the city traffic data
collected at different hours every day, for the last five years.
 Briefly explain what is meant by the term spatial data warehouse.
Answer
A spatial data warehouse is a database that is specifically designed to store, manage, and
analyze spatial data. Spatial data is data that includes geographic or geospatial information,
such as the location of objects on the Earth's surface. A spatial data warehouse allows users to
query and analyze this data in order to identify patterns, trends, and relationships that may not
be apparent from the raw data alone. This can be useful for applications such as
transportation planning, where it is important to understand the spatial distribution of traffic
on the city's roads.

 Design a spatial data warehouse that stores the motorway traffic information so
that people can easily see the average and peak time traffic flow by motorway, by
time of day, by weekdays, and the traffic situation when a major accident occurs
Answer
One potential design for a spatial data warehouse that stores motorway traffic information is
as follows:
The data warehouse would store the following data:
 Date and time of the traffic measurement
 Location of the traffic measurement (e.g. coordinates, motorway name)
 Traffic flow (e.g. number of vehicles per hour)
 Accident indicator (e.g. 0/1 value indicating whether an accident occurred at that time
and location)
The data warehouse would use spatial data types and functions to store and manipulate the
location data. This would allow users to easily query and analyze the data based on the
location of the traffic measurements.
The data warehouse would include dimensions for time (e.g. hour, day of week), motorway,
and accident. These dimensions would be used to slice and dice the data in order to
understand the traffic flow by different time periods, motorways, and accident situations.
The data warehouse would include measures for traffic flow (e.g. average flow, peak flow).
These measures would be aggregated along the time and motorway dimensions, allowing
users to easily see the average and peak traffic flow by time of day, by motorway, and by
weekdays.
The data warehouse would allow users to query the data using a variety of spatial and
temporal filters, such as the location of a specific motorway or the time of day. This would
enable users to easily analyze the traffic situation on a particular motorway or at a specific
time of day, and to compare the traffic flow between different motorways and time periods.
The data warehouse would also allow users to query the data based on the accident indicator.
This would enable users to understand the impact of accidents on traffic flow, and to identify
potential bottlenecks or areas of congestion that may need to be addressed in the planning of
the M50 extension.

 What information can we mine from such a spatial data warehouse to help city
planners?
Answer
A spatial data warehouse that stores motorway traffic information could be used to mine a
variety of useful insights for city planners. Some examples of information that could be
mined from such a data warehouse include:
 The average and peak traffic flow on different motorways, by time of day and by day
of week. This could help city planners to identify the busiest times and routes, and to
plan for additional capacity or alternative routes as needed.
 The impact of accidents on traffic flow. This could help city planners to understand
how accidents affect traffic on different motorways, and to identify potential
bottlenecks or areas of congestion that may need to be addressed in the planning of
the M50 extension.
 The spatial distribution of traffic flow. This could help city planners to visualize the
overall traffic situation in the city, and to identify potential areas for improvement or
expansion.
 The relationship between traffic flow and other factors, such as weather, road
conditions, and events. This could help city planners to understand how these factors
affect traffic flow, and to plan for potential disruptions or changes in traffic patterns.
Overall, a spatial data warehouse that stores motorway traffic information could provide
valuable insights to help city planners make informed decisions about the planning of the
M50 extension.
 This data warehouse contains both spatial and temporal data. Propose one
mining technique that can efficiently mine interesting patterns from such a
spatio-temporal data warehouse.
Answer
One mining technique that could be used to efficiently mine interesting patterns from a
spatio-temporal data warehouse is spatial-temporal clustering. This technique involves using
algorithms to identify groups or clusters of spatial and temporal data points that have similar
characteristics or behaviors. For example, in the case of motorway traffic data, spatial-
temporal clustering could be used to identify groups of motorways that have similar traffic
patterns (e.g. high traffic during rush hour, low traffic at night), or to identify groups of times
and locations that have high accident rates.
Overall, spatial-temporal clustering could be an effective mining technique for efficiently
mining interesting patterns from a spatio-temporal data warehouse of motorway traffic data.

You might also like