COMP1942 Question Paper

COMP1942 Question Paper
COMP1942 Exploring and Visualizing Data (Spring Semester 2014)

Midterm Examination (Question Paper)
Date: 1 April, 2014 (Tue)
Time: 12:05-13:20
Duration: 1 hour 15 minutes
Student ID:__________________ Student Name:_________________________________
Seat No. :__________________
Instructions:
(1) Please answer all questions in Part A and Part B in the answer sheet.
(2) You can optionally answer the bonus question in Part C in the answer sheet. You can obtain additional
marks for the bonus question if you answer it correctly.
(3) You can use a calculator.
Question Paper
1/7
Part A (Compulsory Short Questions)

Q1 (20 Marks)
(a) Consider a data set containing 10 transactions and 6 items.

We know that the lift ratio of association rule “{A, B}  C” is 1.25.
We also know that the support of {A} is 7, the support of {B} is 5, the support of {C} is 6, the support
of {A, B} is 4, the support of {A, C} is 5 and the support of {B, C} is 3.
Is it always true that we can find the support of “{A, B}  C”? If yes, please explain it and write down
the support of “{A, B}  C”. Otherwise, please elaborate it.
(b) In the Apriori algorithm, we know how to find some sets L1, C2, L2, ….
(i) Is it always true that the number of itemsets in L2 is smaller than or equal to the number of itemsets
in C2? If yes, please explain it. Otherwise, please give a counter example.
(ii) Is it always true that the number of itemsets in C2 is larger than or equal to the number of itemsets in
L1? If yes, please explain it. Otherwise, please give a counter example.
(c) We know that conditional FP-trees are constructed from an FP-tree. Is it always true that we can
construct the FP-tree based on all conditional FP-trees constructed? Please elaborate it.
Q2 (20 Marks)
(a) Consider Algorithm forgetful sequential k-means clustering. Let a be a constant defined in this
algorithm.
(i) Please write down the steps for Algorithm forgetful sequential k-means clustering.
(ii) Consider a cluster found in the algorithm containing n examples where its initial mean is equal to
m0. Let xj be the first j-th example in this cluster and mj be the mean vector of this cluster after the
first j-th examples are added for j = 1, 2, …, n. We can express mn in the following form.
n
mn  X  m0   Y  x p
p 1
where X and Y are some expressions.

Please show that mn can be expressed in this form. After you show this statement, please also
write down what is X and what is Y.
(You are not required to memorize the formula for this question. You just need to show how you
obtain the above expression and finally you can obtain X and Y.)
(b) We are given the following table with 3 input attributes, namely “Gender”, “Child” and “Income”, and
1 target attribute, namely “Insurance”. “Actual Insurance” corresponds to the actual values for
attribute “Insurance” and “Predicted Insurance” corresponds to the values for attribute “Insurance”
given by a classification model (e.g., decision tree).
Gender Child Income Actual Insurance Predicted Insurance
Male Yes High Yes Yes
Male Yes Low No Yes
Male No High No No
Female Yes High Yes No
Female Yes Medium No Yes
Female No Medium Yes Yes
Female No Low No No
(i) Please give the confusion matrix.
(ii)Please give the lift chart.
2/7
Q3 (20 Marks)
(a) Please give two reasons why we need to do clustering.

(b) We are given five data points.
a: (1, 2), b: (2, 4), c: (7, 6), d: (6, 9), e: (8, 9)
Suppose that there are two clusters. The first cluster contains points a and b while the second cluster
contains points c, d and e.
(i) (1) What is the center of the first cluster if we use the centroid linkage as a distance measurement?
(2) What is the center of the second cluster if we use the centroid linkage as a distance measurement?
(ii) Consider the agglomerative approach for hierarchical clustering.
Suppose that these two clusters are merged.
(1) What is the center of the merged cluster if we use the centroid linkage as a distance measurement?
(2) What is the center of the merged cluster if we use the median linkage as a distance measurement?
Q4 (20 Marks)
The following shows a history of customers with their incomes, ages and an attribute called “Have_iPhone”
indicating whether they have an iPhone. We also indicate whether they will buy an iPad or not in the last
column. You cannot use XLMiner in this question.
No. Income Age Have_iPhone Buy_iPad
1 high young yes yes
2 high old yes yes
3 medium young no yes
4 high old no yes
5 medium young no no
6 medium young no no
7 medium old no no
8 medium old no no
We want to train a CART decision tree classifier to predict whether a new customer will buy an iPad or not.
We define the value of attribute Buy_iPad to be the label of a record.
(a) Please find a CART decision tree according to the above example. In the decision tree, whenever
a node contains at most 3 records, we do not continue to process this node for splitting.
(b) Consider a new young customer whose income is medium and he has an iPhone. Please predict
whether this new customer will buy an iPad or not.
3/7
Part B (Compulsory Multiple-Choice (MC) Questions)

In this part, there are 4 multiple-choice questions, namely Q5, Q6, Q7 and Q8. The total scores in this part
are 20 scores. Each question weighs 5 scores.
Q5. [Removed]
A. [Removed]
B. [Removed]
C. [Removed]
D. [Removed]
E. [Removed]
Q6. [Removed]
A. [Removed]
B. [Removed]
C. [Removed]
D. [Removed]
E. [Removed]
4/7
Q7. [Removed]
A. [Removed]
B. [Removed]
C. [Removed]
D. [Removed]
E. [Removed]
5/7
Q8. [Removed]
A. [Removed]
B. [Removed]
C. [Removed]
D. [Removed]
E. [Removed]
6/7
Part C (Bonus Question)

Note: The following bonus question is an OPTIONAL question. You can decide whether you will answer
it or not.
Q9 (10 Additional Marks)
We are given four items, namely A, B, C and D. Their corresponding unit profits are pA, pB, pC and pD.
The following shows five transactions with these items. Each row corresponds to a transaction where a non-
negative integer shown in the row corresponds to the total number of occurrences of the correspondence
item present in the transaction.
A B C D
0 0 3 2
3 4 0 0
0 0 1 3
1 0 3 5
6 0 0 0
The frequency of an itemset in a row is defined to be the minimum of the number of occurrences of all items
in the itemset. For example, itemset {C, D} in the first row has frequency = 2. But, itemset {C, D} in the
third row has frequency = 1.
The frequency of an itemset in the dataset is defined to be the sum of the frequencies of the itemset in all
rows in the dataset. For example, itemset {C, D} has frequency = 2+0+1+3+0 = 6.
Define a function f on an itemset s. This function will be specified later. One example of this function is f(s)
= ispi. In this example, if s = {C, D}, then f(s) = pC + pD.
The profit of an itemset s in the dataset is defined to be the product of the frequency of this itemset in the
dataset and f(s).
For example, itemset {C, D} has profit = 6 . f({C, D})
(a) Assume that we adopt function f such that f(s) = (ispi)/|s| where |s| denotes the no. of items in s.
Suppose that we know that pA = 10, pB = 10, pC = 10 and pD = 10.
We want to find all itemsets with profit at least 50.
Can the Apriori Algorithm be adapted to find these itemsets?
If yes, please write down the pseudo-code and illustrate it with the above example.
If no, please explain why the Apriori Algorithm cannot be adapted. In this case, please also design
an algorithm, write down the pseudo-code and illustrate it with the above example.
(b) Assume that we adopt function f such that f(s) = ispi.
Suppose that we know that pA = 5, pB = 10, pC = 6 and pD = 4.
We want to find all itemsets with profit at least 50.
Can the Apriori Algorithm be adapted to find these itemsets?
If yes, please write down the pseudo-code and illustrate it with the above example.
If no, please explain why the Apriori Algorithm cannot be adapted. In this case, please also design
an algorithm, write down the pseudo-code and illustrate it with the above example.
End of Paper
7/7

COMP1942 Question Paper

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMP1942 Question Paper

Uploaded by

Copyright:

Available Formats

COMP1942 Question Paper

COMP1942 Exploring and Visualizing Data (Spring Semester 2014)

Student ID: Student Name:_______________

Seat No. :__________________

Part A (Compulsory Short Questions)

(a) Consider a data set containing 10 transactions and 6 items.

where X and Y are some expressions.

(a) Please give two reasons why we need to do clustering.

Part B (Compulsory Multiple-Choice (MC) Questions)

Part C (Bonus Question)

Q9 (10 Additional Marks)

You might also like