Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

1) How might you determine outliers in the data?

Ans: There are several ways to determine outliers in the data:

Visual inspection: One of the simplest ways to identify outliers is to visually


inspect the data using box plots, scatter plots, or histograms. In a box plot, outliers
are often represented as individual points beyond the whiskers of the box, while in
a scatter plot, outliers are points that are far from the main cluster of points.

Z-score: A z-score measures how many standard deviations a data point is from
the mean of the data set. A data point that has a z-score greater than 3 or less than -
3 is often considered an outlier.

Tukey's method: Tukey's method, also known as the interquartile range (IQR)
method, defines outliers as any data points that are more than 1.5 times the IQR
below the first quartile or above the third quartile.

Cook's distance: Cook's distance is a measure used in linear regression to identify


influential points that may be outliers. Any data point that has a Cook's distance
greater than 1 is often considered an outlier.

Machine learning techniques: Various machine learning techniques such as


clustering, decision trees, and support vector machines can be used to identify
outliers in the data.

It is important to note that the determination of outliers depends on the context and
the purpose of the analysis, and outliers should be carefully examined and
interpreted before any action is taken.

You are working in an organization's marketing team. The organization aims to


conduct a market survey consisting of 50 questions with four possible answers. The
customers can choose only one option from the answer. (3-marks)
(a) How would you convert this data into a form suitable for complete association
analysis? (Hint: Complete Association Analysis contains positive as well as negative
association rules)
(b) In particular, what type of attributes would you have, and how many of them are in
1) your dataset (write the number of instances in your dataset)?

Ans:

(a) To convert this data into a form suitable for complete association analysis, we would need to
transform the categorical responses into binary values. This can be done by creating a binary
variable for each possible answer to each question, where the variable is 1 if the customer selected
that answer and 0 otherwise. For example, if the question is "Which of the following products do
you prefer?" and the possible answers are A, B, C, and D, we would create four binary variables: A, B,
C, and D. If a customer selected answer A, then the A variable would be 1 and all other variables
would be 0.
(b) The attributes in this dataset would be binary variables representing the possible answers to each
question. There would be 4 binary variables for each question, making a total of 200 binary variables
in the dataset (50 questions x 4 possible answers per question). Each instance in the dataset would
represent a unique customer's responses to all 50 questions, resulting in 1 row per customer and
200 columns of binary variables.

Suppose set A includes all the items in the market basket. In that case, what will be
the confidence of rules > A andA > Ø?

Ans: The confidence of the rule "if A then Ø" (A → Ø) is calculated as the ratio of the number of transactions
containing both A and Ø to the number of transactions containing A. Since set A includes all the items in the
market basket, any transaction that contains A must also contain Ø. Therefore, the number of transactions
containing both A and Ø is equal to the number of transactions containing A. The confidence of A → Ø would
be 100%, as all transactions containing A also contain Ø.

The confidence of the rule "if Ø then A" (Ø → A) is calculated as the ratio of the number of transactions
containing both A and Ø to the number of transactions containing Ø. Since the empty set does not contain any
items, there are no transactions that contain both A and Ø. Therefore, the number of transactions containing both
A and Ø is equal to zero. As a result, the confidence of Ø → A is undefined, as it would require dividing by
zero.
To compute the Gini index for an attribute, we first need to calculate the proportion of each class in the attribute.
The Gini index is then calculated as the sum of the squared proportions of each class subtracted from 1.

(a) For the Customer ID attribute:

There are 20 instances in the table.

Each customer ID is unique, so there are 20 classes in the attribute.

Since each class has only one instance, the proportion of each class is 1/20.

The Gini index for the Customer ID attribute is calculated as follows:

Gini index = 1 - ((1/20)^2 + (1/20)^2 + ... + (1/20)^2)

= 1 - (20 * (1/400))
= 0.95

Therefore, the Gini index for the Customer ID attribute is 0.95.

(b) For the Gender attribute:

There are 10 instances of male and 10 instances of female in the table.

There are two classes in the attribute: male and female.

The proportion of males in the attribute is 10/20 = 0.5.

The proportion of females in the attribute is also 10/20 = 0.5.

The Gini index for the Gender attribute is calculated as follows:

Gini index = 1 - (0.5^2 + 0.5^2)

= 1 - 0.5

= 0.5

Therefore, the Gini index for the Gender attribute is 0.5.

What is a market basket transaction? As a Manager analyzing market baskets in the


rekail domain, what is interesting for you from the following pairs:
1. Arule that has high support and high confidence.
2. Arule that has reasonably high support but low confidence.
3. Arule that has low support and low confidence.
4. A rule that has low support and high confidence.
Justify your answer.

A market basket transaction refers to a record of items purchased together by a customer in a single transaction.
For example, if a customer buys bread, milk, and eggs in a single transaction, that transaction is a market basket
transaction.

As a manager analyzing market baskets in the retail domain, the following pairs are interesting:

A rule that has high support and high confidence: This rule indicates that the item(s) on the left-hand side of the
rule (antecedent) are frequently purchased together with the item(s) on the right-hand side (consequent). This is
valuable information for managers as it allows them to identify popular item combinations and possibly bundle
or promote them together to increase sales.
A rule that has reasonably high support but low confidence: This rule indicates that the item(s) on the antecedent
are frequently purchased but not always with the item(s) on the consequent. Managers can use this information
to identify potential cross-selling opportunities and suggest complementary products to customers.

A rule that has low support and low confidence: This rule indicates that the item(s) on the antecedent and
consequent are rarely purchased together, making it less valuable for managers to focus on this rule.

A rule that has low support and high confidence: This rule indicates that the item(s) on the antecedent are rarely
purchased, but when they are, they are often purchased with the item(s) on the consequent. This type of rule can
help managers identify niche or under-promoted products that could benefit from increased marketing efforts.

You might also like