Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

MID-TERM EXAM

Name: Phạm Mai Linh

Student ID: 20070516

Duration of time: 120 min

Câu 1:

- What is the main difference between K-Nearest Neighbors and K-Mean Clustering algorithms?

- Define Entropy and Information Gain? In a decision tree algorithm, how high or low does the
value of Entropy and Information Gain affect data clustering and attribute selection?

- Distinguish two types of models (classification) and regression (regression)? Give 2 examples
of each model type.

- In the confusion metrics, what is True Positive, True Negative, False Positive, False Negative?
Present the formulas for calculating accuracy (Acccuracy) and error (Error) based on confusion
matrix?

Câu 2:

Given the set of {1,2,4,6,2}

a. Calculate the Mean?

b. Calculate the Median?

c. Calculate the Mode?

Câu 3:

The data in the chart below is the number of working late arrivals (vertical axis) of employees
(horizontal axis) in a store. Find the Median values of the number of late work?
Câu 4:

The graph below depicts the salaries of 8 CEOs in a corporation. Each point is a CEO with a
monthly salary described by a number on the horizontal axis (million VND). Find Q1, Q2, Q3,
the median, minimum and maximum of the quartiles that describe this data?

Câu 5:

The following 20 customer data is used to build a decision tree model to group bad debts by the
attribute "age". Calculate Information Gain, R values corresponding to the selection of the "age"
attribute for clustering (Calculated by the Log2 function, first need to count and calculate the pi
ratios): Draw conclusions about the clustering attribute and the data order.
entropy parent

B C

entropy child 1 entropy child 2

Câu 6:

Calculation accuracy and error of model-based confusion matrix:

Thực tế
Positive Negative
Dự đoán Positive TP = 10 FP = 2
Negative FN = 3 TN = 5
FN + FP
Error of model−based = ∗100 %
TN+ TP+ FN + FP

Câu 7:

Find the simple linear regression equation (y=a+bx) between the dependent variable y and the
independent variable x by relying on the data in the table:

xi yi

1 3

3 5

5 11

7 14
Câu 8:

($)
eSTT (Ordinal) (Categorical) (Binary)
(Ratio-scale)
1 300 Music businessman Visa Card
2 162 Music businessman Master Card
3 180 Travel Engineer ATM Card
4 217 Music Lecturer ATM Card
5 181 sporting Engineer Master Card
6 194 Travel businessman Visa Card
7 256 Music Lecturer Master Card
8 270 sporting businessman Visa Card

Requirements:

Using K-means algorithm with the types of data variables given in the table above.
With K = 2 and initially choose data 1 and 4 as the center. Indicates what data the
clusters consist of after the 4th iteration.

Note:

- Ratio-scale variable using base 10. logarithm


- Intermediate steps need to be clearly shown.
- Round to 2 odd numbers. Rounding rules, if the last odd number >= 5, it is
rounded up to 1 in the previous number. For example, 0.065 is rounded to 0.07;
0.064 then rounded to 0.64

You might also like