Professional Documents
Culture Documents
Asc399 Feb23
Asc399 Feb23
Asc399 Feb23
INSTRUCTIONS TO CANDIDATES
2. Answer ALL questions in the Answer Booklet. Start each answer on a new page.
3. Do not bring any material into the examination room unless permission is given by the invigilator.
4. Please check to make sure that this examination pack consists of:
i. the Question Paper
ii. an Answer Booklet - provided by the Faculty
QUESTION 1
ii) Based on i), determine the variable that should become the first splitting attributes as
the root node and explain the reason.
(2 marks)
iii) Draw the decision tree that can be obtained from ii).
(3 marks)
b) Two splitting alternatives using GINI as the splitting criterion produce the purity values as
shown below.
ii) Information Gain is the basic criterion to decide whether a feature should be used to
split a node or not. It is also used to measure the reduction in entropy by splitting a
dataset according to a given value of a random variable.
Derive and explain the formula for Information Gain.
(3 marks)
Based on the table above, which split can be classified as the best split? Justify your
answer.
(3 marks)
QUESTION 2
a) The following is a two-way table on the distribution of smoking patients and the risk of
Heart Attack. Interpret the value of OR(Y=1)sm0ke if it is known that the odds ratio value is
3.
Heart Attack
Smoking Status Y=0 (Not at risk) Y=1 (At risk)
Non-Smokers 55 70
Smokers 45 30
(2 marks)
b) In SAS EMiner, the variable selection process can be done using a variable selection node
with two procedures related to correlation analysis and association analysis. Between
these two procedures:
Fault = 1 if the fault in the accident was that policy holder's fault, 0 otherwise;
The model estimates for the logarithm of the odds of fraud Z is:
Z = 53.119 - 0.081 City + 0.367Gender + 0.006Age - 1.738Fault - 0.142Deductible
i) Explain the odds ratio for fraud in an accident where the policyholder was at fault,
assuming all other inputs are constant.
(3 marks)
ii) Explain the odds ratio for fraud in an accident where the policyholder is a male,
assuming all other inputs are constant.
(3 marks)
iii) Determine whether the odds ratio for fraud increase or decrease with age and explain
the value.
(4 marks)
iv) Find the probability of fraud in a claim by a male policyholder aged 30 years, who lives
in a major city, has a deductible of RM400 and who was not at fault in the accident.
Interpret the value.
(5 marks)
QUESTION 3
a) Neural networks have the ability to learn by example by taking specific inputs and turning
them into a specific output. Explain how a neural network model can be kept up to date.
(4 marks)
b) A typical neural network model has an input layer, a second layer known as the hidden
layer and an output layer. In general, what is the minimum number of hidden layers
sufficient for the model? State ONE (1) advantage and ONE (1) disadvantage of having a
wider layer.
(3 marks)
c) Explain ONE (1) general use of Artificial Neural Network in data analytics and give TWO
(2) of its commercial practical applications.
(3 marks)
QUESTION 4
a) Based on the following fit statistics, specify which model is prone to have overfitting
problem. Show your workings.
Train: Valid:
Train: Valid:
Model Average Average Train: Valid:
ROC ROC
Description Squared Squared Mis. Rate Mis. Rate
Index Index
Error Error
StepWLog 0.1343 0.1360 0.1965 0.1908 0.8480 0.8440
ChaidDT 0.1306 0.1358 0.1884 0.1916 0.8560 0.8420
(4 marks)
b) Given the following confusion matrix for Model X. Calculate the values of sensitivity,
specificity and misclassification rate.
(3 marks)
c) The sensitivity and specificity values are sensitive to the imbalance samples of Y = 1
versus Y = 0 cases. Explain this situation and state one approach to overcome this
problem.
(3 marks)