Asc399 Feb23

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

CONFIDENTIAL CD/FEB 2023/ASC399

UNIVERSITI TEKNOLOGI MARA


FINAL EXAMINATION

COURSE INSURANCE DATA ANALYTICS


COURSE CODE ASC399
EXAMINATION FEBRUARY 2023
TIME 2 HOURS

INSTRUCTIONS TO CANDIDATES

1. This question paper consists of four (4) questions.

2. Answer ALL questions in the Answer Booklet. Start each answer on a new page.

3. Do not bring any material into the examination room unless permission is given by the invigilator.

4. Please check to make sure that this examination pack consists of:
i. the Question Paper
ii. an Answer Booklet - provided by the Faculty

5. Answer ALL questions in English.

DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO DO SO


This examination paper consists of 6 printed pages

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL


CONFIDENTIAL 2 CD/FEB 2023/ASC399

QUESTION 1

A campaign manager is interested in constructing a customer profile to predict whether the


voter will vote for their preferred party based on their profile condition. In this dataset, there
are three categorical attributes which are Age (Youth, Middle Aged and Senior), Location
(Rural and Urban), and Candidate (Charisma and Popular). Based on these 16 voters' data
information, answer the following questions.

ID Age Location Candidate Vote


1 Youth Urban Popular Yes
2 Middle Age Urban Charisma Yes
3 Senior Rural Popular No
4 Senior Urban Popular Yes
5 Youth Urban Charisma No
6 Middle Age Rural Charisma No
7 Middle Age Rural Popular No
8 Youth Urban Charisma Yes
9 Senior Rural Charisma Yes
10 Senior Urban Charisma No
11 Senior Rural Popular Yes
12 Youth Urban Popular Yes
13 Youth Urban Charisma No
14 Middle Age Urban Popular No
15 Youth Rural Popular Yes
16 Middle Age Rural Charisma Yes

a) Given the entropy value for Age is 0.9512.

i) Calculate the entropy value for Location and Candidate.


(6 marks)

ii) Based on i), determine the variable that should become the first splitting attributes as
the root node and explain the reason.
(2 marks)

iii) Draw the decision tree that can be obtained from ii).
(3 marks)

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL


CONFIDENTIAL 3 CD/FEB 2023/ASC399

b) Two splitting alternatives using GINI as the splitting criterion produce the purity values as
shown below.

Split Alternative Purity Value


Variable A 0.8424
Variable B 0.1485

i) Which is consider as the best split? State your reason.


(3 marks)

ii) Information Gain is the basic criterion to decide whether a feature should be used to
split a node or not. It is also used to measure the reduction in entropy by splitting a
dataset according to a given value of a random variable.
Derive and explain the formula for Information Gain.

(3 marks)

iii) Given information as shown below:


Split 1 Split 2
Chi-square 12.4562389 27.9857342
df(r-1)(c-1) 1 1
p-value 0.000056 0.000049
logworth 5 10

Based on the table above, which split can be classified as the best split? Justify your
answer.

(3 marks)

QUESTION 2

a) The following is a two-way table on the distribution of smoking patients and the risk of
Heart Attack. Interpret the value of OR(Y=1)sm0ke if it is known that the odds ratio value is
3.

Heart Attack
Smoking Status Y=0 (Not at risk) Y=1 (At risk)
Non-Smokers 55 70
Smokers 45 30
(2 marks)

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL


CONFIDENTIAL 4 CD/FEB 2023/ASC399

b) In SAS EMiner, the variable selection process can be done using a variable selection node
with two procedures related to correlation analysis and association analysis. Between
these two procedures:

• state which analysis is appropriate when the target is binary.


• specify the test statistic involved in the selected analysis.
• explain how an input is selected using the chosen analysis.
(3 marks)

c) An insurance company has examined a random sample of 1,900 automobile accident


claims. A logistic regression model is fitted to this data with the target being coded as 1
for a case that was fraudulent, and 0 otherwise. The five inputs included in the models
are:

City = 1 if the claimant lived in a large city, 0 otherwise;

Gender = 1 for male, 0 for female;

Age = age in years;

Fault = 1 if the fault in the accident was that policy holder's fault, 0 otherwise;

Deductible = Deductible amount (in RM)

The model estimates for the logarithm of the odds of fraud Z is:
Z = 53.119 - 0.081 City + 0.367Gender + 0.006Age - 1.738Fault - 0.142Deductible

i) Explain the odds ratio for fraud in an accident where the policyholder was at fault,
assuming all other inputs are constant.
(3 marks)

ii) Explain the odds ratio for fraud in an accident where the policyholder is a male,
assuming all other inputs are constant.
(3 marks)

iii) Determine whether the odds ratio for fraud increase or decrease with age and explain
the value.
(4 marks)

iv) Find the probability of fraud in a claim by a male policyholder aged 30 years, who lives
in a major city, has a deductible of RM400 and who was not at fault in the accident.
Interpret the value.
(5 marks)

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL


CONFIDENTIAL 5 CD/FEB 2023/ASC399

QUESTION 3

a) Neural networks have the ability to learn by example by taking specific inputs and turning
them into a specific output. Explain how a neural network model can be kept up to date.
(4 marks)

b) A typical neural network model has an input layer, a second layer known as the hidden
layer and an output layer. In general, what is the minimum number of hidden layers
sufficient for the model? State ONE (1) advantage and ONE (1) disadvantage of having a
wider layer.
(3 marks)

c) Explain ONE (1) general use of Artificial Neural Network in data analytics and give TWO
(2) of its commercial practical applications.

(3 marks)

QUESTION 4

a) Based on the following fit statistics, specify which model is prone to have overfitting
problem. Show your workings.

Train: Valid:
Train: Valid:
Model Average Average Train: Valid:
ROC ROC
Description Squared Squared Mis. Rate Mis. Rate
Index Index
Error Error
StepWLog 0.1343 0.1360 0.1965 0.1908 0.8480 0.8440
ChaidDT 0.1306 0.1358 0.1884 0.1916 0.8560 0.8420

(4 marks)

b) Given the following confusion matrix for Model X. Calculate the values of sensitivity,
specificity and misclassification rate.

Predicted Accepted Predicted Rejected


Accept 1000 500
Reject 2500 6000

(3 marks)

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL


CONFIDENTIAL 6 CD/FEB 2023/ASC399

c) The sensitivity and specificity values are sensitive to the imbalance samples of Y = 1
versus Y = 0 cases. Explain this situation and state one approach to overcome this
problem.
(3 marks)

END OF QUESTION PAPER

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL

You might also like