Asc399 Feb23

CONFIDENTIAL CD/FEB 2023/ASC399
UNIVERSITI TEKNOLOGI MARA

FINAL EXAMINATION
COURSE INSURANCE DATA ANALYTICS

COURSE CODE ASC399
EXAMINATION FEBRUARY 2023
TIME 2 HOURS
INSTRUCTIONS TO CANDIDATES
1. This question paper consists of four (4) questions.
2. Answer ALL questions in the Answer Booklet. Start each answer on a new page.
3. Do not bring any material into the examination room unless permission is given by the invigilator.
4. Please check to make sure that this examination pack consists of:
i. the Question Paper
ii. an Answer Booklet - provided by the Faculty
5. Answer ALL questions in English.
DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO DO SO

This examination paper consists of 6 printed pages
© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL

CONFIDENTIAL 2 CD/FEB 2023/ASC399
QUESTION 1
A campaign manager is interested in constructing a customer profile to predict whether the

voter will vote for their preferred party based on their profile condition. In this dataset, there
are three categorical attributes which are Age (Youth, Middle Aged and Senior), Location
(Rural and Urban), and Candidate (Charisma and Popular). Based on these 16 voters' data
information, answer the following questions.
ID Age Location Candidate Vote

1 Youth Urban Popular Yes
2 Middle Age Urban Charisma Yes
3 Senior Rural Popular No
4 Senior Urban Popular Yes
5 Youth Urban Charisma No
6 Middle Age Rural Charisma No
7 Middle Age Rural Popular No
8 Youth Urban Charisma Yes
9 Senior Rural Charisma Yes
10 Senior Urban Charisma No
11 Senior Rural Popular Yes
12 Youth Urban Popular Yes
13 Youth Urban Charisma No
14 Middle Age Urban Popular No
15 Youth Rural Popular Yes
16 Middle Age Rural Charisma Yes
a) Given the entropy value for Age is 0.9512.
i) Calculate the entropy value for Location and Candidate.

(6 marks)
ii) Based on i), determine the variable that should become the first splitting attributes as
the root node and explain the reason.
(2 marks)
iii) Draw the decision tree that can be obtained from ii).
(3 marks)

b) Two splitting alternatives using GINI as the splitting criterion produce the purity values as
shown below.
Split Alternative Purity Value

Variable A 0.8424
Variable B 0.1485
i) Which is consider as the best split? State your reason.

(3 marks)
ii) Information Gain is the basic criterion to decide whether a feature should be used to
split a node or not. It is also used to measure the reduction in entropy by splitting a
dataset according to a given value of a random variable.
Derive and explain the formula for Information Gain.
(3 marks)
iii) Given information as shown below:

Split 1 Split 2
Chi-square 12.4562389 27.9857342
df(r-1)(c-1) 1 1
p-value 0.000056 0.000049
logworth 5 10
Based on the table above, which split can be classified as the best split? Justify your
answer.
(3 marks)
QUESTION 2
a) The following is a two-way table on the distribution of smoking patients and the risk of
Heart Attack. Interpret the value of OR(Y=1)sm0ke if it is known that the odds ratio value is
3.
Heart Attack
Smoking Status Y=0 (Not at risk) Y=1 (At risk)
Non-Smokers 55 70
Smokers 45 30
(2 marks)

b) In SAS EMiner, the variable selection process can be done using a variable selection node
with two procedures related to correlation analysis and association analysis. Between
these two procedures:
• state which analysis is appropriate when the target is binary.

• specify the test statistic involved in the selected analysis.
• explain how an input is selected using the chosen analysis.
(3 marks)
c) An insurance company has examined a random sample of 1,900 automobile accident

claims. A logistic regression model is fitted to this data with the target being coded as 1
for a case that was fraudulent, and 0 otherwise. The five inputs included in the models
are:
City = 1 if the claimant lived in a large city, 0 otherwise;
Gender = 1 for male, 0 for female;
Age = age in years;
Fault = 1 if the fault in the accident was that policy holder's fault, 0 otherwise;
Deductible = Deductible amount (in RM)
The model estimates for the logarithm of the odds of fraud Z is:
Z = 53.119 - 0.081 City + 0.367Gender + 0.006Age - 1.738Fault - 0.142Deductible
i) Explain the odds ratio for fraud in an accident where the policyholder was at fault,
assuming all other inputs are constant.
(3 marks)
ii) Explain the odds ratio for fraud in an accident where the policyholder is a male,
assuming all other inputs are constant.
(3 marks)
iii) Determine whether the odds ratio for fraud increase or decrease with age and explain
the value.
(4 marks)
iv) Find the probability of fraud in a claim by a male policyholder aged 30 years, who lives
in a major city, has a deductible of RM400 and who was not at fault in the accident.
Interpret the value.
(5 marks)

QUESTION 3
a) Neural networks have the ability to learn by example by taking specific inputs and turning
them into a specific output. Explain how a neural network model can be kept up to date.
(4 marks)
b) A typical neural network model has an input layer, a second layer known as the hidden
layer and an output layer. In general, what is the minimum number of hidden layers
sufficient for the model? State ONE (1) advantage and ONE (1) disadvantage of having a
wider layer.
(3 marks)
c) Explain ONE (1) general use of Artificial Neural Network in data analytics and give TWO
(2) of its commercial practical applications.
(3 marks)
QUESTION 4
a) Based on the following fit statistics, specify which model is prone to have overfitting
problem. Show your workings.
Train: Valid:
Train: Valid:
Model Average Average Train: Valid:
ROC ROC
Description Squared Squared Mis. Rate Mis. Rate
Index Index
Error Error
StepWLog 0.1343 0.1360 0.1965 0.1908 0.8480 0.8440
ChaidDT 0.1306 0.1358 0.1884 0.1916 0.8560 0.8420
(4 marks)
b) Given the following confusion matrix for Model X. Calculate the values of sensitivity,
specificity and misclassification rate.
Predicted Accepted Predicted Rejected

Accept 1000 500
Reject 2500 6000
(3 marks)

c) The sensitivity and specificity values are sensitive to the imbalance samples of Y = 1
versus Y = 0 cases. Explain this situation and state one approach to overcome this
problem.
(3 marks)
END OF QUESTION PAPER

Asc399 Feb23

Uploaded by

Copyright:

Available Formats

You might also like

Asc399 Feb23

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Asc399 Feb23

Uploaded by

Copyright:

Available Formats

CONFIDENTIAL CD/FEB 2023/ASC399

UNIVERSITI TEKNOLOGI MARA

COURSE INSURANCE DATA ANALYTICS

1. This question paper consists of four (4) questions.

5. Answer ALL questions in English.

DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO DO SO

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL

A campaign manager is interested in constructing a customer profile to predict whether the

ID Age Location Candidate Vote

a) Given the entropy value for Age is 0.9512.

i) Calculate the entropy value for Location and Candidate.

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL

Split Alternative Purity Value

i) Which is consider as the best split? State your reason.

iii) Given information as shown below:

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL

• state which analysis is appropriate when the target is binary.

c) An insurance company has examined a random sample of 1,900 automobile accident

City = 1 if the claimant lived in a large city, 0 otherwise;

Gender = 1 for male, 0 for female;

Age = age in years;

Deductible = Deductible amount (in RM)

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL

Predicted Accepted Predicted Rejected

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL

END OF QUESTION PAPER

© Hak Cipta Universiti Teknologi MARA CONFIDENTIAL

You might also like