Assignment 3 Spring - 24

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Instructions

1) Make sure you write your group members’ names and ERP IDs clearly. Only one group member
should submit the assignment.
a. File Name Format: Group Number_Members’ ERP

2) For Question 2, I have colored in red the parts that actually need a response. The other parts are
intermediary workings where no written response is necessary.
3) For Question 3 show your working and explanations in the Excel Answer sheet.
4) Submit your excel answer sheet for both the questions, R file for question 2 (.R script), and a
pdf document answering all questions.
5) The deadline for the assignment is Tuesday, 14th May 2024, 11.55 PM
6) You may approach TAs or me with questions till Sunday, 12th May 2024, 5:00 PM

7) You can avail late days for this submittal (Max 3 allowed- 14th May 11:56 PM to 15th May 11:55
PM makes 1 late day). Procedure for late days submission is as follows:
a. Submit the assignment via an email to both the TAs.
b. Do not submit assignments solely to Sir’s email.
c. Email Subject Line: Late Day Submission - Group No. & Assignment No.
d. Mention the number of late days utilized in the email body.
Question 1 – Decision Trees
Following the steps defined below, create a decision tree model to predict whether an individual has
diabetes:

1) Randomize the data using the data analysis tool pack in Excel.
a. To do this, use the random number generation tool and generate one uniform random
variable with 768 observations (because you have 768 rows of data) with seed = 123.
b. Now sort your data in ascending order against the random variable you just generated
2) Load the data into R. You can use the read.csv command for this purpose
3) Convert the variable Outcome into a factor variable
4) Remove the random variable column from your data (because we only needed it to randomize
the data and we do not need that column anymore)
5) Split your data into training and testing. Use the top 500 rows as training and the bottom 268
rows as testing
6) Create a decision tree model on your training data to predict the "Outcome" variable using the
rpart function.
7) In your console, print the decision tree model just made and explain how to read the output
and what each value means. You don't have to explain every node. Just a few terminal nodes
to show you understand how to interpret the output.
8) There are some parameters that control how the decision tree model works. These can be
accessed in the help file of rpart. Type "?rpart" to bring up the help file and scroll down to
controls. You will see a hyperlink titled "rpart.control". Click on the hyperlink and read the help
file.
9) Create a decision tree model where every terminal node has at least 25 observations. Do you
notice any difference between this model and the model created in part (6) above? Explain
10) Plot the decision tree model from (9) above using rpart.plot
11) Predict the probability of having diabetes for each observation in both training and test data.
Create the ROC plot and precision recall curves and report the area under the curve for all
curves.
12) Compare the output of (11) above to part (9) of the logistic regression question of Assignment
2. Which model is better? Why?
13) An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80,
insulin = 100, BMI = 25, Age = 50. According to your final model from part (9) above, what is
the probability that the individual has diabetes? Explain.
Question 2 – Linear Optimization

You might also like