Professional Documents
Culture Documents
CS-7830 Assignment-2 Questions 2022
CS-7830 Assignment-2 Questions 2022
In this assignment, you will perform exploratory data analysis (EDA) on the given
dataset and implement Logistic Regression from scratch using a programming
language of your choice. As part of your implementation of logistic regression (LR),
you will code the Gradient Descent Algorithm that we discussed in class to find out
the parameters for Θ. One way to verify gradient descent is working as expected is
to monitor the value of J(Θ) whether it is decreasing with each training iteration.
Follow software design principles and document (comment) your code clearly
explaining what you did and why you did what you did. In your report, include a
README that states how your code is supposed to be run to obtain the expected
results.
Please address the subparts in each section to receive full credit and justify what
you did in your implementation as well as the results you obtained.
You will use a dataset representing ten years of clinical care at 130 US hospitals and
integrated delivery networks. It includes over 50 features representing patient and
hospital outcomes. The dataset is included in the assignment with the filename
diabetic_data_final.csv. You may use any train/val/test split ratio and/or K-Fold
cross-validation as you see fit. Note that your model should never be exposed to
your testing (hold-out) data at train time.
e. Cross-tabulation shows the frequency by which groups of data from the two
features appear. Generate cross-tabulations between the dependent variable,
time_in_hospital with each of the independent variables gender,
max_glu_serum, and insulin. Explain what can you infer from each of these
cross-tabulation results? (5)
a. Using the training/validation/test split ratio of your choice for the complete
dataset, train your Logistic Regression (LR) model to predict time_in_hospital
using the num_lab_procedures as an input feature. (15)
b. Evaluate the performance of your model on the training set and hold-out
(testing) set using a metric discussed in class (e.g., precision, recall, F1
score, and confusion matrix). Discuss your results. (15)
3. Logistic Regression with multiple variables and training samples (45
points)
a. Using the training/validation/test split ratio of your choice for the complete
dataset, train your LR model to predict time_in_hospital using the 10 features
below: (10)
age, num_lab_procedures, num_procedures, num_medications,
number_outpatient, number_emergency, number_inpatient,
number_diagnosis, insulin, and diabetesMed.
b. Using the training/validation/test split ratio of your choice for the complete
dataset, train your LR model to predict time_in_hospital using forward
selection to select the most significant features in the dataset as input
variables. Which subset of features gave you the best performance? What are
your thoughts on these features getting selected? (10)
c. Compare the performance of the model built using all of the features in (3a.)
with the model trained using the selected features (3b.). Which set of features
performed better? (10)
d. Train your LR model using training sample size within the range {25000,
50000, 75000} and the features selected in (3b.) and compute cross-entropy
loss (log loss) on the same hold-out set and plot your results (log-loss on y-
axis and train sample size on x-axis). What is the impact of increasing
training sample size? Justify why the log-loss decreases, increases or doesn’t
change at all. (15)
b. Cost Function:
i. Keeping the best regularized (or not) model after the experiments from
4a., train your LR model to predict sleep_quality by changing the cost
function to the following: (10)
ii. Compare the performance of both the models (4.b.i and 4.a.). Do they
give the same solution with a difference in cost function? (10)
Submit a zipped file containing your code(s) and report (in pdf) in the Dropbox
folder titled “Assignment 2-LastName” on Pilot.
Academic Integrity: Please note that the code and report you submit should be your
work, and yours alone. If plagiarism is detected, it will be dealt with strictly and
in accordance with Wright State guidelines.