Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Assignment 2

Due Date/Time: Oct 21, 2022 11:59 PM


Total Points: 150

In this assignment, you will perform exploratory data analysis (EDA) on the given
dataset and implement Logistic Regression from scratch using a programming
language of your choice. As part of your implementation of logistic regression (LR),
you will code the Gradient Descent Algorithm that we discussed in class to find out
the parameters for Θ. One way to verify gradient descent is working as expected is
to monitor the value of J(Θ) whether it is decreasing with each training iteration.
Follow software design principles and document (comment) your code clearly
explaining what you did and why you did what you did. In your report, include a
README that states how your code is supposed to be run to obtain the expected
results.

Please address the subparts in each section to receive full credit and justify what
you did in your implementation as well as the results you obtained.

You will use a dataset representing ten years of clinical care at 130 US hospitals and
integrated delivery networks. It includes over 50 features representing patient and
hospital outcomes. The dataset is included in the assignment with the filename
diabetic_data_final.csv. You may use any train/val/test split ratio and/or K-Fold
cross-validation as you see fit. Note that your model should never be exposed to
your testing (hold-out) data at train time.

1. Exploratory Data Analysis (35 points)

a. Consider the following numeric variables in the dataset: time_in_hospital,


num_lab_procedures, num_procedures, num_medications,
number_outpatient, number_emergency, number_inpatient, and
number_diagnoses. Summarize the statistics of these variables into count,
mean, standard deviation, minimum, 25% percentile, 50% percentile, 75%
percentile, and maximum. (5)

b. Consider the following object/string variables in the dataset: age, insulin,


change, and diabetesMed. Summarize the statistics of these variables into
count, unique value, top value, and frequency of top value. (5)

c. Generally, age is considered as a numeric data type but in this dataset, it is of


an object data type. Can you think of a reason for this and is there a way to
encode this age variable as a numeric data type? If so, how would you do
that? Similarly, explain how would you encode the variable readmitted from
object type to numeric type? (5)

d. Treat the variable time_in_hospital as a dependent variable and


num_lab_procedures, num_procedures, num_medications, and
number_diagnoses as independent variables. Observe the relationship between
the dependent variable and each independent variable separately using box
plots. What are your general observations and inferences by looking at these
box plots? (5)

e. Cross-tabulation shows the frequency by which groups of data from the two
features appear. Generate cross-tabulations between the dependent variable,
time_in_hospital with each of the independent variables gender,
max_glu_serum, and insulin. Explain what can you infer from each of these
cross-tabulation results? (5)

f. Generate separate count plots between the dependent variable


time_in_hospital and each of these independent variables, gender, age,
num_procedures, and num_medications. What do these count plots tell you
about the relationship between the dependent variable and each of these
independent variables? (5)

g. Consider a subset of the following variables: time_in_hospital,


num_lab_procedures, num_procedures, num_medications, and
number_diagnoses. Generate a correlation matrix and heat map of
correlations between these variables. How would you interpret the results as
visualized from the heat map? (5)

For questions 2, 3, and 4, implement Logistic Regression from scratch using a


programming language of your choice. As part of your implementation of
logistic regression (LR), code the Gradient Descent Algorithm to find out the
parameters for Θ. You may use any train/val/test split ratio and/or K-Fold
cross-validation as you see fit.

2. Logistic Regression using One Feature (30 points)

a. Using the training/validation/test split ratio of your choice for the complete
dataset, train your Logistic Regression (LR) model to predict time_in_hospital
using the num_lab_procedures as an input feature. (15)

b. Evaluate the performance of your model on the training set and hold-out
(testing) set using a metric discussed in class (e.g., precision, recall, F1
score, and confusion matrix). Discuss your results. (15)
3. Logistic Regression with multiple variables and training samples (45
points)

a. Using the training/validation/test split ratio of your choice for the complete
dataset, train your LR model to predict time_in_hospital using the 10 features
below: (10)
age, num_lab_procedures, num_procedures, num_medications,
number_outpatient, number_emergency, number_inpatient,
number_diagnosis, insulin, and diabetesMed.

b. Using the training/validation/test split ratio of your choice for the complete
dataset, train your LR model to predict time_in_hospital using forward
selection to select the most significant features in the dataset as input
variables. Which subset of features gave you the best performance? What are
your thoughts on these features getting selected? (10)

c. Compare the performance of the model built using all of the features in (3a.)
with the model trained using the selected features (3b.). Which set of features
performed better? (10)

d. Train your LR model using training sample size within the range {25000,
50000, 75000} and the features selected in (3b.) and compute cross-entropy
loss (log loss) on the same hold-out set and plot your results (log-loss on y-
axis and train sample size on x-axis). What is the impact of increasing
training sample size? Justify why the log-loss decreases, increases or doesn’t
change at all. (15)

Evaluate performance of your models using a metric discussed in class


(e.g., confusion matrix). You may also use graphs for explaining your
observations.

4. Experiments with regularization and Cost function (40 points)

a. Regularization and Feature Scaling:


i. For the best performing model in Q 3. (Model from 3c.), does
regularization improve the performance? (10)
ii. Does Feature Scaling improve the performance for the model in Q
4a.? (10)
Evaluate performance of your models using a metric discussed in class
(such as confusion matrix). You may also use graphs for explaining your
observations.

b. Cost Function:
i. Keeping the best regularized (or not) model after the experiments from
4a., train your LR model to predict sleep_quality by changing the cost
function to the following: (10)

ii. Compare the performance of both the models (4.b.i and 4.a.). Do they
give the same solution with a difference in cost function? (10)

Submit a zipped file containing your code(s) and report (in pdf) in the Dropbox
folder titled “Assignment 2-LastName” on Pilot.

Academic Integrity: Please note that the code and report you submit should be your
work, and yours alone. If plagiarism is detected, it will be dealt with strictly and
in accordance with Wright State guidelines.

You might also like