Professional Documents
Culture Documents
Idm Ass3
Idm Ass3
Idm Ass3
Data Understanding
We have been given a challenge that comes from the medical field and has to do with identifying
diseases from the symptoms that patients exhibit. Partitioning the incoming data into the right
number of clusters is the work at hand. It is anticipated that each cluster will only contain one
class (disease) and as potential data scientists, we are to solve this problem by partitioning our
data such that the values of our classes are as close as possible to the actual class values.
Initial surface level observations indicate that the data has 2784 rows and 133 columns wherein
each row represents an individual patient and each column indicates the symptom that particular
patient is exhibiting. Each column contains binary continuous values, more specifically a 0 and 1,
where a value of 0 indicates the patient is not exhibiting that symptom and a value of 1 indicates
that the patient is exhibiting that symptom, which are then supposed to be organized into clusters
by us and we are to produce as high a value for their Adjusted Rand Index as possible.
Data Preparation
Since there were no null values, no imputation of data was required. Since there were no
categorical columns (other than the row ID column, which I dropped before starting), no type of
encoding was required. As I opted to perform K-Means clustering first, I ran a code that capped
all outlier values that may have been in the data since K-Means is very sensitive to outliers. I also
made sure that the data wasn’t skewed; and since the data values were present in 0s and 1s, I
opted to go for normalization instead of standardization. This is the standard data preprocessing I
did. If any additional preprocessing was done to the data afterwards, it is mentioned in the “Data
Preprocessing” column in the Data Modelling table.
NOTE: All of the above mentioned preprocessing was done on the data after my 8th submission.
Data Modelling
Evaluation
1) Which algorithm worked best for the given dataset and why?
2) What is the optimal number of clusters in the data as per your findings and why?
3) What were the overall challenges that you faced while improving the score, and so on?