Idm Ass3

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

IDM Assignment 3

Hamza Faisal 22971


Business Understanding

Data Understanding
We have been given a challenge that comes from the medical field and has to do with identifying
diseases from the symptoms that patients exhibit. Partitioning the incoming data into the right
number of clusters is the work at hand. It is anticipated that each cluster will only contain one
class (disease) and as potential data scientists, we are to solve this problem by partitioning our
data such that the values of our classes are as close as possible to the actual class values.

Initial surface level observations indicate that the data has 2784 rows and 133 columns wherein
each row represents an individual patient and each column indicates the symptom that particular
patient is exhibiting. Each column contains binary continuous values, more specifically a 0 and 1,
where a value of 0 indicates the patient is not exhibiting that symptom and a value of 1 indicates
that the patient is exhibiting that symptom, which are then supposed to be organized into clusters
by us and we are to produce as high a value for their Adjusted Rand Index as possible.

Data Preparation

Since there were no null values, no imputation of data was required. Since there were no
categorical columns (other than the row ID column, which I dropped before starting), no type of
encoding was required. As I opted to perform K-Means clustering first, I ran a code that capped
all outlier values that may have been in the data since K-Means is very sensitive to outliers. I also
made sure that the data wasn’t skewed; and since the data values were present in 0s and 1s, I
opted to go for normalization instead of standardization. This is the standard data preprocessing I
did. If any additional preprocessing was done to the data afterwards, it is mentioned in the “Data
Preprocessing” column in the Data Modelling table.

NOTE: All of the above mentioned preprocessing was done on the data after my 8th submission.

Data Modelling

Sub No. Data Model Details Clustering Type Score


Preprocessing
1 None Performed on KNIME with 4 clusters, centroid initialization set to first k- K-Means 0.03655
rows and 99 max. iterations
2 None Performed on KNIME with 9 clusters, centroid initialization set to random K-Means 0.35517
and 99 max. iterations
3 None Performed on KNIME with 9 clusters, centroid initialization set to random K-Means 0.35517
and 199 max. iterations
4 None Performed on KNIME with 5 clusters, centroid initialization set to random K-Means 0.36637
and 199 max. iterations
5 None Performed on KNIME with 2 clusters, centroid initialization set to random K-Means 0.16042
and 199 max. iterations
6 None Performed on KNIME with 500 clusters, centroid initialization set to first k- K-Means 0.15725
rows and 350 max. iterations
7 None Performed on KNIME with 5 clusters, centroid initialization set to random K-Means 0.36637
and 50 max. iterations
8 None All models from here on were made on Python. No. of clusters = 3, affinity Agglomerative 0.07237
was Euclidean and linkage = ward
9 None Default K-Means clustering in Python (8 clusters) K-Means 0.42015
10 None Default K-Means again. Score increased this time due to randomness. K-Means 0.67049
11 None Default agglomerative clustering (2 clusters) Agglomerative 0.01586
12 None Default agglomerative clustering again to check if there is a pattern to the Agglomerative 0.01586
randomness. Apparently agglomerative clustering does not display
randomness
13 None 16 clusters, 300 max. iterations and algorithm = “full” K-Means 0.49704
14 None 16 clusters, 200 max. iterations and algorithm = “full” K-means 0.69159
15 None 16 clusters, 180 max. iterations and algorithm = “full” K-Means 0.66589
16 None 16 clusters, 120 max. iterations and algorithm = “l1” K-Means 0.6518
17 None 18 clusters, 300 max. iterations and algorithm = “full” K-Means 0.55567
18 None Default K-Means K-Means 0.80573
19 None 10 clusters, 150 max. iterations, algorithm = “full”, random state = 0, n_init K-Means 0.48645
= 20
20 None 20 clusters, 300 max. iterations, algorithm = “full”, random state = 0, n_init K-Means 0.46266
= 20
21 None 24 clusters, 300 max. iterations, algorithm = “full”, random state = 0, n_init K-Means 0.48645
= 50
22 None 30 clusters, 300 max. iterations, algorithm = “full”, random state = 0, n_init K-Means 0.46266
= 100
23 None Default agglo clustering Agglomerative 0.01586
24 None Accidentally uploaded last file Agglomerative 0.01586
25 None Increased clusters from 2 to 10 Agglomerative 0.46672
26 None Increased clusters from 10 to 16 Agglomerative 0.50846
27 None Increased clusters from 16 to 24 Agglomerative 0.34617
28 None 22 clusters and linkage was single Agglomerative 0.37563
29 None 16 clusters and linkage was complete Agglomerative 0.38636
30 None Accidentally uploaded previous file Agglomerative 0.38636
31 None 30 clusters, 300 max. iterations, algorithm = “full”, random state = 0, n_init K-Means 0.56308
= 100
32 None 40 clusters, 340, max. iterations, algorithm = “full”, random state = 42, K-Means 0.50114
n_init = 362
33 None 16 clusters, affinity = cosine, linkage = complete Agglomerative 0.73412
34 None 10 clusters, affinity = cosine, linkage = single Agglomerative 0.58181
35 None 24 clusters, 300 max. iterations, algorithm = “full”, random state = 0, n_init K-Means 0.68636
= 50
36 None 10 clusters, 150 max. iterations, algorithm = “full”, random state = 0, n_init K-means 0.34242
= 20
37 None 40 clusters, 340, max. iterations, algorithm = “l2” K-means 0.45113
38 None 8 clusters, 10 max. iterations, algorithm = “auto” K-means 0.71249
39 None 20 clusters, affinity = cosine, linkage = ward Agglomerative 0.52964
40 None 10 clusters, 100 max. iterations, algorithm = “full”, random state = 0, n_init K-means 0.62693
= 100
41 Dropped 45 Clusters = 8, random state = 0. Dropped these columns because a value of 0 K-means 0.51232
columns with meant no one had that particular symptom but score did not increase due to
sum 0 loss of data.
42 None Clusters = 7, random state = 0. K-means 0.49442
43 None Clusters = 6, random state = 0 K-means 0.46417
44 Dropped 36 Used default K-Means. Dropping correlated columns did not improve score K-means 0.39123
columns with at all.
multico > 0.8
45 None Clusters = 87, affinity = “l1”, linkage = “complete” K-means 0.2697
46 None Clusters = 87, rest parameters set to default K-means 0.25013
47 None Clusters = 3, rest parameters set to default K-means 0.07237
48 None Accidentally submitted previous file again K-means 0.07237
49 None Default K-Means. Only changed n_init to 1000. K-means 0.49422
50 None n_clusters=16, n_init=50, max_iter=350, verbose=0, random_state=42, K-means 0.47456
algorithm="full"
51 None cluster = 16, max iter = 350, algo = elkan K-means 0.60619
52 None 24 clusters, 300 max. iterations, algorithm = “full”, random state = 0, n_init K-means 0.45113
= 50
53 None Clusters = 18, rest settings on default K-means 0.4974
54 None Did not record parameters K-means 0.46266
55 None Did not record parameters K-means 0.36637
56 None 18 clusters, affinity = cosine, linkage = single Agglomerative 0.38636
57 None 18 clusters, affinity = cosine, linkage = average Agglomerative 0.38636
58 None 24 clusters, linkage = complete Agglomerative 0.52964
59 None Clusters = 22, linkage = single Agglomerative 0.37563
60 None n_clusters=16, affinity='manhattan' Agglomerative 0.38636
61 None 3 clusters, 99 max iteration and fuzzifier = 2 Fuzzy C-Means 0.25346
62 None Init = k-means++, rest parameters default. K-Means 0.66589
63 None Ran the previous model again to check randomness K-Means 0.37752
64 None cluster = 16, max iter = 350, algo = elkan, init = k-means++ K-Means 0.62777
65 None Fitted data with default DBScan clustering. No improvement DBScan 0.15725
66 None Previous model except I reduced distance b/w clusters but still no change. DBScan 0.15725
67 None Fitted data with default MeanShift clustering algorithm MeanShift 0.02808
68 None Clusters = 6, linkage = average Agglomerative 0.59426
69 None Linkage = average, rest parameters set to default Agglomerative 0.07729
70 None Accidentally uploaded previous file Agglomerative 0.07729
71 None clus = 16, max iter = 99, fuzzifier = 7 Fuzzy C-Means 0.25346
72 None clus = 16, max iter = 99, fuzzifier = 100 Fuzzy C-Means 0.16055
73 None linkage = average, clusters = 5 Agglomerative 0.23619
74 None 12 clusters, 200 max. iterations, algorithm = “full”, rand. state = 0 K-Means 0.61053
75 None 18 clusters, affinity = cosine, linkage = single Agglomerative 0.34103
76 None 24 clusters, affinity = cosine, linkage = single Agglomerative 0.35848
77 None Fitted the data 10 times with default K-Means and took average of the cluster K-Means 0.46439
labels and then rounded them off to the floor value.
78 None 22 clusters, linkage = single, affinity = manhattan Agglomerative 0.21985
79 None 32 clusters, 300. max. iterations, algorithm = “full”, random state = 42 K-Means 0.51168
80 None 16 clusters, affinity = cosine, linkage = complete Agglomerative 0.40888

Evaluation

1) Which algorithm worked best for the given dataset and why?

2) What is the optimal number of clusters in the data as per your findings and why?

3) What were the overall challenges that you faced while improving the score, and so on?

You might also like