Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

NetLogo K-Means Guidelines

The embedded applet allows you to watch the K-Means Clustering algorithm at work on randomly
generated datasets. Your exploration of this clustering simulation has a dual purpose: first, it will give you
more insights into how partitioning clustering works. Second, it will illustrate how the choice of the
number of clusters K can affect the results of the clustering algorithm, under different controlled scenarios
for the simulated data. Working with simulated data here has the advantage that it is easier to evaluate
the properties of the cluster solution when the true underlying structure of the data is known. This, of
course, is never the case with real data.

To watch the clustering algorithm at work, please follow these steps:

1. Generate simulated data. For this, you need to choose parameters in the top two green sliders:

a. number-of-real-clusters-to-generate: the number of clusters that are inherent in the data; these
are the “real” clusters from which the data is randomly generated*
b. num-data-points; the number of data points (the size of the sample of simulated data)

*Note: In this applet you have control over the data generation process, so you can set the number of
“real” clusters. The applet will then generate data with this number of clusters. In reality, you won’t
know what procedure generated the data you are analyzing – you will examine it and make an
informed decision about an appropriate K. It’s important to bear in mind that clusters aren’t in the
data, but something we read out of and impose upon it.

2. Choose K, the number of clusters that the K-means algorithm will partition the data into. You will
need to specify this value in the third green slider.
Of course, K does not need to be equal to the “real” number of clusters. For non-simulated (real) data, it
seldom is, because the true structure of the data is always unknown.

3. Click ‘setup’ to generate data

You will notice that the simulated data points are grey (they have not been assigned to any cluster yet)
and the centroids that you specified are colored.

4. Iterate through the steps of the K-means clustering algorithm:

● Click ‘Assign Points’


This assigns each point in the data set to the centroid which is closest. Note that the data points
are now correspondingly changing color.
● Click ‘Update Centroids’
New clusters have now been formed by the previous assignment, thus the centroids will now
change. This step computes the new centroids of the clusters after the re-assignment in the
previous step.
● Click ‘Assign Points’ again

1
● Click ‘Update Centroids’ again
● Keep on clicking these two buttons in sequence until the clustering stabilizes (i.e., it doesn’t
change any more)

To avoid all of the clicking back and forth:


● Drag the ‘model speed’ slider at the top of the applet all the way to left
● Click ‘setup’ again to generate new data
● Click ‘Find Clusters (go)’ to start the clustering procedure – it will run until it converges

Reflection Questions

● Does the resulting clustering align with what you would expect when you visually examine the
data? Do you ‘agree’ with the clusters that the algorithm identified?
● Did the visualization of the procedure change/improve your understanding of what the clustering
algorithm is doing?
● Do you notice anything about the procedure that might affect how you interpret results?

Hopefully you now have a better idea about what’s going on ‘under the hood’. Here’s something else to
try:

● Generate a dataset and let the algorithm find a stable clustering


● Take a screenshot of the clustering and save it
● Click the ‘Reset centroids’ button
● Note that the data set is the same, but the initial centroids are different
● Run the algorithm again to find a stable clustering
● Take another screenshot and save it
● Reset the centroids again and repeat the process another few times

Reflection Questions

● What do you notice when you compare the different cluster solutions?
● How do you account for this? What does this tell you about the algorithm?

Try adjusting the green sliders at the top of the applet and then run the clustering algorithm again. The
sliders change certain aspects of the randomly-generated data: the number of clusters that are inherent
in the data (the “real” clusters from which the data has been randomly generated), the number of data
points (the size of the sample of simulated data), and the number of centroids that the algorithm will use
(the number of clusters that the algorithm will partition the data into: K).

2
Reflection Questions

● What happens when there are more centroids than “real” clusters? (This is similar to “overfitting”,
or the equivalent of trying to find in the data a more complex structure than the reality presents.)
● What happens when there are fewer centroids than “real” clusters? (This is similar to looking for
a more parsimonious or simpler structure than the one that reality presents, thus necessarily
neglecting some aspects of the real data.)

When you’ve played with the applet for a bit, head over to the forum on the Hub to discuss it with your
peers and ask any questions you still have. To get the conversation going, address the following question:

Real data doesn’t necessarily have ‘natural’ implicit clusters, but clustering algorithms will still identify
clusters. What do analysts need to bear in mind when interpreting or using algorithm-derived cluster
solutions?

You might also like