Professional Documents
Culture Documents
Principal Coordinates Analysis - Towards Data Science
Principal Coordinates Analysis - Towards Data Science
Principal Coordinates Analysis - Towards Data Science
Member-only story
HANDS-ON TUTORIALS
202 2
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 1/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 2/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
The generated mappings can be used for better understanding which items
are close to each other, and which are different. It can also allow you to
identify groups or clusters.
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 3/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
With a Principal Coordinate Analysis, the primary goal is creating the best
possible mapping, and one thing you can look at in the graph are
groups/clusters. With clustering, the principal goal is identifying clusters
and one thing you can do with those clusters is to try to plot them on a map.
Perceptual Mapping
The first example of Principal Coordinate Analysis that we’re going to see is a
Perceptual Mapping use case. Perceptual Mapping means that you make a
geographic map, but you use an unusual distance measure.
Of course, there are all types of distance measures that you can use for a
map. Yet the idea of Perceptual Mapping is to create a visualization that gives
you a great insight into other dimensions than distance.
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 4/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
For this example, I have created a small data set with travel times by train
for cities in France. Ever since I’ve been living in France, I’ve been surprised
how some close together cities take such a long time by train, whereas some
long distances are very fast to travel. This can be due to conditions like the
type of train that can run between two cities or geographical barriers like
mountains and whatnot.
Product Mapping
The second example that we’ll look at is an example from product branding.
We will use a simulated data set of distances between products. Imagine that
you're a company, and you want to introduce a new product. You could use
this technique to map the product among existing products, to find out
whether it is different enough to be introduced.
Product Mapping can be done using metric data, but it is often done using
non-metric data as well. You can only use Principal Coordinate Analysis if
you have data on a metric scale. For ordinal numeric data, you need to use a
method called Non-Metric Multidimensional Scaling.
When starting with this method, you need data that comes down to having a
so-called dissimilarity matrix. This means that you have a distance or
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 5/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Be aware that this method can be used only for real measurements of
distance. For example, if your value is twice as high, your distance must be
twice as large. Therefore, you cannot use it for ordinal data like consumers
filling in a measurement scale from one to five. You need to use a non-
metric multidimensional scaling for that.
Once you have the metric distance matrix, you can compute your solution
using the Torgerson method (when distances are euclidean) or else by the
iterative method.
The Torgerson formula for the double centering starts by computing the
squares of the distances:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 6/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Then, you apply a Singular Value Decomposition on the matrix B. Once you
do that, you take the first two dimensions of your SVD and you use them as
axes to plot your mapping. The scores of your items on the first two
dimensions will be used as coordinates for your map.
Principal Coordinates Analysis — the cost function for the iterative method
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 7/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Let’s start with a general overview of France’ 10 largest cities for those who
are not familiar:
For this analysis, I have used an itinerary planner to obtain the travel time
from each city to the other by train. I have put those travel times (in
minutes) in a distance matrix. You can obtain the data directly from an S3
bucket using the following line of code:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 8/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 9/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Note that this data set is already a distance matrix. Only, it is in a data frame
format. We need to convert it into a distance matrix format as follows:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 10/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Principal Coordinates Analysis — convert the data into a real distance matrix object
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 11/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
from the stats package that will do all the mathematics of the Torgerson
method for us, and we will obtain the coordinates for our mapping.
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 12/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
You will obtain the new coordinates for each of the cities and that looks as
follows:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 13/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
This coordinate matrix is the output of the model. Of course, a logical next
step is to plot those coordinates to obtain the visual version of the mapping.
We can then create the mapping as follows:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 14/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
In the following graph, you see the mapping that you obtain. The cities are
now relocated based on travel times by train rather than by kilometers:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 15/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Principal Coordinates Analysis — The top 10 cities of French reorganized based on train travel time
Those who have looked at the original map of France will notice that there is
something weird going on: cities in the east are all shown in the west and
vice versa. Since the Principal Coordinates Analysis is based on distance, it
does not preserve notions of the original directions. We can easily flip the
map over the x-axis to get east and west back in place. You can do this with
the following code:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 16/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
You will now obtain the final mapping of the top 10 French cities,
reorganized based on travel times by train:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 17/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Principal Coordinates Analysis — The top 10 cities of French reorganized based on train travel time
This mapping allows us to see several interesting things. Firstly, we see that a
lot of the cities are projected much closer to each other except for three
outliers: Nice, Toulouse, and Bordeaux are being pulled away from each
other to represent longer travel times in the southern cities of France.
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 18/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
The cities Marseille and Montpellier are also in the south of France, yet those
are moved much closer to Paris and further away from the other southern
cities. This can be explained by the quick train line from Paris to Marseille.
In the Northern part of France, we see the distance from east to west being
made much smaller. The distance from Nantes to Strasbourg is shown as
being very small, although they are on opposite sides of the country.
Now if you really want to move forward with this mapping, you could use
mapping or GIS packages like the cartography package to make this map
look stunning. Be aware that the current mapping is based on a 0-centered
map. Yet when you want to project onto a map of the country, you will need
to make some additional decisions, including how to scale the map and
where to place the center.
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 19/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 20/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Unlike in the previous example, we do not yet have a distance matrix. Since
the input for a Principal Component Analysis is a distance matrix, we need
to compute that distance matrix first, based on the data. The dist function
in R computes the euclidean distances between observations, as follows:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 21/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 22/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Now we need to fit the Principal Coordinates Analysis using cmdscale . The
code is shown below:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 23/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 24/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
You can generate the plot of the 10 candies on the two dimensions of the
Principal Coordinates Analysis as follows:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 25/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 26/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
You can get some interesting insights from this graph. Most of the candies
are grouped at the bottom of the graph. There is one very different candy:
candy 5. Then for the other candies, we might distinguish two groups: one
group on the bottom right (candies 2, 3, 4, and 1) and a group on the bottom
left (candies 10, 9, 6, 7, and 8).
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 27/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
To find out a bit more about what the dimensions actually mean, it can be
interesting to analyze correlations between the original variables and the
two dimensions. This can be done as follows:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 28/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
This tells us that the first dimension is strongly correlated with Sweetness
and Sourness. The second dimension is mainly representing Saltiness.
We can conclude two things for the question of our candy company:
Firstly, the company does not yet have candy in the top right of the graph.
It may be interesting for them to study whether this would have any
added value. This would be a candy that scores high on both dimensions.
This candy could for example be a Sweet/Salty combination.
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 29/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
The data are a distance matrix with the travel times in minutes from each
city too the other. It looks as follows:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 30/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
The Python function that we’re going to use for the Principal Coordinates
Analysis can only take a symmetrical distance matrix. This means that we
have to fill in the NAs with the corresponding values. This is easy to do by
replacing the NAs by 0 and doing a sum of the original matrix and the
transposed matrix:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 31/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Principal Coordinates Analysis — converting the half distance matrix to the symmetric distance matrix
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 32/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Now we get to the modeling. You can use the scikit-bio package for your
Principal Coordinates Analysis. You can use the code below to install and
import the package and to for the model. Finally, you print the coordinates
in the first two dimensions using the .samples attribute:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 33/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 34/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Finally, we want to create a plot of those 10 coordinates. You can use the
following code to plot the cities with matplotlib :
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 35/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
The resulting map is shown below. It is the same output as the one obtained
by R, except that it is mirrored. I will not repeat the conclusions, as they will
be exactly the same conclusions as we’ve seen in the R analysis above.
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 36/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Principal Coordinates Analysis — mapping the cities based on travel times by train
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 37/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 38/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 39/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 40/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Now that you have this matrix, you can move on to fitting the model. We’ll
use the skbio package again and plot the results with matplotlib :
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 41/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
You will obtain the same graph as the one outputted by the equivalent R
code. The Python graph is shown below:
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 42/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Conclusion
Using Principal Coordinates Analysis, we have visualized the 10 largest cities
of France and created an alternative map of France based on travel times by
train.
I hope that this article was useful for you. Don’t hesitate to stay tuned for more
maths, stats, and data content!
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 43/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
Joos Korstanje in Towards Data Science Marco Peixeiro in Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 44/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
800 17 2.2K 19
Adrian H. Raudaschl in Towards Data Science Joos Korstanje in Towards Data Science
2.4K 24 363 10
See all from Joos Korstanje See all from Towards Data Science
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 45/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
1 69
Lists
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 46/47
15/11/23, 18:37 Principal Coordinates Analysis | Towards Data Science
116 34
89
https://towardsdatascience.com/principal-coordinates-analysis-cc9a572ce6c 47/47