Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Here are some tips from your old professor that will help you execute this project.

(1) Use MySQL to create a table that has everything you need before you go to
Orange. This will need to include the TPP score, which you will have to
calculate. Gene Parmeasan and Chris Cassa told you how.

(2) Once you have organized your data in MySQL, export it as a ".csv" file that you can
load into Orange. You will find the export button at the top of MySQL's results grid.

(3) In Orange, skip the identifier attributes. These aren't meaningful measures for your
analysis. But, we have to tell the computer that.

(4) Before you do your K-Means analysis, use Orange's Select Columns tool to remove
TPP, Route ID and Country (for now). Why, do you think?

(5) Before you do your K-Means analysis, use Orange's pre-processing tool. This tool
allows us use measures at different scales at the same time, without measures with
large numbers, like city population, overwhelming other important measures, like
customers served. We have to tell the computer that.
Use these normalization parameters

(6) If you want to check your work against the course team set K = 5 for your cluster
analysis. That's what we did.

(7) Use the Select Columns after your K-means widget to get cluster assignments and
merge the data with your original file so now you can see routes, their attributes and
their country and cluster assignments.
(8) Spend some time with these boxplots. Can you capture in one image why cluster-
based analysis is more useful than country-based analysis in this context? I think it's
there.

(9) Export a table that contains route_IDs and their cluster assignments back to MySQL
for writing queries. Pay careful attention so that the .csv you export has only the
columns you need. Select Columns can help you do this.

The easiest way to input this data into MySQL is to use the Import Data Wizard that
you can find by right clicking on the table to which you will be importing. I bet you will
need to make that table first. Did you do that? You can do that. Chris Cassa told you how.

(10) Now that you can query this data with the cluster assignment information included,
let's answer some managerially relevant questions. Please keep in mind that the cluster
names are not persistent. This means that this analysis will result in different cluster
names when different people do the exact same thing -- even when the same person
does the same thing, but at different times. Importantly, cluster membership does not
change each time we run the algorithm. The names (c1, c2, c3....) simply apply to
different clusters on different machines and on different days, so its best to use
descriptors for each of the clusters. Box plots can help you do that.

(11) Smile! You are a real-world data hero for doing all of this.

You might also like