Pl152 Preliminary Report

CO7201 MSc Individual Project Department of Computer Science
Department of Computer Science

CO7201-MSc Project
Machine Learning
Pengyuan Liu
CFS ID:pl152
Friday, July 1st, 2016
Word Count - 1492

Table of Content
1. Introduction ............................................................................................................... 1
1.1. Motivation ........................................................................................................... 1
1.2. Objectives ............................................................................................................ 1
1.3. Challenges ........................................................................................................... 2
2. Requirements ......................................................................................................... 2
2.1. Aim ................................................................................................................... 2
2.2. Essentials .......................................................................................................... 2
2.3. Recommended ................................................................................................. 3
2.4. Optional ........................................................................................................... 3
3. Technical Specification ............................................................................................... 3
4. Background Study ................................................................................................... 4
4.1. Introduction to Clustering Algorithms ................................................................ 4
4.2. Introduction to Geodemographic Analysis ......................................................... 4
4.3. Introduction to PyCharm ..................................................................................... 5
4.4. Introduction to QGIS ........................................................................................... 5
4.5. Reading List ......................................................................................................... 5
5. Work Plan ............................................................................................................... 6
6. Risk Plan .................................................................................................................. 8
7. References .............................................................................................................. 8
DECLARATION
All sentences or passages quoted in this report, or computer code of any form whatsoever
used and/or submitted at any stages, which are taken from other people’s work have been
specifically acknowledged by clear citation of the source, specifying author, work, date and
page(s). Any part of my own written work, or software coding, which is substantially based upon
other people’s work, is duly accompanied by clear citation of the source, specifying author, work,
date and page(s). I understand that failure to do this amounts to plagiarism and will be
considered grounds for failure in this module and the degree examination as a whole.
Name: Pengyuan Liu
Date: 30/06/2016
1. Introduction
1.1. Motivation
Clustering algorithms as a widely used unsupervised learning method in Machine
Learning, has been proved extremely useful in analyzing geodemographic classification in
order to explore the quality of Volunteered Geographic Information (VGI). [1]
Geodemographic classification has been defined as ‘the analysis of people by where they
live’; [2] it involves categorical summary measures that aim to capture the multidimensional
characteristics of small geographical areas. The analysis of geodemographic classification has
been an applied and often interdisciplinary area of research, with methodological
developments seeking to address real-world problems with spatial dimension. [2]
A recent published research [1] about the use of geodemographic classification for
exploring the quality of Volunteered Geographic Information (VGI) provides a good example
for implementing the machine learning technique for data analysis and visualization. The
research [1] mainly uses the fuzzy c-means clustering algorithm to classify the content of
OpenStreetMap in Leicestershire. It would be a good opportunity to extend the research by
implementing multiple clustering algorithms to analyze the content of OpenStreetMap and
fostering further analysis in other social science areas with clustering methods.
1.2. Objectives
The main objectives of this project is as follows:
(1) Generate a research report for deeply understanding about Machine Learning,
particularly in clustering algorithms in unsupervised learning.
(2) Implement multiple clustering algorithms in Python mainly with scikit-learn package
and other packages. Analyze advantages and drawbacks of implemented clustering
algorithms.
(3) Realize the visualizations in QGIS, an Open Source Geographic Information System
with the output of clustering algorithms. Compare the visualizations with the
1
researcher’s output.
(4) Implement clustering algorithms in other social science data.
1.3. Challenges
Researching different clustering algorithms and understand the mathematical principles of
implemented algorithms could be quite challenging in itself. The project will need to be proper
visualized with QGIS to compare and analyze the implemented algorithms.
Focusing on the rather technical difficulties, connecting the output of implemented
algorithms with QGIS and realizing the visualization could be challenging. The structural of data
which need to be clustered is the main concern for realizing the algorithms.
2. Requirements
2.1. Aim
The Aim of this project is to research the use and comparisons of clustering algorithms for
exploring the quality of Volunteered Geographic Information (VGI).
2.2. Essentials
These requirements represent the basic aim of the project, without any of these the project
might not illustrate the results of research properly or meet its intended purpose, which is to
explore different clustering algorithms for VGI.
(1) Implement multiple clustering algorithms in Python with the package scikit-learn
package, self-implemented algorithms or other supplementary packages.
(2) Build map in QGIS for Leicestershire with provided data and 2011 UK Census OA
boundaries which could be downloaded via UK Data Service.
(3) Realize the visualization in QGIS with the implemented algorithms and provided data.
2
2.3. Recommended
These requirements adds extra analysis for the project, making the research more reliable and
more substantial.
(1) Compare visualizations for different implemented clustering algorithms in QGIS to
explore the differences between the researcher’s output [1] and the output in this
project.
(2) Compare algorithms with chosen standards, such as efficiency and accuracy for
illustrating the advantages and disadvantages of different algorithms.
2.4. Optional
These requirements are optional and will not affect the research output of the project.
(1) Choose other social science data which would be visualized in QGIS and cloud be
clustered with the proper clustering algorithm which could be chosen based on previous research
results.
(2) Analyze and illustrate the research on the data with the visualization to show the
clustered results clearly in QGIS.
3. Technical Specification
For purpose of the required research, the following are for the project (Recommended):
(1) Linux Ubuntu System (64 or 32 Bit)
(2) Python 3.4, Numpy package and Scikit-learn package
(3) Pycharm Community Edition
(4) QGIS, Open Source Geographic Information System
3
4. Background Study
4.1. Introduction to Clustering Algorithms
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense or another) to each other than
to those in other groups (clusters) [3]. It is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information retrieval, bioinformatics, data compression, and computer
graphics.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their notion of what constitutes a cluster
and how to efficiently find them. Popular notions of clusters include groups with small distances
among the cluster members, dense areas of the data space, intervals or particular statistical
distributions. Clustering can therefore be formulated as a multi-objective optimization problem.
The appropriate clustering algorithm and parameter settings (including values such as the distance
function to use, a density threshold or the number of expected clusters) depend on the individual
data set and intended use of the results.
4.2. Introduction to Geodemographic Analysis
Geodemographic classification has been defined as ‘the analysis of people by where they live’;
it involves categorical summary measures that aim to capture the multidimensional characteristics
of small geographical areas [2].
The idea that census outputs could serve to identify and to characterize the geographies of
cities gathered momentum with the increased availability of national census data and the
computational ability to look for patterns in such data [4]. Of particular importance to the emerging
geodemographic industry was the development of clustering techniques to group statistically
similar neighborhoods into classes on a 'like with like' basis. More recently, data have become
available at finer geographical resolutions (such as postal units), often originating from private
4
commercial (i.e. non-governmental) sources.
4.3. Introduction to PyCharm
PyCharm is an Integrated Development Environment (IDE) used for programming in Python.
It provides code analysis, a graphical debugger, an integrated unit tester, integration with version
control systems (VCSes), and supports web development with Django. PyCharm is developed by
the Czech company JetBrains [5].
It is cross-platform working on Windows, Mac OS X and Linux. PyCharm has a Professional
Edition, released under a proprietary license and a Community Edition released under the Apache
License.
4.4. Introduction to QGIS
QGIS is a cross-platform free and open-source desktop geographic information system (GIS)
application that provides data viewing, editing, and analysis [6]. QGIS allows users to create maps
with many layers using different map projections. Maps can be assembled in different formats and
for different uses. QGIS allows maps to be composed of raster or vector layers. Typical for this kind
of software, the vector data is stored as either point, line, or polygon-feature. Different kinds of
raster images are supported, and the images could be georeferenced.
4.5. Reading List
The reading list so far is as follow:

(1) Brunsdon, C., & Singleton, A. (Eds.). (2015). Geocomputation: A Practical Primer.
SAGE. Chapter 8, Geodemographic Analysis
(2) De Sabbata S, 2016, Exploring Volunteered Geographic Information using
Geodemographics.
(3) Scikit-learn documents
(4) A comparasion between fuzzy c-means and k-means
(5) QGIS user guide
(6) Machine Learning in action
5
5. Work Plan
Serial
Task End Date
No.
Gather knowledge about clustering algorithms and developing basic

1. 16/06/2016
understanding about the project.
2. Project Description 16/06/2016
3. Detailed research about algorithms and visualization technologies. 28/06/2016
4. Preliminary Report 01/07/2016
Start of Development process, implement algorithms with provided

5. 07/07/2016
data.
6
Implement algorithms output into QGIS to explore output with

6. 22/07/2016
visualization
Write interim report and research methods for comparing different

7. 29/07/2016
clustering algorithms.
Interview with the second marker and compare algorithms, try to

8. 05/08/2016
analyze more algorithms if it is necessary for final report.
Compare algorithms and visualizations and generate the report for

9. 10/08/2016
the results.
10. Final-report Template 12/08/2016
7
11. Implement optional requirements 26/08/2016
12. Final Report 09/09/2016
13. Viva 16/09/2016
6. Risk Plan
Risk: During the research process, the software crash could happen due to technical/outer
factors leading to deletion files.
Solution: Taking regular backups of project documents and with svn commits on daily basis
will reduce the risk.
7. References
1. De Sabbata S, 2016, Exploring Volunteered Geographic Information using
Geodemographics
2. Brunsdon, C., & Singleton, A. (Eds.). (2015). Geocomputation: A Practical Primer. SAGE.
3. https://en.wikipedia.org/wiki/Cluster_analysis
8
4. https://en.wikipedia.org/wiki/Geodemography
5. https://en.wikipedia.org/wiki/PyCharm
6. https://en.wikipedia.org/wiki/QGIS

Pl152 Preliminary Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pl152 Preliminary Report

Uploaded by

Copyright:

Available Formats

CO7201 MSc Individual Project Department of Computer Science

Department of Computer Science

Friday, July 1st, 2016

Word Count - 1492

1.1. Motivation ........................................................................................................... 1

1.2. Objectives ............................................................................................................ 1

1.3. Challenges ........................................................................................................... 2

2.1. Aim ................................................................................................................... 2

2.2. Essentials .......................................................................................................... 2

2.3. Recommended ................................................................................................. 3

2.4. Optional ........................................................................................................... 3

3. Technical Specification ............................................................................................... 3

4. Background Study ................................................................................................... 4

4.1. Introduction to Clustering Algorithms ................................................................ 4

4.2. Introduction to Geodemographic Analysis ......................................................... 4

4.3. Introduction to PyCharm ..................................................................................... 5

4.4. Introduction to QGIS ........................................................................................... 5

4.5. Reading List ......................................................................................................... 5

5. Work Plan ............................................................................................................... 6

6. Risk Plan .................................................................................................................. 8

Name: Pengyuan Liu

Clustering algorithms as a widely used unsupervised learning method in Machine

Learning, has been proved extremely useful in analyzing geodemographic classification in

order to explore the quality of Volunteered Geographic Information (VGI). [1]

characteristics of small geographical areas. The analysis of geodemographic classification has

been an applied and often interdisciplinary area of research, with methodological

developments seeking to address real-world problems with spatial dimension. [2]

OpenStreetMap in Leicestershire. It would be a good opportunity to extend the research by

implementing multiple clustering algorithms to analyze the content of OpenStreetMap and

The main objectives of this project is as follows:

particularly in clustering algorithms in unsupervised learning.

and other packages. Analyze advantages and drawbacks of implemented clustering

(4) Implement clustering algorithms in other social science data.

Researching different clustering algorithms and understand the mathematical principles of

visualized with QGIS to compare and analyze the implemented algorithms.

Focusing on the rather technical difficulties, connecting the output of implemented

exploring the quality of Volunteered Geographic Information (VGI).

explore different clustering algorithms for VGI.

package, self-implemented algorithms or other supplementary packages.

boundaries which could be downloaded via UK Data Service.

(1) Compare visualizations for different implemented clustering algorithms in QGIS to

illustrating the advantages and disadvantages of different algorithms.

clustered results clearly in QGIS.

(1) Linux Ubuntu System (64 or 32 Bit)

(2) Python 3.4, Numpy package and Scikit-learn package

(3) Pycharm Community Edition

(4) QGIS, Open Source Geographic Information System

distributions. Clustering can therefore be formulated as a multi-objective optimization problem.

data set and intended use of the results.

4.2. Introduction to Geodemographic Analysis

of small geographical areas [2].

geodemographic industry was the development of clustering techniques to group statistically

commercial (i.e. non-governmental) sources.

4.3. Introduction to PyCharm

PyCharm is an Integrated Development Environment (IDE) used for programming in Python.

the Czech company JetBrains [5].

It is cross-platform working on Windows, Mac OS X and Linux. PyCharm has a Professional

4.4. Introduction to QGIS

raster images are supported, and the images could be georeferenced.

4.5. Reading List

The reading list so far is as follow:

(3) Scikit-learn documents

(4) A comparasion between fuzzy c-means and k-means