Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

CO7201 MSc Individual Project Department of Computer Science

Department of Computer Science


CO7201-MSc Project

Machine Learning
Pengyuan Liu
CFS ID:pl152

Friday, July 1st, 2016

Word Count - 1492


CO7201 MSc Individual Project Department of Computer Science

Table of Content
1. Introduction ............................................................................................................... 1

1.1. Motivation ........................................................................................................... 1

1.2. Objectives ............................................................................................................ 1

1.3. Challenges ........................................................................................................... 2

2. Requirements ......................................................................................................... 2

2.1. Aim ................................................................................................................... 2

2.2. Essentials .......................................................................................................... 2

2.3. Recommended ................................................................................................. 3

2.4. Optional ........................................................................................................... 3

3. Technical Specification ............................................................................................... 3

4. Background Study ................................................................................................... 4

4.1. Introduction to Clustering Algorithms ................................................................ 4

4.2. Introduction to Geodemographic Analysis ......................................................... 4

4.3. Introduction to PyCharm ..................................................................................... 5

4.4. Introduction to QGIS ........................................................................................... 5

4.5. Reading List ......................................................................................................... 5

5. Work Plan ............................................................................................................... 6

6. Risk Plan .................................................................................................................. 8

7. References .............................................................................................................. 8
CO7201 MSc Individual Project Department of Computer Science

DECLARATION
All sentences or passages quoted in this report, or computer code of any form whatsoever

used and/or submitted at any stages, which are taken from other people’s work have been

specifically acknowledged by clear citation of the source, specifying author, work, date and

page(s). Any part of my own written work, or software coding, which is substantially based upon

other people’s work, is duly accompanied by clear citation of the source, specifying author, work,

date and page(s). I understand that failure to do this amounts to plagiarism and will be

considered grounds for failure in this module and the degree examination as a whole.

Name: Pengyuan Liu

Date: 30/06/2016
CO7201 MSc Individual Project Department of Computer Science

1. Introduction

1.1. Motivation

Clustering algorithms as a widely used unsupervised learning method in Machine

Learning, has been proved extremely useful in analyzing geodemographic classification in

order to explore the quality of Volunteered Geographic Information (VGI). [1]

Geodemographic classification has been defined as ‘the analysis of people by where they

live’; [2] it involves categorical summary measures that aim to capture the multidimensional

characteristics of small geographical areas. The analysis of geodemographic classification has

been an applied and often interdisciplinary area of research, with methodological

developments seeking to address real-world problems with spatial dimension. [2]

A recent published research [1] about the use of geodemographic classification for

exploring the quality of Volunteered Geographic Information (VGI) provides a good example

for implementing the machine learning technique for data analysis and visualization. The

research [1] mainly uses the fuzzy c-means clustering algorithm to classify the content of

OpenStreetMap in Leicestershire. It would be a good opportunity to extend the research by

implementing multiple clustering algorithms to analyze the content of OpenStreetMap and

fostering further analysis in other social science areas with clustering methods.

1.2. Objectives

The main objectives of this project is as follows:

(1) Generate a research report for deeply understanding about Machine Learning,

particularly in clustering algorithms in unsupervised learning.

(2) Implement multiple clustering algorithms in Python mainly with scikit-learn package

and other packages. Analyze advantages and drawbacks of implemented clustering

algorithms.

(3) Realize the visualizations in QGIS, an Open Source Geographic Information System

with the output of clustering algorithms. Compare the visualizations with the

1
CO7201 MSc Individual Project Department of Computer Science

researcher’s output.

(4) Implement clustering algorithms in other social science data.

1.3. Challenges

Researching different clustering algorithms and understand the mathematical principles of

implemented algorithms could be quite challenging in itself. The project will need to be proper

visualized with QGIS to compare and analyze the implemented algorithms.

Focusing on the rather technical difficulties, connecting the output of implemented

algorithms with QGIS and realizing the visualization could be challenging. The structural of data

which need to be clustered is the main concern for realizing the algorithms.

2. Requirements

2.1. Aim

The Aim of this project is to research the use and comparisons of clustering algorithms for

exploring the quality of Volunteered Geographic Information (VGI).

2.2. Essentials

These requirements represent the basic aim of the project, without any of these the project

might not illustrate the results of research properly or meet its intended purpose, which is to

explore different clustering algorithms for VGI.

(1) Implement multiple clustering algorithms in Python with the package scikit-learn

package, self-implemented algorithms or other supplementary packages.

(2) Build map in QGIS for Leicestershire with provided data and 2011 UK Census OA

boundaries which could be downloaded via UK Data Service.

(3) Realize the visualization in QGIS with the implemented algorithms and provided data.

2
CO7201 MSc Individual Project Department of Computer Science

2.3. Recommended

These requirements adds extra analysis for the project, making the research more reliable and

more substantial.

(1) Compare visualizations for different implemented clustering algorithms in QGIS to

explore the differences between the researcher’s output [1] and the output in this

project.

(2) Compare algorithms with chosen standards, such as efficiency and accuracy for

illustrating the advantages and disadvantages of different algorithms.

2.4. Optional

These requirements are optional and will not affect the research output of the project.

(1) Choose other social science data which would be visualized in QGIS and cloud be

clustered with the proper clustering algorithm which could be chosen based on previous research

results.

(2) Analyze and illustrate the research on the data with the visualization to show the

clustered results clearly in QGIS.

3. Technical Specification
For purpose of the required research, the following are for the project (Recommended):

(1) Linux Ubuntu System (64 or 32 Bit)

(2) Python 3.4, Numpy package and Scikit-learn package

(3) Pycharm Community Edition

(4) QGIS, Open Source Geographic Information System

3
CO7201 MSc Individual Project Department of Computer Science

4. Background Study
4.1. Introduction to Clustering Algorithms

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects

in the same group (called a cluster) are more similar (in some sense or another) to each other than

to those in other groups (clusters) [3]. It is a main task of exploratory data mining, and a common

technique for statistical data analysis, used in many fields, including machine learning, pattern

recognition, image analysis, information retrieval, bioinformatics, data compression, and computer

graphics.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be

achieved by various algorithms that differ significantly in their notion of what constitutes a cluster

and how to efficiently find them. Popular notions of clusters include groups with small distances

among the cluster members, dense areas of the data space, intervals or particular statistical

distributions. Clustering can therefore be formulated as a multi-objective optimization problem.

The appropriate clustering algorithm and parameter settings (including values such as the distance

function to use, a density threshold or the number of expected clusters) depend on the individual

data set and intended use of the results.

4.2. Introduction to Geodemographic Analysis

Geodemographic classification has been defined as ‘the analysis of people by where they live’;

it involves categorical summary measures that aim to capture the multidimensional characteristics

of small geographical areas [2].

The idea that census outputs could serve to identify and to characterize the geographies of

cities gathered momentum with the increased availability of national census data and the

computational ability to look for patterns in such data [4]. Of particular importance to the emerging

geodemographic industry was the development of clustering techniques to group statistically

similar neighborhoods into classes on a 'like with like' basis. More recently, data have become

available at finer geographical resolutions (such as postal units), often originating from private

4
CO7201 MSc Individual Project Department of Computer Science

commercial (i.e. non-governmental) sources.

4.3. Introduction to PyCharm

PyCharm is an Integrated Development Environment (IDE) used for programming in Python.

It provides code analysis, a graphical debugger, an integrated unit tester, integration with version

control systems (VCSes), and supports web development with Django. PyCharm is developed by

the Czech company JetBrains [5].

It is cross-platform working on Windows, Mac OS X and Linux. PyCharm has a Professional

Edition, released under a proprietary license and a Community Edition released under the Apache

License.

4.4. Introduction to QGIS

QGIS is a cross-platform free and open-source desktop geographic information system (GIS)

application that provides data viewing, editing, and analysis [6]. QGIS allows users to create maps

with many layers using different map projections. Maps can be assembled in different formats and

for different uses. QGIS allows maps to be composed of raster or vector layers. Typical for this kind

of software, the vector data is stored as either point, line, or polygon-feature. Different kinds of

raster images are supported, and the images could be georeferenced.

4.5. Reading List

The reading list so far is as follow:


(1) Brunsdon, C., & Singleton, A. (Eds.). (2015). Geocomputation: A Practical Primer.
SAGE. Chapter 8, Geodemographic Analysis
(2) De Sabbata S, 2016, Exploring Volunteered Geographic Information using

Geodemographics.

(3) Scikit-learn documents

(4) A comparasion between fuzzy c-means and k-means

(5) QGIS user guide

(6) Machine Learning in action

5
CO7201 MSc Individual Project Department of Computer Science

5. Work Plan
Serial
Task End Date
No.

Gather knowledge about clustering algorithms and developing basic


1. 16/06/2016
understanding about the project.

2. Project Description 16/06/2016

3. Detailed research about algorithms and visualization technologies. 28/06/2016

4. Preliminary Report 01/07/2016

Start of Development process, implement algorithms with provided


5. 07/07/2016
data.

6
CO7201 MSc Individual Project Department of Computer Science

Implement algorithms output into QGIS to explore output with


6. 22/07/2016
visualization

Write interim report and research methods for comparing different


7. 29/07/2016
clustering algorithms.

Interview with the second marker and compare algorithms, try to


8. 05/08/2016
analyze more algorithms if it is necessary for final report.

Compare algorithms and visualizations and generate the report for


9. 10/08/2016
the results.

10. Final-report Template 12/08/2016

7
CO7201 MSc Individual Project Department of Computer Science

11. Implement optional requirements 26/08/2016

12. Final Report 09/09/2016

13. Viva 16/09/2016

6. Risk Plan
Risk: During the research process, the software crash could happen due to technical/outer

factors leading to deletion files.

Solution: Taking regular backups of project documents and with svn commits on daily basis

will reduce the risk.

7. References
1. De Sabbata S, 2016, Exploring Volunteered Geographic Information using

Geodemographics

2. Brunsdon, C., & Singleton, A. (Eds.). (2015). Geocomputation: A Practical Primer. SAGE.

3. https://en.wikipedia.org/wiki/Cluster_analysis

8
CO7201 MSc Individual Project Department of Computer Science

4. https://en.wikipedia.org/wiki/Geodemography

5. https://en.wikipedia.org/wiki/PyCharm

6. https://en.wikipedia.org/wiki/QGIS

You might also like