Professional Documents
Culture Documents
Pl152 Preliminary Report
Pl152 Preliminary Report
Machine Learning
Pengyuan Liu
CFS ID:pl152
Table of Content
1. Introduction ............................................................................................................... 1
2. Requirements ......................................................................................................... 2
7. References .............................................................................................................. 8
CO7201 MSc Individual Project Department of Computer Science
DECLARATION
All sentences or passages quoted in this report, or computer code of any form whatsoever
used and/or submitted at any stages, which are taken from other people’s work have been
specifically acknowledged by clear citation of the source, specifying author, work, date and
page(s). Any part of my own written work, or software coding, which is substantially based upon
other people’s work, is duly accompanied by clear citation of the source, specifying author, work,
date and page(s). I understand that failure to do this amounts to plagiarism and will be
considered grounds for failure in this module and the degree examination as a whole.
Date: 30/06/2016
CO7201 MSc Individual Project Department of Computer Science
1. Introduction
1.1. Motivation
Geodemographic classification has been defined as ‘the analysis of people by where they
live’; [2] it involves categorical summary measures that aim to capture the multidimensional
A recent published research [1] about the use of geodemographic classification for
exploring the quality of Volunteered Geographic Information (VGI) provides a good example
for implementing the machine learning technique for data analysis and visualization. The
research [1] mainly uses the fuzzy c-means clustering algorithm to classify the content of
fostering further analysis in other social science areas with clustering methods.
1.2. Objectives
(1) Generate a research report for deeply understanding about Machine Learning,
(2) Implement multiple clustering algorithms in Python mainly with scikit-learn package
algorithms.
(3) Realize the visualizations in QGIS, an Open Source Geographic Information System
with the output of clustering algorithms. Compare the visualizations with the
1
CO7201 MSc Individual Project Department of Computer Science
researcher’s output.
1.3. Challenges
implemented algorithms could be quite challenging in itself. The project will need to be proper
algorithms with QGIS and realizing the visualization could be challenging. The structural of data
which need to be clustered is the main concern for realizing the algorithms.
2. Requirements
2.1. Aim
The Aim of this project is to research the use and comparisons of clustering algorithms for
2.2. Essentials
These requirements represent the basic aim of the project, without any of these the project
might not illustrate the results of research properly or meet its intended purpose, which is to
(1) Implement multiple clustering algorithms in Python with the package scikit-learn
(2) Build map in QGIS for Leicestershire with provided data and 2011 UK Census OA
(3) Realize the visualization in QGIS with the implemented algorithms and provided data.
2
CO7201 MSc Individual Project Department of Computer Science
2.3. Recommended
These requirements adds extra analysis for the project, making the research more reliable and
more substantial.
explore the differences between the researcher’s output [1] and the output in this
project.
(2) Compare algorithms with chosen standards, such as efficiency and accuracy for
2.4. Optional
These requirements are optional and will not affect the research output of the project.
(1) Choose other social science data which would be visualized in QGIS and cloud be
clustered with the proper clustering algorithm which could be chosen based on previous research
results.
(2) Analyze and illustrate the research on the data with the visualization to show the
3. Technical Specification
For purpose of the required research, the following are for the project (Recommended):
3
CO7201 MSc Individual Project Department of Computer Science
4. Background Study
4.1. Introduction to Clustering Algorithms
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense or another) to each other than
to those in other groups (clusters) [3]. It is a main task of exploratory data mining, and a common
technique for statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information retrieval, bioinformatics, data compression, and computer
graphics.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their notion of what constitutes a cluster
and how to efficiently find them. Popular notions of clusters include groups with small distances
among the cluster members, dense areas of the data space, intervals or particular statistical
The appropriate clustering algorithm and parameter settings (including values such as the distance
function to use, a density threshold or the number of expected clusters) depend on the individual
Geodemographic classification has been defined as ‘the analysis of people by where they live’;
it involves categorical summary measures that aim to capture the multidimensional characteristics
The idea that census outputs could serve to identify and to characterize the geographies of
cities gathered momentum with the increased availability of national census data and the
computational ability to look for patterns in such data [4]. Of particular importance to the emerging
similar neighborhoods into classes on a 'like with like' basis. More recently, data have become
available at finer geographical resolutions (such as postal units), often originating from private
4
CO7201 MSc Individual Project Department of Computer Science
It provides code analysis, a graphical debugger, an integrated unit tester, integration with version
control systems (VCSes), and supports web development with Django. PyCharm is developed by
Edition, released under a proprietary license and a Community Edition released under the Apache
License.
QGIS is a cross-platform free and open-source desktop geographic information system (GIS)
application that provides data viewing, editing, and analysis [6]. QGIS allows users to create maps
with many layers using different map projections. Maps can be assembled in different formats and
for different uses. QGIS allows maps to be composed of raster or vector layers. Typical for this kind
of software, the vector data is stored as either point, line, or polygon-feature. Different kinds of
Geodemographics.
5
CO7201 MSc Individual Project Department of Computer Science
5. Work Plan
Serial
Task End Date
No.
6
CO7201 MSc Individual Project Department of Computer Science
7
CO7201 MSc Individual Project Department of Computer Science
6. Risk Plan
Risk: During the research process, the software crash could happen due to technical/outer
Solution: Taking regular backups of project documents and with svn commits on daily basis
7. References
1. De Sabbata S, 2016, Exploring Volunteered Geographic Information using
Geodemographics
2. Brunsdon, C., & Singleton, A. (Eds.). (2015). Geocomputation: A Practical Primer. SAGE.
3. https://en.wikipedia.org/wiki/Cluster_analysis
8
CO7201 MSc Individual Project Department of Computer Science
4. https://en.wikipedia.org/wiki/Geodemography
5. https://en.wikipedia.org/wiki/PyCharm
6. https://en.wikipedia.org/wiki/QGIS