Demographics Crime Trends

Demographics of Crime Trends in London
Akhilesh Pandey
Abstract A Criminal activity causes a negative impact on

the society. It affects the people in the area, and can influence
economic and social growth of the locality. It is hence important
to explore and understand the reasons as well as the impact
of the crime on the locality. We propose that crime is affected
by the local demographics such as employment rate, education
level, poverty etc. This paper aims to analyze the affects of
demographics on the crime patterns using various regression
and classification methods in the region of London.
I. I NTRODUCTION
Crime is defined as any action which is unlawful and
hinders the working of the society [6]. It is an unlawful activity and hence controlling crime is considered an important
duty of the government. Various methods are implemented
to achieve this goal. These have been broadly classified into
two regions- dissuasion techniques and policing techniques.
While the former focuses stricter punishment system, the
latter focuses on prevention of crime. Crime Analysis is a
systematic analysis in the field of criminal justice that utilizes
a systematic approach [1], is a part of the latter technique.
While this is a wide field of study, Crime Mapping focuses
on the demographics of the crime incidents. With availability
of strong analytic tools and availability and acces to large
amount of data, it has become an important part of policing
in any city as it can help the security officials to understand
the crime trends at various locations. It can also help with
administrative issues such as staffing and employing counter
measures to tackle issues in a pre-emptive manner.
Criminal activities in a region also affect other aspects of
society, such as, development activities in a region, the
cost of property, etc. Therefore, it is interesting from the
economic perspective to understand the relation between
different crime in a region and the factors affecting it.
This paper looks into the data of crime incidents that have
been recorded in London. We will also look at the demographics of different regions and try to identify how different
factors can be associated with the number of incidents
occurring in it.
II. BACKGROUND
Crime Analysis has been explored to understand the crime
mapping of different regions. The different terms pertaining
to this paper are:
LSOA Codes- They are the Lower Layer Super Output
Area Codes which are used for reporting statistics for
small areas in the United Kingdom. Each borough in
London has been divided into sub subdivisions. The
London area 4584 LSOA regions.
Crime Types- This covers the different types of criminal activities which have been recorded. Out data set has
categorized all the criminal activities into 11 different
categories.
Index of Deprivation- This is an index which is a
measure of Depravity of an area. It was introduced
by the UK government in 2007. These rankings are
available for different boroughs as well as LSOA. It
is a cumulative ranking of a region based on various
factors of poverty.
Demographics- These are the sets of attributes which
we will be using to understand the socioeconomic
statistics of LSOAs. These attributes include various
properties, such as, Average Income, Unemployment
Rate, Education Distribution, Population distribution, etc.
III. T HE DATA SETS
To observe the relationship, two data sets were used: The

Street Level Crime Data and the Demographics Data:A. The Demographics Records
This data set contains the records for the various attributes
that will be required in our analyses. The data set has been
extracted from the Census 2011 Data set and is available
at various levels. We will be using the data given at the
LSOA level. The census data set is available at the Website
of Office of National Statistics [4]. This data file is based on
Census 2011 and hence we will use the crime statistics of
this year for our analysis. A 16 MB Excel file is available
for download. This data is available for every different
LSOA region in London. The important attributes are the
Population, Racial distribution, Education distribution, Mean
Income of residents of each LSOA. These represent the
Demographics discussed in the previous section.
B. The Crime Records
This data set contains the records of the crime incidents
that have recorded by the Police. It is available at the Police
Data website [5]. The data can be extracted by selecting
various Forces and range of dates for which data is required
from the different police forces present in the UK. We have
selected City of London Police and Metropolitan Police
Service and dates for all thee months of the year 2011
for our analysis . The combination of the data for different
months generated an Excel file of 63 MB. The data records
the number of criminal events and classifies them into 11
different categories. The data of the events are presented in
mm/yy format and it has been broken down to LSOA level.

The location of the incident is also presented in the form of
geo coordinates.
VI. T OOLS
We will be using various tools for our analyses. The three
data sets were received in Excel format and were converted
into .csv format for analyses:
C. Adjusted IMD Scores 2010

This data set provides the Depravity Score for all the
LSOAs of London. This is a cumulative ranking of various
factors such as poverty, health, etc.. The data set is available
at the website of ONS [4]. A 4.7 MB Excel file is generated
for LSOA level Depravity Index data for the City of London.
We will use this data to understand how poverty and other
factors affect the crime rate/Number of Instances in a city.
Although the data set is for 2010, we can use this as the
IMD scores will not vary in one year and hence the data set
is usable. This data set contains LSOA codes for which the
depravity index value has been provided and hence it can be
easily combined with the other data sets.
IV. L ITERATURE R EVIEW
Crime Analysis has been explored extensively and a lot of
studies are published to explain factors that influence Crime
in a region.
A. Mapping crime: Understanding hotspots.
In this paper, John et. al has analyzed the distribution of
crime incidents across different regions the US. They explain
that the Crime in a large region is always Clustered to
specific regions [2]. They have named these regions as the
hotspots in the map.
B. Residential Burglaries and Neighbourhood Socioeconomic Context in London
In this paper, Poetz et. al have analyzed the Burglary
events in the city of London. They explain that the Burglaries in different regions are affected by the neighbourhood
regions [2] and have tried to develop a model to predict the
risk of burglaries based on various demographic factors.
V. H YPOTHESIS
We will try to extend the model proposed by Poetz et.
al and try to implement it on the overall Crime rate in
London. It can be said that the crime in a region is affected
by the Unemployment Rate, Overall number of people in
the working age, The Population density, education and the
median income of the population and the racial composition
of the locality. It is also affected by the ratio of youth
population to the overall population in the region:
1) The areas with higher Deprivation Index may have a
higher Crime rate.
2) The area with lower Unemployment Rate, lower percent of primary school qualified people and higher
people of working age may have a higher crime rate.
3) We will try to study the relationship between racial
distribution and crime rate.
4) We will try to understand the effects of population and
total area on the crime rate.
A. Weka
Weka is a powerful Graphical User Interface tool for Data
Analyses by University of Waikato. We will be using it for
handling Outliers as well as Classification and Regression.
B. R
R is an Open source statistical tool. It has a lot of packages
which can be used to implement different functionalities. We
will be using R for Mapping our data, joining the different
data sets as well as further analyses.
C. Microsoft Excel
Excel is a commonly used statistical tool which is used
for storing data as well as calculating parameters of the
distribution. We will be using Excel for combining data from
different Months for Crime Data set. We will also be using
Excel for data formatting, data visualizations and basic data
manipulations.
D. SQL
SQL is a Structured Query Language Developed by Oracle. It is used for handling data which has a structure and data
is highly organized. It will be used by us for data cleaning
as well as some basic querying of the data.
VII. DATA P RE - PROCESSING AND C LEANING
All the Excel files were converted to .csv format as
this is the most used format in our tools. The two data
sets Census and Adjusted IMD scores were clean and
ready to be processed but the crime data had issues and
some modifications were required for Crime records 2011.
The data cleaning and pre-processing procedure has been
explained below:
A. The Crime Records 2011
We removed the entries which had no Longitude and
Latitude values as this would help us in plotting the regions
of crime. This data set had some of the incidents reported
twice under the same category. This is because some entries
had the same coordinates, as well as Crime type and date
Month. We can be sure that the same event has not occurred
twice in the same location as the coordinates of incidents
were same up to two decimal points and other details
matched as well. It is possible the duplication was caused by
recording of incidents at different stations. Another possible
reason could be that the crime involved many people and
hence was duplicated, The duplicate data was removed by
using SQL queries. The query used was
SELECT DISTINCT ( L o n g i t u d e , L a t i t u d e ,
Crime . Type , Date , LSOA . Code , LSOA . Name )
FROM C r i m e D a t a
GROUP BY ( L o n g i t u d e , L a t i t u d e ,
Crime . Type , Date , LSOA . Code , LSOA . Name )
This reduced the number of incidents by a significant number. The CrimeId, Context and Last Outcome were
removed as they do not provide any significant information.
We were then be able to plot this data on the map of Greater
London using R.
We then removed the Crime Types in order to aggregate
the data for Each LSOA. This gave us the total count of all
the Criminal Activities recorded in a LSOA. Fig. 1 shows
the number of different crime events recorded in London in
2011.
Primary Key for the data set. Hence, we can easily merge
the data sets based on this value. This operation is performed
by using plyr package in R. After removing the redundant
columns created, our data set is prepared.
The population and Area of each LSOA is different and this
can directly affect the number of crime incidents in the area.
So we will add a column Crime Rate which gives us the
Number of Crime Incidents per Person.
VIII. DATA A NALYSIS
A. Crime Data set
The Crime Data set gives us the locations of the incidents
that have taken place in our data set. Once the data cleaning
was performed, we can plot the data on a map of London
in order to visualize the distribution of crime incidents. The
data was plotted with the help of spatstat in R. Fig. 2 shows
different Crime incidents and their locations on the London
Map.
Fig. 1: Different Crime Events recorded in 2011
B. The Census Data 2011

The Census data was ready for use and pre-processing
was not required. The column names had to be modified to
be used in Weka as it does not accept special symbols. The
missing values present in different attributes were replaced
with 0.
The Age Distribution was given as a Numerical distribution
for different categories. This data is not very useful as it
does not describe the distribution of age and hence we
added a column depicting the percentage to understand the
distribution better.
The Census data set was combined with the other two data
sets by using the common LSOA Code value and Names in
case of Crime data set and LSOA code in IMD Scores data
set. A simple change in the column names was required as
the value of labels was LSOA.names and LSOA.codes and
Codes and Names in the other data set. We will analyze
the three data sets for our investigation.
C. Adjusted IMD Scores 2010
This data set consists of three columns, the LSOA code
and a Depravity Index, and hence it is a fairly small data
set. It is ready for use and does not require any cleaning or
pre-processing.
D. London Data
This Data set is formed by merging the three data sets.
All the three data sets have LSOA code which serves as the
Fig. 2: Distribution of Crime Events Across various regions

in Greater London
It can be easily visualized that most of the data points are
concentrated near the Westminster region and hence we
can say it has the highest number of crime events recorded.
One of the reasons could be a high number of reporting of
incidents as compared to other cities. Another reason could
be that this region is highly populated. We will generalize our
data set from LSOA regions to Boroughs in order to present
the exact figures of crimes recorded in different regions. Fig.
3 shows the number of incidents that were recorded during
2011 for different Boroughs.
This table is consistent with our observations from the
crime mapping as Westminster records the highest number
of crime incidents reported.
B. Crime Rate
The Crime rate of a LSOA is a simple ration of number
of crime events and the total population in the region. We
will understand the relationship between Crime rate and
the LSOAs. The Crime Rate has a very wide range of
3) Unemployment Rate: This gives us the information

about the Unemployment rate in a LSOA, given as a
percentage of total youth in the working age. The average
unemployment rate across various LSOAs was 7.43%.
The average crime for regions with higher than average
unemployment rate was 0.055.
Fig. 3: Crime events recorded during 2011 for different

Boroughs
values (0.0012 - 1.559) with a mean and a median value of

0.0493 and 0.0426 respectively and a standard deviation
of 0.041. Fig 4 shows the cumulative distribution of Crime
rates among different LSOAs.
4) Mean Income of Household: This gives us the

information related to the Mean income of households in a
LSOA. The average of Mean Income across all the LSOA
was 46184. The average crime rate for LSOAs below the
average employment rate was 0.045.
5) Depravity Index: This gives us the information about
the overall lifestyle of people in a particular area. The
average value of the depravity index for LSOA in London
was 0.18 and the average value of the crime rate for
LSOAs with depravity index less than average value was
0.050.
6) Racial Distribution: We will be using the NonWhite percentage in a LSOA for our Analysis. This can
be easily calculated by 100-White Percentage in our
data set. The average value of Non-White percentage was
39.28% and the average value for crime rate of LSOAs with
higher than average percentage of non-Whites was 0.051.
Fig. 4: Cumulative Distribution of Crime Rate
7) Area: This gives us the information regarding the

Area in hectare of the different LSOAs in our table. The
average area of LSOAs in London was 32.547 Hectares.
The average Crime Rate for LSOAs with more than average
Area was 0.054.
The distribution is a Lognormal distribution and Crime

rate increases sharply after 90th percentile, before which the
value is approximately 0.122. We can infer that there are
some LSOAs where concentration of crime is very high as
compared to other LSOAs. This result is also consistent with
our mapping of Crime Events which were also focused on
certain regions.
We will create different models to understand the influence

of the attributes on the crime rate. These models are created
using Weka.
C. London Data Set
A. Correlation
1) Age: The London Data gives us information about the

age distribution across various LSOAs. We will look at the
regions with high Percentage of youth population, i.e., the
people aged between 16 to 45. The average percentage
composition of youth population was 23%. The average
Crime rate for regions with more than average youth age
was 0.061 and hence a positive relation between the 2
factors can be observed.
The correlation between one attribute, and one or more

than one attributes is given as the relation of how the two
sets of attributes are related to each other. It is a powerful
statistical tool which can be used to predict the behaviour
of an attribute based on the value of other attributes. The
correlation value is between 0 and 1 where 0 signifies no
relation and value 1 signifies complete dependence. We
will run a Multi Linear Regression Model for our data set,
and try and predict the behaviour of Number of Incidents
and the attributes of demographics which we have discussed
above. We have split built the model using 66% Split
model where the data set is divided into two parts in which
two third is used for building the model and one third is
used for testing.
2) Qualification: The London data provides us

information about the population with the highest
qualification of different levels. We will consider the
least level of education which is given as Highest Level
of Education: Level 1. We have converted this data to
percentage of total population. The average value of this
data was 10.89%. The average Crime rate in regions with
less than the average qualification value was 0.057.
IX. DATA M ODELING
The model has a correlation coefficient of 0.5602. This

means there is a weak relationship between the set of
attributes and the Number of Incidents.
Fig. 5: Multi Linear Regression Output from our model in

Weka
B. Classification
In order to create a classification model, we will add an
arbitrary Class attribute. This is a binary attribute which
has been marked as 1 for Crime Rate more than average
value and 0 for crime rate lower than the average value. A
similar process was carried for other numeric data attributes
as well, where the Numeric values less than the average
value were given a Low value and the ones more than the
average value were given a High value.
A K- Nearest Neighbor Model is a model which assumes
that the K neighbors behave in a similar way and hence
assigns them the same value as their neighbor. A K nearest
neighbor was built for different values of k. The highest
number of correctly classified instances was achieved for
k=11. This model was able to classify 71.13% of incidents
correctly. True Positive and True Negative rates are a
measure of classification, which give the number of times
the model has predicted a Negative (or a 0) or a Positive
(or a 1) and the actual values are 0 or 1 respectively. Our
model gave a TP Rate of 0.865 for 0 with a precision of
0.719 and TP Rate of 0.526 for 1 with a precision of
0.656. However, the Kappa statistic () value of this model
was low = 0.3888.
The value is a degree of robustness of a model. It is a
comparison between the measured accuracy and the expected
accuracy of a model. A Kappa statistic value of 1 signifies
that the measured accuracy of a model and the predicted
accuracy of a model are in complete agreement where as a
value of 0 represents no agreement between the two.
Fig. 7: Confusion Matrix for J48 tree Classifier
However, the J48 tree suffers a drawback as it can be over

fitting for a large number of attributes as in the case of our
data set. This would mean although the model seems correct
for the given data, it might not be able to give the same
result for another random data set. Fig. 8 shows the J48 tree
created.
Fig. 8: Decision tree created for J48 tree Classifier

A Logistic Classifier aims to predict the response of a
decision variable based on the values of other variables. This
is particularly useful in cases of binary decision variables
such as Pass/fail, High/Low, etc. We have constructed a
binary decision variables. The Logistic Model for our data
set was able to classify 73.32% of data correctly with a True
Positive rate of 0.841 with a precision of 0.755 for 0 and
0.559 for 1 with a precision of 0.686. The value for this
model was 0.434.
Fig. 9: Confusion Matrix for Logistic Classifier
Fig. 6: Confusion Matrix for K Means Classifier
X. C ONCLUSION
A J48 decision tree uses C4.5 Algorithm to build a

decision tree. It builds a decision tree by splitting data based
on values of different attributes such that the Information
gain is maximum at each splitting. The J48 tree when applied
on our data set, was able to classify the data correctly with
a percentage of 69.55%. The TP Rate for 0 of this model
was 0.818 with a precision of 0.757 and that for 1 was
0.576 with a precision of 0.662. The value for this model
was 0.410.
The aim of the paper was to study the effects of Demographics on the Crime Rate/Number of Incidents for a
region. We have developed a Linear model which had a
weak relation with the Demographic patterns across various
regions. It can be concluded that the Crime Rate over a
region is affected by other factors beyond the demographics.
We plotted the number of Crime Incidents across different
boroughs and were able to determine regions with number of
events. We were able to map the crime events over the region
of London and identify regions which have higher incidents
of crime as compared to other regions.The cumulative distribution of Crime Rate also points out that the crime events
in certain Areas were higher than the other.
Although the depravity index is a pointer of poverty in a
location, however it fails to specify the disparity in income
in a LSOA. The disparity in income would help us improve
the efficiency of the model as it is one of the biggest factors
affecting crime. Therefore, it is a better pointer to understand
poverty in a region. This would increase the correlation value
in a region and its crime rate.
The mapping does not take into account the cost of living
in different regions is different. This is important as it can
tell us whether the average income in a region is good or
bad financially.
XI. E XTENSIONS
This model is a framework to an approach to Crime analysis. We have considered the Demographics of the London.
The following extensions are possible to the research:
1) We can profile different localities by their commercial
significance, e.g., Industrial Areas such as IT parks,
factories, etc. are less susceptible to crime incidents.
However, these areas may have higher population density as people from different regions might temporarily
shift to such places.
2) We can also study the effect of police profile associated
with different regions. How the number of police
officers per 1000 people affects the Crime rates for
different regions.
3) With the improvement in technology and availability
of internet, a new crime trend has emerged. The
Cyber Crime pertains to criminal activities which
are committed over the internet. Can this model be
implemented to cover such activities?
R EFERENCES
[1] Santos, Rachel Boba. Crime analysis with crime mapping. Sage,
2012. pp. 1.
[2] Eck, John, et al. Mapping crime: Understanding hotspots. (2005):
1-71.
[3] J. Malczewski and A. Poetz, Residential Burglaries and Neighborhood Socioeconomic Context in London, Ontario: Global and Local
Regression Analysis*, The Professional Geographer, vol. 57, no. 4,
pp. 516-529, 2005.
[4] Ons.gov.uk, Office for National Statistics (ONS) - ONS, 2015.
[Online]. Available: http://www.ons.gov.uk/ons/index.html. [Accessed:
13- Dec- 2015].
[5] Data.police.uk, Home | data.police.uk, 2015. [Online]. Available:
http://data.police.uk. [Accessed: 13- Dec- 2015].
[6] C. Block and R. Block, "Crime Definition, Crime Measurement, and
Victim Surveys", Journal of Social Issues, vol. 40, no. 1, pp. 137-159,
1984.

Demographics Crime Trends

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Demographics Crime Trends

Uploaded by

Copyright:

Available Formats

Demographics of Crime Trends in London

Abstract A Criminal activity causes a negative impact on

To observe the relationship, two data sets were used: The

mm/yy format and it has been broken down to LSOA level.

C. Adjusted IMD Scores 2010

Fig. 1: Different Crime Events recorded in 2011

B. The Census Data 2011

Fig. 2: Distribution of Crime Events Across various regions

3) Unemployment Rate: This gives us the information

Fig. 3: Crime events recorded during 2011 for different

values (0.0012 - 1.559) with a mean and a median value of

4) Mean Income of Household: This gives us the

Fig. 4: Cumulative Distribution of Crime Rate

7) Area: This gives us the information regarding the

The distribution is a Lognormal distribution and Crime

We will create different models to understand the influence

C. London Data Set

1) Age: The London Data gives us information about the

The correlation between one attribute, and one or more

2) Qualification: The London data provides us

IX. DATA M ODELING

The model has a correlation coefficient of 0.5602. This

Fig. 5: Multi Linear Regression Output from our model in

Fig. 7: Confusion Matrix for J48 tree Classifier

However, the J48 tree suffers a drawback as it can be over

Fig. 8: Decision tree created for J48 tree Classifier

Fig. 9: Confusion Matrix for Logistic Classifier

Fig. 6: Confusion Matrix for K Means Classifier

A J48 decision tree uses C4.5 Algorithm to build a

You might also like