Professional Documents
Culture Documents
CDA Thesis
CDA Thesis
CDA Thesis
LEARNING
A PROJECT REPORT
Submitted by
SURENDHAR A (310820104097)
RAHUL S (310820104076)
PRAVEEN G (310820104073)
Of
BACHELOR OF ENGINEERING
IN
MAY 2024
1
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this project report “CRIME DATA ANALYSIS USING MACHINE
LEARNING” is a bonafide work of SURENDHAR A (310820104097), RAHUL S
(310820104076), PRAVEEN G (310820104073) who carried out the project under my
supervision.
2
CERTIFICATE OF EVALUATION
1 SURENDHAR A
(310820104097)
PRAVEEN G
3
(310820104073)
The Reports of the project work submitted by the above students in partial fulfilment
for the award of Bachelor of Engineering Degree in Computer Science and Engineering of
Anna University, Chennai were evaluated and confirmed to be reports of the work by the
above students and then evaluated.
3
ACKNOWLEDGEMENT
We also thank the Teaching and Non -Teaching staff members of the
Department of Computer Science and Engineering for their constant Support.
We also acknowledge with a deep sense of reverence, our gratitude towards our
Parents and Member of our Family, who has always supported us morally as
well as economically.
4
ABSTRACT
Crime data analysis using machine learning harnesses advanced algorithms to scrutinize
extensive datasets, uncovering patterns and trends in criminal behavior. One primary
application is predicting crime hotspots by analyzing historical data to identify spatial and
temporal patterns. This enables proactive resource allocation, enhancing public safety and
reducing crime rates in vulnerable areas.Machine learning also aids in detecting suspicious
behavior by flagging deviations from normal patterns within datasets. For instance, anomaly
detection algorithms can identify unusual financial transactions or atypical social media
activity, providing valuable leads for law enforcement. Automating this process improves the
identify anomalies in criminal activity that may evade traditional methods. By analyzing
diverse data sources, such as social media and surveillance footage, algorithms can detect
enforcement to stay ahead of evolving criminal trends and effectively combat new
data preprocessing, feature selection, and algorithm choice. Ensuring data quality and
addressing privacy concerns are essential, as is mitigating algorithmic bias to ensure fair and
ethical outcomes.In summary, crime data analysis using machine learning offers significant
potential for enhancing public safety and reducing crime rates. By leveraging advanced
algorithms, law enforcement can gain valuable insights into criminal behavior, enabling
proactive interventions and effective resource allocation. Adhering to best practices ensures
fair and ethical outcomes while maximizing the benefits of this innovative approach to crime
5
சுருக்கம்
TABLE OF CONTENTS
NO. NO.
ABSTRACT 5
ABSTRACT(TAMIL) 6
LIST OF FIGURES 9
6
LIST OF ABBREVIATIONS 10
1. INTRODUCTION 11
1.1 INTRODUCTION 11
1.2 BACKGROUND 12
1.3 SIGNIFICANCE OF CRIME RATE PREDICTION 12
1.4 ROLE OF MACHINE LEARNING 13
1.5 PREDICTIVE MODELING 13
1.6 FEATURE ENGINEERING 14
1.7 EXPLORATORY DATA ANALYSIS 14
1.8 EVALUATION METRICS 15
1.9 INTEGRATION OF MACHINE LEARNING IN LAW 15
ENFORCEMENT
1.10 CHALLENGES AND LIMITATION 16
2. LITERATURE SURVEY 17
3. SYSTEM ANALYSIS 19
3.1 EXISTING SYSTEM 19
3.2 PROPOSED SYSTEM 19
3.3 FEASIBILITY STUDY 22
3.4 REQUIREMENT SPECIFICATION 24
3.5 LANGUAGE SPECIFICATION - PYTHON 24
4. SYSTEM DESIGN 27
4.1 SYSTEM ARCHITECTURE 27
4.2 DATA FLOW DIAGRAM 27
4.3 USE CASE DIAGRAM 29
4.4 ACTIVITY DIAGRAM 30
4.5 SEQUENCE DIAGRAM 31
4.6 CLASS DIAGRAM
31
5. MODULE DESCRIPTION 33
5.1 MODULE 1 33
7
5.2 MODULE 2 35
5.3 MODULE 3 38
6. TESTING 42
6.1 TYPES OF TESTING 42
6.2 TESTING TECHNIQUES 43
7 CONCLUSION 48
8 APPENDIX - CODING 49
8
LIST OF FIGURES
9
EDA EXPLORATORY DATA
ANALYSIS
ML MACHINE LEARNING
AI ARTIFICAL INTELLIGENCE
NP NUMPY
PD PANAS
RF RANDOM FOREST
LIST OF ABBREVIATIONS
10
CHAPTER 1
INTRODUCTION
1.1 Introduction
This study endeavors to delve into the application of machine learning techniques in crime
rate prediction and analysis, aiming to harness the potential of data-driven methodologies. The
primary goal is to develop robust predictive models capable of estimating crime rates by
considering diverse factors, including but not limited to historical crime records, socio-
economic indicators, and geographical data. Leveraging a repertoire of machine learning
algorithms such as decision trees, support vector machines, and neural networks, the research
seeks to dissect and unearth patterns embedded within the collected data. By discerning the
most influential factors shaping crime rates, these models promise to furnish invaluable
insights into the intricate dynamics of criminal behavior.
Furthermore, the study will direct its focus towards scrutinizing crime patterns and trends to
unearth spatial and temporal correlations, hotspots, and emerging patterns. Employing an
array of exploratory data analysis techniques like clustering, visualization, and association rule
mining, the research aims to unveil concealed patterns and relationships lurking within the
vast expanse of crime data. Such in-depth analysis holds the potential to augment our
comprehension of crime dynamics and facilitate the crafting of proactive strategies for crime
prevention.
In essence, this study endeavors to harness the power of machine learning to not only predict
crime rates accurately but also to unravel the underlying intricacies of criminal behavior.
Through comprehensive analysis and exploration of crime data, the research aims to furnish
stakeholders with actionable insights to bolster their efforts in combating crime and fostering
safer communities.
1.2 Background
11
Crime's global prevalence presents daunting challenges across societies. Precise
prediction and analysis of crime rates are pivotal for formulating impactful prevention
strategies, optimizing resource allocation, and shaping policies. In recent times, machine
learning methodologies have risen to prominence in crime rate prediction and analysis,
capitalizing on data-driven methodologies to deepen insights into criminal behaviors. This
technological leap enables a nuanced comprehension of crime dynamics, empowering
stakeholders to deploy targeted interventions and proactive measures.
Accurate crime rate prediction and analysis serve as crucial tools for understanding
and addressing criminal activities. Through sophisticated data analysis techniques, patterns
and trends in crime can be identified, allowing law enforcement agencies, policymakers, and
urban planners to anticipate and respond effectively to emerging threats. By pinpointing
hotspots of criminal activity, resources can be allocated strategically to deter crime and
enhance public safety. Moreover, analyzing the underlying factors contributing to crime
enables the implementation of targeted interventions, such as community outreach programs
or socioeconomic initiatives, to address root causes and prevent recidivism. Ultimately, the
insights gleaned from crime rate prediction and analysis empower stakeholders to make
informed decisions and collaborate in creating safer, more resilient communities.
12
1.4 Role of Machine Learning in Crime Analysis
Machine learning algorithms are revolutionizing the way we approach crime analysis
by harnessing the power of data. These algorithms can sift through vast amounts of
information, ranging from past crime records to socio-economic factors and geographical
data. By doing so, they uncover intricate patterns and correlations that might elude traditional
methods. This holistic approach enables more accurate predictions of crime rates and trends,
aiding law enforcement agencies in proactive measures and resource allocation. Moreover,
machine learning algorithms facilitate comprehensive analysis by considering multifaceted
aspects of crime, such as its underlying causes and spatial distribution. This nuanced
understanding allows for targeted interventions and tailored strategies to address specific
challenges within communities. As technology continues to advance, these algorithms hold
promise in not only predicting and analyzing crime but also in preventing it through early
detection and strategic planning.
13
Feature engineering is akin to sculpting raw data into a refined form, crucial for
understanding complex phenomena like crime rates. By carefully selecting and transforming
features, such as demographic profiles, weather patterns, or social media metrics, analysts can
unravel hidden insights. Demographics offer a window into the socioeconomic fabric of
communities, shedding light on vulnerability and behavioral trends. Weather conditions serve
as environmental factors influencing human behavior, affecting crime patterns in tangible
ways. Meanwhile, the pervasive influence of social media activity unveils subtle shifts in
societal dynamics, offering predictive cues. Through advanced techniques like dimensionality
reduction or polynomial transformations, feature engineering refines these variables,
extracting nuanced relationships. This refined data not only enhances the accuracy of
predictive models but also empowers policymakers with actionable insights for targeted
interventions. Thus, feature engineering stands as a cornerstone in deciphering the
multifaceted puzzle of crime dynamics, enabling proactive measures for safer, more resilient
communities.
Exploratory data analysis (EDA) techniques are essential tools for uncovering intricate
patterns within crime data. Through clustering methods, such as k-means or hierarchical
clustering, similar spatial or temporal crime occurrences can be grouped together, revealing
underlying structures and trends. Visualization plays a crucial role in EDA, allowing analysts
to represent complex data in intuitive formats, such as heatmaps, choropleth maps, or time
series plots. These visualizations facilitate the identification of hotspots, spatial distributions,
and temporal fluctuations in criminal activities. Association rule mining further enriches the
analysis by identifying frequent co-occurrences or correlations among different types of
crimes or variables, aiding in understanding underlying relationships and potential
contributing factors. By leveraging these techniques, law enforcement agencies and
policymakers can gain actionable insights into crime dynamics, enabling targeted resource
allocation, strategic planning, and proactive crime prevention efforts aimed at mitigating risks
and enhancing public safety.
14
1.8 Evaluation Metrics for Predictive Models
Evaluation metrics such as accuracy, precision, recall, and F1 score are crucial for
assessing the performance of predictive models in predicting crime rates. Accuracy measures
the overall correctness of predictions, while precision quantifies the proportion of correctly
predicted positive cases out of all cases predicted as positive. Recall, also known as
sensitivity, gauges the proportion of true positive cases that were correctly identified. The F1
score balances precision and recall, providing a single metric that combines both. In the
context of predicting crime rates, these metrics help gauge how well the model identifies
actual instances of crime and minimizes false predictions. For example, a high precision score
indicates fewer false positives, which means fewer resources wasted on investigating non-
existent crimes. Conversely, a high recall score suggests the model effectively captures most
actual crime instances, aiding in crime prevention and law enforcement efforts. Overall, these
metrics provide a comprehensive assessment of the predictive model's performance and its
practical utility in addressing real-world challenges related to crime prediction and
prevention.
The integration of machine learning in law enforcement holds immense potential for
revolutionizing traditional approaches to crime prevention and response. By harnessing
advanced algorithms and real-time data analytics, law enforcement agencies can optimize
their decision-making processes. Machine learning techniques enable the analysis of vast
amounts of data to identify patterns, trends, and anomalies, thus facilitating proactive crime
prevention efforts. Predictive models can forecast crime hotspots, enabling law enforcement
to allocate resources strategically and deploy patrols to high-risk areas. Additionally, machine
learning can aid in the optimization of patrol routes, minimizing response times and
maximizing coverage. Moreover, by continuously learning from new data, these systems can
adapt and evolve to changing crime patterns, enhancing the overall effectiveness of law
enforcement efforts. However, it's crucial to address ethical considerations such as bias and
privacy concerns to ensure the responsible and equitable use of these technologies in law
enforcement practices.
15
1.10 Challenges and Limitations
Machine learning's potential in crime rate prediction and analysis is undeniable, yet
navigating its challenges is crucial. Data quality and availability often hinder accurate
predictions, as crime data can be incomplete or biased. Algorithm bias further complicates
matters, as models may perpetuate societal inequalities if not carefully constructed.
Interpretability poses a significant hurdle; understanding how models reach conclusions is
essential for trust and accountability. Moreover, ethical dilemmas arise from the use of
sensitive personal data, raising concerns about privacy and consent. Addressing these
challenges demands interdisciplinary collaboration, incorporating expertise from data science,
law, ethics, and social sciences. Only through rigorous attention to these issues can machine
learning truly realize its potential in crime prediction while upholding fairness, transparency,
and ethical principles.
16
CHAPTER 2
LITERATURE REVIEWS
1. [1] Mohler, G. O., Self-exciting point they demonstrate the effectiveness of this modeling
Short, M. B., approach in capturing crime patterns and predictingfuture
process modeling of
criminal activity.
Brantingham, P. crime. Journal of
J., Schoenberg, F. the American
P., & Tita, G. E. Statistical
(2011). Association,
106(493), 100-108.
2. Wang, P., Liu, Crime prediction The paper titled "Crime prediction based on criminal
W., Li, D., & behavior similarity" by Wang, Liu, Li, and Zhang (2013)
based on criminal
presents a crime prediction approach that leverages the
Zhang, L. (2013). behavior similarity. similarity in criminal behaviors.
Expert Systems
with Applications,
40(12), 4912-4919.
3. Ashfaq, R., Expert Systems It analyzes the strengths and limitations of these
Zhang, X., & with Applications, techniques and discusses the key factors influencing their
40(12), 4912-4919. performance.
Ahmad, M. O.
(2018). Crime
prediction using
machine learning
techniques:
17
4. Wu, Z., Yin, H., In 2019 5th The authors propose a predictive model that utilizes
historical crime data and employs machine learning
& Zhu, Y. (2019). International
algorithms for accurate crime rate estimation.
Crime rate Conference on
prediction based
on machine Control,
learning. Automation and
Robotics (ICCAR)
(pp. 476-480).
IEEE.
5. Ribeiro, F. N., de In 2019 IEEE The authors explore the application of various machine
Souza, R. C., & International learning techniques, including decision trees, support
Pereira, C. A. vector machines, random forests, and artificial neural
Conference on networks, for crime prediction tasks.
(2019). Crime
prediction using Systems, Man and
machine learning Cybernetics (SMC)
algorithms.
(pp. 1569-1574).
IEEE.
6. Yuan, Y., Wang, A spatiotemporal The study leverages convolutional neural networks
F., Zheng, Y., (CNNs) and recurrent neural networks (RNNs) to capture
approach. Applied
Xie, X., & Sun, spatial and temporal patterns in crime data.
Geography, 121,
G. (2020).
Predicting 102233.
criminal hotspots
using deep
learning
18
CHAPTER 3
SYSTEM ANALYSIS
The existing system for crime rate prediction and analysis using machine learning
involves various approaches and methodologies. Traditionally, crime rate analysis relied on
statistical techniques and manual data processing, which limited the accuracy and efficiency
of predictions. However, with advancements in machine learning and data analytics, more
sophisticated systems have been developed.
The existing system typically starts with data collection, which involves gathering
crime-related data such as historical crime records, socio-economic indicators, demographic
information, and geographical data. These datasets are preprocessed to clean and transform
the data into a suitable format for analysis.
Machine learning algorithms are then employed to create predictive models. Various
techniques, including decision trees, support vector machines, random forests, and neural
networks, are used to train the models on the collected data. Feature engineering techniques
are also applied to identify the most relevant features affecting crime rates and enhance the
performance of the models.
Once the predictive models are developed, they are evaluated using appropriate
performance metrics such as accuracy, precision, recall, and F1 score. The models are then
used to predict future crime rates based on the identified patterns and trends in the data.
The proposed system for crime rate prediction and analysis using machine learning is
poised to revolutionize traditional crime analysis methodologies. By harnessing cutting-edge
algorithms and techniques, it aims to elevate accuracy, efficiency, and effectiveness in
understanding and forecasting criminal activities. Through robust data processing and
predictive modeling, this system can identify patterns and trends within vast datasets,
enabling law enforcement agencies to proactively allocate resources and strategize crime
prevention measures. Furthermore, by leveraging machine learning, it can adapt and evolve
with dynamic shifts in criminal behavior, continuously refining its predictive capabilities.
Ultimately, this system holds the potential to not only optimize resource allocation but also
19
enhance community safety by empowering authorities with actionable insights to combat
crime more effectively. The system involves the following key components
Feature Engineering
Feature engineering is a critical step in data preprocessing where relevant features are
selected or created to enhance the performance of machine learning models. This process
involves analyzing the dataset to identify valuable insights and relationships, which may not
be immediately apparent. Techniques such as imputation, encoding categorical variables, and
scaling numerical features are commonly employed to ensure the data is suitable for
modeling. Additionally, feature extraction methods like Principal Component Analysis (PCA)
or dimensionality reduction techniques might be applied to reduce the complexity of the
dataset while preserving important information. The ultimate goal of feature engineering is to
improve the predictive power and generalization ability of the model by providing it with the
most informative and discriminative features.
20
develop predictive models that generalize well to new data and yield valuable insights across
various domains.
Prediction
The trained models are then used to predict future crime rates based on new input data.
These predictions consider the identified patterns and relationships between crime rates and
various factors.
21
3.3 FEASIBILITY STUDY
With an eye towards gauging the project's viability and improving server performance,
a business proposal defining the project's primary goals and offering some preliminary cost
estimates is offered here. Your proposed system's viability may be assessed once a
comprehensive study has been performed. It is essential to have a thorough understanding of
the core requirements of the system at hand before beginning the feasibility study. The
feasibility research includes mostly three lines of thought:
• Economical feasibility
• Technical feasibility
• Operational feasibility
• Social feasibility
1.ECONOMICAL FEASIBILITY
The study's findings provide valuable insights for upper management to assess
potential cost savings from implementing the technology. With finite resources, the
corporation must carefully allocate funds, ensuring each dollar spent has a justified purpose.
Leveraging predominantly open-source and free technologies significantly reduced
infrastructure costs, aligning with budget constraints. Emphasizing customizable products
was pivotal, enabling tailored solutions without excessive spending. This strategic approach
not only optimizes resource utilization but also fosters agility and adaptability within the
system. By minimizing expenses without compromising quality or functionality, the
corporation maximizes ROI and ensures sustainable growth. Furthermore, investing in
adaptable infrastructure lays a foundation for future scalability and innovation, positioning the
company for long-term success in a competitive landscape. Effective cost management fosters
financial stability, empowering the corporation to navigate uncertainties and capitalize on
emerging opportunities. Thus, the study's findings not only inform decision-making but also
contribute to the organization's overall efficiency and resilience.
2.TECHNICAL FEASIBILITY
The research's primary objective is to ascertain the technical feasibility of the system
to facilitate its seamless development. The intention is to integrate additional systems without
22
overburdening the IT staff, thereby mitigating any undue anxiety for the buyer. Given the low
probability of requiring adjustments during installation, simplicity in design is paramount. By
prioritizing simplicity, the system can streamline processes and minimize potential
complications. This approach not only enhances efficiency but also reduces the risk of errors
and downtime. Moreover, a straightforward design fosters ease of maintenance and
troubleshooting, bolstering long-term sustainability. Through meticulous evaluation and
testing, the research endeavors to identify potential bottlenecks and address them proactively.
Ultimately, the aim is to ensure a robust and scalable system that can adapt to evolving needs
without causing undue strain on resources or stakeholders.
3.OPERATIONAL FEASIBILITY
Ensuring user engagement and satisfaction is paramount in technology adoption. A
crucial step involves guiding users through the optimal utilization of the resource, minimizing
any sense of intimidation or threat. By positioning the system as a necessary tool rather than
an adversary, users are more likely to embrace its functionalities. Effective training and
orientation sessions lay the groundwork for swift adoption, empowering users to navigate the
system confidently. Building trust in the system is key, fostering an environment where users
feel comfortable providing constructive feedback. As users gain faith in the system's
reliability and utility, they become more inclined to offer valuable insights, enriching the
development process. Ultimately, this iterative cycle of user empowerment and feedback
integration accelerates the system's evolution and enhances its effectiveness.
4.SOCIAL FEASIBILITY
In the social feasibility analysis, understanding how a project might impact the community is
paramount. This involves assessing potential shifts in demographics, employment
opportunities, and social dynamics. One critical aspect is recognizing how the project aligns
with existing cultural norms and institutional frameworks. These frameworks often dictate the
availability of certain types of workers within a community. If the project requires specialized
skills or expertise that are not prevalent in the community, it could face challenges in finding
qualified personnel. This scarcity might necessitate strategies for recruitment, training, or even
relocation incentives to attract the needed workforce. Additionally, it's essential to consider
the long-term effects on local employment patterns and the overall socio-economic fabric. By
addressing these factors early in the feasibility analysis, project planners can better anticipate
23
and mitigate potential hurdles, ensuring smoother integration and acceptance within the
community.
SOFTWARE REQUIREMENTS
Operating system : Windows7 (with service pack 1), 8, 8.1 and 10
Language : Python
24
ADVANTAGES OF USING PYTHON
Reliability
Python is a favorite among software developers for its ethos of simplicity and
consistency. Its concise syntax and readability make code presentation a breeze, fostering
clarity and maintainability. In comparison to its counterparts, Python allows developers to
write code swiftly, thanks to its straightforward syntax and extensive standard library.
Moreover, the vibrant Python community offers invaluable feedback, enabling developers to
refine their products and applications continuously. Its simplicity also renders it an ideal
choice for beginners, who can grasp its fundamentals relatively quickly, setting a solid
foundation for their programming journey.
For seasoned developers, Python's simplicity serves as a springboard for innovation. With a
robust ecosystem for machine learning and data science, they can focus on devising
groundbreaking solutions to real-world problems, leveraging Python's stability and reliability
to create trustworthy applications. This emphasis on innovation propels Python to the
forefront of technological advancement, driving progress across various domains.
25
Easily Executable
26
CHAPTER 4
SYSTEM DESIGN
27
rules or loops, as the flow of information is entirely one-way. A flowchart can be used to
illustrate the steps used to accomplish a certain data-driven task. Several different notations
exist for representing data-flow graphs. Each data flow must have a process that acts as either
the source or the target of the information exchange. Rather than utilizing a data-flow
diagram, users of UML often substitute an activity diagram. In the realm of data-flow plans,
site-oriented data-flow plans are a subset. Identical nodes in a data-flow diagram and a Petri
net can be thought of as inverted counterparts since the semantics of data memory are
represented by the locations in the network. Structured data modeling (DFM) includes
processes, flows, storage, and terminators.
A process is one that takes in data as input and returns results as output.
Data Store
In the context of a computer system, the term "data stores" is used to describe the
various memory regions where data can be found. In other cases, "files" might stand in for
data.
28
Data Flow
Data flows are the pathways that information takes to get from one place to another.
Please describe the nature of the data being conveyed by each arrow.
External Entity
In this context, "external entity" refers to anything outside the system with which the
system has some kind of interaction. These are the starting and finishing positions for inputs
and outputs, respectively.
The whole system is shown as a single process in a level DFD. Each step in the
system's assembly process, including all intermediate steps, are recorded here. The "basic
system model" consists of this and 2-level data flow diagrams.
29
4.4 ACTIVITY DIAGRAM
An activity diagram, in its most basic form, is a visual representation of the sequence
in which tasks are performed. It depicts the sequence of operations that make up the overall
procedure. They are not quite flowcharts, but they serve a comparable purpose.
30
4.5 SEQUENCE DIAGRAM
These are another type of interaction-based diagram used to display the workings of
the system. They record the conditions under which objects and processes cooperate.
31
Figure 4.6 Class Diagram
32
CHAPTER 5
MODULE DESCRIPTION
5.1 MODULE 1:
Data Collection and Pre-processing
The Data Collection and Preprocessing module acts as the foundation for crime rate
prediction and analysis, sourcing data from a multitude of channels such as law enforcement
databases, public records, and even social media. It meticulously sifts through the gathered
data, employing techniques like data cleaning and normalization to ensure accuracy and
consistency. Through this process, disparate datasets are harmonized and prepared for
seamless integration into machine learning algorithms. This module serves as the gateway to
unlocking valuable insights into crime patterns and trends, laying the groundwork for
effective predictive models and informed decision-making in law enforcement and urban
planning.
Data Gathering
The data collection module integrates diverse sources such as law enforcement
databases, government records, and relevant datasets. It aggregates historical crime records,
socio-economic metrics, demographic details, and geographical data. Through a well-
designed system, it ensures efficient retrieval and compilation of this information. The
module prioritizes accuracy and comprehensiveness, ensuring a representative dataset. By
harmonizing disparate sources, it fosters a holistic understanding of the underlying dynamics.
This integrated approach aids in generating insights crucial for informed decision-making in
various domains, from public policy to law enforcement strategies.
Feature Extraction
Feature extraction in data preparation for crime rate analysis entails identifying and
extracting pertinent factors from collected data. This includes temporal elements like time of
day and day of the week, spatial aspects such as geographical coordinates and neighborhood
attributes, as well as socio-economic variables like income and education levels. Accurate
prediction of crime rates relies heavily on the selection of relevant features, as they directly
influence the model's ability to capture underlying patterns and dynamics within the data.
Therefore, careful consideration and selection of appropriate features are paramount to ensure
the effectiveness and accuracy of crime rate predictions.
Data Preprocessing
After feature extraction, preprocessing is crucial for refining data quality. Techniques
like data transformation, normalization, and scaling ensure data is suitable for analysis by
adjusting its range and distribution. Categorical variables are often encoded using methods
like one-hot encoding or label encoding to enable their integration into machine learning
models. Preprocessing aims to optimize data for accurate model training and robust
performance, ultimately enhancing the effectiveness of subsequent analysis tasks. It
streamlines the data pipeline, making it more manageable and conducive to extracting
meaningful insights. This phase is pivotal in ensuring the reliability and efficiency of machine
learning algorithms by addressing data irregularities and preparing it for further analysis and
modeling.
34
Data Splitting
Splitting the preprocessed data into training and testing datasets is crucial for
evaluating machine learning models effectively. The training dataset is utilized to train the
models, while the testing dataset serves to assess their predictive capabilities on unseen data.
It's essential to establish an appropriate split ratio, typically ranging from 70-80% for training
and the remainder for testing. Randomly shuffling the data before splitting is vital to prevent
any inherent patterns or biases from influencing the training process, ensuring the models
generalize well to new data.
5.2 MODULE 2:
Machine Learning Modeling
The Machine Learning Modeling module plays a crucial role in the crime rate
prediction system by harnessing advanced algorithms to analyze preprocessed data.
Leveraging diverse machine learning techniques, it discerns patterns and trends from
historical crime data. By assimilating this knowledge, it generates predictive models capable
of forecasting future crime rates with precision. These models serve as invaluable tools for
law enforcement agencies and policymakers in devising proactive strategies to address and
mitigate potential crime hotspots. Through continual refinement and optimization, this
module empowers decision-makers with actionable insights to enhance public safety
measures effectively.
35
Model Selection
Selecting the right machine learning algorithm for crime rate prediction is crucial.
Decision trees are intuitive and easy to interpret, making them suitable for understanding
complex relationships in data. Support vector machines excel in handling high-dimensional
data and can capture intricate patterns. Random forests offer robustness against overfitting
and can handle large datasets efficiently. Neural networks, particularly deep learning
architectures, are adept at capturing nonlinear relationships but may require substantial
computational resources. Considering factors like problem complexity, model interpretability,
computational demands, and data availability is essential for making an informed choice.
Model Training
During training, the selected algorithm(s) iteratively adjust their internal parameters
based on the preprocessed data. By analyzing historical crime data, the model(s) identify
patterns, relationships, and dependencies between crime rates and factors like time, location,
socio-economic indicators, and demographics. The training process involves optimizing the
model(s) to minimize prediction errors, ensuring accuracy in forecasting future crime rates.
Through this iterative approach, the model(s) gradually improve their ability to generalize and
make reliable predictions across different scenarios. Regular evaluation against the testing
dataset helps gauge the model's performance and fine-tune its parameters further. Continuous
refinement ensures that the model(s) effectively capture the complexities of crime dynamics,
aiding law enforcement and policymakers in making informed decisions.
36
on factors like the size of the hyperparameter space, computational resources, and desired
model performance.
Model Evaluation
Evaluation of trained models is crucial for assessing their generalization capabilities.
During this phase, the testing dataset, segregated during preprocessing, proves pivotal.
Metrics like accuracy, precision, recall, F1 score, and area under the ROC curve offer insights
into the model's performance. Techniques like k-fold cross-validation enhance robustness by
validating across various data splits. By leveraging these evaluation methods, researchers
ensure models perform well on unseen data, a hallmark of effective machine learning
systems. Regular evaluation and refinement cycles contribute to the continual improvement of
model accuracy and reliability, essential for real-world deployment and decision-making
processes.
Model Optimization
After evaluating the models, it's crucial to refine them further for better performance.
Techniques like ensemble learning, where multiple models collaborate, can leverage diverse
perspectives for more accurate predictions. Regularization methods help prevent overfitting,
enhancing the model's generalization to unseen data. Feature selection ensures that only the
most relevant aspects are considered, reducing noise and improving prediction quality. This
optimization journey aims to fortify the models, ensuring they reliably and accurately forecast
crime rates, ultimately contributing to more effective crime prevention strategies.
Model Deployment
The integration process involves meticulous testing to ensure seamless compatibility
with existing system components. Engineers establish clear input and output interfaces,
facilitating smooth communication between the model and the broader crime rate prediction
system. Technical considerations such as scalability, reliability, and computational efficiency
are addressed to optimize deployment performance. Rigorous validation procedures are
conducted to verify the model's accuracy and robustness across diverse datasets. Continuous
monitoring mechanisms are implemented to track the model's performance in real-time and
address any emerging issues promptly. Deployment documentation is prepared to provide
comprehensive guidance for system administrators and users. Collaboration between data
scientists, software engineers, and domain experts ensures alignment with operational
requirements and strategic objectives. Feedback loops are established to gather insights from
37
end-users, driving iterative improvements to the deployed model. Overall, the deployment
phase marks the culmination of efforts to transition the machine learning model from
development to practical application within the crime rate prediction and analysis system.
5.3 MODULE 3:
Crime Pattern Analysis
The Crime Pattern Analysis module serves as a cornerstone in the realm of crime rate
prediction and analysis, leveraging machine learning algorithms. Its primary objective is to
delve into crime data, meticulously scrutinizing patterns, trends, and interrelations. Through
advanced exploratory data analysis and statistical methodologies, it uncovers pivotal insights
essential for devising effective crime prevention strategies and optimizing resource allocation.
By deciphering the intricate dynamics of criminal activities, this module empowers law
enforcement agencies to anticipate and mitigate potential threats proactively, fostering safer
communities. Its robust analytical capabilities enable the identification of hotspots, modus
operandi, and emerging patterns, thereby facilitating targeted interventions and enhancing
overall public safety measures.
Data Exploration
In this phase, we delve into the preprocessed crime data to unveil its nuanced traits.
Through a spectrum of statistical tools like descriptive statistics, histograms, and box plots,
38
we dissect the distribution, central tendencies, and variability of crime rates and pertinent
variables. This meticulous examination furnishes an introductory panorama of the data, laying
the groundwork for more intricate analyses. By scrutinizing these metrics, we aim to unravel
patterns, outliers, and potential correlations that will underpin subsequent analytical
endeavors.
Spatial Analysis
Spatial analysis techniques play a crucial role in understanding the spatial dynamics of
crime. By overlaying crime incidents onto maps, patterns and trends emerge, enabling
authorities to identify hot spots and allocate resources effectively. Heat maps visually
represent concentrations of crime, while kernel density estimation helps in pinpointing areas
with high densities of incidents. Clustering algorithms further aid in identifying spatial
patterns and potential crime clusters. This analytical approach empowers law enforcement to
adopt targeted interventions, enhancing crime prevention and public safety strategies.
Temporal Analysis
Temporal analysis is a crucial tool in criminology, delving into the intricate
fluctuations of crime rates over time. Through techniques like decomposition, autocorrelation,
and trend analysis, it dissects data to unveil seasonal variations, enduring trends, and sudden
shifts in criminal activity. By unraveling these temporal patterns, researchers gain insights
into the dynamics of crime, enabling the creation of tailored intervention approaches. This
methodical examination facilitates the anticipation of peaks and troughs in criminal behavior,
empowering law enforcement agencies and policymakers to deploy resources effectively.
Ultimately, temporal analysis serves as a strategic compass, guiding efforts to mitigate crime
and enhance public safety across diverse temporal landscapes.
Correlation Analysis
Correlation analysis serves as a vital tool in unraveling the intricate web of
connections between crime rates and a myriad of influencing factors, including socio-
economic indicators, demographic variables, and environmental conditions. By employing
correlation coefficients, scatter plots, and regression analysis, researchers can effectively
quantify and visualize the degree and direction of these relationships. This analytical
approach not only highlights the factors significantly linked to crime rates but also aids
policymakers in devising targeted interventions and evidence-based strategies. Through
39
rigorous correlation analysis, key insights emerge, guiding informed decision-making aimed
at addressing the root causes and mitigating the prevalence of crime within communities.
Predictive Analytics
Predictive analytics in this module offer a forward-looking perspective, leveraging
historical crime data and discernible patterns to forecast future crime rates. Time series
forecasting methods enable the projection of crime trends over designated time frames, while
regression models and machine learning algorithms analyze various factors to predict crime
rates accurately. These predictive insights empower stakeholders in resource allocation,
aiding in strategic policy formulation and proactive crime prevention strategies. By
anticipating potential spikes or declines in crime, authorities can optimize deployment of law
enforcement resources, enhance community safety, and preemptively address emerging
criminal threats.
40
Reports and visualizations play a pivotal role in conveying these insights effectively.
Clear, informative reports and interactive visualizations provide stakeholders with a digestible
overview of the analysis results, facilitating their interpretation and utilization in devising
evidence-based strategies. This enhances not only crime prevention measures but also
resource allocation and policy formulation.
41
CHAPTER 6
TESTING
Testing serves as the vigilant gatekeeper ensuring the integrity and reliability of
software. Its essence lies in meticulously uncovering and rectifying flaws within the final
product. Whether scrutinizing a comprehensive system or a minute component, testing is the
beacon illuminating potential pitfalls. Stress testing stands out as a crucial facet, validating
the resilience of software even amidst the harshest conditions. Within the realm of testing,
myriad approaches await exploration, catering to the diverse array of evaluation needs. From
unit tests to integration tests, regression tests to performance tests, the landscape brims with
opportunities to scrutinize functionality and robustness. Each test bears its unique
significance, collectively contributing to the overarching goal of fortifying software against
vulnerabilities. The efficacy of testing lies not only in its breadth but also in its depth, delving
into every nook and cranny to unearth imperfections. In the realm of software development,
testing emerges as an indispensable ally, safeguarding against the perils of inadequate quality
control. Through meticulous examination and relentless scrutiny, testing endeavors to uphold
the standard of excellence expected from modern software solutions.
42
phase in the incremental model assesses the complete application, ensuring integration and
functionality across all increments. Both models emphasize the importance of testing but
differ in their timing and approach within the software development life cycle.
Unit Testing
The term "unit testing" refers to a specific kind of software testing in which discrete
elements of a program are investigated. The purpose of this testing is to ensure that the
software operates as expected.
Test Cases
1. Test that the pre-processing stage correctly handles missing values, outliers,
and inconsistencies in the crime-related data.
2. Test that the feature engineering techniques accurately identify relevant
features and transform the data to extract meaningful patterns and relationships.
3. Test that the machine learning model selection and training process properly
integrates with the preprocessed data, ensuring compatibility and accurate model training.
43
Integration Testing
Integration testing is the crucial phase where the program undergoes rigorous
examination to ensure seamless cohesion among its combined components. It serves as the
ultimate trial for the software in its finalized form, meticulously scrutinizing every interaction
point for potential issues or glitches. By subjecting the program to various scenarios and
inputs, testers strive to unearth any hidden flaws that could disrupt its functionality. This
phase demands meticulous attention to detail, as even minor discrepancies in component
interactions can have cascading effects on the overall system performance. Through
systematic testing methodologies, integration testers meticulously simulate real-world usage
scenarios to validate the robustness and reliability of the software. The objective is to detect
and rectify any inconsistencies or incompatibilities that may arise when different parts of the
system come together. Ultimately, successful integration testing lays the foundation for a
cohesive and dependable software product ready for deployment.
Test Cases
1. Test the integration of data collection modules with the preprocessing modules
to ensure that the collected crime-related data is correctly processed and prepared for analysis.
2. Test the integration of feature engineering techniques with the preprocessing
modules to verify that relevant features are properly identified and transformed.
3. Test the integration of the machine learning algorithms with the prepared data
to ensure accurate training and prediction.
Functional Testing
Functional testing is a vital phase in software development where the system's
functionality is rigorously examined against predefined requirements and specifications. It
begins by feeding inputs into the functions under scrutiny, then meticulously analyzing the
resultant outputs. Unlike other testing methodologies, functional testing prioritizes the
correctness of outcomes over the intricacies of processing methods. By executing a series of
test cases, it meticulously scrutinizes the system's behavior, ensuring that it aligns seamlessly
with the specified criteria. This meticulous approach validates the accuracy and integrity of
the system's functionalities, assuring stakeholders of its reliability and adherence to intended
functionalities.
44
Test Cases
1. Test the functionality of the data collection module to ensure that it successfully
collects and stores crime-related data from various sources.
2. Test the functionality of the preprocessing module to verify that it properly handles
missing values, outliers, and inconsistencies in the crime data.
3. Test the feature engineering techniques to ensure that they correctly identify relevant
features and transform the data accordingly.
45
Test Cases
1. Test the system by providing a set of known crime-related data and verify that the
predicted crime rates match the expected values.
2. Test the system with different types of crime datasets, including varying sizes and
formats, to ensure that the system can handle and process the data accurately.
3. Test the system with simulated real-time data updates and verify that the predictions
are consistently updated and reflect the changes in the input data.
White Box Testing, also known as clear box testing or structural testing, is an
approach where the tester has access to the internal workings and code of the software being
tested. With this knowledge, test cases are designed based on an understanding of the code's
logic, paths, and structure. Unlike Black Box Testing, where the tester doesn't have visibility
into the internal workings, White Box Testing allows for a more thorough examination of the
software's behavior, focusing on specific paths and conditions within the code. By examining
the inner workings, testers can identify potential errors or vulnerabilities that might not be
evident from an external perspective. This method is particularly useful for uncovering logic
errors, boundary cases, and ensuring code coverage. The term "white box" originates from the
analogy of a transparent box, where the internal contents are visible to anyone observing it.
This transparency enables testers to scrutinize the code comprehensively, hence the name
White Box Testing.
46
Test Cases
1. Test the preprocessing module by verifying that missing values are correctly handled
through techniques such as imputation or removal.
2. Test the feature engineering module by examining the transformed features and
ensuring they capture meaningful patterns and relationships in the data.
3. Test the machine learning algorithms by assessing the model's accuracy on a known
training dataset and verifying that the model is not overfitting or underfitting.
47
CHAPTER 7
CONCLUSION
Crime rate prediction and analysis through machine learning techniques represents a
powerful tool for advancing our comprehension of criminal activities and bolstering crime
prevention strategies. By amalgamating machine learning algorithms with thorough data
analysis and predictive modeling, we unlock avenues for more precise and proactive
approaches to estimating crime rates. Through the utilization of these algorithms, historical
crime records, socio-economic indicators, and geographical data can be synthesized to
construct predictive models that offer insights into crime rates, pinpoint high-risk areas, and
unveil spatial and temporal patterns of criminal behavior. The analysis of crime patterns and
trends serves to deepen our understanding of the dynamics of criminal activities, facilitating
the identification of hotspots, correlations, and emerging patterns. This analytical depth is
invaluable for crafting targeted crime prevention strategies, optimizing resource allocation,
and informing policy-making decisions. By integrating machine learning techniques into
crime rate prediction and analysis, we equip law enforcement agencies, policymakers, and
urban planners with data-driven insights to make informed choices and proactively address
safety concerns within communities. Nevertheless, there exist challenges that warrant
attention, including issues related to data quality, algorithmic bias, and ethical considerations.
Ensuring the responsible and effective implementation of machine learning techniques in this
context necessitates a concerted effort to address these challenges. Further research and
development efforts are indispensable for refining and broadening the capabilities of crime
rate prediction and analysis systems powered by machine learning. Such endeavors are
crucial in our collective pursuit of reducing crime rates and fostering safer societies through
evidence-based approaches.
48
APPENDIX 1
APPENDIX-1:CODING
!unzip /content/drive/MyDrive/CRIMEANALYSIS/crime.zip
!unzip /content/drive/MyDrive/CRIMEANALYSIS/map.zip
sns.set()
49
import numpy as np # linear algebra
import pandas as pd import pandas
as pd import numpy as np import
geopandas as gpd import
matplotlib.pyplot as plt
50
rights', 'torture', 'extortion','atrocities','arrest','fake encounter','false implication','property
stolen','property','stolen','auto','auto theft','death','killer','murder'] penalties = {
'rape': 'Imprisonment for 7 years to life and fine',
'harassment': 'Imprisonment up to 3 years and/or fine',
'human rights': 'Imprisonment up to 7 years and/or fine',
'torture': 'Imprisonment up to 10 years and/or fine',
'extortion': 'Imprisonment up to 3 years and/or fine',
'atrocities': 'Imprisonment up to 10 years and/or fine',
'arrests': 'Imprisonment up to 3 years and/or fine',
'fake encounter': 'Life imprisonment',
'false implication': 'Imprisonment up to 7 years and/or fine'
} for item in my_list: if
item in input.lower():
if item == 'rape'or item == 'harassment' :
st.write(victims)
st.header('VICTIMS OF INCEST RAPE') rape_victims=
victims[victims['Subgroup']=='Victims of Incest Rape']
st.write(rape_victims)
g=
pd.DataFrame(rape_victims.groupby(['Year'])['Rape_Cases_Reported'].sum().reset_index())
st.header('YEAR WISE CASES')
st.write(g) fig=
px.bar(g,x='Year',y='Rape_Cases_Reported',color_discrete_sequence=['blue'])
st.plotly_chart(fig)
st.header('AREA WISE CASES')
g1=
pd.DataFrame(rape_victims.groupby(['Area_Name'])['Rape_Cases_Reported'].sum().reset_in
dex()) g1.replace(to_replace='Arunachal Pradesh',value='Arunanchal Pradesh',inplace=True)
st.write(g1) g1.columns=['State/UT','Cases Reported'] shp_gdf =
gpd.read_file('/content/India_States/Indian_states.shp') merge
=shp_gdf.set_index('st_nm').join(g1.set_index('State/UT')) fig,ax=plt.subplots(1,
figsize=(10,10))
elif item =='human rights' or item =='torture' or item =='extortion' or item =='atrocities'
or item =='arrest' or item =='fake encounter' or item =='false implication' :
x=item st.header(x.upper()+'
CRIME') g2=
pd.DataFrame(police_hr.groupby(['Area_Name'])['Cases_Registered_under_Human_Rights_
Violations'].sum().reset_index())
st.write(x) st.write(g2)
st.header('YEAR WISE CASES')
g3 =
pd.DataFrame(police_hr.groupby(['Year'])['Cases_Registered_under_Human_Rights_Violati
ons'].sum().reset_index()) g3.columns = ['Year','Cases Registered']
52
pd.DataFrame(police_hr.groupby(['Year'])['Policemen_Chargesheeted','Policemen_Convicted
'].sum().reset_index()) st.write(g4)
year=['2001','2002','2003','2004','2005','2006','2007','2008','2009','2010']
fig = go.Figure(data=[
go.Bar(name='Policemen Chargesheeted', x=year, y=g4['Policemen_Chargesheeted'],
marker_color='purple'),
go.Bar(name='Policemen Convicted', x=year, y=g4['Policemen_Convicted'],
marker_color='red')
])
fig.update_layout(barmode='group',xaxis_title='Year',yaxis_title='Number of policemen')
st.plotly_chart(fig)
st.header(x+'STATE WISE REPORTS') g2.columns= ['State/UT','Cases Reported']
st.write(g2) g2.replace(to_replace='Arunachal Pradesh',value='Arunanchal
Pradesh',inplace=True) colormaps = ['RdPu', 'viridis', 'coolwarm', 'Blues', 'Greens',
'Reds', 'PuOr', 'inferno',
'magma', 'cividis', 'cool', 'hot', 'YlOrRd', 'YlGnBu']
53
st.write(stats) plt.bar(['Recovered', 'Stolen'],
[df['Cases_Property_Recovered'][0],
df['Cases_Property_Stolen'][0]]) plt.title('Cases of Property Recovered and Stolen')
plt.xlabel('Type of Property') plt.ylabel('Number of Cases')
plt.savefig('my_plot.png') st.image('my_plot.png') labels = ['Recovered', 'Stolen']
sizes = [df['Value_of_Property_Recovered'][0], df['Value_of_Property_Stolen'][0]]
colors = ['green', 'red'] plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%
%') plt.title('Property Recovered and Stolen') plt.axis('equal')
plt.savefig('my_plot.png') st.image('my_plot.png') group_data =
df.groupby('Group_Name').agg({'Cases_Property_Recovered': 'sum',
'Cases_Property_Stolen': 'sum'}) group_data.plot(kind='bar') plt.title('Cases of
Property Recovered and Stolen by Group Name') plt.xlabel('Group Name')
plt.ylabel('Number of Cases') plt.savefig('my_plot.png')
st.image('my_plot.png') cases_by_area_year =
df.pivot_table(values=['Cases_Property_Recovered',
'Cases_Property_Stolen'], index='Area_Name', columns='Year', aggfunc='sum')
st.write(cases_by_area_year)
plt.scatter(df['Value_of_Property_Recovered'], df['Value_of_Property_Stolen'])
plt.title('Value of Property Recovered vs. Stolen') plt.xlabel('Value of Property
Recovered') plt.ylabel('Value of Property Stolen')
plt.savefig('my_plot.png')
st.image('my_plot.png') top_stolen =
df.sort_values(by='Cases_Property_Stolen',
ascending=False).head(5)[['Sub_Group_Name', 'Cases_Property_Stolen']]
top_stolen.rename(columns={'Sub_Group_Name': 'Sub-group', 'Cases_Property_Stolen':
'Number of Cases Stolen'}, inplace=True) top_stolen.reset_index(drop=True,
inplace=True) top_stolen.index += 1 st.write(top_stolen) sub_group_cases =
df[['Sub_Group_Name', 'Cases_Property_Stolen']].copy()
sub_group_cases.set_index('Sub_Group_Name', inplace=True)
st.write(sub_group_cases) plt.hist([df['Value_of_Property_Recovered'],
df['Value_of_Property_Stolen']], bins=5,
label=['Recovered', 'Stolen']) plt.title('Value of Property Recovered and
Stolen') plt.xlabel('Value of Property') plt.ylabel('Frequency')
54
plt.legend() plt.savefig('my_plot.png') st.image('my_plot.png') year_data
= df.groupby('Year').agg({'Cases_Property_Recovered': 'sum',
'Cases_Property_Stolen': 'sum'}) year_data.plot(kind='bar')
plt.title('Cases of Property Recovered and Stolen by Year')
plt.xlabel('Year') plt.ylabel('Number of Cases')
plt.savefig('my_plot.png') st.image('my_plot.png')
summary_stats = df[['Cases_Property_Recovered',
'Cases_Property_Stolen']].describe().round(2)
summary_stats.rename(columns={'Cases_Property_Recovered': 'Recovered Cases',
'Cases_Property_Stolen': 'Stolen Cases'}, inplace=True)
st.write(summary_stats) elif item =='auto' or item ==
'auto theft':
g5 =
pd.DataFrame(auto_theft.groupby(['Area_Name'])['Auto_Theft_Stolen'].sum().reset_index())
st.write(g5) g5.columns = ['State/UT','Vehicle_Stolen']
g5.replace(to_replace='Arunachal Pradesh',value='Arunanchal Pradesh',inplace=True)
shp_gdf = gpd.read_file('/content/India_States/Indian_states.shp')
merged = shp_gdf.set_index('st_nm').join(g5.set_index('State/UT'))
colors = ['hotpink','purple','red']
55
fig = go.Figure(data=[go.Pie(labels=vehicle_group,
values=vehicle_vals,sort=False,marker=dict(colors=colors),textfont_size=12)])
st.plotly_chart(fig)
g5 =
pd.DataFrame(auto_theft.groupby(['Year'])['Auto_Theft_Stolen'].sum().reset_index())
sr_no = [1,2,3,4,5]
g8 =
pd.DataFrame(motor_c.groupby(['Area_Name'])['Auto_Theft_Stolen'].sum().reset_index())
g8_sorted = g8.sort_values(['Auto_Theft_Stolen'],ascending=True) fig =
px.scatter(g8_sorted.iloc[-10:,:], y='Area_Name', x='Auto_Theft_Stolen',
orientation='h',color_discrete_sequence=["red"])
st.plotly_chart(fig)
elif item=='murder' or item=='killer' or item=='death' or item=='homicide' or
item=='fatalities':
murder =
pd.read_csv("/content/32_Murder_victim_age_sex.csv")
st.write(murder.Year.unique()) murder.Area_Name.unique()
56
murder.Sub_Group_Name.unique() st.write(murder.head(10)) url
= "https://flo.uri.sh/visualisation/2693755/embed"
57
plt.style.use("fivethirtyeight") plt.figure(figsize = (14,10)) ax = sns.barplot( x =
'Year', y = 'Victims_Total' , hue = 'Sub_Group_Name' , data =
murderg ,palette= 'bright') #plotting barplot plt.title('Gender
Distribution of Victims per Year',size = 20) ax.set_ylabel('')
plt.savefig('my_plot.png') st.image('my_plot.png')
murdera = murder.groupby(['Year'])
['Victims_Upto_10_15_Yrs','Victims_Above_50_Yrs', 'Victims_Upto_10_Yrs',
'Victims_Upto_15_18_Yrs',
'Victims_Above_50_Yrs', 'Victims_Upto_10_Yrs',
'Victims_Upto_15_18_Yrs','Victims_Upto_18_30_Yrs',
'Victims_Upto_30_50_Yrs',].sum().reset_index() #grouping
with the gender and age groups
58
plt.style.use("fivethirtyeight") plt.figure(figsize = (14,10)) ax = sns.barplot(x
= 'Sub_Group_Name' , y = 'vals',hue = 'AgeGroup' ,data =
murderag,palette= 'colorblind') #making barplot taking Agegroup as hue/category plt.title('Age
& Gender Distribution of Victims',size = 20) ax.get_legend().set_bbox_to_anchor((1, 1))
#using anchor so that legend doesnt show on
the graph
ax.set_ylabel('')
ax.set_xlabel('Victims Gender')
for p in ax.patches:
ax.annotate("%.f" % p.get_height(), (p.get_x() + p.get_width() / 2.,
p.get_height()), ha='center', va='center', fontsize=15, color='black', xytext=(0,
8), textcoords='offset points')
plt.savefig('my_plot.png') st.image('my_plot.png') murderst =
murder[murder['Sub_Group_Name']== '3. Total'] #we need only total
number of victims per state
murderst=
murderst.groupby(['Area_Name'])['Victims_Total'].sum().sort_values(ascending =
False).reset_index() new_row = {'Area_Name':'Telangana',
'Victims_Total':27481} murderst =
murderst.append(new_row , ignore_index=True )
murderst.sort_values('Area_Name')
import geopandas as gpd
gdf = gpd.read_file('/content/India_States/Indian_states.shp')
murderst.at[17, 'Area_Name'] = 'NCT of Delhi' merged =
gdf.merge(murderst, left_on='st_nm', right_on='Area_Name')
merged.drop(['Area_Name'], axis=1)
#merged.describe() merged['coords'] =
merged['geometry'].apply(lambda x:
x.representative_point().coords[:]) merged['coords'] = [coords[0]
for coords in merged['coords']]
sns.set_context("talk")
sns.set_style("dark")
59
#plt.style.use('dark_background'
) cmap = 'YlGn' figsize = (25,
20)
plt.savefig('my_plot.png')
st.image('my_plot.png')
!curl https://ipv4.icanhazip.com/
!npm install localtunnel
!streamlit run /content/app.py &>/content/logs.txt &
!npx localtunnel --port 8501
ML PART
# Visualization Libraries
import matplotlib import
matplotlib.pyplot as plt import
seaborn as sns #Preprocessing
Libraries import pandas as pd
from sklearn.model_selection
import train_test_split from
sklearn.metrics import
precision_score, recall_score,
confusion_matrix,
classification_report,
accuracy_score, f1_score
60
# Evaluation Metrics from yellowbrick.classifier
import ClassificationReport from sklearn import
metrics
df = pd.concat([pd.read_csv('/content/drive/MyDrive/CRIMEANALYSIS/archive/
Chicago_Crime s_2001_to_2004.csv', error_bad_lines=False),
pd.read_csv('/content/drive/MyDrive/CRIMEANALYSIS/archive/Chicago_Crimes_2005_to
_2007.csv', error_bad_lines=False)], ignore_index=True) df = pd.concat([df,
pd.read_csv('/content/drive/MyDrive/CRIMEANALYSIS/archive/Chicago_Crimes_2008_to
_2011.csv', error_bad_lines=False)], ignore_index=True) df = pd.concat([df,
pd.read_csv('/content/drive/MyDrive/CRIMEANALYSIS/archive/Chicago_Crimes_2012_to
_2017.csv', error_bad_lines=False)], ignore_index=True)
df.head()
df.to_csv('output.csv', index=False)
df.info()
df = df.dropna()
# As the dataset is too huge is size, we would just subsampled a dataset for modelling as
proof of concept df = df.sample(n=100000)
df.info()
61
df.drop(['Date'], axis=1) df = df.drop(['date2'], axis=1) df =
df.drop(['Updated On'], axis=1) df.head()
df.groupby([df['Primary Type']]).size().sort_values(ascending=True).plot(kind='barh')
plt.show()
# First, we sum up the amount of Crime Type happened and select the last 13 classes
all_classes = df.groupby(['Primary Type'])['Block'].size().reset_index()
all_classes['Amt'] = all_classes['Block'] all_classes = all_classes.drop(['Block'],
axis=1) all_classes = all_classes.sort_values(['Amt'], ascending=[False])
unwanted_classes = all_classes.tail(13)
unwanted_classes
62
# After that, we replaced it with label 'OTHERS' df.loc[df['Primary
Type'].isin(unwanted_classes['Primary Type']), 'Primary Type'] = 'OTHERS'
plt.show()
63
# At Current Point, the attributes is select manually based on Feature Selection Part.
Features = ["IUCR", "Description", "FBI Code"] print('Full Features: ', Features)
x1 = x[Features] #Features to
train x2 = x[Target] #Target Class to
train y1 = y[Features] #Features to test
y2 = y[Target] #Target Class to test
# Random Forest
# Create Model with configuration
# Prediction result =
rf_model.predict(y[Features])
# Classification Report
# Instantiate the classification model and visualizer target_names =
Classes visualizer = ClassificationReport(rf_model,
classes=target_names) visualizer.fit(X=x1, y=x2) # Fit the
training data to the visualizer visualizer.score(y1, y2) #
Evaluate the model on the test data
# Classification Report
# Instantiate the classification model and visualizer target_names =
Classes visualizer = ClassificationReport(nn_model,
classes=target_names) visualizer.fit(X=x1, y=x2) # Fit the
training data to the visualizer visualizer.score(y1, y2) #
Evaluate the model on the test data
65
print('================= Classification Report =================')
print('')
print(classification_report(y2, result, target_names=target_names))
66
result = eclf1.predict(y[Features])
result
app code
df.groupby([df['Primary Type']]).size().sort_values(ascending=True).plot(kind='barh')
plt.savefig('my_plot1.png') st.image('my_plot1.png') all_classes =
df.groupby(['Primary Type'])['Block'].size().reset_index() all_classes['Amt'] =
all_classes['Block'] all_classes = all_classes.drop(['Block'], axis=1) all_classes =
all_classes.sort_values(['Amt'], ascending=[False])
unwanted_classes = all_classes.tail(13)
df.loc[df['Primary Type'].isin(unwanted_classes['Primary Type']), 'Primary Type'] = 'OTHERS'
68
# Plot Bar Chart visualize Primary Types
plt.figure(figsize=(14,10)) plt.title('Amount
of Crimes by Primary Type')
plt.ylabel('Crime Type') plt.xlabel('Amount of
Crimes')
df.groupby([df['Primary Type']]).size().sort_values(ascending=True).plot(kind='barh')
plt.savefig('my_plot1.png') st.image('my_plot1.png')
Classes = df['Primary Type'].unique() Classes
df['Primary Type'] = pd.factorize(df["Primary Type"])[0]
df['Primary Type'].unique()
X_fs = df.drop(['Primary Type'], axis=1)
Y_fs = df['Primary Type']
x1 = x[Features] #Features to
train x2 = x[Target] #Target Class to
train y1 = y[Features] #Features to test
y2 = y[Target] #Target Class to test
# Model Training
rf_model.fit(X=x1,y=x2) nn_model =
MLPClassifier(solver='adam',
alpha=1e-5,
hidden_layer_sizes=(40,),
random_state=1,
max_iter=1000
)
# Prediction
70
st.write("============= Ensemble Voting Results =============")
st.write("Accuracy : ", ac_sc) st.write("Recall : ", rc_sc)
st.write("Precision : ", pr_sc) st.write("F1 Score : ", f1_sc)
st.write("Confusion Matrix: ") st.write(confusion_m) target_names =
Classes visualizer = ClassificationReport(eclf1, classes=target_names)
visualizer.fit(X=x1, y=x2) # Fit the training data to the visualizer
visualizer.score(y1, y2) # Evaluate the model on the test data
g = visualizer.poof(outpath='my_classification_report.png')
# Save the figure as a PNG file
st.image('my_classification_report.png') df =
pd.DataFrame({'Offense': offenses, 'Percentage': percentages})
st.header('here are the risk %') # Display the DataFrame in
Streamlit
st.write(df)
!curl https://ipv4.icanhazip.com/
!npm install localtunnel
!streamlit run /content/app2.py &>/content/logs.txt &
!npx localtunnel --port 8501
71
from warnings import simplefilter
simplefilter("ignore") import os
72
st.title("CRIME ANALYSIS") st.write('What
kind of info you are looking for')
73
shp_gdf = gpd.read_file('/content/India_States/Indian_states.shp')
merge =shp_gdf.set_index('st_nm').join(g1.set_index('State/UT'))
fig,ax=plt.subplots(1, figsize=(10,10))
elif item =='human rights' or item =='torture' or item =='extortion' or item =='atrocities'
or item =='arrest' or item =='fake encounter' or item =='false implication' :
x=item st.header(x.upper()+'
CRIME') g2=
pd.DataFrame(police_hr.groupby(['Area_Name'])['Cases_Registered_under_Human_Rights_
Violations'].sum().reset_index())
st.write(x) st.write(g2)
74
st.header('YEAR WISE CASES')
g3 =
pd.DataFrame(police_hr.groupby(['Year'])['Cases_Registered_under_Human_Rights_Violati
ons'].sum().reset_index()) g3.columns = ['Year','Cases Registered']
fig = go.Figure(data=[
go.Bar(name='Policemen Chargesheeted', x=year, y=g4['Policemen_Chargesheeted'],
marker_color='purple'),
go.Bar(name='Policemen Convicted', x=year, y=g4['Policemen_Convicted'],
marker_color='red')
])
fig.update_layout(barmode='group',xaxis_title='Year',yaxis_title='Number of policemen')
st.plotly_chart(fig)
st.header(x+'STATE WISE REPORTS') g2.columns=
['State/UT','Cases Reported'] st.write(g2)
g2.replace(to_replace='Arunachal
Pradesh',value='Arunanchal Pradesh',inplace=True)
colormaps = ['RdPu', 'viridis', 'coolwarm', 'Blues',
'Greens', 'Reds', 'PuOr', 'inferno',
'magma', 'cividis', 'cool', 'hot', 'YlOrRd', 'YlGnBu']
75
ax.axis('off') ax.set_title('State-wise '+x+' Cases Reported',
fontdict={'fontsize': '15', 'fontweight' : '3'})
fig = merged.plot(column='Cases Reported', cmap=random_cmap, linewidth=0.5, ax=ax,
edgecolor='0.2',legend=True)
plt.savefig('my_plot.png')
st.header('INTENSITY MAP')
st.image('my_plot.png')
st.header('Penalties')
st.write(penalties.get(item))
elif item =='property' or item =='property stolen' or item =='stolen'or item =='Burglary':
df =
pd.read_csv('/content/10_Property_stolen_and_recovered.csv')
stats = df.describe() st.write(stats) plt.bar(['Recovered', 'Stolen'],
[df['Cases_Property_Recovered'][0],
df['Cases_Property_Stolen'][0]]) plt.title('Cases of Property Recovered and Stolen')
plt.xlabel('Type of Property') plt.ylabel('Number of Cases')
plt.savefig('my_plot.png') st.image('my_plot.png') labels = ['Recovered', 'Stolen']
sizes = [df['Value_of_Property_Recovered'][0], df['Value_of_Property_Stolen'][0]]
colors = ['green', 'red'] plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%
%')
plt.title('Property Recovered and Stolen') plt.axis('equal')
plt.savefig('my_plot.png') st.image('my_plot.png') group_data =
df.groupby('Group_Name').agg({'Cases_Property_Recovered': 'sum',
'Cases_Property_Stolen': 'sum'}) group_data.plot(kind='bar') plt.title('Cases of
Property Recovered and Stolen by Group Name') plt.xlabel('Group Name')
plt.ylabel('Number of Cases') plt.savefig('my_plot.png')
st.image('my_plot.png') cases_by_area_year =
df.pivot_table(values=['Cases_Property_Recovered',
'Cases_Property_Stolen'], index='Area_Name', columns='Year', aggfunc='sum')
st.write(cases_by_area_year)
plt.scatter(df['Value_of_Property_Recovered'], df['Value_of_Property_Stolen'])
plt.title('Value of Property Recovered vs. Stolen') plt.xlabel('Value of Property
Recovered') plt.ylabel('Value of Property Stolen') plt.savefig('my_plot.png')
76
st.image('my_plot.png') top_stolen =
df.sort_values(by='Cases_Property_Stolen',
ascending=False).head(5)[['Sub_Group_Name', 'Cases_Property_Stolen']]
top_stolen.rename(columns={'Sub_Group_Name': 'Sub-group', 'Cases_Property_Stolen':
'Number of Cases Stolen'}, inplace=True)
top_stolen.reset_index(drop=True, inplace=True)
top_stolen.index += 1 st.write(top_stolen)
shp_gdf = gpd.read_file('/content/India_States/Indian_states.shp')
merged = shp_gdf.set_index('st_nm').join(g5.set_index('State/UT'))
77
fig, ax = plt.subplots(1, figsize=(10, 10)) ax.axis('off')
ax.set_title('State-wise Auto Theft Cases Reported(2001-2010)',
fontdict={'fontsize': '15', 'fontweight' : '3'})
colors = ['hotpink','purple','red']
fig = go.Figure(data=[go.Pie(labels=vehicle_group,
values=vehicle_vals,sort=False,marker=dict(colors=colors),textfont_size=12)])
st.plotly_chart(fig)
g5 =
pd.DataFrame(auto_theft.groupby(['Year'])['Auto_Theft_Stolen'].sum().reset_index())
78
cells=dict(values=[sr_no,vehicle_list],
height=30))
])
st.plotly_chart(fig) motor_c = auto_theft[auto_theft['Sub_Group_Name']=='1.
Motor Cycles/ Scooters'] g8 =
pd.DataFrame(motor_c.groupby(['Area_Name'])['Auto_Theft_Stolen'].sum().reset_index())
g8_sorted = g8.sort_values(['Auto_Theft_Stolen'],ascending=True) fig =
px.scatter(g8_sorted.iloc[-10:,:], y='Area_Name', x='Auto_Theft_Stolen',
orientation='h',color_discrete_sequence=["red"])
st.plotly_chart(fig)
elif item=='murder' or item=='killer' or item=='death' or item=='homicide' or
item=='fatalities':
murder =
pd.read_csv("/content/32_Murder_victim_age_sex.csv")
st.write(murder.Year.unique()) murder.Area_Name.unique()
murder.Sub_Group_Name.unique() st.write(murder.head(10)) url
= "https://flo.uri.sh/visualisation/2693755/embed"
79
'Sub_Group_Name'])['Victims_Total'].sum().reset_index() # grouping with year and sub
group murderg = murderg[murderg['Sub_Group_Name']!= '3. Total'] # we dont
need total
category of sub group
murderg = murder.groupby(['Year' ,
'Sub_Group_Name'])['Victims_Total'].sum().reset_index() # grouping with year and sub
group murderg = murderg[murderg['Sub_Group_Name']!= '3. Total'] # we dont
need total
category of sub group
the graph
ax.set_ylabel('')
ax.set_xlabel('Victims Gender')
for p in ax.patches:
ax.annotate("%.f" % p.get_height(), (p.get_x() + p.get_width() / 2.,
p.get_height()), ha='center', va='center', fontsize=15, color='black', xytext=(0,
8), textcoords='offset points')
plt.savefig('my_plot.png') st.image('my_plot.png') murderst =
murder[murder['Sub_Group_Name']== '3. Total'] #we need only total
number of victims per state
murderst=
murderst.groupby(['Area_Name'])['Victims_Total'].sum().sort_values(ascending =
False).reset_index() new_row = {'Area_Name':'Telangana',
'Victims_Total':27481} murderst =
murderst.append(new_row , ignore_index=True )
murderst.sort_values('Area_Name') import geopandas as gpd
81
gdf = gpd.read_file('/content/India_States/Indian_states.shp')
murderst.at[17, 'Area_Name'] = 'NCT of Delhi' merged =
gdf.merge(murderst, left_on='st_nm', right_on='Area_Name')
merged.drop(['Area_Name'], axis=1)
#merged.describe() merged['coords'] =
merged['geometry'].apply(lambda x:
x.representative_point().coords[:]) merged['coords'] = [coords[0]
for coords in merged['coords']]
sns.set_context("talk")
sns.set_style("dark")
#plt.style.use('dark_background')
cmap = 'YlGn'
figsize = (25, 20)
plt.savefig('my_plot.png')
st.image('my_plot.png')
df.groupby([df['Primary Type']]).size().sort_values(ascending=True).plot(kind='barh')
plt.savefig('my_plot1.png') st.image('my_plot1.png')
Classes = df['Primary Type'].unique() Classes
df['Primary Type'] = pd.factorize(df["Primary Type"])[0]
df['Primary Type'].unique() X_fs = df.drop(['Primary
Type'], axis=1)
Y_fs = df['Primary Type']
# Model Training
rf_model.fit(X=x1,y=x2) nn_model =
MLPClassifier(solver='adam',
alpha=1e-5,
hidden_layer_sizes=(40,),
random_state=1,
max_iter=1000
)
# Prediction
85
result = eclf1.predict(y[Features]) ac_sc =
accuracy_score(y2, result) rc_sc = recall_score(y2,
result, average="weighted") pr_sc = precision_score(y2,
result, average="weighted") f1_sc = f1_score(y2, result,
average='micro') confusion_m = confusion_matrix(y2,
result)
g = visualizer.poof(outpath='my_classification_report.png')
86
APPENDIX 2
APPENDIX-2:OUTPUT SCREENS
87
88
REFERENCES
[1] Mohler, G. O., Short, M. B., Brantingham, P. J., Schoenberg, F. P., & Tita, G. E.
(2011). Self-exciting point process modeling of crime. Journal of the American Statistical
Association, 106(493), 100-108.
[2] Wang, P., Liu, W., Li, D., & Zhang, L. (2013). Crime prediction based on criminal
behavior similarity. Expert Systems with Applications, 40(12), 4912-4919.
[3] Ashfaq, R., Zhang, X., & Ahmad, M. O. (2018). Crime prediction using machine
learning techniques: A comprehensive review. Big Data Research, 14, 47-70.
[4] Wu, Z., Yin, H., & Zhu, Y. (2019). Crime rate prediction based on machine learning. In
2019 5th International Conference on Control, Automation and Robotics (ICCAR) (pp.
476480). IEEE.
[5] Ribeiro, F. N., de Souza, R. C., & Pereira, C. A. (2019). Crime prediction using
machine learning algorithms. In 2019 IEEE International Conference on Systems, Man and
Cybernetics (SMC) (pp. 1569-1574). IEEE.
[6] Yuan, Y., Wang, F., Zheng, Y., Xie, X., & Sun, G. (2020). Predicting criminal hotspots
using deep learning: A spatiotemporal approach. Applied Geography, 121, 102233.
[7] Mohler, G. O., Short, M. B., & Brantingham, P. J. (2019). Randomized controlled
trials in criminology: A new dawn or a false hope? Journal of Experimental Criminology,
15(3), 369387.
[8] Gerber, M. S., & Dow, P. A. (2019). Machine learning for crime prediction: The case
of Los Angeles. Journal of Quantitative Criminology, 35(3), 531-551.
[9] Santamaria, J. C., & Contreras, I. (2020). Machine learning models for urban crime
prediction: A systematic review. IEEE Access, 8, 130459-130478.
[10] Lee, Y., Kim, D., & Lee, M. (2021). Crime prediction using deep learning with spatial
information. Applied Sciences, 11(9), 4179.
89
CONFERENCE CERTIFICATES
90
91