Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 35

ABSTRACT

Brain O Vision Solutions (INDIA) Pvt Ltd is a leading company that specializes in providing
innovative solutions to complex problems. As a Data Science intern at Brain O Vision, you will
have the opportunity to work with a team of experienced data scientists and gain hands-on expe-
rience in the field of Data Science.
The Data Science internship at Brain O Vision is a four months program designed to provide stu-
dents and recent graduates with a comprehensive understanding of Data Science principles and
techniques. During the internship, you will work on real-world projects and be exposed to vari-
ous tools and technologies used in the industry.

Road accidents pose a significant threat to public safety worldwide, causing immense human suf-
fering and economic losses. Predictive analytics using machine learning (ML) techniques has
emerged as a promising approach to mitigate these risks by identifying patterns and factors con-
tributing to accidents. In this study, we propose a predictive model implemented in Python to
forecast road accidents based on historical data.

The methodology involves several steps. Firstly, we collect and preprocess a comprehensive
dataset comprising various features such as weather conditions, road characteristics, traffic
volume, and historical accident records. Next, we employ exploratory data analysis techniques to
gain insights into the data distribution and relationships between different variables. Sub-
sequently, we utilize supervised ML algorithms, including decision trees, random forests, sup-
port vector machines, and neural networks, to train and evaluate the predictive models.

The performance of each model is assessed using metrics such as accuracy, precision, recall, and
F1-score. We also conduct feature importance analysis to identify the most influential factors af-
fecting accident occurrence. Additionally, we explore ensemble techniques to combine multiple
models for improved prediction accuracy and robustness.

The experimental results demonstrate the effectiveness of the proposed approach in accurately
predicting road accidents. By leveraging ML algorithms, our model can assist transportation au-
thorities and policymakers in proactively identifying high-risk areas and implementing targeted
interventions to reduce accident rates. Furthermore, the Python implementation ensures scalabil-
ity, flexibility, and ease of integration into existing systems.

1
CONTENTS
Title Page Page.No
Certificate
Internship Certificate
Acknowledgement
Abstract 1
Contents 2
Chapter 1 Introduction
1.1 Introduction to System 3
1.2 Problem Definition 4
1.3 Aim 5
1.4 Objective 5
1.5 Need of System 5
Chapter 2 Hardware & Software Requirements
2.1 Sysytm Environment 6
Chapter 3 Techniques of Prediction
3.1 Prdiction Models 7
3.2 Feature Engineering Approaches 7
Chapter 4 API Development
4.1 Introduction to RESTFUL API 11
4.2 Design API Switch DJANGO Rest Frame 12
Chapter 5 Implementation Issues
5.1 Python 14
5.2 Data Base 16
Chapter 6 Testing
6.1 Testing Objective 19
6.2 Types of Testing 19
6.3Testing Method Used 21

2
6.4Testing Result & Analysis 22
Chapter 7 Coding 28
Chapter 8 Future Scope 32
Conclusion 33
References 34

3
CHAPTER 1
INTRODUCTION
1.1 Introduction to System:
road accident prediction offers a critical avenue for improving road safety and mitigating
the loss of life and property caused by traffic accidents. This research field combines
various disciplines such as data science, machine learning, transportation engineering, and
urban planning to develop models and tools that can anticipate and prevent road accidents.
Road accidents continue to be a significant global concern, exacting a heavy toll in terms
of human lives, economic losses, and societal well-being. Despite advancements in
vehicle safety technologies and traffic management systems, the World Health
Organization (WHO) reports that road traffic injuries remain a leading cause of death
worldwide, particularly among young adults aged 15-29 years. Furthermore, road
accidents impose a substantial economic burden, costing countries billions of dollars
annually in healthcare expenses, property damage, and lost productivity.
In light of these challenges, there is a growing imperative to develop proactive measures
to mitigate the occurrence and severity of road accidents. One promising approach is the
use of predictive modeling techniques to forecast where and when accidents are likely to
happen. By leveraging available data on traffic patterns, road conditions, weather
forecasts, and historical accident records, researchers and policymakers can identify high-
risk areas and implement targeted interventions to enhance road safety.
This research project aims to contribute to the ongoing efforts in road accident prevention
by developing advanced predictive models capable of forecasting accidents with
improved accuracy and reliability. Through the integration of machine learning
algorithms, statistical analysis, and spatial modeling techniques, we seek to uncover
underlying patterns and risk factors associated with road accidents. By understanding the
complex interplay of variables influencing accident occurrence, we aim to provide
actionable insights for policymakers, transportation agencies, and other stakeholders to
implement evidence-based interventions and reduce the incidence of road accidents.
Significance of the Issue: Road accidents are a significant global concern due to their
devastating impact on human lives, economies, and societies. Despite advancements in
vehicle safety technology and traffic management systems, the number of road traffic

4
injuries and fatalities remains high. This highlights the urgency of finding effective
solutions to reduce the occurrence and severity of road accidents.
Objective of the Research: The primary objective of the research is to develop advanced
predictive models for road accident prediction. These models will leverage data on
various factors such as traffic patterns, road conditions, weather forecasts, and historical
accident records to forecast where and when accidents are likely to occur. By accurately
predicting accident hotspots and risk factors, the research aims to facilitate the
implementation of targeted interventions to improve road safety.
Methodology: The research will employ a multidisciplinary approach, integrating
techniques from data science, machine learning, statistical analysis, and spatial modeling.
Machine learning algorithms will be utilized to uncover patterns and relationships within
the data, while statistical analysis will help identify significant risk factors associated with
road accidents. Spatial modeling techniques will enable the visualization and analysis of
geographic patterns in accident occurrence.
Findings and Analysis: The research will present findings derived from the developed
predictive models and analyze the underlying factors influencing road accidents. This
analysis may include identifying high-risk areas, assessing the impact of specific variables
(such as road infrastructure, traffic volume, and weather conditions) on accident
occurrence, and evaluating the effectiveness of existing road safety measures.
Recommendations for Future Research and Applications: The research will conclude by
offering recommendations for future research directions and practical applications. This
may include suggestions for refining predictive models, collecting additional data sources,
implementing targeted interventions, and integrating predictive analytics into existing
traffic management systems. The ultimate goal is to contribute to the advancement of road
safety initiatives and create safer transportation systems for all road users.
1.2 Problem Definition:
Prediction of road accidents is a new requirement for the current era. Many scholars have worked
on this domain and proposed various models but need to improve the accuracy of the work.
Hence accuracy of prediction is the basic issue in this type of work. Furthermore, scholars have
not improved the input dataset like pre-processing steps, as this increases the learning of the
model. To resolve all the above issues this model introduces a new learning method with
improved results.

5
1.3 Aim:

“To Improve prediction accuracy Road Accident Prediction”.

1.4 Objective:

To perform a thorough analysis of the research area.


To study the problems in the System through fact finding techniques.
To develop conceptual, logical and physical model for the system.
To enhance the learning model detection accuracy.
To proposed a Road Accident Prediction model with trained dataset.
To document our efforts and analysis in a proper comprehensible manner.
Goal:
The project is targeted at those people who want to predict the road accident.
Improve the input training dataset quality.
The system should be adaptive for large datasets, irrespective of the training model.
To make a trained model that is consistent, reliable and secure.
To develop a well-organized information prediction model.
To make good documentation so as to facilitate possible future enhancements.

1.5 Need of the System:


In present time almost all things are available online which saved people time and money at
the same time. So the online data use for the learning of machine, for future prediction help this
world.
The need for data prediction, particularly in contexts like stock market analysis, is paramount
due to the inherently uncertain and volatile nature of financial markets. Predictive models serve
as invaluable tools for investors, financial analysts, and institutions seeking to anticipate future
trends and make informed decisions. In the realm of stock market analysis, data prediction
enables stakeholders to forecast price movements, identify potential opportunities for investment,
and manage risk more effectively. By leveraging historical market data, fundamental indicators,
economic factors, and market sentiment, predictive models can uncover patterns, correlations,
and trends that may not be readily apparent to human analysts. Moreover, in an era marked by
the proliferation of big data and advanced analytical techniques, the need for accurate and timely
predictions has become more pronounced than ever. Data prediction empowers market
participants to adapt swiftly to changing market conditions, optimize portfolio performance, and
capitalize on emerging opportunities, thereby enhancing overall competitiveness and financial
resilience. In essence, data prediction plays a pivotal role in navigating the complexities of the
stock market landscape, enabling stakeholders to stay ahead of the curve and make data-driven

6
decisions that drive success and mitigate risks.
CHAPTER 2
HARDWARE & SOFTWARE REQUIREMENTS
Introduction:

In this chapter we mentioned the software and hardware requirements, which are necessary for
successfully running this system. The major element in building systems is selecting compatible
hardware and software. The system analyst has to determine what software package is best for
the “Road Accident Prediction” and, where software is not an issue, the kind of hardware and
peripherals needed for the final conversion.

2.1 System Environment:

After analysis, some resources are required to convert the abstract system into the real one. All
the resources, which accomplish a robust
The hardware and software selection begins with requirement analysis, followed by a request for
proposal and vendor evaluation.

Software and real system, are identified. According to the provided functional specification all
the technologies and its capacities are identified. Basic functions and procedures and
methodologies are prepared to implement. Some of the Basic requirements such as hardware and
software are described as follows: -
Hardware and Software Specification
Requirements:

Microsoft Windows 7/8.


HTML/Python.
Ms-Office package.
Hardware Requirements:
Intel Processor 2.0 GHz or above.
2 GB RAM or more.
160 GB or more Hard Disk Drive or above

7
CHAPTER 3
TECHNIQUES OF PREDICTION

3.1 Prediction Models:

Prediction models are used across various domains to forecast future outcomes based on
historical data and relevant features. Here are some commonly used techniques for building
prediction models:

Linear Regression:

Linear regression is a simple yet powerful technique used for predicting a continuous target
variable based on one or more input features.
It assumes a linear relationship between the input features and the target variable.

Logistic Regression:

Logistic regression is used for binary classification tasks, where the target variable has two
possible outcomes.
It models the probability that a given input belongs to a particular class using a logistic function.

Decision Trees:

Decision trees are versatile models that can be used for both classification and regression tasks.
They partition the feature space into regions and make predictions based on the majority class or
mean value within each region.

Random Forests:

Random forests are ensemble learning methods that combine multiple decision trees to improve
prediction accuracy and robustness.
They introduce randomness during the training process by bootstrap sampling and feature
selection.

Gradient Boosting Machines (GBMs):

GBMs are another ensemble learning technique that builds a series of weak learners sequentially
to minimize a loss function.
They fit each new model to the residual errors of the previous model, thereby reducing the
overall prediction error.

Support Vector Machines (SVM):

SVM is a supervised learning algorithm that can be used for classification or regression tasks.

8
It finds the optimal hyperplane that separates classes or predicts the target variable with
maximum margin.

Neural Networks:

Neural networks are highly flexible models inspired by the structure of the human brain.
They consist of interconnected layers of neurons and can be used for a wide range of prediction
tasks, including classification, regression, and time series forecasting.
Time Series Models:

Time series models are specifically designed to predict future values based on past observations
in a time-dependent manner.
Techniques such as Autoregressive Integrated Moving Average (ARIMA), Exponential
Smoothing (ETS), and Long Short-Term Memory (LSTM) networks are commonly used for time
series forecasting.

K-Nearest Neighbors (KNN):

KNN is a non-parametric method used for classification and regression tasks.


It predicts the target variable by averaging the values of the k-nearest neighbors in the feature
space.

Ensemble Methods:

Ensemble methods combine multiple individual models to produce a single prediction, often
achieving better performance than any single model alone.
Techniques such as bagging, boosting, and stacking are commonly used for ensemble learning.

When choosing a prediction model technique, it's essential to consider factors such as the nature
of the data, the complexity of the problem, interpretability requirements, computational
resources, and the trade-off between bias and variance. Additionally, model evaluation and
validation are crucial to ensure the selected technique performs well on unseen data and
generalizes effectively to new instances.

Feature Engineering Techniques


Feature engineering is a crucial step in the process of building predictive models. It involves
transforming raw data into a format that is suitable for machine learning algorithms, extracting
relevant information, and creating new features that can improve prediction performance. Here
are some common feature engineering techniques:

Imputation:

Imputation is used to handle missing values in the dataset. Missing values can be filled using
statistical measures such as mean, median, or mode of the feature, or through more advanced
techniques like K-nearest neighbors imputation or predictive imputation using regression
models.

9
Normalization and Standardization:

Normalization scales numerical features to a standard range, typically between 0 and 1, making
them comparable across different scales.
Standardization transforms numerical features to have a mean of 0 and a standard deviation of 1,
ensuring that features are centered around zero and have a uniform scale.
One-Hot Encoding:

One-hot encoding is used to convert categorical variables into a numerical format that can be
used by machine learning algorithms.
It creates binary dummy variables for each category, with a value of 1 indicating the presence of
the category and 0 indicating absence.

Encoding Ordinal Variables:

Ordinal variables have a natural order or ranking, such as low, medium, and high.
Encoding ordinal variables involves mapping each category to a numerical value according to its
order, preserving the ordinal relationship between categories.

Feature Scaling:

Feature scaling ensures that all numerical features have a similar scale, preventing features with
large values from dominating those with smaller values during model training.
Techniques include Min-Max scaling, where features are scaled to a specific range, and Z-score
scaling, where features are standardized to have a mean of 0 and a standard deviation of 1.

Binning:

Binning involves dividing numerical features into discrete bins or intervals based on their values.
Binning can help capture non-linear relationships between features and the target variable and
reduce the impact of outliers.

Feature Interaction:

Feature interaction involves creating new features by combining two or more existing features.
This can help capture complex relationships between features and improve model performance.

Dimensionality Reduction:

Dimensionality reduction techniques such as Principal Component Analysis (PCA) or Singular


Value Decomposition (SVD) can be used to reduce the number of features in high-dimensional
datasets while preserving as much information as possible.

10
3.2 Feature Engineering Approaches

Feature Selection:

Feature selection techniques identify the most relevant features that contribute the most to
prediction performance while removing irrelevant or redundant features.
Methods include filter methods (e.g., correlation-based feature selection), wrapper methods (e.g.,
recursive feature elimination), and embedded methods (e.g., Lasso regression).

Time Series Features:

For time series data, additional features such as lagged variables, rolling statistics (e.g., moving
averages), and seasonal indicators can be engineered to capture temporal patterns and trends.
Effective feature engineering can significantly impact the performance of predictive models,
allowing them to better capture underlying patterns and relationships in the data. It requires
domain knowledge, creativity, and experimentation to identify the most informative features for
a given prediction task.

11
CHAPTER 4
API DEVELOPMENT

4.1 :Introduction to RESTful API:

A RESTful API, short for Representational State Transfer Application Programming Interface, is
a type of web service designed to follow the principles of REST architecture. REST, or Repre-
sentational State Transfer, is an architectural style for creating networked applications, particu-
larly web services, emphasizing simplicity, scalability, and reliability. RESTful APIs enable com-
munication between clients and servers over the internet, facilitating the exchange of data and re-
sources in a stateless manner.

At the core of RESTful APIs lies the client-server architecture, which separates the client and
server components, enabling them to operate independently. This separation of concerns allows
for greater scalability and flexibility in designing and deploying web applications. The client
sends requests to the server, which processes these requests and returns responses containing the
requested data or performing the requested actions. This stateless nature ensures that each re-
quest contains all the information necessary for the server to understand and process it, without
relying on any previous interactions.

One of the fundamental principles of RESTful APIs is their resource-based nature. Resources,
such as data entities or functionality, are identified by unique URLs (Uniform Resource Loca-
tors) and can be accessed and manipulated using standard HTTP methods. These methods, in-
cluding GET, POST, PUT, and DELETE, correspond to CRUD (Create, Read, Update, Delete)
operations and form the foundation of interacting with RESTful APIs. Resources are typically
represented using standardized formats like JSON (JavaScript Object Notation) or XML (eXten-
sible Markup Language), allowing for interoperability and ease of integration across different
systems.

RESTful APIs follow a uniform interface for communication, using standard HTTP methods and
status codes to convey information between clients and servers. Each interaction with the API is
self-contained and stateless, ensuring that the server does not retain any client-specific informa-
tion between requests. Furthermore, RESTful APIs leverage hypermedia as the engine of appli-
cation state (HATEOAS), allowing clients to navigate the API dynamically by following hyper-
media links provided in resource representations. This enables a more flexible and discoverable
API design, enhancing the overall user experience and facilitating system integration.

12
4.2 Designing APIs with DJANGO REST framework:

Designing APIs with Django REST Framework (DRF) involves leveraging the powerful features
and tools provided by Django, a high-level Python web framework, to create RESTful web ser-
vices. Django REST Framework is a popular toolkit built on top of Django, specifically designed
for building APIs.
1. Installation and Setup: To get started with Django REST Framework, you first need to install
Django and Django REST Framework using pip, Python's package manager. Once installed, you
can create a new Django project and configure it to use Django REST Framework by adding it to
the installed apps in the project's settings.

2. Creating API Endpoints: Django REST Framework allows you to define API endpoints using
Django's URL routing mechanism. You can create views, known as viewsets or API views, to
handle requests to these endpoints and define the logic for processing and responding to them.
Viewsets provide a convenient way to organize related API endpoints for CRUD operations on
model instances, while API views offer more flexibility for customizing the behavior of individ-
ual endpoints.

3. Serializers: Serializers are a key component of Django REST Framework that facilitate the
conversion of complex data types, such as Django model instances, into JSON or other content
types suitable for transmission over the network. Serializers define the structure and format of
the data returned by API endpoints and handle the serialization and deserialization of data during
request and response processing.

4.Authentication and Authorization: Django REST Framework provides built-in support for vari-
ous authentication and authorization mechanisms to secure your APIs. You can configure authen-
tication backends to authenticate users based on credentials like username and password, tokens,
or OAuth2. Similarly, you can define permissions to control access to API endpoints based on
user roles and permissions.

5.Pagination, Filtering, and Ordering: Django REST Framework offers built-in support for pagi-
nation, filtering, and ordering of API querysets to efficiently handle large datasets and improve
the performance of your APIs. You can customize the behavior of pagination, filtering, and or-
dering using built-in classes and mixins provided by DRF or implement custom logic as needed.
6. Documentation and Testing: Django REST Framework provides tools for automatically gener-
ating API documentation based on your API views, serializers, and other components. You can

13
use tools like Django REST Swagger or Django REST Framework's built-in documentation gen-
erator to generate interactive API documentation that helps developers understand and use your
APIs. Additionally, DRF includes support for writing unit tests and integration tests to ensure the
reliability and correctness of your API endpoints

14
CHAPTER 5
IMPLEMENTATION ISSUES
5.1 Python

Python is an interpreted, object-oriented, high-level programming language with dynamic


semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a
scripting or glue language to connect existing components together. Python's simple, easy to
learn syntax emphasizes readability and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages program modularity and code reuse.
The Python interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms, and can be freely distributed. Often, programmers fall in
love with Python because of the increased productivity it provides. Since there is no compilation
step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or
bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error,
it raises an exception. When the program doesn't catch the exception, the interpreter prints a
stack trace. A source level debugger allows inspection of local and global variables, evaluation of
arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on.
The debugger is written in Python itself, testifying to Python's introspective power. On the other
hand, often the quickest way to debug a program is to add a few print statements to the source:
the fast edit-test-debug cycle makes this simple approach very effective.

Implementing machine learning models in Python can encounter various issues, ranging from
technical challenges to conceptual hurdles. Here are some common issues and considerations
when implementing machine learning models in Python:

Data Preprocessing:

Data cleaning: Dealing with missing values, outliers, and inconsistencies in the dataset.
Data transformation: Scaling numerical features, encoding categorical variables, and handling
imbalanced datasets.
Feature engineering: Creating new features, selecting relevant features, and transforming
variables for better model performance.

Model Selection and Tuning:

Choosing the appropriate algorithm: Selecting the right machine learning algorithm based on the
problem type (e.g., classification, regression, clustering) and data characteristics.
Hyperparameter tuning: Optimizing the hyperparameters of the selected model using techniques
like grid search, random search, or Bayesian optimization.
Model evaluation: Assessing model performance using appropriate evaluation metrics and
validation techniques such as cross-validation.

Overfitting and Underfitting:

Overfitting occurs when the model learns the training data too well, capturing noise instead of

15
the underlying patterns, leading to poor generalization to unseen data.
Underfitting happens when the model is too simple to capture the underlying structure of the
data, resulting in high bias and low model performance.
Techniques to address overfitting include regularization (e.g., L1/L2 regularization), cross-
validation, and early stopping.

Computational Resources:

Memory constraints: Large datasets or complex models may require significant memory
resources, leading to memory errors or slowdowns.
CPU/GPU utilization: Some machine learning algorithms, especially deep learning models, may
benefit from GPU acceleration for faster training.
Optimization techniques: Batch processing, data streaming, and parallelization can help improve
computational efficiency for large-scale datasets.

Interpretability and Explainability:

Interpreting model predictions: Understanding how the model makes predictions and explaining
its decisions to stakeholders.
Model transparency: Using interpretable models (e.g., linear models, decision trees) or post-hoc
interpretability techniques (e.g., SHAP values, LIME) to gain insights into model behavior.

Deployment and Integration:

Model deployment: Integrating the trained model into production systems or applications for
real-world use.
Deployment infrastructure: Choosing the right deployment platform (e.g., cloud-based services,
containerization) and optimizing model inference for low latency and high throughput.
Model monitoring: Monitoring model performance and drift over time to ensure continued
accuracy and reliability in production environments.

Scalability and Maintenance:

Scalability concerns: Ensuring that the implemented solution can scale to handle increasing data
volumes or user loads.
Model maintenance: Updating models periodically with new data and retraining them to adapt to
changing patterns or drift in the data distribution.

Addressing these issues requires a combination of technical expertise, domain knowledge, and
iterative experimentation. Python's rich ecosystem of machine learning libraries (e.g., scikit-
learn, TensorFlow, PyTorch) and tools can help streamline the implementation process and
overcome many of these challenges. Additionally, leveraging best practices, documentation, and
community support can facilitate successful implementation and deployment of machine learning
models in Python.

16
5.2 DATABASE :

Databases are crucial components of backend development, responsible for storing, organizing,
and managing data for applications. In backend development with Python, several database
options are available, each catering to different use cases and preferences. Databases play a
crucial role in managing and accessing data efficiently, enabling applications to store, retrieve,
and manipulate data effectively to support various business processes and functionalities.

What is SQL?

SQL (pronounced "ess-que-el") stands for Structured Query Language. SQL is used to
communicate with a database. According to ANSI (American National Standards Institute), it is
the standard language for relational database management systems.

SQL statements are used to perform tasks such as update data on a database, or retrieve data
from a database. Some common relational database management systems that use SQL are:
Oracle, Sybase, Microsoft SQL Server, Microsoft Access, Ingres, etc.

Although most database systems use SQL, most of them also have their own additional
proprietary extensions that are usually only used on their system. However, the standard SQL
commands such as "Select", "Insert", "Update", "Delete", "Create", and "Drop" can be used to
accomplish almost everything that one needs to do with a database.

Terminology of SQL:

Here is some terminology related to SQL:


SQL: Structured Query Language is a programming language used to store and process data in a
relational database.

Relational database:
Stores data in tables with rows and columns that represent data attributes and relationships.

SQL processing:
The process of parsing, optimizing, generating row sources, and executing a SQL statement.

Data Manipulation Language (DML):


A subset of operations used to insert, delete, and update data in a database. DML is often a
sublanguage of a more extensive language like SQL.

Data Definition Language (DDL):


Used to define the structure of new database objects.

Data Control Language (DCL):


One of the three main components of SQL.

Transaction Log:

17
Files in the database that store a history of the changes made in the database.

TempDB:
Used to store temporary objects required for database operations that won't fit in memory.

Backup:
A point-in-time copy of the data in your database.

Keywords:
Words that have significance in SQL, such as SELECT, DELETE, or BIGINT. Certain keywords
are reserved and require special treatment for use as identifiers such as table and column names.

SQL Commands:

These SQL commands are mainly categorized into five categories:

DDL – Data Definition Language.


DQL – Data Query Language.
DML – Data Manipulation Language.
DCL – Data Control Language.
TCL – Transaction Control Language.

SQL Joins:

SQL | Join (Inner, Left, Right and Full Joins) SQL Join statement is used to combine data or rows
from two or more tables based on a common field between them. Different types of Joins are as
follows:

 INNER JOIN
 LEFT JOIN
 RIGHT JOIN
 FULL JOIN
 NATURAL JOIN

SRF and MRF:

SRF and MRF are differences between correlated and non-correlated subqueries in SQL. A
correlated subquery is not an independent query, but it can refer to columns in a table listed in 26
the FROM list of the main query. A non-correlated subquery is an independent query, and the
subquery's output is substituted in the main query.

SQL, or Structured Query Language, is a standard database language used to access and
manipulate data in databases. SQL can create, update, delete, and retrieve data in databases like
MySQL and Oracle by executing queries.

Single-row functions (SRF) return a single result row for every row of a queried table or view.

18
These functions can appear in select lists, WHERE clauses, START WITH and CONNECT BY
clauses, and HAVING clauses. For example, length and case conversion functions are SRF.

Multiple-row functions (MRF) work upon a group of rows and return one result for the complete
set of rows. They are also known as Group Functions.

Normalization:

Normalization is a multi-step process in SQL that improves data integrity and removes data
redundancy. It also helps organize data in a database. The four types of normalization are:

First Normal Form (1NF):

A table is in 1NF if the atomicity of the table is 1, which means that a single cell can only hold a
single-valued attribute. 1NF disallows multi-valued attributes, composite attributes, and their
combinations.

Second Normal Form (2NF):

Any column that is not the primary key needs to depend on the primary key. Third Normal Form
(3NF): Any column that is not the primary key is only dependent on the primary key (and no
other columns).

Boyce-Codd Normal Form (BCNF):

A relation will be in 3NF if it is in 2NF and no transition dependency exists.

Advantages of Normalization:
 Normalization helps to minimize data redundancy.
 Greater overall database organization.
 Data consistency within the database.  Much more flexible database design. Disadvantages
of Normalization:
 You cannot start building the database before knowing what the user needs.
 It is very time-consuming and difficult to normalize relations of a higher degree

19
CHAPTER 6
TESTING
6.1 Testing Objectives:

Testing meets 3 objectives:


Identification of Errors:
These are obvious anomalies that show up in the behavior of a program or a unit or a
component. Such behavior as the following is considered an error:

 Wrong totals
 Misalignments
 Messages that say the wrong thing
 Actions that do not execute as promised: the Delete button does not delete, the update
menu does not update properly, etc.

2. Conformance to requirements:
These errors are the result of testing the functions in the software against the Requirement
Definition Document to ensure that every requirement, functional or non-functional is in the
system and that it operates properly. This is often called an Operational Qualification.
However, note that even if some of the requirements do not seem to be “Operational”, this
Is an operational test. For example, the software may be required to display
copyright message on all acquired components.

3. Performance Qualification: These are not “errors” as such but failures to


conform to performance requirements. Technically, they can be part of the second type.
However, Performance Qualification (PQ) became a standard way of testing for
historical reasons. Some systems will perform differently under different loads and conditions.
For example, a citizen lookup application may need to operate within a specific time response for
a load of up to 300 queries an hour. The software application may be designed properly, ie, may
pass the Operational Qualification, but may fail to meet the required loads because of poor
programming or too many database calls.

6.2 Types of Testing:

Software testing is a process of running with intent of finding errors in software.


Software testing assures the quality of software and represents final review of other
phases of software like specification, design, code generation etc.

Unit Testing:
Unit testing emphasizes the verification effort on the smallest unit of software
design i.e.; a software component or module. Unit testing is a dynamic method for
verification, where program is actually compiled and executed. Unit testing is performed
in parallel with the coding phase. Unit testing tests units or modules not the whole

20
software.

I have tested each view/module of the application individually. As the modules


were built up testing was carried out simultaneously, tracking out each and every kind of
input and checking the corresponding output until module is working correctly.
The functionality of the modules was also tested as separate units. Each of the
three modules was tested as separate units. In each module all the functionalities were
tested in isolation.

In the Shop Products Module when a product has been added to cart it has been
made sure that if the item already exists in the shopping cart then the quantity is increased
by one else a new item is created in the shopping cart. Also the state of the system after a
product has been dragged in to the shopping cart is same as the state of the system if it
was added by clicking the add to cart button. Also it has been ensured that all the images
of the products displayed in the shop products page are drag gable and have the product
property so that they can be dropped in the cart area.
In the Product Description Module it has been tested that all the images are
displayed properly. Users can add review and the as soon as a user adds a review it is
updated in the view customer review tab. It has been checked to see if the whole page
refreshes or a partial page update happens when a user writes a review.
In the Cart Details it has been tested that when a user edits a quantity or removes
a product from the cart, the total price is updated accordingly. It has been checked to see
if the whole page refreshes or a partial page update happens when a user edits the cart.
Visual Studio 2008 has in built support for testing the application. The unit testing
can be done using visual studio 2008 without the need of any external application.
Various methods have been created for the purpose of unit testing. Test cases are
automatically generated for these methods. The tests run under the ASP.NET context
which means settings from Web.config file are automatically picked up once the test case
starts running.
Methods were written to retrieve all the manufacturers from the database,
strings that match a certain search term, products that match certain filter criteria, all
images that belong to a particular product etc.

Integration Testing:
In integration testing a system consisting of different modules is tested for
problems arising from component interaction. Integration testing should be developed
from the system specification. Firstly, a minimum configuration must be integrated and
tested.
In my project I have done integration testing in a bottom up fashion i.e. in this
project I have started construction and testing with atomic modules. After unit testing the
modules are integrated one by one and then tested the system for problems arising from
component interaction.

Validation Testing:

21
It provides final assurances that software meets all functional, behavioral & performance
requirement. Black box testing techniques are used.
There are three main components
- Validation test criteria (no. in place of no. & char in place of char)
- Configuration review (to ensure the completeness of s/w configuration.)
- Alpha & Beta testing-Alpha testing is done at developer’s site i.e. at home & Beta
testing once it is deployed. Since I have not deployed my application, I could not do the
Beta testing.

Test Cases- I have used a number of test cases for testing the product. There were
different cases for which different inputs were used to check whether desired output is
produced or not.
1. Addition of a new product to the cart should create a new row in the shopping cart.
2. Addition of an existing product to the cart has to update the quantity of the product.
3. Any changes to items in the cart have to update the summary correctly.
4. Because same page is inserting data into more than one table in the database atomicity of the
transaction is tested

6.3 Testing Methods Used:

BLACK BOX TESTING:

Also known as functional testing. Software testing technique whereby the internal workings of
the item being tested are not known by the tester. For example, in a black box test on software
design the tester only knows the inputs and what the expected outcomes should be and not how
the program arrives at those outputs. The tester does not ever examine the
programming code and does not need any further knowledge of the program other than its
specifications.
The advantages of this type of testing include:
The test is unbiased because the designer and the tester are independent of each other.
The test is done from the point of view of the user, not the designer.
Test cases can be designed as soon as the specifications are complete.
The disadvantages of this type of testing include:
The test can be redundant if the software designer has already run a test case.
The test cases are difficult to design.
Testing every possible input stream is unrealistic because it would take an inordinate amount of
time; therefore, many program paths will go untested.

WHITE BOX TESTING:

The purpose of any security testing method is to ensure the robustness of a system in the face of
malicious attacks or regular software failures. White box testing is performed based on the
knowledge of how the system is implemented. White box testing includes analyzing data flow,
control flow, information flow, coding practices, and exception and error handling within the
system, to test the intended and unintended software behavior. White box testing can be
performed to validate whether code implementation follows intended design, to validate

22
implemented security functionality, and to uncover exploitable vulnerabilities.
White box testing requires access to the source code. Though white box testing can be performed
any time in the life cycle after the code is developed, it is a good practice to perform white box
testing during the unit testing phase.
White box testing requires knowing what makes software secure or insecure, how to think like an
attacker, and how to use different testing tools and techniques. The first step in white box testing
is to comprehend and analyze source code, so knowing what makes software secure is a
fundamental requirement. Second, to create tests that exploit software, a tester must think like an
attacker. Third, to perform testing effectively, testers need to know the different tools and
techniques available for white box testing.

GREY BOX TESTING:

Gray box testing is a software testing technique that uses a combination of black box testing and
white box testing. Gray box testing is not black box testing, because the tester does know some
of the internal workings of the software under test. In gray box testing, the tester applies a
limited number of test cases to the internal workings of the software under test. In the remaining
part of the gray box testing, one takes a black box approach in applying inputs to the software
under test and observing the outputs.
Gray box testing is a powerful idea. The concept is simple; if one knows something about how
the product works on the inside, one can test it better, even from the outside. Gray box testing is
not to be confused with white box testing; i.e. a testing approach that attempts to cover the
internals of the product in detail. Gray box testing is a test strategy based partly on internals. The
testing approach is known as gray box testing, when one does have some knowledge, but not the
full knowledge of the internals of the product one is testing.
In gray box testing, just as in black box testing, you test from the outside of a product, just as you
do with black box, but you make better-informed testing choices because you're better informed;
because you know how the underlying software components operate and interact.

6.4 Testing Results and analysis:

The application can be used for any Ecommerce application. It is easy to use, since it
uses the GUI provided in the user dialog. User friendly screens are provided. The
application is easy to use and interactive making online shopping a recreational
activity for users. It has been thoroughly tested and implemented.

23
Training Dataset

Pre-processing

Feature Engineering

Training Feature Desired Output

Random Forest
Training

Trained Model

Fig. 1 Block diagram of proposed model.

24
Machine learning data preprocessing is a crucial step in the model development pipeline. It in-
volves transforming raw data into a format that is suitable for training machine learning models.
Here's an overview of key techniques and concepts involved in data preprocessing:
Handling Missing Data: Identify and handle missing data: Determine if there are missing
values in the dataset and decide on appropriate strategies for handling them, such as imputation
or deletion.

Data Cleaning: Remove irrelevant or redundant features: Identify features that do not contribute
useful information to the model and remove them from the dataset.

Data Transformation: Scaling: Scale numerical features to a similar range to prevent features
with larger magnitudes from dominating the learning process.

Reduce the number of features:


Use techniques such as Principal Component Analysis (PCA) or feature selection to reduce the
dimensionality of the dataset while preserving as much information as possible.
Train Learning Model “Random Forest”
Random Forest is an ensemble learning technique that builds a collection of decision trees and
aggregates their predictions to make more accurate and robust predictions.

Random Sampling of Data:


Random Forest starts by randomly selecting subsets of the training data through a process
called bootstrapping. This means that for each tree in the forest, a new sample is drawn from the
original dataset with replacement. This creates multiple diverse subsets of the data, ensuring that
each tree sees a different portion of the training data.

Building Decision Trees:


For each subset of the data, a decision tree is constructed. Each tree is trained independently of
the others.
At each node of the decision tree, a random subset of features is considered for splitting. This
randomness ensures that each tree in the forest learns different aspects of the data and reduces
correlation between trees.

Growing Full Trees:


Random Forest typically grows full trees without pruning. This means that each tree is grown
until every leaf node is pure (i.e., contains instances of only one class in the case of
classification, or has a small number of samples in the case of regression).

Voting or Averaging:
Once all trees are grown, predictions are made by aggregating the predictions of individual
trees. For classification tasks, each tree "votes" for a class, and the class with the most votes is
chosen as the final prediction (majority voting). For regression tasks, the predictions of

25
individual trees are averaged to obtain the final prediction.

Reducing Overfitting:

Random Forest mitigates overfitting by averaging the predictions of multiple trees. Since each
tree is trained on a different subset of the data and features, they capture different aspects of the
underlying patterns in the data.

Additionally, the randomness introduced during tree construction helps decorrelate the trees and
reduces the likelihood of overfitting to noise in the data.

Dataset
Unnamed: 0: This appears to be an index column, likely auto-generated by the system when the
dataset was created. It may not contain any meaningful information for analysis.

Accident_Index:

A unique identifier assigned to each accident in the dataset. It serves as a primary key to
uniquely identify each accident record.
Longitude: The geographic coordinate that specifies the east-west position of an accident
location on the Earth's surface, measured in degrees.
Latitude: The geographic coordinate that specifies the north-south position of an accident
location on the Earth's surface, measured in degrees.
Police_Force: The code or identifier for the police force that reported the accident.
Accident_Severity: Indicates the severity of the accident, usually categorized into different levels
such as slight, serious, or fatal.
Number_of_Vehicles: The total number of vehicles involved in the accident.
Number_of_Casualties: The total number of casualties (both injured and killed) in the accident.
Day_of_Week: Indicates the day of the week when the accident occurred, typically represented
as an integer (e.g., 1 for Sunday, 2 for Monday, etc.).
Local_Authority_(District): The local authority district where the accident occurred.
1st_Road_Class: The classification of the first road involved in the accident (motorway, A road,
B road).
1st_Road_Number: The number of the first road involved in the accident.
Road_Type: Describes the type of road where the accident occurred (e.g., single carriageway,
dual carriageway, roundabout).
Speed_limit: The speed limit for the road where the accident occurred, in miles per hour (mph).
Junction_Detail: Provides details about the junction where the accident occurred (e.g.,
roundabout, crossroads, T or staggered junction).

Junction_Control:
Specifies the type of junction control (e.g., traffic signals, give way or stop sign, not at
junction or within 20 meters).
2nd_Road_Class: The classification of the second road involved in the accident (if applicable).
2nd_Road_Number: The number of the second road involved in the accident (if applicable).

26
Pedestrian_Crossing-Human_Control:
Indicates if there was human control at any pedestrian crossing near the accident site.

Pedestrian_Crossing-Physical_Facilities:
Describes the physical facilities available at any pedestrian crossing near the accident site.

Light_Conditions:
Describes the lighting conditions at the time of the accident (e.g., daylight, darkness with
street lights, darkness without street lights).

Weather_Conditions:
Describes the weather conditions at the time of the accident (e.g., fine no high winds, raining
no high winds, snowing no high winds, fine high winds).

Road_Surface_Conditions:
zescribes the road surface conditions at the time of the accident (e.g., dry, wet or damp, snow,
ice).

Special_Conditions_at_Site:
Indicates if there were any special conditions at the accident site (e.g., roadworks, oil or diesel,
mud).

Carriageway_Hazards:
Describes any hazards present on the carriageway at the time of the accident (e.g., none, other
object on road, pedestrian in carriageway).

Urban_or_Rural_Area:
Indicates whether the accident occurred in an urban or rural area.

Did_Police_Officer_Attend_Scene_of_Accident:
Indicates whether a police officer attended the scene of the accident.
hour: The hour of the day when the accident occurred.
minute: The minute of the hour when the accident occurred.

Use Case Diagram:

Use case diagram consists of use cases and actors and shows the interaction between them.
The key points are:
The main purpose is to show the interaction between the use cases and the actor.
To represent the system requirement from user’s perspective.
The use cases are the functions that are to be performed in the module.

Training

27
Input
Dataset

ADMIN Proposed
User
Model

 Use Case Diagram for Model Training

Fig.

Proposed
Model

Testing
Dataset

USER User

Prediction
prediction
Fig.5.2

Use Case Diagram for Model Testing

28
CHAPTER 7
CODING
Python Source File:
import numpy as np
import pandas as pd
import time
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline

def evaluate_classification(model, name, X_train, X_test, y_train, y_test):


train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

train_accuracy = accuracy_score(y_train, train_predictions)


test_accuracy = accuracy_score(y_test, test_predictions)

train_precision = precision_score(y_train, train_predictions, aver-


age='weighted')
test_precision = precision_score(y_test, test_predictions, aver-
age='weighted')

train_recall = recall_score(y_train, train_predictions, average='weighted')


test_recall = recall_score(y_test, test_predictions, average='weighted')

print("Training Set Metrics:")


print("Training Accuracy {}: {:.2f}%".format(name, train_accuracy * 100))
print("Training Precision {}: {:.2f}%".format(name, train_precision * 100))
print("Training Recall {}: {:.2f}%".format(name, train_recall * 100))

print("\nTest Set Metrics:")


print("Test Accuracy {}: {:.2f}%".format(name, test_accuracy * 100))
print("Test Precision {}: {:.2f}%".format(name, test_precision * 100))
print("Test Recall {}: {:.2f}%".format(name, test_recall * 100))

def preprocess_data(df):
scaler = StandardScaler()
numerical_features = ['longitude', 'latitude', 'Speed_limit', 'hour',
'minute']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

29
return df

def train_and_save_model(data_file, model_file, num_rows=None):


start_time = time.time()
print("Loading the dataset...")
df = pd.read_csv(data_file)

df[['hour', 'minute']] = df['Time'].str.split(':', ex-


pand=True).astype('int32')

print(f"value counts for {df['Accident_Severity'].value_counts()} ")


#print(f"columns for {df.columns} ")

features = ['longitude', 'latitude', 'Speed_limit', 'hour',


'minute','Number_of_Vehicles', 'Number_of_Casualties',
'Day_of_Week','Light_Conditions',
'Weather_Conditions', 'Road_Surface_Condi-
tions','2nd_Road_Class','1st_Road_Class','Carriageway_Hazards']
X = df[features]
y = df['Accident_Severity']

pipeline = ImbPipeline([
('preprocess', StandardScaler()),
('sampling', SMOTE(random_state=20)),
('classifier', RandomForestClassifier(n_estimators=100))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, ran-


dom_state=7)

print("Model Training...")
# Training the model on the training dataset
pipeline.fit(X_train, y_train)

print("Saving the model...")


# Save the trained model
with open(model_file, "wb") as f:
pickle.dump(pipeline, f)

end_time = time.time()
print(f"Model training and saving took {end_time - start_time:.2f} seconds")
evaluate_classification(pipeline, "Random Forest", X_train, X_test, y_train,
y_test)

if __name__ == "__main__":

30
data_file = "clean_df.csv"
model_file = "random_forest_model_smote_train1.pkl"
num_rows = None # Set the number of rows for training (e.g., num_rows =
1000000)
train_and_save_model(data_file, model_file, num_rows)

Jupiter Source File:


import pandas as pd
import numpy as np
from sklearn import metrics
from joblib import load
from sklearn.metrics import accuracy_score
import time
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline as ImbPipeline

# Function to preprocess and scale numerical columns


def preprocess_and_scale(df, numerical_features):
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])
return df

def print_classification_report(y_true, y_pred):


print("\n\n------------------result----------------\n\n")
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
print(metrics.classification_report(y_true, y_pred))

def test_model(num_records=50): # Specify the number of records for testing, de-


fault is 50
# Load the trained model
model = load("random_forest_model_smote_train1.pkl")

# Load the test data


print("Loading dataset...")
df = pd.read_csv("clean_df.csv") # Load only the first 'num_records' rows
df[['hour', 'minute']] = df['Time'].str.split(':', ex-
pand=True).astype('int32')

31
features = ['longitude', 'latitude', 'Speed_limit', 'hour',
'minute','Number_of_Vehicles', 'Number_of_Casualties',
'Day_of_Week','Light_Conditions',
'Weather_Conditions', 'Road_Surface_Conditions','Carriageway_Hazards']
X = df[features]
y = df['Accident_Severity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, ran-


dom_state=7)

# Define the pipeline


pipeline = ImbPipeline([
('sampling', SMOTE(random_state=12)),
('classifier', model)
])

# Fit the pipeline and make predictions on the test data


start_time = time.time()
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
end_time = time.time()
print(f"Prediction took {end_time - start_time:.2f} seconds")

print_classification_report(y_test, y_pred)

if __name__ == "__main__":
test_model()

32
CHAPTER 8
FUTURE SCOPE
FUTURE ENHANCEMENT

o This project can be operate at all browser/ OS.


o Higher Security features can be included in this software.
o Program scheduling can also be included in this project.
o Automation through other models can be implemented in this project for few features.

33
CONCLUSION
In conclusion, prediction models play a crucial role in various domains by enabling forecasts of
future outcomes based on historical data and relevant features. Throughout this exploration,
we've delved into the fundamentals and complexities of prediction modeling, covering a
spectrum of techniques and considerations. In this model effective data preprocessing, feature
engineering, model selection, and evaluation are integral components of the prediction of Road
Accident Prediction. These steps are critical for preparing the data, extracting meaningful
information, optimizing model performance, and ensuring reliable predictions. This work
improves the work efficiency of above 90%. Achieving this is just because of methods of feature
engineering and selection of learning model.

Advantages of “Road Accident Prediction”:

 Predict the data as per input features.


 Takes less time for learning.
 Takes constant time for prediction.
 Trained model transfer is easy and secured.

Limitations of “Road Accident Prediction”:

Besides the above achievements and the successful completion of the project, we still feel
the project has some limitations, listed as below:

 It is not a large scale system.


 Only limited dataset formats are allowed to work for the training of the system.
 Doesn’t work on multiple files of data.

34
REFERENCES

FOR PHP INSTALLATION


http://www.php.net/
FOR DEPLOYMENT AND PACKING ON SERVER
www.developer.com
www.15seconds.com
FOR MY SQL
http://www.mysql.com/
FOR CSS
http://cssed.sourceforge.net/
FOR APACHE
http://www.apache.org/
FOR OTHER USEFUL REFERENCES
http://www.eclipse.org/pdt/
http://www.w3schools.com/default.asp
http://en.wikipedia.org/
REFERENCE BOOKS:
Lee Babin, “Beginning Ajax with PHP”.
Leon Atkinson,“ Core PHP Programming”
Luke Welling & Laura Thompson, “Beginning Ajax with PHP”.
Roger S.Pressman, “Software Engineering”.

35

You might also like