Professional Documents
Culture Documents
Noi Dung
Noi Dung
PREFACE__________________________________________________________
BACKGROUND KNOWLEDGE_______________________________________
1.4.2. Datasets____________________________________________15
CONCLUSION____________________________________________________
REFERENCES_____________________________________________________
1
PREFACE
Reasons for choosing the topic
Nowadays, security cameras are widely used in military agencies and units
to ensure safety in high-traffic areas and critical locations, such as warehouses.
These locations are crucial to protect against potential threats to human safety,
assets, and security, such as explosions, unauthorized individuals, and
unfamiliar vehicles. Guards are employed to keep watch and walk around as
needed to maintain safety.
Through real observations, it's evident that units are adopting digital
solutions and new technologies to enhance operations. Some areas have installed
basic camera systems, but they lack advanced image analysis and intelligent
monitoring. As a result, monitoring relies on humans and doesn’t have timely
automated alerts for safety issues. These intelligent features are expensive and
not easily customizable. Also, data security and software integration remain
concerns.
Stemming from the above reasons, I choose the topic: Researching and
building an intelligent monitoring system utilizing artificial intelligence.
Objects of the study: In this work, objects of the study are license plate
recognition technologies, facial recognition technologies, fire alarms, intrusion
detection, large-scale monitoring and surveillance system.
Scope of the study: In this work, our system takes an image or video as
input, applies image analysis and export images or video streaming.
Researching methods
Researching the basic image processing, machine learning and deep learning
algorithms used for license plate recognition, face recognition, fire alarms,
intrusion detection and monitoring and surveillance system.
To complete the project, I would like to thank the teachers in the Institute of
Information Communication Technology – Military Technical Academy for
teaching me knowledge about general subjects as well as specialized subjects,
helping me had a solid theoretical foundation and facilitated me throughout my
studies. In particular, I would like to express my deep gratitude to Ph.D. Nguyen
Quoc Khanh, who directly guided, cared, enthusiastically instructed and created
all favorable conditions during the time I completed this project. Finally, I
would like to thank my family and friends for always facilitating, caring, helping
and encouraging me during my study and completion of my graduation thesis.
Despite many efforts, due to limited time and limited knowledge, mistakes
are inevitable. Therefore, I look forward to receiving suggestions and guidance
from teachers to make my project more complete.
I sincerely thank!
4
Chapter 1
1. BACKGROUND KNOWLEDGE
1.1. Convolutional neural networks (CNN)
A Convolutional neural network (CNN) is a neural network that has one or
more convolutional layers and are used mainly for image processing,
classification, segmentation and also for other auto correlated data.
More and more diverse and interesting uses are being found for CNN
architectures. An example of a non-image based application is “The
Unreasonable Effectiveness of Convolutional Neural Networks in Population
Genetic Inference” by Lex Flagel et al. This is used to perform selective sweeps,
finding gene flow, inferring population size changes, inferring rate of
recombination. There are researchers using CNNs for generative models in
single cell genomics for disease identification. CNNs are also being used in
astrophysics to interpret radio telescope data to predict the likely visual image to
5
CNNs can detect complex patterns such as handwritten numbers because the
architecture can perform detections over the previous detections (The input of an
6
inner layer is the output from the previous one). How do we define which
convolutions to apply to solve a specific problem? The neural network does it
during the training process where we have a set of training images with a
tag/category associated with each one of these. Then the neural network decides
which are the best convolutions combinations to apply in order to classify the
training dataset successfully.
Techniques like Canny edge detection and contour finding can be used to
identify boundaries of objects. However, these methods often require additional
processing to distinguish between objects and handle occlusions.
* Blob Detection:
* Color-Based Methods:
*Adaptive Thresholding:
Two-stage detectors first propose potential object regions and then classify
and refine those regions. They typically offer higher accuracy and are effective
for complex scenes.
Export mode is used for exporting a YOLOv8 model to a format that can be
used for deployment. In this mode, the model is converted to a format that can
be used by other software applications or hardware devices. In this study, I
utilized YOLOv8s for training and inferring. Subsequently, I converted the
trained model into the .onnx format. This format is useful when deploying the
model to production environments. This approach ensures that the model can
seamlessly interface with various software applications or hardware devices,
making it well-suited for real-world deployment scenarios.
11
This technique will not be suitable in cases such as: cameras mounted on
moving vehicles, cameras with continuous panning, observing underwater
environments, and more.
Receives the original image and a structuring element that decides the
nature of the operation (kernel)
13
* Contour detection
DeepFace (by Facebook): One of the earliest deep learning models for face
recognition, DeepFace employs a deep neural network with more than 120
million parameters. It uses a 3D alignment approach for a face and then feeds
the aligned face through the network. On the LFW dataset, DeepFace achieved
an accuracy that was almost human-level.
FaceNet (by Google): FaceNet introduced the concept of using a triplet loss
function, which takes in three inputs – an anchor, a positive of the same class as
the anchor, and a negative of a different class. The model aims to minimize the
distance between the anchor and the positive and maximize the distance between
the anchor and the negative. FaceNet set new records in terms of accuracy on
standard benchmarks.
ResNet: While not exclusively designed for face recognition, the deep
residual networks (ResNet) architecture has been adapted for the task. By
introducing skip connections (or shortcuts) to jump over some layers, ResNet
can train very deep networks without the problem of vanishing gradients. This
depth allows for the learning of intricate facial features.
1.4.2. Datasets
Datasets play a crucial role in training and evaluating face recognition
models. Here's an overview of some of the most prominent datasets used in the
face recognition.
Labeled Faces in the Wild (LFW): One of the most widely-used datasets,
LFW contains more than 13,000 images of faces collected from the web. Each
face has been labeled with the name of the person pictured, and there are 1,680
16
of the collected people that have two or more distinct photos in the dataset. LFW
has been a benchmark for evaluating face verification and recognition
algorithms.
CelebA: With over 200,000 celebrity images, CelebA has large diversities,
large quantities, and rich annotations, including 40 attribute labels per image. It's
suitable for attribute recognition tasks in addition to face recognition.
The final stage is the recognition of the characters using pattern matching
techniques or classifiers like neural networks and fuzzy classifiers. However, the
main downside of separating detection from recognition is its impact on the
accuracy and efficiency of the overall recognition process. This happens mainly
due to the imperfection of the detection process such as flaws in the bounding
box prediction. For instance, if the detection process misses a part of a license
plate, it will affect to reduce the overall accuracy of the recognition process.
Thus, in a multi-stage approach, it is important to achieve satisfying results in
18
each stage. Figure 2.1 shows the main processing stages in a multi-stage plate
recognition system.
In these systems, two separate sub-networks were used for license plate
detection and license plate OCR. However, using only one deep neural network
will make it more difficult to intervene in every step of the processing.
Therefore, it is difficult to apply this approach to Vietnamese LPs that include
both 1-line and 2-line plate styles.
19
Chapter 2
(Arxiv20)
WebFace600K 71.865 69.449 80.454 73.394 51.026 99.70 98.00 96.58 95.02
At its core, Redis is an in-memory key-value store. This means that all its
data resides in memory, leading to exceptionally fast read and write operations.
However, Redis isn't just limited to simple key-value pairs. It supports a rich set
of data types, including strings, hashes, lists, sets, sorted sets, bitmaps, and
hyperloglogs. This versatility allows developers to model their data in various
ways, depending on the application's requirements.
23
covering the license plate area. Vehicle speed, lighting conditions, and weather
also greatly affect recognition accuracy. Therefore, within the scope of this
study, they have collected data, labeled it, and retrained the recognition model to
achieve more accurate results.
Their proposed license plate recognition model consists of three main parts:
car detection, license plate detection and license plate OCR, as illustrated in
Figure 2.13.
License plate detection module detects the license plate. The OCR module
conducts optical character recognition and goes through an ocr-check for the
validity of the obtained results to confirm whether the prediction result is valid
or not. If valid, the identified object completes the identification process and the
results are stored in the database. In contrast, the object is not yet recognized,
and continues to be processed in the following frames.
For each detectable area, the objects will update the tracking algorithm we
recommend. The tracking algorithm is based on the Euclidean distance between
the centers of detected objects (i.e. between newly discovered regions and
previously detected regions), and performs object storage discard after a zero
time. appear in the frame. Objects that have not been successfully identified will
continue to be passed through the number plate detection step, the rest will be
ignored. Objects that have not been successfully identified include: new objects
appear; the subject of the license plate capture has not been successful; OCR
identification results are wrong compared to the regulations of the Vietnamese
number plate issued by the government.
27
The tracking algorithm we propose has three meanings. Firstly, moving car
objects when captured from traffic cameras in Vietnam only move one way.
Therefore, we build a simple but sufficient tracking algorithm to solve the
problem and ensure the resources used. Second, each object performs successful
identification exactly once, increasing overall performance. Third, objects that
have not been successfully recognized will have the opportunity to re-recognize
in later frames. This will increase the accuracy and robustness of the system.
Figure 2.14. Our LP detection was inspired from top-down HPE approach.
For the license plate detection part, we approach the problem in the direction
of keypoints detection, each corner of the license plate is considered a keypoint.
Based on the keypoints, we can align the license plates when they are deformed,
or tilted at different angles. This is still unresolved by methods using pure
rectangular closures for object detection.
The problem of detecting license plate keypoints is very close to the human
pose estimation problem. There are two traditional ways to solve this problem:
28
2.3.2.1. Preprocessing
All vehicle images were normalized to the standard normal distribution, and
then resized to fixed sizes according to input shapes of the model architectures
respectively. The purpose of this pre-processing is to obtain the same range of
values for each input image before feeding to the CNN model. It also helps to
speed up the convergence of the model.
( )
' 2 ' 2
−( x−k x ) + ( y−k y )
Y xyc=exp (2.1)
2 rk 2
k '
In which, r k is the radius of the keypoint k in Y . To diminish δ= R −k , all
W H
classes of C shared two offset regression heads O ∈ R R × R ×2, in which
( O xy 1 , O xy2 ) =( δ x , δ y ) if Y xyc >0 , 1≤ c ≤C . We choose R=4 , r k =2.
We found that in addition to detecting the object's keypoints, Zhou [44] also
detects the center of the object and regresses the offsets to the object's keypoints
as a way to connect the keypoints. other of the object is not necessary in our
29
problem. Because each car we detect has only one license plate and can rely on
the shape of the number plate to control the keypoints if there are more than 4
detected keypoints. Therefore, we only detect 4 keypoints that are enough to
solve the problem and limit the error rate when the model has to learn more
information. At the same time, previous studies consider keypoints as different
classes, however, experimental process shows that the keypoints of license
plates can be considered as a single class.
Figure 2.16. Keypoints design for a license plate (colours are according to
classes). From left to right showed designs 1 c ,2 c , 4 c ,1 c 3accordingly.
experiments that the design 1 c gives the best result. This is contradicted to
human pose estimation where all keypoints belong to different classes [7].
The first type of architecture gives good accuracy in this task, HRNet[7]
achieves the state-of-the-art on the COCO test-dev human pose estimation
dataset with AP = 74.4. We find that the above architectures have a very high
relationship with segmentation network architectures, so we change the
segmentation network architectures to suit the problem.
The automatic license plate recognition system from traffic cameras requires
processing time as quickly as possible, because the vehicles are constantly
moving at high speed. Therefore, we prefer testing lightweight architectures
including DLA-34 [8], HRNet-18 [7], Restnet18 [2]. For semantic segmentation
architecture, we choose DDRNet [10] and its variants (Figure 2.17) as DDRNet
31
2.3.3.2. Preprocessing
All LPs images are normalized to the standard normal distribution and then
are resized to fixed sizes according to input shapes of the model architectures
respectively. For 2-line LPs, we use the X-Y Cut algorithm [6] to find out where
to split a 2-line LP into two 1-line LPs.
2.3.3.3. Architecture
The input of the VOCR is an image of 1-line LP which has size of 32 × 140.
The first part of VOCR is VOCR-CNN which extracts local features based on
VGG19 [19] (Figure 2.21). Due to small input image size 32 × 140, we choose
average pooling instead of max pooling to avoid losing of information.
Additionally, the kernel size and stride size were chosen with a small size to
compress the information appropriately. The output of VOCR-CNN has 256
feature maps; each feature map has size of 1 × 70. We considered the number of
dimensions of each feature map to be the number of time steps, thus time steps T
= 70. At each time step t, a feature vector of size 1×256 contains the features of
all feature maps at that time step.
34
The second part of the VOCR is the VOCR-RNN (Figure 2.22) in the form
of an encoder-decoder using the Gated Recurrent Neural Networks (GRU)
combined with the attention mechanism [19].
In which [he, h' e , hd ] is the hidden output feature at every time step of the
encoder. In the encoder part, we choose GRU-bidirectional to extract
interdependencies in the input string. The output of the encoder at each time step
is concatenate of [he, h' e ] with the dimension 512. The output of the encoder is
through the fully connected layer to reduce the number of dimensions equal to
the number of hd . The attention mechanism is inspired by [12] consider each hd
at each time step as query and all [ he ,h ' e ] act as keys and values. Since the
dimensions of keys and query are not the same, we use the attention scoring
function as Additive attention [4]
T 1× T
a (Q , K )=wv tanh ( W q Q+W k K ) ∈ R (2.2)
q×T h× q k ×T h ×k h
In which, Q∈R ,Wq ∈R ,K∈ R ,W k ∈ R , wv∈ R . In this work,
Q=hd , K= [ he , h' e ] , h=256 , q=256 , k=512 , T=70. Therefore, the context vector c
(2.3)
i=1
w=[ w1 , w2 , … , wT ] ∈ R
1× T
v ×1
C=V ×w ∈ R
We train the decoder using the teacher forcing training strategy (Figure
2.23). The network decoder takes as input hd at the previous time step and
concatenate [ context vector , c , embedding ] of the label at the current time step. The
output of the decoder is the concatenate of [ hd , c , embedding ] that goes through
the fully connected layer with the output dimensions N = 36 (corresponding to
the number of characters appearing in Vietnamese LPs plus 3 special characters
¿ sos> ,< eos >,< pad> ¿ ). The highest one will the character needed to decode. If
To optimize the loss function L, we use the parameter update method based
on gradient descent with the Adam algorithm to automatically adjust the
learning rate and avoid falling into the local minimum.
Table 2-4. Results of license plate recognition with Vietnamese license plates
LPRnet
92.3% 96.9% 94. 1% 96.4% 94.8% 96.8%
STN
(VGG1
9)
Our proposed ALPR system was set up test using a single GPU NVIDIA
GeForce RTX 2080 and without using any inference optimization techniques
such as Tensor-RT. We recorded the processing speeds of stages as follows: The
processing speed was 38.28 FPS in the stage of car detection using YOLOv2; In
the LP detection stage, DDRNet23sh could be run at 103.5 FPS; and in the LP
recognition stage, we got 36 FPS with VOCR. Consequently, our ALPR system
can be run at 15.73 FPS in a single NVIDIA GeForce RTX 2080 without using
any inference optimization techniques. Based on these results, we believe that
our ALPR system can run in real-time when being deployed using additional
inference optimization techniques.
40
For example, a self-checkout system with a large number of users can lead
to long lines and wait times. This can be a frustrating experience for customers,
and it can also reduce the efficiency of the system.
Real-world conditions can change rapidly and unpredictably. This can make
it difficult for computer vision systems to maintain accuracy and performance.
For example, a car license plate recognition must be able to handle changes in
lighting, weather, and traffic conditions.
In the next section, I will propose the Restreaming technique to address the
above problems.
- Create a buffer for each camera, storing the frames read from that camera
- There is also a cache that stores images after being analyzed by modules
(including boundingboxes, annotations).
In this case, the processing server reads directly from the machine at a very fast
speed. This is similar to when we play a pre-existing video on our computer.
Firstly, let the image sequentially pass through each independent model.
This method has the advantage of keeping the models independent of each other.
However, in terms of speed, it will decrease significantly with an increasing
number of models.
48
Secondly, merging the source code of the models together. This method is
approximately 30% faster than the previous one. However, the drawback is that
it's not always feasible to implement this way due to library conflicts between
models, thus reducing the flexibility in combining models.
APIs serve not just as a bridge between the computer vision model and other
applications, but also facilitate more flexible and efficient deployment,
management, and scaling of the model. They offer a standardized interface
through which other developers can interact with the model without needing to
understand its underlying technical details. This helps to mitigate risks, enhance
security, and simultaneously increase the availability and readiness of the
service.
any information about the client's state between requests, enhancing scalability
and reliability. RESTful communication also allows for multiple representations
of resources, including JSON, XML, or HTML, catering to the diverse needs of
clients. Clients can specify their preferred representation using HTTP headers,
and servers respond accordingly. Moreover, REST APIs use self-descriptive
messages, where each response includes metadata and headers that describe the
content and how to interpret it. This self-description simplifies client
understanding and interaction with the API. In summary, REST APIs provide a
structured, stateless, and uniform way for software systems to communicate and
exchange data over the web. They have become a fundamental component of
modern web and mobile application development, enabling seamless integration
and data exchange between various services and platforms.
Although REST API is widely used and has many advantages such as
flexibility and ease of integration, there are some limitations when applied in a
microservices environment. First, the JSON data format commonly used in
REST can increase latency and reduce performance due to the need for
serialization and deserialization. Additionally, REST has limitations in
51
multiple languages such as C, C++, Go, Java, C#, PHP, Python, Node.js, and
Arduino.
Broker: The broker is a software component that receives all messages from
publishing clients and forwards them to subscribing clients. It maintains
connections with continuous clients and can be scalable depending on the
implementation.
Message: The message is the data that a device receives when it subscribes
to a topic or sends when publishing to a topic.
Publish: is the process by which a device sends its message to the broker.
Subscribe: is the action a device takes to retrieve messages from the broker.
- When a device (a client) wants to send data to the broker, this activity is
called 'publish.'
- When a device (a client) wants to receive data from the broker, this activity
is called 'subscribe.'
55
Components of MQTT
Slower: because it uses larger data packets to communicate with the server.
56
Cost: HTTP requires opening and closing the connection for each request,
while MQTT stays online to keep the channel always open between the broker's
server and the client.
Energy Consumption: because it takes more time and more data packets, it
therefore uses more energy.
Chapter 3
monitoring and recording all activities around the clock, ensuring that no
incident goes unnoticed or unmonitored.
These systems go beyond simply providing users with images and videos.
They have the capability to directly analyze data from these images and videos
using complex deep learning models. They can recognize objects, classify them,
and even predict behavior in real-time.
* Hardware limitations
* Real-time problem
User management
Receive results from micro services, draw bounding boxes on the image, and
send it back to the client-side software.
Add users
Update user information
Remove user from database
Add role
Update role information
Remove user roles
64
Figure 3.33. Use case model for the system manager (admin).
- Receive and process data sent back from camera management servers, data
storage and database management.
- Processing restreaming
70
CONCLUSION
Some limitations
- There has not been sufficient time and conditions to conduct tests on the
military-scale QS-Net network, or with an extremely high volume of
requests to assess the system's load-bearing capacity
Future work
Maintain and fix errors arising when deploying the system for a long time.
REFERENCES
[1] Alejandro Newell, Kaiyu Yang, and Jia Deng. , "Stacked Hourglass
Networks for Human Pose Estimation," 2016.
[2] Bin Xiao, Haiping Wu, and Yichen Wei., "Simple Baselines for Human
Pose Estimation and Tracking," 2018.
[3] Duan. T.D., Du. T.L.H., Phuoc. T.V, Building an Automatic Vehicle
License-Plate Recognition System, International Conference in Computer
Science, 2020, p.323-329.
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Y. Bengio, "Neural Machine
Translation by Jointly Learning to Align and Translate," 2014.
[5] Einstein, A., B. Podolsky, and N. Rosen, 1935, “Can quantum-mechanical
description of physical reality be considered complete?”, Phys. Rev. 2019,
p.777-780.
[6] Faisal Shafait and Thomas Breuel., "The Effect of Border Noise on the
Performance of Projection-Based Page Segmentation Methods," 23.
[7] Feng Zhang et al., "Distribution-Aware Coordinate Representation for
Human Pose Estimation," 2019.
[8] Fisher Yu et al., "Deep Layer Aggregation," 2019.
[9] Guo, Jia, et al. "Sample and computation redistribution for efficient face
detection.", 2021.
[10] Khanh Nguyen, Dan Pham Van, Van Pham Thi Bich, "An efficient method
to improve the accuracy of Vietnamese vehicle license plate recognition in
unconstrained environment," MAPR, 2021.
[11] Mark Everingham et al., "The Pascal Visual Object Classes (VOC)
Challenge," 2010.
[12] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning, "Effective
Approaches to Attention-based Neural Machine Translation," 2015.
[13] Olga Russakovsky et al., "ImageNet Large Scale Visual Recognition
Challenge," 2014.
[14] R. Laroca et al., A Robust Real-Time Automatic License Plate Recognition
Based on the YOLO Detector, p. 58-66, 2018.
74
[15] Rafael Muller, Simon Kornblith, and Geoffrey E. Hinton., "When Does
Label Smoothing Help?," 2019.
[16] Tsung-Yi Lin et al., "Microsoft COCO: Common Objects in Context,"
2015.
[17] Xiao, Jianyu, Guoli Jiang, and Huanhua Liu. "A lightweight face
recognition model based on mobilefacenet for limited computation
environment.", 2021.
[18] Xingyi Zhou, Dequan Wang, and Philipp Kr¨ahenb¨uhl, "Objects as
Points," 2019.
[19] Zisserman, Karen Simonyan and Andrew, "Very Deep Convolutional
Networks for Large-Scale Image Recognition," 2014.