Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 82

LIST OF ACRONYMS

STT ACRONYMS FULLY WORD


1 AI Artificial intelligence
2 ALPR Automatic License Plate Recognition
3 API Application Programming Interface
4 CNN Convolutional Neural Network
5 FPS Frame per second
6 HPE Human Pose Estimation
7 LP License plate
8 LSTM Long Short Term Memory
9 OCR Optical character recognition
10 RNN Recurrent Neural Network
11 YOLO You-only-look-once
TABLE OF FIGURES

Figure 1.1. One and two stage detectors.........................................................8


Figure 1.2. The different between one-stage and two-stage detector.............9
Figure 1.3. The development of YOLO........................................................10
Figure 1.4. Background subtraction..............................................................11
Figure 1.5. Background subtraction when background is not static.............11
Figure 1.6. Result of CNT method................................................................12
Figure 1.7. Morphological transformations..................................................13
Figure 1.8. Contour detection........................................................................13
Figure 1.9. Main stages in a multi-stage license plate recognition system...17
Figure 1.10. The overall structure of single-stage model for ALPR.............18
Figure 2.1. New method for database using in face recognition...................23
Figure 2.2. Some images in fire dataset........................................................24
Figure 2.3. The flow of ALPR core model...................................................25
Figure 2.4. Our LP detection was inspired from top-down HPE approach.. 27
Figure 2.5. Keypoints encoding and local offset illustration........................29
Figure 2.6. Keypoints design for a license plate (colours are according to
classes). From left to right showed designs 1 c ,2 c , 4 c ,1 c 3accordingly..............29
Figure 2.7. Original DDRNet architecture....................................................30
Figure 2.8. DDRNetsup Network, “RB” denotes sequential residual basic
blocks. “RBB” denotes single residual bottleneck block. “DAPPM” denotes the
Deep Aggregation Pyramid Pooling Module......................................................31
Figure 2.9. Size of old and new Vietnamese LP (unit mm)..........................32
Figure 2.10. Color and text formats of Vietnamese license plate.................33
Figure 2.11. VOCR-CNN architecture..........................................................34
Figure 2.12. VOCR-RNN architecture..........................................................34
Figure 2.13. Teacher forcing training strategy in Machine Translation.......36
Figure 2.14. License plate images in MTAVLP dataset...............................37
Figure 2.15. Results of using DDRNet23sh and VOCR...............................40
Figure 2.16. Single analytic..........................................................................46
Figure 2.17. Multiple analytics.....................................................................47
Figure 2.18. Sequentially process through each single analytic...................48
Figure 2.19. An example of protobuf script..................................................53
Figure 2.20. MQTT components...................................................................55
Figure 2.21. An example for MQTT usage...................................................55
Figure 2.22. Compare MQTT and HTTP......................................................56
Figure 3.1. Use case model for the system manager (admin).......................64
Figure 3.2. Login interface............................................................................64
Figure 3.3. Grid camera view interface.........................................................65
Figure 3.4. Alert interface.............................................................................65
Figure 3.5. Camera list interface...................................................................66
Figure 3.6. Interface for adding and updating camera information..............66
Figure 3.7. Video playback interface............................................................67
Figure 3.8. Single camera view interface.....................................................67
Figure 3.9. Vehicle tracking interface...........................................................67
Figure 3.10. Result interface of number plate recognition by camera..........68
Figure 3.11. System model............................................................................69
TABLE OF TABLES

Table 2-1. Compare various facial recognition models................................20


Table 2-2. Results in different datasets.........................................................22
Table 2-3. License plate data collected from traffic cameras.......................37
Table 2-4. Results of license plate recognition with Vietnamese license
plates....................................................................................................................38
Table 2-5. Compare gRPC API and REST API...........................................52
CONTENTS

PREFACE__________________________________________________________

BACKGROUND KNOWLEDGE_______________________________________

1.1. Convolutional neural networks (CNN)_________________________4

1.1.1. Common uses for CNNs________________________________4

1.1.2. The core of CNN: the convolution_________________________5

1.1.3. CNNs multilayered architecture___________________________5

1.1.4. The future of CNNs____________________________________6

1.2. Object detection problem___________________________________6

1.2.1. Traditional object detection method________________________6

1.2.2. Deeplearning-based object detection_______________________8

1.2.3. You Only Look Once (YOLO) and its variants_______________9

1.3. Motion detect for realtime monitoring system__________________11

1.3.1. Background subtraction________________________________11

1.3.2. Count (CNT) - fastest background subtraction______________12

1.3.3. Techniques for image processing_________________________12

1.4. Face recognition problem__________________________________14

1.4.1. A brief of history about face recognition model_____________14

1.4.2. Datasets____________________________________________15

1.5. Car license plate recognition problem_________________________16

1.5.1. Multi-stage license plate recognition systems_______________17

1.5.2. Single-stage license plate recognition systems______________18

1.6. Fire alerts problem_______________________________________19

1.7. Conclusion for chapter 1___________________________________19


EXPERIMENTAL RESULTS AND ASSESSMENT______________________

2.1. Face recognition module___________________________________20

2.1.1. Detection model______________________________________20

2.1.2. Recognition model____________________________________21

2.1.3. Database for facial features_____________________________22

2.2. Fire alert module_________________________________________23

2.3. Car license plate recognition module_________________________24

2.3.1. Car detection________________________________________26

2.3.2. License plate detection_________________________________27

2.3.3. License plate optical character recognition_________________32

2.3.4. Overall performance___________________________________37

2.4. Deploy computer vision services to production_________________40

2.4.1. Deploying computer vision services______________________40

2.4.2. Flask framework______________________________________41

2.4.3. The challenges when deploying computer vision systems in real-


world settings________________________________________________42

2.4.4. Restreaming technique_________________________________43

2.5. Setting up environment for AI services and designing architecture for


multiple analytics on video streaming_______________________________44

2.5.1. Setting up driver for hardware and required libraries_________44

2.5.2. Multiple analytics and single analytic_____________________46

2.5.3. Design architecture for AI module APIs___________________48

2.5.4. MQTT protocol______________________________________53

2.6. Conclusion for chapter 2___________________________________56


LARGE-SCALE INTELLIGENT MONITORING SYSTEM________________

3.1. Introduction about intelligent monitoring system based on camera__57

3.2. A large-scale intelligent monitoring system____________________60

3.2.1. The main components, functions and actors of the system_____60

3.2.2. System structure._____________________________________62

3.2.3. Use case diagram_____________________________________63

3.2.4. Architectural design___________________________________64

3.2.5. User interface design__________________________________64

3.3. Model of deployment._____________________________________68

3.3.1. Database server______________________________________69

3.3.2. Routing server_______________________________________69

3.3.3. Analytics server______________________________________70

3.4. Conclusion for chapter 3___________________________________70

CONCLUSION____________________________________________________

REFERENCES_____________________________________________________
1

PREFACE
Reasons for choosing the topic

Nowadays, security cameras are widely used in military agencies and units
to ensure safety in high-traffic areas and critical locations, such as warehouses.
These locations are crucial to protect against potential threats to human safety,
assets, and security, such as explosions, unauthorized individuals, and
unfamiliar vehicles. Guards are employed to keep watch and walk around as
needed to maintain safety.

Guarding and patrolling present several challenges. The area to cover is


often wide and supervision can be difficult, particularly during adverse weather
conditions or at night. Guards may become distracted, leading to lapses in
vigilance and increased safety risks. Given the importance of these critical areas,
any safety issues could have serious consequences for people, property, and
even political stability. Therefore, modern technologies must be employed to
enhance monitoring capabilities and strengthen security measures, ensuring
effective surveillance and safety in vital areas.

Image processing technology surveillance is widely used in civilian and


military fields. It helps monitor security in public places such as airports,
buildings, offices, shops, and supermarkets by using IP cameras. Surveillance
systems are also used at immigration checkpoints at airports and borders to
record images and retrieve them when needed. This image processing
technology can have the functions of a VMS(Video Management System) or
intelligent image analysis and processing functions. Investing in advanced
features such as facial recognition or license plate detection will be expensive
and heavily dependent on the developer. Therefore, depending on the investment
scale and purpose of use, surveillance systems may only be equipped with a few
advanced features.
2

Through real observations, it's evident that units are adopting digital
solutions and new technologies to enhance operations. Some areas have installed
basic camera systems, but they lack advanced image analysis and intelligent
monitoring. As a result, monitoring relies on humans and doesn’t have timely
automated alerts for safety issues. These intelligent features are expensive and
not easily customizable. Also, data security and software integration remain
concerns.

Stemming from the above reasons, I choose the topic: Researching and
building an intelligent monitoring system utilizing artificial intelligence.

Purpose of the study

Research about license plate recognition technologies, facial recognition


technologies, fire alarms, intrusion detection, and applying them to build a large-
scale monitoring and surveillance system for critical areas.

Objects and scope of the study

Objects of the study: In this work, objects of the study are license plate
recognition technologies, facial recognition technologies, fire alarms, intrusion
detection, large-scale monitoring and surveillance system.

Scope of the study: In this work, our system takes an image or video as
input, applies image analysis and export images or video streaming.

Researching methods

Researching the basic image processing, machine learning and deep learning
algorithms used for license plate recognition, face recognition, fire alarms,
intrusion detection and monitoring and surveillance system.

Doing more experiments to evaluate the algorithms.

Setting and deploying our monitoring and surveillance system in real-world


conditions.
3

To complete the project, I would like to thank the teachers in the Institute of
Information Communication Technology – Military Technical Academy for
teaching me knowledge about general subjects as well as specialized subjects,
helping me had a solid theoretical foundation and facilitated me throughout my
studies. In particular, I would like to express my deep gratitude to Ph.D. Nguyen
Quoc Khanh, who directly guided, cared, enthusiastically instructed and created
all favorable conditions during the time I completed this project. Finally, I
would like to thank my family and friends for always facilitating, caring, helping
and encouraging me during my study and completion of my graduation thesis.

Despite many efforts, due to limited time and limited knowledge, mistakes
are inevitable. Therefore, I look forward to receiving suggestions and guidance
from teachers to make my project more complete.

I sincerely thank!
4

Chapter 1

1. BACKGROUND KNOWLEDGE
1.1. Convolutional neural networks (CNN)
A Convolutional neural network (CNN) is a neural network that has one or
more convolutional layers and are used mainly for image processing,
classification, segmentation and also for other auto correlated data.

A convolution is essentially sliding a filter over the input. Rather than


looking at an entire image at once to find certain features it can be more
effective to look at smaller portions of the image.

1.1.1. Common uses for CNNs


The most common use for CNNs is image classification, for example
identifying satellite images that contain roads or classifying hand written letters
and digits. There are other quite mainstream tasks such as image segmentation
and signal processing, for which CNNs perform well at. CNNs have been used
for understanding in Natural Language Processing (NLP) and speech
recognition, although often for NLP Recurrent Neural Nets (RNNs) are used. A
CNN can also be implemented as a U-Net architecture, which are essentially
two almost mirrored CNNs resulting in a CNN whose architecture can be
presented in a U shape. U-nets are used where the output needs to be of similar
size to the input such as segmentation and image improvement.

More and more diverse and interesting uses are being found for CNN
architectures. An example of a non-image based application is “The
Unreasonable Effectiveness of Convolutional Neural Networks in Population
Genetic Inference” by Lex Flagel et al. This is used to perform selective sweeps,
finding gene flow, inferring population size changes, inferring rate of
recombination. There are researchers using CNNs for generative models in
single cell genomics for disease identification. CNNs are also being used in
astrophysics to interpret radio telescope data to predict the likely visual image to
5

represent the data. Deepmind’s WaveNet is a CNN model for generating


synthesized voice, used as the basis for Google’s Assistant’s voice synthesizer.

1.1.2. The core of CNN: the convolution


The core of a CNN is the convolution, but what’s a convolution? In
mathematics a convolution is a process which combines two functions on a set
to produce another function on the set. We can think of images as two-
dimensional functions. Many important image transformations are convolutions
where you convolve the image function with a very small, local function called a
“kernel.” The kernel slides to every position of the image and computes a new
pixel as a weighted sum of the pixels it floats over, in order to process the spatial
structure of the image.

Convolutions are used to detect simple patterns such as edge detection in an


input image. For example, we can detect edges by taking the values −1 and 1 on
two adjacent pixels, and zero everywhere else. That is, we subtract two adjacent
pixels. When side by side pixels are similar, this gives us approximately zero.
On edges, however, adjacent pixels are very different in the direction
perpendicular to the edge.

1.1.3. CNNs multilayered architecture


A CNN has a multilayered architecture where an image is used as an input
and has multiple layers of convolutions which creates feature maps. In inner
layers, a pooling process is performed in order to reduce the resolution of the
subsequent feature maps and increase their quantity. The end layer of feature
maps is passed through a linear classifier where a one dimensional array called
output layer is created containing the tags the image could be associated with.
When a pixel in this one dimensional array is activated it means the image was
classified as its associated tag.

CNNs can detect complex patterns such as handwritten numbers because the
architecture can perform detections over the previous detections (The input of an
6

inner layer is the output from the previous one). How do we define which
convolutions to apply to solve a specific problem? The neural network does it
during the training process where we have a set of training images with a
tag/category associated with each one of these. Then the neural network decides
which are the best convolutions combinations to apply in order to classify the
training dataset successfully.

1.1.4. The future of CNNs


Ever since the breakthrough of CNN in 2012, the evolution in Deep
Learning techniques has accelerated exponentially, ranging from image
classification models used by self-driven vehicles to text generation models like
ChatGPT (GPT-4) that produces human-like texts. For that reason, we can
expect that Deep Learning tools will continue evolving in the next 5 to 10 years
to a point this kind technology will be widely democratized and become a
standard tool for every person, having a ubiquitous presence in our daily lives.

1.2. Object detection problem


Object detection is a fundamental task in computer vision that involves
identifying and localizing objects within images or videos. Unlike image
classification, where the goal is to classify an entire image into predefined
categories, object detection goes a step further by not only recognizing what
objects are present but also specifying where they are located in the image.

1.2.1. Traditional object detection method


Traditional object detection methods refer to approaches that were used
before the rise of deep learning and convolutional neural networks. These
methods often rely on handcrafted features, heuristics, and machine learning
algorithms to detect objects in images. Here's an overview of traditional object
detection techniques.

* Histogram of Oriented Gradients (HOG):


7

HOG is a feature descriptor that represents the distribution of local edge


directions and gradients in an image. It's commonly used in combination with a
support vector machine (SVM) classifier to detect objects like pedestrians and
vehicles.

* Edge Detection and Contour Analysis:

Techniques like Canny edge detection and contour finding can be used to
identify boundaries of objects. However, these methods often require additional
processing to distinguish between objects and handle occlusions.

* Blob Detection:

Blob detection algorithms identify regions in an image that have uniform


intensity levels. They can be used to detect objects with distinctive color or
intensity patterns.

* Color-Based Methods:

Some objects can be detected based on specific color properties. Color


thresholding and region growing are examples of color-based techniques.

*Adaptive Thresholding:

Adaptive thresholding techniques adjust threshold values based on local


characteristics of the image, improving object detection in varying lighting
conditions.

Traditional object detection methods, while effective in some scenarios,


have limitations in handling complex scenes, varying object appearances, and
scale changes. The advent of deep learning has significantly surpassed the
performance of traditional techniques, especially in cases where object detection
requires high accuracy and robustness. However, traditional methods still have
their place in situations where computational resources are limited or when
dealing with simpler tasks.
8

1.2.2. Deeplearning-based object detection


Deep learning-based object detection has revolutionized the field of
computer vision by significantly improving the accuracy and robustness of
object detection tasks. It involves training neural networks, particularly
convolutional neural networks (CNNs), to learn representations of objects
directly from data.

Figure 1.1. One and two stage detectors

Deep learning-based object detection architectures can be categorized into


one-stage detectors and two-stage detectors:

* One-Stage Detectors (e.g., YOLO, SSD):

One-stage detectors directly predict object classes and bounding box


coordinates from a single pass through the network. They are faster but might
have slightly lower accuracy compared to two-stage detectors.

* Two-Stage Detectors (e.g., Faster R-CNN, Mask R-CNN):


9

Two-stage detectors first propose potential object regions and then classify
and refine those regions. They typically offer higher accuracy and are effective
for complex scenes.

Figure 1.2. The different between one-stage and two-stage detector

Deep learning-based object detection has achieved state-of-the-art


performance on a wide range of tasks, from common object detection to more
complex challenges like detecting objects in cluttered scenes, handling
occlusions, and even performing instance-level segmentation.

1.2.3. You Only Look Once (YOLO) and its variants


YOLO (You Only Look Once), a popular object detection and image
segmentation model, was developed by Joseph Redmon and Ali Farhadi at the
University of Washington. Launched in 2015, YOLO quickly gained popularity
for its high speed and accuracy
10

Figure 1.3. The development of YOLO

YOLOv8 is the latest version of YOLO by Ultralytics. As a cutting-edge,


state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous
versions, introducing new features and improvements for enhanced
performance, flexibility, and efficiency. YOLOv8 supports a full range of vision
AI tasks, including detection, segmentation, pose estimation, tracking, and
classification. This versatility allows users to leverage YOLOv8's capabilities
across diverse applications and domains.

Export mode is used for exporting a YOLOv8 model to a format that can be
used for deployment. In this mode, the model is converted to a format that can
be used by other software applications or hardware devices. In this study, I
utilized YOLOv8s for training and inferring. Subsequently, I converted the
trained model into the .onnx format. This format is useful when deploying the
model to production environments. This approach ensures that the model can
seamlessly interface with various software applications or hardware devices,
making it well-suited for real-world deployment scenarios.
11

1.3. Motion detect for realtime monitoring system


1.3.1. Background subtraction
Background subtraction is any technique that allows the foreground of the
image to be extracted for further processing. This technique finds applications in
various domains, including: surveillance, motion detection, people counting,
traffic analysis, and more. It also fast and easy to implement.

Figure 1.4. Background subtraction

This technique will not be suitable in cases such as: cameras mounted on
moving vehicles, cameras with continuous panning, observing underwater
environments, and more.

Figure 1.5. Background subtraction when background is not static

Therefore, for this technique to achieve high effectiveness, it requires the


condition that the camera must be positioned stably or the background portion
should remain static.
12

1.3.2. Count (CNT) - fastest background subtraction


Count is a method that uses only the information of the pixel values in the
previous nFrames and other additional information, based on frame count. CNT
algorithm much faster than any other background solution in OpenCV. The
basic reason is that it is very simple, and thoroughly optimized with Valgrind.
Simple, because the algorithm was developed with simplicity in mind, trying to
capture the essence of the background subtraction process as performed in
human vision. The implementation of a simple algorithm is fast as is, but was
further optimized by practical software development methods. If the background
lighting changes – especially for outdoor semi-clouded sky, the algorithm
included some field tested hard coded thresholds to account for these situations.

Figure 1.6. Result of CNT method

After experimenting with various background subtraction algorithms, we


selected the CNT algorithm for our system.

1.3.3. Techniques for image processing


* Pre-processing: Morphological transformations

Receives the original image and a structuring element that decides the
nature of the operation (kernel)
13

Figure 1.7. Morphological transformations

* Contour detection

Contour detection used to identify the boundaries of objects or distinct


regions within an image. It involves tracing the outline of objects by detecting
continuous curves that connect points of similar intensity or color. Contour
detection is particularly useful when you want to extract object shapes, identify
objects based on their boundaries, or perform tasks like object counting and
shape analysis.

Figure 1.8. Contour detection


14

1.4. Face recognition problem


Face recognition is the task of making a positive identification of a face in a
photo or video image against a pre-existing database of faces. It begins with
detection - distinguishing human faces from other objects in the image, then
works on identification of those detected faces. Face recognition technology has
witnessed significant advancements over the past few years, driven by deep
learning and the availability of vast amounts of data. This technology has
evolved from simple algorithms to sophisticated neural network architectures
that can recognize and differentiate between millions of faces with high
accuracy.

1.4.1. A brief of history about face recognition model


Several models have been published in the realm of face recognition. Deep
learning has revolutionized the field of face recognition, leading to
unprecedented accuracy levels and robustness against various challenges.
Several deep learning architectures have emerged as frontrunners in this domain:

DeepFace (by Facebook): One of the earliest deep learning models for face
recognition, DeepFace employs a deep neural network with more than 120
million parameters. It uses a 3D alignment approach for a face and then feeds
the aligned face through the network. On the LFW dataset, DeepFace achieved
an accuracy that was almost human-level.

FaceNet (by Google): FaceNet introduced the concept of using a triplet loss
function, which takes in three inputs – an anchor, a positive of the same class as
the anchor, and a negative of a different class. The model aims to minimize the
distance between the anchor and the positive and maximize the distance between
the anchor and the negative. FaceNet set new records in terms of accuracy on
standard benchmarks.

VGGFace & VGGFace2: Developed by the Visual Geometry Group


(VGG) at the University of Oxford, these models are based on the famous VGG
15

architecture. While VGGFace was trained on 2.6 million images of 2,622


celebrities, VGGFace2, an improved version, was trained on a much larger and
more diverse dataset, leading to better performance.

ArcFace: ArcFace introduced an additive angular margin loss to improve


the discriminative power of the embeddings. This approach focuses on
enhancing the margin between classes in the feature space, leading to better
separability and, consequently, higher recognition accuracy.

MobileFaceNets: With the rise of mobile and edge computing, there's a


need for lightweight models that don't compromise much on accuracy.
MobileFaceNets is designed for such scenarios. It's efficient in terms of
computation and memory without a significant drop in performance.

ResNet: While not exclusively designed for face recognition, the deep
residual networks (ResNet) architecture has been adapted for the task. By
introducing skip connections (or shortcuts) to jump over some layers, ResNet
can train very deep networks without the problem of vanishing gradients. This
depth allows for the learning of intricate facial features.

These models, trained on vast datasets, have learned rich representations of


faces, capturing nuances and variations. They can handle challenges like pose
variations, lighting changes, occlusions, and age-related changes, making them
suitable for real-world applications.

1.4.2. Datasets
Datasets play a crucial role in training and evaluating face recognition
models. Here's an overview of some of the most prominent datasets used in the
face recognition.

Labeled Faces in the Wild (LFW): One of the most widely-used datasets,
LFW contains more than 13,000 images of faces collected from the web. Each
face has been labeled with the name of the person pictured, and there are 1,680
16

of the collected people that have two or more distinct photos in the dataset. LFW
has been a benchmark for evaluating face verification and recognition
algorithms.

YouTube Faces (YTF): This dataset consists of 3,425 videos of 1,595


different people. The dataset was designed for studying the problem of
unconstrained face recognition in videos. The videos are collected from
YouTube, and all the videos contain only one of the 1,595 subjects.

VGGFace & VGGFace2: Developed by the Visual Geometry Group at the


University of Oxford, these datasets are extensive. While VGGFace contains 2.6
million images of 2,622 celebrities, VGGFace2 is larger and more diverse, with
over 3 million face images of 9,131 subjects.

CASIA-WebFace: This dataset was collected from IMDb and contains


about 500,000 images of 10,575 subjects, making it one of the largest publicly
available datasets for face recognition.

MS-Celeb-1M: A dataset released by Microsoft, MS-Celeb-1M contains


roughly 10 million images of 100,000 celebrities collected from the internet. It's
designed for the training of deep networks for face recognition tasks.

MegaFace: MegaFace contains over 1 million images, and its primary


challenge is to recognize one individual accurately from a million distractor
images. It's designed to evaluate the performance of face recognition algorithms
at the extreme scales.

CelebA: With over 200,000 celebrity images, CelebA has large diversities,
large quantities, and rich annotations, including 40 attribute labels per image. It's
suitable for attribute recognition tasks in addition to face recognition.

1.5. Car license plate recognition problem


There are two main approaches for ALPR: multi-stage and single-stage
license plate recognition.
17

1.5.1. Multi-stage license plate recognition systems


Most of the existing solutions for the ALPR task have considered the multi-
stage method, which consists of three main steps. The first stage is the license
plate detection or extraction. Existing algorithms use traditional computer vision
techniques and deep learning methods with object detection to locate the license
plate in an image. Traditional computer vision techniques are mainly based on
the features of the license plate such as shape, color, symmetry, texture, etc. In
the second stage, the license plate is segmented and the characters are extracted
using some common techniques such as mathematical morphology, connected
components, relaxation labelling, and vertical and horizontal projection.
However, the character segmentation stage is not necessarily performed in every
multi-stage ALPR system, because there are some segmentation-free algorithms
in which this stage is omitted.

Figure 1.9. Main stages in a multi-stage license plate recognition system

The final stage is the recognition of the characters using pattern matching
techniques or classifiers like neural networks and fuzzy classifiers. However, the
main downside of separating detection from recognition is its impact on the
accuracy and efficiency of the overall recognition process. This happens mainly
due to the imperfection of the detection process such as flaws in the bounding
box prediction. For instance, if the detection process misses a part of a license
plate, it will affect to reduce the overall accuracy of the recognition process.
Thus, in a multi-stage approach, it is important to achieve satisfying results in
18

each stage. Figure 2.1 shows the main processing stages in a multi-stage plate
recognition system.

1.5.2. Single-stage license plate recognition systems


While most research has focused on multi-stage processes in license plate
recognition, recently there have been some successful studies on single-stage
processes. Can be understood as using a single deep neural network, trained for
end-to-end detection, localization and recognition of the license plate in a single
forward pass. This allows models to share parameters and have fewer
parameters than multi-stages models, thus being faster and more efficient. This
approach used VGG16 as a feature extractor, and modified VGG16 and
removed three layers and the output was fed into a Region Proposal Network
(RPN). They use two separate sub-networks for license plate detection and
vehicle number recognition. The license plate detection subnet gives bounding
box coordinates and probabilities, to avoid character segmentation they modeled
a sequence labeling problem using Bidirectional RNNs (Figure 2.2).

Figure 1.10. The overall structure of single-stage model for ALPR.

In these systems, two separate sub-networks were used for license plate
detection and license plate OCR. However, using only one deep neural network
will make it more difficult to intervene in every step of the processing.
Therefore, it is difficult to apply this approach to Vietnamese LPs that include
both 1-line and 2-line plate styles.
19

1.6. Fire alerts problem


Fires in storage facilities and warehouses can lead to catastrophic losses,
both in terms of valuable assets and potential business disruptions. Such areas
often house a concentration of goods, equipment, and sometimes hazardous
materials, making the impact of a fire or explosion particularly severe. Beyond
the immediate destruction of inventory, fires can damage infrastructure, leading
to long-term operational challenges and significant financial burdens. The loss
of unique or hard-to-replace items can also disrupt supply chains, affecting both
the business and its clients. Furthermore, fires in these settings pose a risk to
personnel, potentially leading to tragic loss of life or injury. Given the severe
implications of fires, the importance of early fire and explosion warnings cannot
be overstated. Early detection systems, such as smoke detectors and fire
detector, play a crucial role in mitigating the damage caused by fires. Early
warnings can help in containing fires before they spread uncontrollably,
preserving both life and property.

1.7. Conclusion for chapter 1


In this chapter we have covered some background techniques that used in
our intelligent monitoring system.
20

Chapter 2

2. EXPERIMENTAL RESULTS AND ASSESSMENT


2.1. Face recognition module
2.1.1. Detection model
In detection phase, I used SCRFD was introduction in [9]. Although
tremendous strides have been made in uncontrolled face detection, efficient face
detection with a low computation cost as well as high precision remains an open
challenge. In that paper, they pointed out that training data sampling and
computation distribution strategies are the keys to efficient and accurate face
detection. Motivated by these observations, they introduced two simple but
effective methods Sample Redistribution (SR), which augments training samples
for the most needed stages, based on the statistics of benchmark datasets; and
Computation Redistribution (CR), which reallocates the computation between
the backbone, neck and head of the model, based on a meticulously defined
search methodology. Extensive experiments conducted on WIDER FACE
demonstrate the state-of-the-art efficiency-accuracy trade-off for the proposed
SCRFD family across a wide range of compute regimes. In particular, SCRFD-
34GF outperforms the best competitor, TinaFace, by 3.86% (AP at hard set)
while being more than 3× faster on GPUs with VGA-resolution images.

Table 2-1. Compare various facial recognition models.

Method Backbone Easy Medium Hard Params(M) Flops(G) Infer(ms)


DSFD ResNet152 94.29 91.47 71.39 120.06 259.55 55.6
(CVPR19)
RetinaFace
ResNet50 94.92 91.90 64.17 29.50 37.59 21.7
(CVPR20)
HAMBox
ResNet50 95.27 93.76 76.75 30.24 43.28 25.9
(CVPR20)
TinaFace ResNet50 95.61 94.25 81.43 37.98 172.95 38.9
21

(Arxiv20)

ResNet-34GF ResNet50 95.64 94.22 84.02 24.81 34.16 11.8


Bottleneck
SCRFD-34GF 96.06 94.92 85.29 9.80 34.13 11.7
Res
ResNet34x0.
ResNet-10GF 94.69 92.90 80.42 6.85 10.18 6.3
5
SCRFD-10GF Basic Res 95.16 93.87 83.05 3.86 9.98 4.9
ResNet34x0.
ResNet-2.5GF 93.21 91.11 74.47 1.62 2.57 5.4
25
SCRFD-
Basic Res 93.78 92.16 77.87 0.67 2.53 4.2
2.5GF
MobileNet- MobileNetx0
90.38 87.05 66.68 0.37 0.507 3.7
0.5GF .25
SCRFD- Depth-wise
90.57 88.12 68.51 0.57 0.508 3.6
0.5GF Conv

Overall, SCRFD is a novel approach designed to improve the efficiency and


accuracy of face detection tasks. It offers a balanced solution that addresses both
the computational and accuracy challenges in face detection, making it a robust
and efficient choice for realtime requirement.

2.1.2. Recognition model


After researching various models from large to small and considering
practical requirements, I chose MobileFaceNet[17] for the recognition.
MobileFaceNets, which uses fewer than 1 million parameters and is specifically
tailored for high-accuracy, real-time face verification on mobile and embedded
devices. I used a pretrained model on the WebFace600K dataset. The accuracy
measured on the LFW (Labelled Faces in the Wild) dataset reached 99.7%.
22

Table 2-2. Results in different datasets

MR- South East CFP- AgeDB- IJB-


Dataset African Caucasian LFW
ALL Asian Asian FP 30 C(E4)

WebFace600K 71.865 69.449 80.454 73.394 51.026 99.70 98.00 96.58 95.02

2.1.3. Database for facial features


A type of database commonly used to store feature vectors for AI models is
Milvus. Milvus also supports algorithms to optimize the search process.
However, a drawback of Milvus is that it only supports vector data. Therefore,
an additional database like Sqlite is needed to store information such as names,
ages, genders, etc., of individuals. Thus, to draw a bounding box accompanied
by a person's name on the result image, two databases are required. To simplify
deployment and optimize the recognition process, I use Redis to store both the
names of individuals and their corresponding features.

Redis, an acronym for "Remote Dictionary Server," is an open-source, in-


memory data structure store that can function as a database, cache, and message
broker. Its blazing-fast performance, versatility, and ease of use have made it
one of the most popular NoSQL databases in the world.

At its core, Redis is an in-memory key-value store. This means that all its
data resides in memory, leading to exceptionally fast read and write operations.
However, Redis isn't just limited to simple key-value pairs. It supports a rich set
of data types, including strings, hashes, lists, sets, sorted sets, bitmaps, and
hyperloglogs. This versatility allows developers to model their data in various
ways, depending on the application's requirements.
23

Figure 2.11. New method for database using in face recognition

2.2. Fire alert module


To address the pressing concern of providing early warnings for potential
fires and explosions, I have chosen to utilize the YOLOv8 model. This particular
model stands out due to its simplicity, efficiency, and the ease with which it can
be deployed, especially in the realm of object detection. The internet proves to
be a valuable resource, as data pertaining to smoke and fire are readily available.
After a thorough search, I managed to gather a substantial amount of images that
depict fire. These images were then meticulously labeled according to the
YOLO format. To further enhance the efficiency of the system and reduce the
time it takes for the model to process data, I exported the model in the .onnx
format.
24

Figure 2.12. Some images in fire dataset

2.3. Car license plate recognition module


For the feature of detecting and recognizing vehicle license plates, this is
a problem that has been researched for quite some time and has achieved certain
results. Notable studies include paper [14] on the recognition of car license
plates worldwide and paper [3] by Vietnamese authors on the recognition of
Vietnamese license plates. While these studies have produced many feasible
results with high recognition accuracy, the models that work well with foreign
license plates often make many mistakes when applied to Vietnamese license
plates. This is due to the unique characteristics of the size of the license plate
area, the size and font of the license plate, and Vietnamese Unicode characters.
Research on Vietnamese license plates mainly focuses on plates when vehicles
enter parking lots, toll stations, apartment buildings, etc., where conditions such
as distance, camera angle, lighting, vehicle speed, and image quality are all
ensured, thus yielding good results. However, when it comes to recognizing
license plates from surveillance cameras in various areas, many cameras are
positioned at tilted angles, and many vehicles overlap, partially or completely
25

covering the license plate area. Vehicle speed, lighting conditions, and weather
also greatly affect recognition accuracy. Therefore, within the scope of this
study, they have collected data, labeled it, and retrained the recognition model to
achieve more accurate results.

Their proposed license plate recognition model consists of three main parts:
car detection, license plate detection and license plate OCR, as illustrated in
Figure 2.13.

Figure 2.13. The flow of ALPR core model.

The input is a frame obtained from the surveillance camera. on national


highways can predetermine the area of interest (ROI). Motion detection
algorithm based on frame difference. If there is movement, the car detection
module detects the closing frame of the cars. Each detected closing frame will
be updated with the tracking algorithm. The tracking algorithm indicates
whether the object has been processed in the previous frame or not, if not, the
26

License plate detection module detects the license plate. The OCR module
conducts optical character recognition and goes through an ocr-check for the
validity of the obtained results to confirm whether the prediction result is valid
or not. If valid, the identified object completes the identification process and the
results are stored in the database. In contrast, the object is not yet recognized,
and continues to be processed in the following frames.

2.3.1. Car detection


Cars are one of the basic objects present in many large datasets, such as
PASCAL-VOC [11], ImageNet [13] and COCO [16], so we decided not to train
the generator from scratch, instead choose a known model to perform auto
detection based on high call rate criteria, since mistakenly rejecting a car can
lead to overall error, and also, precision rate need to be taken care of as the
wrong detection of non-automobile objects consumes processing time as it has
to go through the following steps. Based on these criteria, we decided to use the
YOLOv2 network due to the fast execution speed of about 70 FPS and the
accuracy of 76.8% mAP compared to the PASCAL-VOC dataset. We have not
made any changes or improvements to YOLOv2.

For each detectable area, the objects will update the tracking algorithm we
recommend. The tracking algorithm is based on the Euclidean distance between
the centers of detected objects (i.e. between newly discovered regions and
previously detected regions), and performs object storage discard after a zero
time. appear in the frame. Objects that have not been successfully identified will
continue to be passed through the number plate detection step, the rest will be
ignored. Objects that have not been successfully identified include: new objects
appear; the subject of the license plate capture has not been successful; OCR
identification results are wrong compared to the regulations of the Vietnamese
number plate issued by the government.
27

The tracking algorithm we propose has three meanings. Firstly, moving car
objects when captured from traffic cameras in Vietnam only move one way.
Therefore, we build a simple but sufficient tracking algorithm to solve the
problem and ensure the resources used. Second, each object performs successful
identification exactly once, increasing overall performance. Third, objects that
have not been successfully recognized will have the opportunity to re-recognize
in later frames. This will increase the accuracy and robustness of the system.

2.3.2. License plate detection

Figure 2.14. Our LP detection was inspired from top-down HPE approach.

For the license plate detection part, we approach the problem in the direction
of keypoints detection, each corner of the license plate is considered a keypoint.
Based on the keypoints, we can align the license plates when they are deformed,
or tilted at different angles. This is still unresolved by methods using pure
rectangular closures for object detection.

The problem of detecting license plate keypoints is very close to the human
pose estimation problem. There are two traditional ways to solve this problem:
28

coordinate regression and heatmap regression. The heatmap regression approach


achieves state-of-the-art on the benchmark datasets, making it more difficult to
train the regression model directly with the coordinates of the keypoints. So we
approach the second way heatmap regression. Similar to the steps in the human
pose estimation problem (Figure 2.14). The training and inference process is
performed as [18].

2.3.2.1. Preprocessing
All vehicle images were normalized to the standard normal distribution, and
then resized to fixed sizes according to input shapes of the model architectures
respectively. The purpose of this pre-processing is to obtain the same range of
values for each input image before feeding to the CNN model. It also helps to
speed up the convergence of the model.

2.3.2.2. Keypoints encoding


In our system, the heatmap regression is used for license plate keypoints
detection. Considering input image X ∈ R W × H ×3 has width of W and height of H .
W H
Then, the output is a heatmap Y ∈[0 ,1] H × R ×C , in which R is the resolution
decrease, C is a number of classes. For each ground truth keypoint k ∈ R 2 of class
' k
C , it is equivalent to k =[ ] in Y . The output at (x , y ) of a class c in heatmap Y is
R

( )
' 2 ' 2
−( x−k x ) + ( y−k y )
Y xyc=exp ⁡ (2.1)
2 rk 2

k '
In which, r k is the radius of the keypoint k in Y . To diminish δ= R −k , all
W H
classes of C shared two offset regression heads O ∈ R R × R ×2, in which
( O xy 1 , O xy2 ) =( δ x , δ y ) if Y xyc >0 , 1≤ c ≤C . We choose R=4 , r k =2.

We found that in addition to detecting the object's keypoints, Zhou [44] also
detects the center of the object and regresses the offsets to the object's keypoints
as a way to connect the keypoints. other of the object is not necessary in our
29

problem. Because each car we detect has only one license plate and can rely on
the shape of the number plate to control the keypoints if there are more than 4
detected keypoints. Therefore, we only detect 4 keypoints that are enough to
solve the problem and limit the error rate when the model has to learn more
information. At the same time, previous studies consider keypoints as different
classes, however, experimental process shows that the keypoints of license
plates can be considered as a single class.

Figure 2.15. Keypoints encoding and local offset illustration.

Designing the number of keypoints for a LP and the number of classes C is


illustrated in the Figure 2.16.

Figure 2.16. Keypoints design for a license plate (colours are according to
classes). From left to right showed designs 1 c ,2 c , 4 c ,1 c 3accordingly.

The design 1 c considers all keypoints to belong to a class. The design 2 c


considers 2 upper keypoints to belong to one class while 2 below keypoints
belong to another class. The design 4 c considers 4 keypoints to belong to four
different classes. The design 1 c 3 considers 2 below keypoints and the centre of
the LP to belong to the same class, ignoring two upper keypoints as their
background varies a lot in different types of vehicles. It has been proved by
30

experiments that the design 1 c gives the best result. This is contradicted to
human pose estimation where all keypoints belong to different classes [7].

2.3.2.3. Architecture for license plate detection.


The studies that used heatmap regression approach had two main
architectures. The first type of architecture maintains high-resolution and
various feature fusion techniques such as HRNet [7], DLA network [8]. The
second type of architecture takes the form of high-to-low-to-high resolution by
symmetric down convolution and up convolution combined with lateral
connections such as Restnet-based architecture [2], Hourglass Network [1].

Figure 2.17. Original DDRNet architecture.

The first type of architecture gives good accuracy in this task, HRNet[7]
achieves the state-of-the-art on the COCO test-dev human pose estimation
dataset with AP = 74.4. We find that the above architectures have a very high
relationship with segmentation network architectures, so we change the
segmentation network architectures to suit the problem.

The automatic license plate recognition system from traffic cameras requires
processing time as quickly as possible, because the vehicles are constantly
moving at high speed. Therefore, we prefer testing lightweight architectures
including DLA-34 [8], HRNet-18 [7], Restnet18 [2]. For semantic segmentation
architecture, we choose DDRNet [10] and its variants (Figure 2.17) as DDRNet
31

is a semantic segmentation architecture, there is a state-of-the-art trade-off


between accuracy and speed when tested on Cityscapes and Camvid, without
using any inference optimization techniques or adding other data for training.

To be applied into LP keypoints detection problem, we made some changes


in the architecture of DDRNet [10]. To inform about lower feature maps, we
used residual connection instead of training and collecting information from
separated stages as in the original architecture. In addition, the size of an output
heatmap that we choose is proportional to the input at R = 4. However, DDRNet
maintains a high-resolution feature map with size of R = 8 compared to the input
image. We made two modifications corresponding to two new architectures in
order to suit the keypoints detection problem. The first architecture DDRNetsh
was obtained via changing the level of down convolution at the first convolution
block. The second architecture DDRNetup was obtained by adding one block
upconvolution at the end of the original architecture (Figure 2.18).

Figure 2.18. DDRNetsup Network, “RB” denotes sequential residual basic


blocks. “RBB” denotes single residual bottleneck block. “DAPPM” denotes the
Deep Aggregation Pyramid Pooling Module.
32

2.3.2.4. Keypoints decoding


Having obtained feature maps, we used max pooling operator with kernel
size of 3 ×3 according to feature maps of C classes. We removed points that have
score less than 0.3 and choose the top 4 points with the highest score. They are
the 4 keypoints of a LP needed to find. Then, positions of keypoints in the
original image ( x or i , y or i ) =( ( x +O x ) × R , y ori =( y +O y ) × R ), in which (x , y ) is a position
of keypoints in the output feature maps.

2.3.3. License plate optical character recognition


2.3.3.1. The characteristics of Vietnamese license plates
Vietnamese vehicle LPs are designed according to government regulations
(Figure 2.19). Vietnamese vehicle LP colors include four main types: blue
background or red background with white text, yellow background or white
background with black text.

Figure 2.19. Size of old and new Vietnamese LP (unit mm).

The characters in the number plate are in a set of 33 characters


{ABCDEFGHKLMNPQSTUVXYZ0123456789-.}. These characters are not
randomly arranged, but all belong to one of five template types (Figure 2.20). 1-
line LP are obtained from 2-line LP by adding a “-” sign when combining the
first and second lines of a 2-line number plate.
33

Figure 2.20. Color and text formats of Vietnamese license plate.

2.3.3.2. Preprocessing
All LPs images are normalized to the standard normal distribution and then
are resized to fixed sizes according to input shapes of the model architectures
respectively. For 2-line LPs, we use the X-Y Cut algorithm [6] to find out where
to split a 2-line LP into two 1-line LPs.

2.3.3.3. Architecture
The input of the VOCR is an image of 1-line LP which has size of 32 × 140.
The first part of VOCR is VOCR-CNN which extracts local features based on
VGG19 [19] (Figure 2.21). Due to small input image size 32 × 140, we choose
average pooling instead of max pooling to avoid losing of information.
Additionally, the kernel size and stride size were chosen with a small size to
compress the information appropriately. The output of VOCR-CNN has 256
feature maps; each feature map has size of 1 × 70. We considered the number of
dimensions of each feature map to be the number of time steps, thus time steps T
= 70. At each time step t, a feature vector of size 1×256 contains the features of
all feature maps at that time step.
34

Figure 2.21. VOCR-CNN architecture.

The second part of the VOCR is the VOCR-RNN (Figure 2.22) in the form
of an encoder-decoder using the Gated Recurrent Neural Networks (GRU)
combined with the attention mechanism [19].

Figure 2.22. VOCR-RNN architecture.


35

In which [he, h' e , hd ] is the hidden output feature at every time step of the
encoder. In the encoder part, we choose GRU-bidirectional to extract
interdependencies in the input string. The output of the encoder at each time step
is concatenate of [he, h' e ] with the dimension 512. The output of the encoder is
through the fully connected layer to reduce the number of dimensions equal to
the number of hd . The attention mechanism is inspired by [12] consider each hd
at each time step as query and all [ he ,h ' e ] act as keys and values. Since the
dimensions of keys and query are not the same, we use the attention scoring
function as Additive attention [4]
T 1× T
a (Q , K )=wv tanh ⁡( W q Q+W k K ) ∈ R (2.2)
q×T h× q k ×T h ×k h
In which, Q∈R ,Wq ∈R ,K∈ R ,W k ∈ R , wv∈ R . In this work,
Q=hd , K= [ he , h' e ] , h=256 , q=256 , k=512 , T=70. Therefore, the context vector c

can be synthesis like bellow:


ai
e
wi= T
∈R
∑ ❑e ai

(2.3)
i=1

w=[ w1 , w2 , … , wT ] ∈ R
1× T

v ×1
C=V ×w ∈ R

In which, w is attention weights of each time steps, V =K , v=k . Context


vector c synthesized will contains important characteristic information from the
input sequence that the current decode step needs attention. This makes a lot of
sense, for example, predicting that the label is the first letter of the number plate,
attention weights w will have a higher weight for time steps containing
information for that character.
36

Figure 2.23. Teacher forcing training strategy in Machine Translation.

We train the decoder using the teacher forcing training strategy (Figure
2.23). The network decoder takes as input hd at the previous time step and
concatenate [ context vector , c , embedding ] of the label at the current time step. The
output of the decoder is the concatenate of [ hd , c , embedding ] that goes through
the fully connected layer with the output dimensions N = 36 (corresponding to
the number of characters appearing in Vietnamese LPs plus 3 special characters
¿ sos> ,< eos >,< pad> ¿ ). The highest one will the character needed to decode. If

we meet ¿ eos >¿character, the decoding process will complete.

2.3.3.4. Loss function


The loss function used is a combination of the cross-entropy loss and label
smoothing technique [15] to increase the generalization for our model.
N
L=−∑ ❑ ( l i × y i × ( log ⁡( ýi ) ) ) (2.4)
i=1

In which, y i is a ground truth, ý i is a prediction of the model, l i=| y i −δ| is the


label smoothing weight at i class. We choose δ=0.1 based on experimentation of
all different thresholds proposed in [15].
37

To optimize the loss function L, we use the parameter update method based
on gradient descent with the Adam algorithm to automatically adjust the
learning rate and avoid falling into the local minimum.

2.3.4. Overall performance


From videos captured by traffic surveillance cameras, we have constructed
the Vietnamese License Plate (MTAVLP) database to serve the training of the
model for detecting and recognizing Vietnamese car license plates. The image
database was collected under various weather conditions, times, lighting,
resolutions, angles, directions, and types of plates, ensuring diversity in the data.
Data was continuously collected over three weeks, totaling more than 700 hours.
These videos were then segmented into shorter clips for preprocessing and
labeling of license plates, resulting in a detailed database as shown in Table:

Table 2-3. License plate data collected from traffic cameras

Num of row White Yellow Blue Red Total


1 row 6017 454 247 243 6961
2 row 12181 634 387 323 13525
Total 18198 1088 634 566 20486

Figure 2.24. License plate images in MTAVLP dataset.


38

Vietnamese license plates are designed according to government


regulations. The background colors of the license plates come in four main
types: blue or red with white letters, and yellow or white with black letters.

To evaluate the license plate recognition model, we divided the license


plate dataset into three sets for training, validation, and testing with ratios of
0.75: 0.1: 0.15. Various architectures with a non-segmentation approach were
studied to compare with VOCR, such as LPRnet-STN and CRNN - a
combination of CNN and RNN. Additionally, we introduced a CRNN (VGG19)
architecture based on CRNN, using VGG19 as the backbone [12].

The evaluation method used is the sequence-level license plate


recognition accuracy Ac c seq , where a license plate is considered correctly
recognized if all OCR characters are correct without any additional characters.
However, in cases where only one character is misrecognized, the above
evaluation criteria may not be meaningful. Therefore, we also use Ac c char , the
character-level accuracy, for further evaluation. A character is deemed correctly
recognized if its position in the label and in the predicted result is the same.

The tables of results below show that VOCR exhibits superior


performance compared Ac c seq = 99,28%, Ac c char = 99,7%.

Table 2-4. Results of license plate recognition with Vietnamese license plates

Train set Val set Test set


Model
Ac c seq Ac c char Ac c seq Ac c char Ac c seq Ac c char

VOCR 99.34% 99.8% 99.25% 99.9% 99.28% 99.7%

LPRnet
92.3% 96.9% 94. 1% 96.4% 94.8% 96.8%
STN

CRNN 92.9% 96.9% 93.6% 97.2% 93.4% 97.1%

CRNN 92.3% 95.5% 92.5% 95.4% 93.0% 96.2%


39

(VGG1
9)

Our proposed ALPR system was set up test using a single GPU NVIDIA
GeForce RTX 2080 and without using any inference optimization techniques
such as Tensor-RT. We recorded the processing speeds of stages as follows: The
processing speed was 38.28 FPS in the stage of car detection using YOLOv2; In
the LP detection stage, DDRNet23sh could be run at 103.5 FPS; and in the LP
recognition stage, we got 36 FPS with VOCR. Consequently, our ALPR system
can be run at 15.73 FPS in a single NVIDIA GeForce RTX 2080 without using
any inference optimization techniques. Based on these results, we believe that
our ALPR system can run in real-time when being deployed using additional
inference optimization techniques.
40

Figure 2.25. Results of using DDRNet23sh and VOCR.

2.4. Deploy computer vision services to production


2.4.1. Deploying computer vision services
Developing a model is one thing, but serving a model in production is a
completely different task. After researching and experimenting with various
computer vision models, the next step is to deploy them in a way that they can
provide services for real-world usages. In recent years, computer vision has
transformed the way we interact with technology, enabling machines to interpret
and understand the visual world just like humans do. From self-driving cars to
facial recognition systems, computer vision applications are becoming
increasingly prevalent in our daily lives. However, the transition from building a
successful computer vision model to deploying it in a real-world production
environment comes with its own unique set of challenges and considerations.

Deploying computer vision services in production involves taking a trained


model from a development environment and integrating it into a live system that
can handle real-time image data, process it accurately, and provide valuable
insights or actions. This process requires careful planning, optimization, and an
understanding of various factors that affect the model's performance and
reliability.

Deploying computer vision models in production is a complex endeavour


that requires a holistic approach that encompasses data, models, infrastructure,
and processes. In the development phase, our primary concern is whether we can
solve the given problem or not. This could be achieved using traditional
techniques with OpenCV or more complex methods involving machine learning
and deep learning algorithms. As we move into the deployment phase, the
ability to meet the practical requirements becomes even more crucial. There are
41

methods that may achieve high-efficiency levels of around 98-99% when


published, but in real-world deployment, they might not be usable due to various
factors. And, scaling computer vision models is key challenges.

2.4.2. Flask framework


Flask, a lightweight and versatile Python web framework, empowers
developers to create robust and efficient web applications and APIs. Flask's
simplicity and flexibility make it an excellent choice for building APIs, which
serve as the backbone of modern software systems, allowing applications to
communicate and exchange data seamlessly over the internet.

Flask APIs enable developers to expose functionalities, resources, or data to


external consumers, whether they are web browsers, mobile apps, or other
services. These APIs are essential for various use cases, including serving
machine learning models, processing images, managing databases, and much
more.

In this digital age, where interconnectedness drives innovation, Flask APIs


play a pivotal role in enabling applications to interact and collaborate
effectively. By harnessing Flask's capabilities, developers can create APIs that
are not only powerful and scalable but also intuitive and easy to maintain. This
opens the door to a wide array of possibilities for integrating services,
automating processes, and enhancing user experiences.

2.4.3. The challenges when deploying computer vision systems in


real-world settings
Computer vision is a rapidly growing field with a wide range of applications
in real-world settings. However, there are still a number of challenges that need
to be addressed before computer vision systems can be deployed at scale.

* Large user load


42

Computer vision systems are often used in applications where there is a


large number of users, such as in self-checkout systems or facial recognition
applications. This can put a strain on the system's resources, and it can be
difficult to ensure that the system can handle the load without experiencing
performance degradation.

For example, a self-checkout system with a large number of users can lead
to long lines and wait times. This can be a frustrating experience for customers,
and it can also reduce the efficiency of the system.

* Changing real-world conditions

Real-world conditions can change rapidly and unpredictably. This can make
it difficult for computer vision systems to maintain accuracy and performance.
For example, a car license plate recognition must be able to handle changes in
lighting, weather, and traffic conditions.

* Hardware and software limitations

Computer vision systems often require specialized hardware and software.


This can make it difficult and expensive to deploy these systems on a wide range
of devices. For example, facial recognition or license plate recognition often
require high-resolution cameras and powerful processors.

It is important to be aware of these challenges when deploying computer


vision systems in real-world settings. By taking steps to address these
challenges, you can help to ensure that your systems are reliable and effective.

In the next section, I will propose the Restreaming technique to address the
above problems.

2.4.4. Restreaming technique


During my research and testing, I found that IP cameras used in traffic
monitoring are limited in the number of simultaneous connections for
43

liveviewing. For example, Hikvision camera model DS-2DE5432IW-AE


maximum 10 simultaneous connections are allowed. This is understandable due
to hardware limitations on IoT devices. If you have to handle connection
requests simultaneously for a long time and continuously, it will affect the
lifespan of the camera device. If you want to handle more request, you can
upgrade to high-end cameras, but in reality, this method is not a priority.
Compared to their cameras, the computer servers have much more powerful
hardware, which is also easy to upgrade. Therefore, I propose this technique to
transfer the task of handling liveview requests from the camera to a high-end
computer server. This will reduce the load pressure for cameras in particular and
IoT devices in general. At the same time, when there are many clients requesting
to view liveview images or view images with intelligent analysis, the server only
needs to process once, saving already limited GPU resources.

In particular, this technique has the following function:

- Create a buffer for each camera, storing the frames read from that camera

- When a user submits a request to watch a liveview, the image will be


retrieved from this buffer instead of asking the camera to initiate a new
connection.

- There is also a cache that stores images after being analyzed by modules
(including boundingboxes, annotations).

- No more limiting the number of simultaneous connections to watch


liveview

With this technique we have solved the problem of common processing


resources in whole system.
44

2.5. Setting up environment for AI services and designing


architecture for multiple analytics on video streaming
For optimal support, I use the Ubuntu 20.04 LTS operating system to deploy
artificial intelligence models for the system. These models are implemented
using the Python language.

2.5.1. Setting up driver for hardware and required libraries


GPUs (Graphics Processing Units) play a crucial role in the field of
Artificial Intelligence (AI) and machine learning. Their computational power
significantly accelerates model training processes, particularly in deep learning,
where complex mathematical calculations are involved. GPUs efficiently handle
matrix operations, a fundamental component of many AI models, and their
multitasking capabilities allow for concurrent execution of multiple tasks.

Popular AI frameworks today, such as PyTorch and TensorFlow, offer GPU


support, providing significantly faster computation compared to CPUs.
However, the setup for utilizing GPUs can vary depending on the versions of
these frameworks. Detailed instructions on how to set up GPU support can
usually be found on their official websites. Most machine learning and deep
learning models heavily rely on GPUs to accelerate processing, making this step
critically important for achieving real-time performance.

CuDNN, or CUDA Deep Neural Network library, is a vital component in the


realm of artificial intelligence (AI) and deep learning. Developed by NVIDIA,
CuDNN is a specialized software library designed to optimize the performance
of deep neural networks on NVIDIA GPUs (Graphics Processing Units). Its
primary role is to enhance the execution of deep learning operations, and its
significance in running AI models cannot be overstated.

Every deep learning model relies on pre-existing libraries and packages to


perform complex computations. To ensure compatibility and avoid conflicts
between these libraries and packages, specific versions are often tightly
45

specified. Each version of a library can be fine-tuned to work with particular


versions of other libraries it depends on. This poses a challenge when managing
environments, especially when we want to deploy multiple machine learning
models on the same server or resource-constrained device.

In cases where resources are limited, such as budget-constrained hardware


purchases, we need to employ environment isolation techniques to manage
different versions of libraries that various models may require. One of the
popular applications for this purpose is Anaconda (or its lightweight counterpart,
Miniconda) or using virtualenv.

Anaconda is a powerful development and package management environment


designed for data science and machine learning. It allows you to create isolated
virtual environments, each capable of containing different versions of libraries
and packages. This helps maintain compatibility between models and libraries
effortlessly while conserving server resources.

Utilizing environment isolation techniques like Anaconda or virtualenv is a


crucial part of environment management in AI and machine learning projects,
especially when dealing with multiple models and diverse version requirements
for complex libraries.

In my system, I used Miniconda to create isolated environments for AI


modules.

2.5.2. Multiple analytics and single analytic


Single analytic, which means using only one model to analyze an input
image or a sequence of images (video). Examples include face recognition,
counting the number of people entering and exiting, counting the number of
vehicles, etc. With single analytic, we typically design the processing in an
offline manner, meaning the input image (or video) is located on the localhost.
46

In this case, the processing server reads directly from the machine at a very fast
speed. This is similar to when we play a pre-existing video on our computer.

Figure 2.26. Single analytic

However, in practice, there arises a need to apply multiple models with


independent functions to analyze the same image, referred to as multiple
analytics. For instance, consider two models: one for face recognition and
another for fire detection. An image captured by a camera might contain a
person named 'A' and a blazing fire. With multiple analytics, the resulting image
will have a bounding box and the name 'Nguyen_Van_Tri', and simultaneously,
another bounding box indicating the area where the fire is occurring. In contrast,
with single analytic, the result would either be from the face recognition or the
fire alert, rather than a combination of both. Thus, multiple analytics provides a
more comprehensive visualization for users.
47

Figure 2.27. Multiple analytics

To implement the multiple analytics technique, we can follow two


approaches: Sequentially process through each single analytic or parallelize the
single analytics and then combine the results

Sequentially process through each single analytic:

For this approach, we have two program implementation directions:

Firstly, let the image sequentially pass through each independent model.
This method has the advantage of keeping the models independent of each other.
However, in terms of speed, it will decrease significantly with an increasing
number of models.
48

Figure 2.28. Sequentially process through each single analytic

Secondly, merging the source code of the models together. This method is
approximately 30% faster than the previous one. However, the drawback is that
it's not always feasible to implement this way due to library conflicts between
models, thus reducing the flexibility in combining models.

Parallelizing the single analytics and then combining the results:

The advantage of this method is that it maintains the independence of each


model, making the deployment process easier. The downside is the need for a
solution to synchronize the processing streams of different models. For instance,
if face recognition runs at 30 FPS and car license plate recognition at 25 FPS,
the result from the face recognition model will be slightly faster than the car
license plate model. Therefore, to combine the results of different models, the
same input image must be provided to all these models.

2.5.3. Design architecture for AI module APIs


After developing and optimizing computer vision models, the next
consideration is the actual deployment and real-world application of these
models. This is especially important when the model needs to be integrated into
customer-facing products or services. In this context, the development of APIs
(Application Programming Interfaces) that provide computer vision services
becomes critically essential.
49

APIs serve not just as a bridge between the computer vision model and other
applications, but also facilitate more flexible and efficient deployment,
management, and scaling of the model. They offer a standardized interface
through which other developers can interact with the model without needing to
understand its underlying technical details. This helps to mitigate risks, enhance
security, and simultaneously increase the availability and readiness of the
service.

By constructing APIs, organizations can create a modular system in which


individual components can be developed, updated, and managed independently.
This not only enhances flexibility in product development but also creates
favorable conditions for collaboration and cooperation with business partners or
other development teams.

2.5.3.1. REST API


REST API (Representational State Transfer Application Programming
Interface ), is a foundational concept in modern web development and
distributed systems. It provides a structured and efficient way for software
applications to communicate and interact with each other over the internet. At its
core, REST is built on a few key principles that make it a powerful and widely
adopted architectural style. One of the central ideas is the notion of resources,
which can be anything from data objects to services. Each resource is identified
by a unique URL, making it easily accessible via HTTP requests. REST APIs
make extensive use of HTTP methods, such as GET for retrieving data, POST
for creating new resources, PUT for updating existing resources, and DELETE
for removing them. This uniformity in the use of HTTP methods simplifies the
interaction with resources and enables a consistent approach across different
API endpoints. Furthermore, REST APIs emphasize statelessness, which means
that each request from a client to a server must contain all the necessary
information to understand and process the request. The server does not retain
50

any information about the client's state between requests, enhancing scalability
and reliability. RESTful communication also allows for multiple representations
of resources, including JSON, XML, or HTML, catering to the diverse needs of
clients. Clients can specify their preferred representation using HTTP headers,
and servers respond accordingly. Moreover, REST APIs use self-descriptive
messages, where each response includes metadata and headers that describe the
content and how to interpret it. This self-description simplifies client
understanding and interaction with the API. In summary, REST APIs provide a
structured, stateless, and uniform way for software systems to communicate and
exchange data over the web. They have become a fundamental component of
modern web and mobile application development, enabling seamless integration
and data exchange between various services and platforms.

Certainly, for services related to image classification and object detection,


APIs are typically built to provide endpoints for clients, often following the
RESTful architecture. This approach is not just an engineering necessity but also
a critical business decision that can shape how the service is consumed, scaled,
and even monetized.

In RESTful APIs, each function related to image classification or object


detection is usually exposed through specific HTTP endpoints. Clients can
interact with these endpoints by sending HTTP requests and receiving responses
in return, typically in JSON format. This modular and stateless architecture
simplifies scaling and allows for more straightforward integration with various
platforms and technologies.

Although REST API is widely used and has many advantages such as
flexibility and ease of integration, there are some limitations when applied in a
microservices environment. First, the JSON data format commonly used in
REST can increase latency and reduce performance due to the need for
serialization and deserialization. Additionally, REST has limitations in
51

supporting data streaming, which reduces its applicability in real-time


applications that require continuous interaction between services. For
monitoring systems, the real-time factor is highly emphasized. Therefore,
limitations in performance, features, and management can become significant
weaknesses to consider.

2.5.3.2. gRPC and gRPC API


gRPC, which stands for gRPC Remote Procedure Calls, is a cutting-edge
remote communication protocol spearheaded by Google. Built on the high-
performance and reliable HTTP/2 transport layer, gRPC employs Protocol
Buffers or "protobufs" as its preferred data format. This combination of HTTP/2
and protobufs creates an environment that is not only highly performant but also
consistent and scalable, making it an ideal choice for complex architectures like
microservices and distributed applications.

One of the standout features of gRPC is its support for full-duplex


streaming, which allows two-way data exchange in real-time. This is
complemented by the protocol's multiplexing capabilities, enabling multiple
requests and responses to be handled concurrently over a single connection.
Additionally, gRPC is pluggable, meaning it can easily incorporate advanced
features such as security protocols and load balancing through the use of
plugins.

The protocol also places a strong emphasis on consistency and scalability.


This is evident in how services and APIs are explicitly defined using Protocol
Buffer files, which not only ensures a uniform interface but also facilitates code
generation. Given that it's designed to be efficient across a broad spectrum of
operational scales, from small single-machine applications to vast distributed
systems, gRPC offers a level of scalability that is hard to match.
52

When it comes to security, gRPC is not found lacking. It robustly supports


advanced security measures, including Mutual Transport Layer Security
(mTLS), which makes it a secure choice for various application landscapes.

In summary, gRPC's versatility makes it a perfect fit for a variety of use


cases. Whether you're dealing with a microservices architecture that demands
high consistency and scalability, or real-time applications requiring rapid data
transmission and low latency, gRPC offers a compelling set of features designed
to meet these challenges. Moreover, its compatibility with multiple
programming languages like C++, Java, Python, Go, and Ruby adds another
layer of flexibility, making it a universally appealing choice for developers.

Therefore, when evaluating communication protocols for your system, it's


essential to consider your specific requirements and objectives. gRPC, with its
high performance, consistency, and scalability, is often a strong candidate that
can meet a wide range of application needs.

gRPC API (gRPC Remote Procedure Calls Application Programming


Interface) is a framework that leverages the features of gRPC for building highly
performant, secure, and scalable applications. Built on HTTP/2 and Protocol
Buffers, gRPC API offers numerous advantages, such as reduced latency and
full-duplex communication. It is particularly useful for image transmission
tasks, where reducing latency is a priority. Therefore, I used gRPC API to
implement my AI modules, that achieve realtime image processing.

Table 2-5. Compare gRPC API and REST API

Features gRPC API REST API

Transport Layer HTTP/2 HTTP/1.x

Data Format Protocol Buffers JSON/XML

Streaming Full-duplex Limited


53

Features gRPC API REST API

Latency Lower Higher

Multiplexing Supported Not supported

Load Balancing Advanced Basic

Security mTLS supported TLS supported

Figure 2.29. An example of protobuf script

2.5.4. MQTT protocol


2.5.4.1. Components of MQTT
MQTT gồm có các thành phần như sau:

Client: The client can be either a publisher or a subscriber that connects to a


centralized broker over a network. Clients can be continuous or temporary,
depending on whether they maintain a session with the broker. MQTT supports
54

multiple languages such as C, C++, Go, Java, C#, PHP, Python, Node.js, and
Arduino.

Broker: The broker is a software component that receives all messages from
publishing clients and forwards them to subscribing clients. It maintains
connections with continuous clients and can be scalable depending on the
implementation.

Topic: A topic in MQTT serves as an endpoint to which clients connect. It


acts as a central distribution point for publishing and subscribing to messages.
Topics are simple hierarchical strings, encoded in UTF-8 and separated by
slashes.

Message: The message is the data that a device receives when it subscribes
to a topic or sends when publishing to a topic.

Publish: is the process by which a device sends its message to the broker.

Subscribe: is the action a device takes to retrieve messages from the broker.

2.5.4.2. How MQTT work


Like any other internet protocol, MQTT is based on clients and a server.
Similarly, the server is responsible for handling the clients' requests for
receiving or sending data between the client and the server. The MQTT server is
called a broker, and the clients are simply connected devices. Therefore:

- When a device (a client) wants to send data to the broker, this activity is
called 'publish.'

- When a device (a client) wants to receive data from the broker, this activity
is called 'subscribe.'
55

Figure 2.30. MQTT components

Figure 2.31. An example for MQTT usage

Additionally, these clients can both publish and subscribe to topics.


Therefore, the brokers here have the responsibility of handling the
publish/subscribe actions with the target topics.

Components of MQTT

2.5.4.3. Compare with HTTP


HTTP is a slower protocol, more expensive, and consumes more energy than
MQTT.

Slower: because it uses larger data packets to communicate with the server.
56

Cost: HTTP requires opening and closing the connection for each request,
while MQTT stays online to keep the channel always open between the broker's
server and the client.

Energy Consumption: because it takes more time and more data packets, it
therefore uses more energy.

Figure 2.32. Compare MQTT and HTTP

2.6. Conclusion for chapter 2


In this chapter, I have presented various deep learning-based computer
vision models that I used in the system and explained the reasons for adjusting
these models to meet real-world requirements. In addition, I also proposed
technical solutions that I successfully tested to address common performance-
related issues encountered during the deployment of systems integrating
multiple artificial intelligence models. In the following chapter, I will develop a
large-scale intelligent monitoring system based on camera that using the
knowledge presented in this chapter.
57

Chapter 3

3. LARGE-SCALE INTELLIGENT MONITORING SYSTEM


3.1. Introduction about intelligent monitoring system based on
camera
Due to the spurring demand for automation and the impetus to make
everything “smart", IoT sensors are increasingly being deployed and there has
been rapid technological advancement to utilize raw sensor data and generate
actionable insights from them. The global smart sensor market size is expected
to grow from $37 billion in 2020 to $90 billion by 2027 at a CAGR of 14% [5] .
Among these IoT sensors, video cameras are very popular, since they act as our
“eyes" to the world and many important video analytics tasks, which we refer to
as Analytical Units (AUs), are performed on the raw visual data feed from video
cameras. Nowadays, camera surveillance systems have become incredibly
prevalent in our daily lives. The convenience and benefits these systems offer
have turned them into a crucial element in ensuring security and managing
important areas.

The development of camera surveillance systems has marked a significant


stride in capturing and monitoring events and activities in real-world
environments. From their humble beginnings, modern camera surveillance
systems have evolved to become more complex and intelligent. They are not
limited to merely displaying images from cameras; they have the ability to
analyze data, recognize objects, and even provide automatic alerts when
unforeseen events occur. One of the critical applications of camera surveillance
systems is ensuring security in densely populated areas and important locations
such as warehouses. These are places where the protection of human safety,
assets, and security is of utmost importance. These systems are capable of
58

monitoring and recording all activities around the clock, ensuring that no
incident goes unnoticed or unmonitored.

Camera-based monitoring systems have undergone significant development.


However, currently, surveillance systems are limited in their capabilities.
Generally, these systems only allow users to view recorded video footage. While
some camera devices have additional features such as motion detection or
people counting, these features often come at a significant cost, making it
difficult for these systems to be adopted widely due to the associated expenses.
Additionally, these systems typically lack the ability for users to customize them
to suit the specific needs of their applications. This can lead to situations where
the surveillance system does not meet the requirements of the user, resulting in
inadequate protection.

With the recent advancements in deep learning, we have witnessed the


emergence and significant development of surveillance systems that integrate
deep learning-based analytics. This has brought forth tremendous promise for
the field of surveillance and security.

These systems go beyond simply providing users with images and videos.
They have the capability to directly analyze data from these images and videos
using complex deep learning models. They can recognize objects, classify them,
and even predict behavior in real-time.

This opens up new opportunities for enhancing real-time monitoring and


event prediction. For instance, these systems can easily identify abnormal
behaviors, such as trespassing into restricted areas or threatening activities. This
can improve overall security and response.

However, along with these benefits come certain challenges.

* High computational resources


59

Integrated surveillance systems with deep learning require significant


computational resources due to the complexity of deep learning models and the
need for substantial data for training. This entails powerful servers or computing
clusters with GPUs, ample storage, high-speed networks, high bandwidth,
optimization softwar. These resources are vital for successful deployment.

* Hardware limitations

In practice, IoT devices in general, and IP cameras in particular, often come


with hardware limitations. Through real-world testing of various types of IP
cameras, ranging from budget to high-end, it's observed that most cameras limit
the number of concurrent live streams of video through the rtsp protocol (for
example, Hikvision cameras allow around 10 streams concurrency). Smaller
cameras tend to have even more stringent limitations on the number of live
video streams, rendering the system incapable of handling a large number of
requests to a single camera device. Furthermore, processing video streams
directly on the camera puts significant stress on these devices, leading to a
reduced lifespan. To serve a larger number of requests, one would need to
upgrade to higher-end cameras. However, this approach can be costly and may
not yield a significant improvement in.

* Real-time problem

The issue of real-time capability in surveillance systems is an extremely


important factor. Quick and accurate response time is a decisive element for the
effectiveness of such systems. A specific example is when a fire occurs; the alert
must be triggered immediately within 1-2 seconds after the event. This allows
on-duty personnel or automated systems a significant amount of time to respond
and manage the situation effectively. Furthermore, in other emergency situations
such as unauthorized intrusion into restricted areas, medical emergencies, or
threatening behaviors, real-time responsiveness is equally critical. The
difference of mere seconds can determine whether an incident is prevented or
60

leads to serious consequences. Therefore, the ability to provide real-time


information and alerts is an indispensable factor in ensuring the safety and
performance of surveillance systems.

3.2. A large-scale intelligent monitoring system


3.2.1. The main components, functions and actors of the system
The system is built with 3 components including:

 Client-side software: is used to manage the system and monitor


important places through cameras.
 The server-side: to manages and controls the data flow from the
cameras, handles requests from client to AI modules
 3 micro services: face recognition module, fire alert module, car license
plate recognition, which return the results to the client-side software.

3.2.1.1. Main components and functions


a. Client-side software.

Manage camera devices (IP address, Port, connection string, Camera


location, camera type, ...)

 Camera list and status


 Add new camera
 Edit camera information
 Remove camera from database
 Setting up the ROI for image monitoring and analysis.
 Select one or more analytics modules to run on this camera.

User management

 Look up user by name


 Add new user to database
 Update user’s role
61

Live view camera images as normal grid or in multiple analytics mode.

 Live view camera


 Choose AI model to apply on camera
 Show unauthorized face/license plate
 Show the fire area

Alert by sound an pop-up

 Alert when detect fire or smoke in ROI


 Alert when a car enters that is not on the registered list, show cropped
image of car and license
 Alert when have unreistered face in protected zone
b. The server manages and controls the data flow from the cameras,
handles requests from client to AI modules

Receive and process requests from client-side software.

Route requests to the necessary microservices ( AI module )

Receive results from micro services, draw bounding boxes on the image, and
send it back to the client-side software.

Working with database

c. 3 micro services for image analytics

- The license plate recognition module using gRPC API

- The face recognition module using gRPC API

- The fire alert module using gRPC API

Each module is responsible for processing data from server.

Return results in Json format to server


62

3.2.1.2. System actors


Based on the purpose of the system, we identify two actors:

System-wide manager (Admin): Admin has the role of general management


of the entire system. Manage user access rights. Perform all operations on
system software.

Operating staff: In charge of the implementation of all the basic functions of


the system.

3.2.2. System structure.


3.2.2.1. Camera management subsystem.
View live images transmitted from the camera in a grid or view each camera
in detail with functions:

 Control panel with camera controls: Pan, Tilt, Zoom, etc.


 Video operations such as: Play, stop, turn on display with license plate
recognition mode, run video playback.
 Display the image of the license plate recognized on that camera.

Comprehensive and complete management of camera information such as:


IP address, connection ports, access information, connection string, camera
location on the map. Includes functions:

 Add a camera device with the following contents: Camera name, IP


address, connection ports, camera type, login information (Username,
password), connection URL string, location aof the camera supporting
digital maps.
 Update camera device information with the following contents: Camera
name, IP address, connection ports, camera type, login information
(Username, password), connection URL string, location annof camera
with map support number.
 Delete camera information.
63

3.2.2.2. Person management subsystem


- Add person to database

- View person information and images

- Delete person from database

- View history of entered person.

3.2.2.3. Car management subsystem


- Add car and license information to databaset

- View car information

- Delete car object from database

- View history of entered car with license information and image.

3.2.2.4. User management subsystem


Manage user information including information such as usernames,
passwords, roles with the following functions:

 Add users
 Update user information
 Remove user from database

Manage user roles with permissions for each role:

 Add role
 Update role information
 Remove user roles
64

3.2.3. Use case diagram

Figure 3.33. Use case model for the system manager (admin).

3.2.4. Architectural design


The system uses Client-server architecture consisting of 3 main components:

 3 modules: face recognition, car license plate recognition, fire alert


 Main server manages and processes data
 Client-side software

3.2.5. User interface design


3.2.5.1. Login interface

Figure 3.34. Login interface.


65

3.2.5.2. Grid camera view interface

Figure 3.35. Grid camera view interface.

3.2.5.3. Unregisterd licende plate interface interface

Figure 3.36. Alert interface.


66

3.2.5.4. Unregistered face alert interface

Figure 3.37. Camera list interface.

3.2.5.5. Interface for adding and updating camera information.

Figure 3.38. Interface for adding and updating camera information

3.2.5.6. Video playback interface


67

Figure 3.39. Video playback interface.

3.2.5.7. Single camera view interface

Figure 3.40. Single camera view interface

3.2.5.8. List of person stored in database interface

Figure 3.41. Vehicle tracking interface.


68

3.2.5.9. Interface for viewing history of car license plate recognition


module

3.2.5.10. Interface for viewing history of face recognition module

Figure 3.42. Result interface of number plate recognition by camera.

3.3. Model of deployment.


The system is divided into 3 main components:

- 3 AI modules: face recognition, car license plate recognition, fire alert


69

- Main server manages and processes data


- Client-side software

Figure 3.43. System model.

3.3.1. Database server


Database server is responsible for synchronizing data from all clients:

- Receive and process data sent back from camera management servers, data
storage and database management.

- Return query results for client users

3.3.2. Routing server


- Receive request from client-side software.

- Transmit image from client-side to Analytic server

- Processing restreaming
70

3.3.3. Analytics server


- Face recognition API

- Fire alert API

- Car license plate recognition API

3.4. Conclusion for chapter 3


In this chapter, we have built a system model of a intelligent monitoring
system based on camera IP, built three AI modules based on machine learning
techniques, built user interface, so that users can easily perform system
operations in stead of interact with black console.
71

CONCLUSION

The results have been achieved.

We have built a system model of a intelligent monitoring system based on


camera IP, built a multiple analytics based on machine learning and deep
learning techniques, visually displayed on a digital map, built user interface, so
that users can easily perform system operations, specifically:

- The program is easy to use, meets the objective needs of a intelligent


monitoring system based on camera IP
- Design a basic program with basic functions that meet user requirements.
- Beautiful interface attracts users, does not cause boredom during
operation.
- Support functions suitable for information search and disaster incident
information system management.
- Proposed restreaming technique to save bandwidth and resouces for
advanced processing
- Apply multiple technologies to solve limited hardware on IoT device.

Some limitations

Although this work has achieved certain results, a number of limitations


have not yet been resolved due to limited capabilities and time conditions. Some
limitations can be seen as follows:

- We consider realtime monitoring based-on camera problem only on


daytime, so in low light conditions or night time, the accuracy of our
methods will be decreased.
- In case the traffic is congested and moving close together, the system will
be slow
72

- There has not been sufficient time and conditions to conduct tests on the
military-scale QS-Net network, or with an extremely high volume of
requests to assess the system's load-bearing capacity

Future work

Continue to refine and add functionalities to the user interface.

Improve artificial intelligence models to perform better under various


weather conditions.

Maintain and fix errors arising when deploying the system for a long time.

Complete, update more functions to improve user experience, as well as


ensure reasonable use of system resources, reasonable data management.
73

REFERENCES

[1] Alejandro Newell, Kaiyu Yang, and Jia Deng. , "Stacked Hourglass
Networks for Human Pose Estimation," 2016.

[2] Bin Xiao, Haiping Wu, and Yichen Wei., "Simple Baselines for Human
Pose Estimation and Tracking," 2018.
[3] Duan. T.D., Du. T.L.H., Phuoc. T.V, Building an Automatic Vehicle
License-Plate Recognition System, International Conference in Computer
Science, 2020, p.323-329.
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Y. Bengio, "Neural Machine
Translation by Jointly Learning to Align and Translate," 2014.
[5] Einstein, A., B. Podolsky, and N. Rosen, 1935, “Can quantum-mechanical
description of physical reality be considered complete?”, Phys. Rev. 2019,
p.777-780.
[6] Faisal Shafait and Thomas Breuel., "The Effect of Border Noise on the
Performance of Projection-Based Page Segmentation Methods," 23.
[7] Feng Zhang et al., "Distribution-Aware Coordinate Representation for
Human Pose Estimation," 2019.
[8] Fisher Yu et al., "Deep Layer Aggregation," 2019.
[9] Guo, Jia, et al. "Sample and computation redistribution for efficient face
detection.", 2021.
[10] Khanh Nguyen, Dan Pham Van, Van Pham Thi Bich, "An efficient method
to improve the accuracy of Vietnamese vehicle license plate recognition in
unconstrained environment," MAPR, 2021.
[11] Mark Everingham et al., "The Pascal Visual Object Classes (VOC)
Challenge," 2010.
[12] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning, "Effective
Approaches to Attention-based Neural Machine Translation," 2015.
[13] Olga Russakovsky et al., "ImageNet Large Scale Visual Recognition
Challenge," 2014.
[14] R. Laroca et al., A Robust Real-Time Automatic License Plate Recognition
Based on the YOLO Detector, p. 58-66, 2018.
74

[15] Rafael Muller, Simon Kornblith, and Geoffrey E. Hinton., "When Does
Label Smoothing Help?," 2019.
[16] Tsung-Yi Lin et al., "Microsoft COCO: Common Objects in Context,"
2015.
[17] Xiao, Jianyu, Guoli Jiang, and Huanhua Liu. "A lightweight face
recognition model based on mobilefacenet for limited computation
environment.", 2021.
[18] Xingyi Zhou, Dequan Wang, and Philipp Kr¨ahenb¨uhl, "Objects as
Points," 2019.
[19] Zisserman, Karen Simonyan and Andrew, "Very Deep Convolutional
Networks for Large-Scale Image Recognition," 2014.

You might also like