Frontiers Paper

1 Design and Implementation of Real Time Object Detection System
2 Based on SSD and OpenCV

3 Fazal Wahab1†, Inam Ullah2†, Anwar Shah3, Rehan Ali Khan4, Ahyoung Choi5, *, Muhammad
4 Shahid Anwar5, *
5 College of Computer Science and Technology, Northeastern University Shenyang China
1
6 BK21 Chungbuk Information Technology Education and Research Center, Chungbuk National University, S Korea
2
7 School of Computing, National University of Computer and Emerging Sciences, Faisalabad, Pakistan
3
8 Department of Electrical Engineering, University of Science and Technology Bannu, 28100, Pakistan
4
9 Department of AI and Software, Gachon University, Seongnem-si 13120, Republic of Korea

5
10 * Correspondence:
11 Ahyoung Choi (email: aychoi@gachon.ac.kr) and Muhammad Shahid Anwar (email: shahidanwar786@gachon.ac.kr)
12 Co-first authors: Fazal Wahab and Inam Ullah contributed equally to this work as first authors.
†
13
14 Keywords: Computer Vision, Deep Learning, Image recognition, Object Detection, Object
15 Recognition, Single Shot Detector
16 Abstract
17 Computer vision (CV) and human-computer interaction (HCI) are essential in many technological
18 fields. Researchers in computer vision are particularly interested in real-time object detection
19 techniques, which have a wide range of applications, including inspection systems. In this study, we
20 design and implement real-time object detection and recognition systems using the Single Shot
21 Detector (SSD) algorithm and deep learning techniques with pre-trained models. The system can
22 detect static and moving objects in real-time and recognize the object's class. The primary goals of
23 this research were to investigate and develop a real-time object detection system that employs deep
24 learning and neural systems for real-time object detection and recognition. In addition, we evaluated
25 the free available, pre-trained models with the SSD algorithm on various types of data sets to
26 determine which models have high accuracy and speed when detecting an object. Moreover, the
27 system is required to be operational on reasonable equipment. We tried and evaluated several deep
28 learning structures and techniques during the coding procedure and developed and proposed a highly
29 accurate and efficient object detection system. This system utilizes freely available datasets such as
30 MS Common Objects in Context (COCO), PASCAL VOC, and Kitti. We evaluated our system's
31 accuracy using various metrics such as precision and recall. The proposed system achieved a high
32 accuracy of 97% while detecting and recognizing real-time objects.
33
34 1. Introduction
35 The computer vision field may be the best arrangement when considering its various application
36 areas. A significant number of these applications include tasks that necessitate either working in a
37 dangerous domain, a large handling force, access to and utilization of massive data databases or
1 Design and Implementation of Real ...
38 dreary schedules for individuals to complete. The conditions under which computer vision
39 frameworks are used range from assembling plants to clinic careful suits. For example, computer
40 vision is frequently used for quality control in the assembly of frames. The computer vision
41 framework outputs fabricated items in the assembly framework application zone to distinguish
42 imperfections and give control signals to a mechanical controller to consequently evacuate flawed
43 parts. Frameworks to naturally analyze skin tumors and neurosurgeons during complex procedures,
44 such as mind medical procedures, are later examples of medicinal frameworks created with computer
45 vision strategies.
46 The process of recognizing a moving or nonstationary object in a video sequence is known as

47 object detection. This is the initial and most crucial step in tracking moving objects. To gain a
48 thorough understanding of images, we would not only classify them but also attempt to precisely
49 guess the concepts and locations of objects contained in each image. Object detection [1] is the name
50 given to this task, which is divided into subtasks such as skeleton detection, face detection, and
51 pedestrian detection. Object detection is a computer-challenging technology that is related to
52 computer vision and image processing. It deals with identifying instances of semantic objects of a
53 certain type (such as people or cars) in advanced pictures and recordings. Well-researched areas of
54 object detection include confronting discovery and person-on-foot location. Most applications of
55 object location are in numerous regions of computer vision, counting video observation, and picture
56 recovery.
57 Object detection is one of the most difficult problems in computer vision, and it is the first step in
58 a few computer vision applications. An object discovery framework's goal is to identify all
59 occurrences of objects of a known category in an image and it is a particularly difficult task in
60 computer vision. [2] somewhat obstructed object detection seeks to develop computational
61 techniques that provide one of the most basic pieces of data required by computer vision applications:
62 Where are the objects located? As one of the essential complications of computer vision, object
63 detection serves as the foundation for many other computer vision tasks, such as instance
64 segmentation [3], image captioning, object tracking, and so on.
65 The study of "computer vision," or CV for short, aims to develop methods that will allow
66 computers to "see" and understand the content of computerized images such as pictures and videos.
67 Because people, particularly children, illuminate the problem of computer vision insignificantly, it
68 appears to be a simple one. It is, by the way, a generally unresolved issue due to both the limited
69 understanding of natural vision and the complexities of vision discernment in an active and
70 constantly changing physical world. The significance of computer vision lies in the issues it can shed
71 light on. It is one of the most cutting-edge technologies, allowing communication between the
72 developed and developing worlds. Computer vision allows self-driving cars to understand their
73 surroundings. Cameras in various locations around the vehicle record video and feed it to a computer
74 vision program, which creates images in real-time to identify activity signs and street limits.
75 A traffic surveillance system can benefit from computer vision techniques [35]. Traffic
76 surveillance systems that identify and recognize moving objects are an important topic in computer
77 vision [4]. Evaluating the sequence of frames removed from the video allows for a better
78 understanding of how moving objects behave. It eliminates the issues associated with traditional
79 methods that rely on human operators. Depending on the degree of manual involvement, these
80 systems are classified as fully automatic, semi-automatic, or manual. The most important and critical
81 component of computer vision applications is moving object detection [10, 39]. The importance of
2
82 computer vision cannot be overstated in today's world. Many of our critical organizations, including
83 the Security Organization, place a high priority on computer vision applications.
84 We can clearly state from our investigation and research that several frameworks are being used
85 to detect objects. Given that our system is not a cutting-edge innovation, nor is the entire computer
86 vision field, it has been used to develop a number of related frameworks. Moreover, with computer
87 vision, a part of the data is gotten from huge and small chunks of information sets and handled into
88 important data for encouraging utilization or preparing purposes. Some of the modern object
89 detection techniques include CNN, R-CNN, Faster R-CNN, Fast R-CNN, and YOLO [5, 18]. These
90 are the different deep learning methods that are currently used for real-time object detection [6, 7].
91 We can also use these techniques for other purposes like health concerns, action detection, etc.
92 The study's main contribution is the design and implementation of real-time object detection and
93 recognition systems using the SSD algorithm and deep learning techniques with a pre-trained model.
94 Our proposed system can detect static and moving objects in real time and classify them. The primary
95 goals of this study were to investigate and develop a real-time object detection system that uses deep
96 learning and neural systems to detect and recognize objects in real-time. Furthermore, we tested the
97 free, pre-trained models with the SSD algorithm on various types of datasets to determine which
98 models have high accuracy and speed when detecting an object. Besides this, the system must be
99 operable on reasonable equipment. During the coding procedure, we evaluated various deep learning
100 structures and techniques and developed and proposed a highly accurate and efficient object detection
101 system.
102 The rest of the paper is structured as follows: Section 2 represents the related work, while Section
103 3 is the system model. Section 4 is the system architecture, and Section 5 is the experimental results
104 and evaluation. Finally, Section 6 concluded this work.
105
106 2. Related Work

107 The significance of computer vision lies in the issues it can shed light on. It is a cutting-edge
108 technology that enables the communication between developed and developing countries. Self-
109 driving cars can understand their surroundings thanks to computer vision. Cameras capture video
110 from various points around the vehicle and feed it to a computer vision program which then forms
111 the images in real-time to discover street limits, study activity signs, and distinguishes. A traffic
112 surveillance system can benefit from computer vision techniques where the detection and recognition
113 of nonstationary or moving objects is an important topic. Analyzing the frame sequence extracted
114 from the live video gives us more information about how moving objects behave. It eliminates the
115 issues associated with traditional methods that rely on human operators. Depending on the degree of
116 manual involvement, these systems are classified as fully automatic, semi-automatic, or manual. The
117 most important and critical component of computer vision applications is moving object detection
118 [10].
119 Technology has advanced rapidly in recent years. Artificial intelligence and computer vision are
120 making significant strides in modern times, thanks to the development of powerful microprocessors
121 and high-quality cameras. With the use of these technologies, computer vision-based real-time object
122 detection can detect, locate, and trace an object from an image or a video. The authors propose a
123 method for integrating this real-time system into a web page in [11]. The TensorFlow object
124 detection API is used to detect objects, and live video is streamed through a password-protected login
3
125 web page that can be accessed from any device. The system draws a box around the object and
126 displays the detection accuracy. Similarly, the authors of [12, 13] presented a variety of approaches
127 for computer vision, including OpenCV and SSD-MobileNet, object recognition, and so on.
128
129 Several recent deep learning approaches can be used to localize, classify, and detect the object.
130 Each of these methods detects and recognizes the object using a different mechanism. In this section
131 [14-16], we will discuss a few of them that are currently used for object detection and recognition [8,
132 9]. CNN, R-CNN, Fast R-CNN, Single Shoot Detector (SSD), and Faster R-CNN are the most
133 common [17]. Because the Faster R-CNN is a member of the CNN family, we will explain it in
134 detail, as well as the R-CNN and the Fast R-CNN, and then we will discuss the SSD, which will be
135 used in our proposed system [19, 20]. According to the study, the SSD is the fastest among the other
136 object detection techniques [26]. 1): Faster R-CNN (region-based Convolutional Neural Network)
137 [22, 30]. To prepare the shared features, faster R-CNN employs the substitute preparation mode. Its
138 employees then plan to begin the weight of the RPN, extract the appropriate proposition from the
139 prepared dataset, and train the Faster R-CNN show with the recommendations repeatedly until the
140 result meets well [25, 32].
141
142 3. System Model

143 The primary goal of this system is to detect a static object from an image and a moving object
144 from a video to display the object's class. Moreover, the functional requirements describe what the
145 system does. The main functional requirements for our proposed system include both static object
146 recognition and moving object recognition [21]. These functional requirements are the data
147 processing module, deep learning module, static object detection module, moving object tracking
148 module [37], pre-defined object module, and object recognition module. The proposed system takes
149 an image from the camera, match it with the data set, match it with the data set classes, runs the pre-
150 trained models, and finally boxes the object and displays the object instance with the accuracy level.
151 The system modules are the functional requirements. We have a total of six modules in our system.
152 Figure 1 depicts the system modules and explains the individual operations of each module. We
153 will explain each module of the system in detail, including its figure and operating procedure, before
154 combining these modules into the proposed system.
4
155
156 Figure 1: The functional requirements system modules.
157
158 3.1 Data Processing Module Analysis

159 In this section, we will discuss image datasets, which are used to train and benchmark various
160 object detection procedures. This section will go over the datasets that will be used in our real-time
161 object detection system. There are numerous free datasets available on the internet that can be used in
162 deep learning techniques. DNN requires a large amount of labeled data (structured data) for training
163 the model; currently, the most used datasets for object detection are ImageNet, PASCAL VOC, and
164 MS Common Objects in Context (COCO).
165 (1) Kitti Dataset
166 KITTI is a dataset composed of stereo cameras and lidar scanners in rural, urban, and highway
167 driving settings. It is divided into ten categories: vans, small cars, trucks, sitting people, pedestrians,
168 cyclists, miscellaneous, trams, and do not care [38]. The images are 1382x512 in size, and 7,500 of
169 them provide 40,000 object labels classified as easy, modest, or difficult based on how widely the
170 images are obstructed and truncated [40].
171
172 (2) MS COCO Dataset
173 MS COCO stands for Common object in context. Microsoft sponsors COCO, and the annotation
174 comprises categories, position information, and a semantic text description of the image. The COCO
175 dataset's open source also contributes to the advancement of object detection. Microsoft-sponsored
176 COCO is new image segmentation, recognition, and captioning dataset. This datasets open-source
177 has made significant advances in semantic segmentation in recent years [33, 34], and it has become a
178 "standard" dataset for image semantic understanding performance, with COCO posing a unique
179 challenge [41].
5
180
181 (3) PASCAL VOC Dataset
182 PASCAL VOC gives a standard picture naming and assessment framework. The PASCAL
183 VOC picture dataset consolidates 20 groups; the dataset highlights a high-quality and fully named
184 image, which is especially useful for analyzing calculation execution. The PASCAL VOC (plan
185 examination, quantifiable illustrating, and computational learning visual address classes) provides
186 standardized image data sets for dissent lesson confirmation as well as a common set of devices for
187 retrieving the data sets and clarifications.
188
189 3.3.1 Training and Testing Data

190 After that, the datasets are divided into train and test subsets. During the experiment, we will
191 randomly divide our dataset in an 80-20 ratio. The results section will explain the relationship
192 between the training data and the detector's presentation. A train-test split can be achieved in several
193 ways. Even so, prediction improves if the distribution of classes on both subsets is sufficiently
194 balanced. After dividing the dataset into two subsets, we must convert the annotated XML files to a
195 TensorFlow-compatible file format.
196 TensorFlow performs the batch operation using its native file format TFrecord (. record).
197 Dissimilar to different stages which do the greater part of the clustering procedure straightforwardly
198 from the pictures, TensorFlow uses a solitary document for the bunch activity. In the TF record,
199 pictures are changed over into the Numpy cluster. This organization for preparing the huge dataset
200 blends coordinate dataset and system engineering just as procedure the huge dataset that does not
201 suitable into the retention. This configuration is the record-based twofold organization that is applied
202 for data preparation and testing in various TensorFlow applications. There are numerous possibilities
203 available for information pre-handling. Before converting the described dataset to TFrecord (parting
204 XML document) or after utilizing TFrecord work, the preparation set and testing split should be
205 possible. The XML document preparation and testing should have been done in the TFrecord record
206 position. The data conversion to TF Record is shown in Figure 2.
207
208
209 Figure 2: Data conversion to TF record.
210
211 3.2 Object detection deep learning module
6
212 Deep learning, a branch of machine learning and artificial intelligence, focus on the training of
213 computational models composed of multi-layered artificial neural networks. An ANN with multiple
214 layers is referred to as a deep neural network (DNN). Deep neural networks have more than two
215 layers of hidden neurons between the input and output layers, which corresponds to the network's
216 depth. Modern accuracy in speech recognition, object detection, image classification, language
217 translation, and other areas has been significantly enhanced by deep neural networks. Deep learning
218 strategies depend on learning portrayals (highlights) from information, for example, content, pictures,
219 or recordings, instead of actualizing task-explicit calculations. Learning can either be solo or
220 regulated. Be that as it may, a large portion of the pragmatic frameworks conveys administered
221 figuring out how to use the advantages of deep learning [28]. Managed learning, fundamentally,
222 implies gaining from market information.
223
224 3.3 Pre-defined Objects Module

225 Pre-defined objects mean that object which we have already defined and labelled, it means the
226 labelled datasets. In modern computer vision, object detection is now considered alongside object
227 recognition (classification), localization, following, and data extraction from the object. Object
228 detection is inextricably linked to these forms. The goal of classification is to determine the object's
229 class or identify its nature. The location of the object(s) within the image or the outline is determined
230 by localization. The development and status of the object may be influenced through object tracking
231 in video or live recording. The goal of the object detection framework is to classify and locate all
232 objects displayed in a picture. The locator's input is a picture of the object, and the output could be a
233 list of the bounding box. The classification, localization, and instance segmentation is shown in
234 Figure 3.
235
236 Figure 3: Classification, localization and instance segmentation.
237 3.4 Static Object Detection Module

238 The object detection key aim is to find all instances of objects from an identified class in an
239 image, such as people, cars, or faces. A static object is a stationary object that is not moving. The
240 exhibition of deep learning strategies increments with an expansion in the measure of preparing
241 information rather than customary learning techniques that soak in execution; this trademark makes
242 the deep learning strategies versatile. Because a deep neural system is made up of many layers, it
243 includes learning portrayals with varying degrees of complication and deliberation. As previously
244 stated, the underlying layers become acquainted with the low-level highlights and then pass them on
7
245 to the subsequent layers. The subsequent layers then accumulate elevated-level highlights based on
246 recently learned lower-level highlights.
247 a) Age: Age is a parameter that keeps track of the number of frames in which the object has not
248 moved.
249 b) Type: Type primarily represents the object's status.
250 c) The object can be new, matched, or occluded: Before processing the next frame, the new
251 object is added to the previous frame.
252
253 3.5 Moving Object Tracking Module

254 In this section, we will discuss moving objects such as moving persons or cars. The detection of
255 the physical movement of an object in each location or region is known as moving object detection.
256 The movement of moving objects can be tracked and analyzed by acting as a division between
257 moving objects and stationary regions or locales. Cameras distinguish moving objects around the
258 vehicle when it is stopped or gradually maneuvering; in our framework, the camera will identify and
259 recognize the moving protest, such as a person on the road or a car on the street. The framework then
260 warns the driver visually and audibly in many smart frameworks. There are two frameworks: one
261 makes use of the all-encompassing See Screen and four cameras mounted on the front, back, and
262 sides of the vehicle, while the other makes use of a single camera mounted on the raise. The four-
263 camera system can alert drivers while they are in three different motions: stopping or shifting into
264 neutral, moving forward, and backing up. The front and rear cameras recognize moving objects
265 independently as the vehicle moves forward or reverses.
266 The framework uses a simulated bird's-eye view image to identify moving objects around the car
267 when it is in stop or neutral. A single rear-view camera framework on a vehicle allows it to detect
268 moving objects behind it [23, 24, 27]. With the help of the cameras, the framework creates video
269 symbolism that it uses to locate moving objects. The framework that uses the Around See Screen has
270 been modified to analyze video signals from the four cameras that are attached to the front, rear, and
271 both side-view mirrors of the car. Then, it can instantly distinguish moving objects around the
272 vehicle. Depending on where the transmission is moving, it can decide which of the three options
273 applies—moving forward, stopping, or backing up. The flow chart of the deep learning module is
274 shown in Figure 4.
8
275
276 Figure 4: The flow chart of the deep learning module.
277
278 3.5.1 Background Subtraction

279 Background subtraction is used to separate a frontal area object from its surroundings. The
280 basic strategy for using this procedure is to create an establishment show that speaks to the scene.
281 The establishment demonstrates capacities as a kind of the point of view and ought to in this way be
282 reliably revived and contain no moving things. Each edge is at that point differentiated with the
283 establishment demonstrate with the objective that alterations within the picture can be seen. By
284 looking at each video diagram against the establishment demonstrate it is conceivable to see moving
285 things as distant as deviations from the reference demonstrate.
286 The calculations utilized for establishment subtraction are essential and clear to utilize, the
287 strategy is besides greatly fragile to changes in nature. The establishment subtraction technique can
288 be confined into two get-togethers, recursive techniques, and non-recursive frameworks. Recursive
289 strategies base the establishment demonstrate on each video layout by recursively reviving the
290 establishment demonstrate. The consequence of this is the model can be influenced by info edges
291 handled in a far-off past. Contrasted with non-recursive systems this technique requires less memory
292 stockpiling, yet possible blunders out of sight model can wait for a more drawn-out timeframe. Non-
293 recursive strategies store a cushion with the keep going on video outlines.
294
9
295 3.5.2 Feature Extractor

296 A crucial element of the object detection model used to extract object features from data is the
297 feature extractor. The following figure depicts the area meta-architecture, extractor, and classifier that
298 make up the object detection illustration structure. As shown in Figure 5, the input picture is routed
299 through the included extractor, which separates features from the image. The classifier then classifies
300 the class and the area of the object within the input image using the features that were removed from
301 the image.
302
303
304 Figure 5: Feature’s extractor and classifier in architecture of meta data.
305 The feature extractor's deep architecture can be used to improve accuracy while reducing
306 computational complexity. In object detection meta-structures, popular feature extractors such as
307 AlexNet, Inception, MobileNet, NAS, ResNet, and VGG can be used. We will use a pre-trained
308 model for feature extraction because MobileNet is more compatible with SSD.
309 3.5.3 Image Scaling

310 Through pixel replication or interpolation, the image is zoomed. Scaling can be used as a low-
311 level preprocessor in a multi-stage image processing chain that operates on scaled features, to alter
312 the visual appearance of an image, to alter the amount of information stored in a scene representation,
313 or for both purposes. Compressing or expanding an image along its coordinate directions is the
314 process of scaling. Since there are various methods for zooming and subsampling.
315 3.6 Object Recognition Module

316 Object recognition entails recognizing and identifying the object class. In reality, we have
317 various object classes such as humans, cars, cups, bottles, and so on. The task at hand is to locate a
318 specific object in an image or even a video sequence. Because, unlike humans, object recognition
319 engines can distinguish and recognize a wide range of objects in images or recordings that may differ
320 in perspective, color, measure, or even when the object is slightly deterred, it may be a serious vision
321 problem. The risk of identifying an object in an image refers to a labelling issue based on well-known
322 object models. In essence, given a blank image containing the objects of interest and a set of names
323 corresponding to a set of models accessible within the framework, the framework may be able to
324 appropriately assign the names to each region within the image. Figure 6 shows how the difficult task
325 of identifying an object in an image is defined as a labelling problem based on recognized object
326 models. In essence, given a nonspecific image containing the objects of interest and a set of names,
327 the framework may be able to properly assign the names to the specific regions within the picture
328 when compared to a set of models accessible within the framework.
10
329
330 Figure 6: Object recognition steps.
331
332 3.7 Models Analysis

333 In this section, we will go over pre-trained and fine-tuned models. We will explain which
334 models we will use in our project to create a successful real-time object detection system based on
335 SSD.
336 3.7.1 Pre-Trained Model

337 A model that has been previously trained is referred to as pre-trained. A pre-trained model can
338 be used as a starting point or for out-of-the-box inference instead of creating and training a model
339 from scratch. Even though a pre-trained model is time-efficient, it is not always 100% accurate. The
340 tables below show some pre-trained models. There are numerous pre-trained models available for
341 free on the Tensorflow zoo website. We will need to use pre-trained SSD_MobileNet_v1_cco
342 models, SSD_MobileNet_v2 models, and VGG16 models, which we will design in Python and
343 implement with the SSD algorithm. We will look at which pre-trained models have high accuracy
344 with the SSD and draw conclusions.
345
346 Table 1: Common pre-trained Models.
Name Speed (ms) COCO mAP [ ˄ Productivity

1]
SSD_MobileNet_v1_coco 30 21 Boxes
Faster_RCNN_Inception_v2 58 28 Boxes
Faster_RCNN_resnet_v2 89 30 Boxes
11
Faster_RCNN_resnet_v1 106 32 Boxes
SSD_mobilenet_v2_coco 31 22 Boxes
SSD_inception_v2_coco 42 24 Boxes
347
348 4. System Design

349 We go over the entire design process of our proposed system. We design various modules such
350 as the data processing module, the deep learning module, and the object detection module. We
351 explain each module with the help of figures that depict the design process.
352 4.1 System Architecture
353 A system architecture is the intangible representation of a framework's structure, behavior, and
354 other aspects. An architecture description is a formal depiction and representation of a system,
355 organized in a way that promotes thinking about the framework's structures and practices. The
356 system architecture is essentially the overall design of the system, describing how the system will
357 function. In our case, the main goal of the dissertation is to demonstrate that by providing an image
358 (live image from video) as an input to the system, it must be capable of detecting the object and
359 recognizing the object's class. To accomplish this, we must first train the SSD with many input
360 images. These images will be taken out of the data sets and handled in line with the prerequisites for
361 SSD input mentioned earlier. After the training phase is finished, a second phase begins in which the
362 system must output the original image along with the precise bounding boxes and pertinent
363 description surrounding the object outside, giving an input image to the pre-trained model. The
364 objective is to have an interactive testing layer during the live phase to test system-wide metrics like
365 mean average precision. The proposed system architecture and the deep learning methods applied for
366 a real-time object detection system are shown in the figure below.
367 The single shot detector algorithm that we will use in our proposed system is depicted in the
368 figure below. We have also shown the data sets that are training and testing data required for the
369 development of our system in the following architecture (Figure 7). We also displayed evaluation
370 metrics like recall, precision, and accuracy. When the system detects an object image, it goes through
371 several steps, such as taking an image with a web camera, extracting the features, classifying the
372 class of the object, testing with the data set, running the detection algorithm, and finally displaying
373 the output, such as an image with a bounding box. The diagram above depicts how the system will
374 capture and detect the object.
12
375
376
377
378
379
380
381
382
383
384
385 Figure 7: The system architecture.
386 4.2 Design of Data Processing Module

387 This section will go over the data processing procedure. As shown in Figure 8, we must have our
388 dataset because we have divided the data set into exercise and testing data. When the system starts, it
389 checks the available dataset; if the system finds the data set, it proceeds to the next step, such as
390 training and test data. If the system does not find the data set, it will look again; otherwise, an error
391 message will be displayed. As shown in the activity diagram, once the data set is established, the
392 system will take a small portion of the training data, such as 80%, and the remainder as testing data,
393 such as 20%. The system will then proceed to the pre-trained model and detect the object, followed
394 by a final evaluation.
395
396 Figure 8: Activity diagram of the data processing module.
397
398
399 4.3 Class Diagram of the Proposed System
13
400 Class diagrams depict the connections and source code conditions that exist between classes. A
401 class describes the tactics and components of an object in this case, which is a specific component of
402 a programmed or the unit of code that corresponds to that entity. A Class Diagram is a static
403 representation of an application. A class outline depicts the different types of objects in the
404 framework as well as the different types of connections. A class is used to represent at least one
405 object in object-oriented programming. While each object is composed of a single class, a single
406 class can be used to begin multiple articles. A static chart represents the class outline. It refers to the
407 static view of a framework/application. The class diagram in our system included the base network,
408 which will be used for feature extraction, and the SSD, which will be used to localize and recognize
409 the object class. Image scaling, data set usage, and object attributes will all be considered here. Our
410 system's main task is to train the models on the given dataset so that they can successfully detect and
411 recognize the object in a video sequence.
412 4.4 System Implementation
413 In the data processing module, we will discuss how the data will be processed and how we will
414 practically implement it using Python coding. We will discuss the practical implementation of the
415 data processing module in this section. In this module, we must consider the train and test data, as
416 well as the evaluation, to determine the accuracy. The pre-trained model and SSD will be trained on
417 train data first. When the system boots up, it loads the train data first, followed by the trained model,
418 and then the test data is passed to the trained model for further evaluation to ensure accuracy. Figure
419 9 depicts the implementation procedure, demonstrating how it works.
420
421
422 Figure 9: System implementation.
423 Convert raw detection dataset to TFRecord for object detection. Converts detection dataset to
424 TFRecords with a standard format allowing to use this dataset to train object detectors. The raw
425 dataset can be downloaded from the internet. Detection dataset contains 1481 training images. Using
14
426 this code with the default settings will set aside the first 500 images as a validation set. This can be
427 altered using the flags.
428 4.4.1 Implementation of Deep Learning Module

429 In this module, we will deliberate how to implement the deep learning module and how it will
430 work in our implementation? In deep learning module, we have the basic proposed algorithm which
431 SSD and the pre-trained models which is for example SSD_MobileNet-c1_coco model. In deep
432 learning module implementation, whenever the webcam opened, it will take the image of the object
433 and this image will pass to the training and testing dataset. In the next phase, the pre-trained model
434 will be activated and be prepared and similarly these images and model will pass to the deep learning
435 techniques which is SSD in our case. Similarly, the object will be detected and recognized. The deep
436 learning module for object detection is basically includes many steps which are the data set
437 processing, the training of the models and the prediction. Prediction is for the recognition of the
438 objects. The training phase will be coded in this module which is the model training on the available
439 dataset. Figure 4 represents the flow chart of the deep learning module. The system will take starting
440 step from the input image with the help of the web camera and then the further steps will take place.
441 The deep learning module of our proposed system includes the SSD basic detection algorithm
442 as well as the SSD_MobileNet_v1_coco pre-trained model. In this section, we will write code for the
443 single shoot detector as well as the mobile net, which serves as the base network and is used to
444 extract features.
445 4.4.2 Implementation of Static Object Detection Module

446 We will discuss the static object detection procedure and its implementation in this module.
447 Object detection, as mentioned in the design section, is the process of object recognition
448 (classification), localization (bounding boxes), tracking, and extracting object information. These
449 processes are inextricably linked to object detection. The main goal of the classification stage is to
450 determine the object's class or recognize what the object is. The class of the object is identified here.
451 Localization is the process of defining the location of an object or objects in an image or localizing
452 an object within the frame. In this module, the image will be taken as input from the web camera and
453 converted to greyscale. Later, cascading will be applied to the image to find out the object. If it is
454 founded successfully, the next phase will be started. Otherwise, it will not proceed. Whenever the
455 object is detected, it will be displayed with a bounding box.
456 4.4.3 Implementation of Pre-Defined Objects Module

457 In this module, we will implement the pre-defined object module that we designed and discussed
458 in the system design section's pre-defined object detection section. Pre-defined objects are essentially
459 datasets, as we have already defined the objects (labelled data) and trained the model to select objects
460 from the pre-defined dataset. If the input data is valid, the pre-trained model will be called upon, the
461 image will be compared with the pre-defined object images, and the most nearby and related object
462 images with their names will be displayed. The pre-defined object detection module is responsible for
463 this.
464 4.4.4 Implementation of Moving Object Tracking Module

465 Object tracking in video or live recording is the process of determining the movement and status
466 of an object. The purpose of an object detection system is to categorize and locate (localize) all
467 objects in an image. In our system, the basic input for the detector is an image comprising the object,
468 and the output is a list of the rectangular bounding box. This procedure involves background
15
469 subtraction techniques. It also includes the processes of localization and classification, just like static
470 object detection. the moving object tracking module. This module is mostly defined for vehicle
471 detection on roads. This type of module is used for traffic purposes. It will detect the moving object,
472 such as a person on the street or on the road, as well as the vehicles that are moving on the road. The
473 system will take an image from the moving object via web camera and apply the background
474 subtraction techniques. Similarly, the image from the live stream will be detected and the system will
475 follow the next step.
476 The system will then proceed to apply the SSD algorithm techniques to the trained dataset as
477 well as the pre-trained models in the following phase, image processing. The next phase is the object
478 recognition phase, in which the object is recognized and the results are displayed. Here we will use
479 the same code as the static object detection code. We have used two terms like static object detection
480 and moving object detection, we mean that our system is capable to detect both scenarios’ objects.
481 Some of the systems do not detect the moving objects, but our system can track the moving object,
482 such as the person walking on the street or the vehicle moving on the road, and the system can detect
483 them.
484 Detecting objects in a moving scene is the first step in video analysis. An object detection
485 mechanism is used when a moving object appears in a video scene. The object detection method
486 relies on information in a single frame. Motion detection is a significant and difficult task when
487 compared to static object detection [42, 43]. When an object moves, the system takes an image of it
488 and then applies the object detection algorithm, which is the single shoot detector algorithm. But
489 during the tracking phase, the system will also use the feature extraction techniques which is in our
490 case the mobile net techniques which provides the high-level features for the detection algorithm.
491 Similarly, both algorithm such as the base network mobile net and the detection network SSD will
492 combine detect and track the moving object.
493 4.4.5 Implementation of Object Recognition Module

494 In this section, we will implement our system's object recognition module. Essentially, object
495 recognition entails identifying the object class and displaying the object with its class name and
496 bounding boxes. Object recognition consists of several steps, including image input, detection,
497 classification, recognition, and finally the result.
498 In reality, we have various object classes such as human, car, cup, bottle, and so on. The task at
499 hand is to locate a specific object in an image or even a video sequence. It is a fundamental vision
500 issue because object recognition engines continue to struggle, in contrast to humans, who can detect
501 and identify a wide variety of objects in images or videos that differ in viewpoint, colour, size, or
502 even when the object is partially obstructed [29]. In the object recognition module's implementation
503 phase, the static object recognition is mostly used for cars that have been parked in a non-parking
504 area, or an area that isn't designated for parking.
505 Detection and recognition are two distinct concepts. The detection process is the first step
506 toward recognition. Detection is the process of locating semantic objects in a video scene, such as
507 humans, cars, and animals. The object detection algorithm is primarily used to extract features that
508 can be used to identify the object class. Detection is the detection of something in front of the web
509 camera. The purpose of recognition is to recognise and identify the class of the detected object,
510 which means whether the detected object is a human, a car, or something else.
16
511 4.4.6 Critical Algorithm and Pseudo Code

512 As previously stated, the critical algorithm in our case is SSD. In this section, we will briefly
513 explain the critical algorithm and how we can code and implement it to build a real-time object
514 detection system. This system handles objects of various sizes by providing highlight maps from
515 various convolutional layers to the classifier. This meta-architecture is faster than others, but it lacks
516 detection precision because it completes everything all at once.
517 To meet our requirements, we are training the SSD algorithm alongside pre-trained models. On
518 our local laptop, we implement the coding process, and we successfully run the following code with
519 the required results. The SSD layer is built around a feed-forward CNN, which generates a fixed
520 estimate collection of bounding boxes and object class cases that are displayed within those boxes.
521 The input image is passed through several convolutional layers before being examined by the SSD
522 layer. The SSD design is built on the venerable VGG-16 architecture but without the fully connected
523 layers. The VGG-16 network was chosen as the base network due to its excellent performance in
524 tasks requiring the classification of high-quality images and its track record in problems where
525 transfer learning can aid in research advancement.
526
527 5. Experimental results and Evaluation

528
TP
 Precision P=
TP+ FP
TP
 Recall R=
TP+FN
 Accuracy is defined by the TP

AC=
following formula TP+ FN + FP
TP
mAP= Average Precision( )
TP+ FP
 Mean Average Precision
TP
mAR= Average Recall( )
TP+ FN
529
530 Accuracy refers to exactness or quality, whereas review refers to completeness or quantity. High
531 accuracy means that a calculation returned significantly more relevant results than insignificant ones,
532 whereas high review means that a calculation returned most of the important results, as shown in
533 Figure 10.
17
1
Accuracy
0.9
0.8
0.7
Accuracy
0.6
0.5
0.4
0.3
0.2
0.1
0
ssd_mobilenet_v1-coco ssd_mobilenet_v2-coco vgg16_ssd
534
535 Figure 10: Model’s accuracy levels on MS COCO data set.
536 Figure 13 clearly shows that the pre-trained model SSD_MobileNet_v1_coco outperforms the
537 other two models on the MS COCO data set. Although the accuracy level changes as we change the
538 data set size, this is because the hug size of the data set can affect the prediction algorithm's accuracy
539 level. The accuracy level on the kitti data set and the pascal VOC dataset is represented by the next
540 two graphs. The pascal VOC data set is the largest dataset in our system. The most important thing
541 we noticed during the evaluation phase is that the accuracy increases at different epochs from start to
542 finish, as defined by the 24 epochs, as shown in Figures 11 and 12.
0.9
Accuracy
0.8
0.7
0.6
0.5
Accuracy
0.4
0.3
0.2
0.1
0
543
544 Figure 11: Model’s accuracy levels on kitti data set.
545 The accuracy levels of the pre-trained models on the kitti data set are depicted in Figure 13. To
546 determine which model, have high accuracy on the kitti data set, all three models were trained on it.
547 On the kitti dataset, the SSD_MobileNet_v1 has high accuracy.
18
0.8
Accuracy
0.7
0.6
Accuracy
0.5
0.4
0.3
0.2
0.1
0
548
549 Figure 12: Model’s accuracy levels on pascal VOC data set.
550 The accuracy levels of the pre-trained models on the pascal VOC data set are also represented
551 in Figure 12. All three models were trained on the Pascal VOC dataset to determine which models
552 perform well on this data set. On this dataset, it is discovered that the SSD_Mobilenet_v1 has a high
553 accuracy. Essentially, the pascal VOC dataset is a very large data set that contains thousands of
554 images, but in our case, our laptop cannot support such a large dataset, so we must reduce the size of
555 the data set. In some cases, we will consider at least 1,000 images.
16
error
14
Amont of Prediction error
12
10
0
556
557 Figure 13: Models prediction error.
558 The precision-recall cure of the various classes is depicted in Figure 15. As previously stated, we
559 have various class objects such as humans, cars, cups, doors, and so on. Labelling techniques have
560 been used to label these classes and objects with their names. The figure above depicts the precision-
561 recall curve as well as the average precision of each class. The standard precision of a bike, clock,
562 door, drawer, and the socket are much lower than that of other object classes. The trained detector
19
563 identifies the five most difficult classes in the dataset. The model overlearned the specific time shown
564 on the clock (location of the hour and minute hands) rather than the clock's structure, which explains
565 the lower accuracy in clock class. There are various types of handles and backgrounds for doors and
566 drawers. The wall's background color makes it difficult to detect the switch. We compared the
567 performance of the models on the MS COCO data set in terms of precision, recall, and accuracy. On
568 the COCO data set, the SSD_MobileNet _v1 has high accuracy.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Precision Recall Accouracy
569
570 Figure 14: Models evaluation metrics on MS COCO dataset.
571
572 As illustrated in Figure 16, the number of images on training data has a significant impact on
573 detector performance. The detector's precision and recall on unseen evaluation datasets improve as
574 the size of the training samples grows. Even though the increase in accuracy is not smooth, we can
575 generalise that detectors trained on more samples have higher accuracies than detectors trained on
576 fewer samples. Because the dataset's object density is unbalanced, the random split produces
577 unbalanced subsets. The root cause of accuracy fluctuations is an unbalanced distribution of objects
578 on the training dataset.
579 5.1 System Testing
580 Systems, particularly those involving the internet, must be tested before being made public to
581 address issues. Different problems, such as software or hardware issues, as well as issues such as
582 system security, basic functionality, user accessibility, and adaptability to a wide range of the
583 desktop, device, and operating systems, should be tested and evaluated. The following are the main
584 goals and objectives of the testing phase:
585 a) The system should be tested to ensure that it meets the requirements for design and
586 development.
587 b) The system will be tested to ensure that it responds correctly to all types of inputs.
588 c) Perform its functions accurately and within a reasonable time frame.
20
589 d) Installed and used in the intended environments.

590 e) Obtain the overall desired and pre-defined results.
591 5.1.1 System Functionality Testing
592 System functionality testing entails testing the system's primary function. This section looks at
593 the overall framework's utility. From the smallest module to the entire framework module. We
594 examine whether they work as expected and whether they can do the job for which they were
595 designed. The system's capabilities are built on various modules. In any case, all modules collaborate
596 to carry out the object detection system. Figure 15 depicts the functionality of our system, which is
597 fully functional and operational. The main goal of system functionality testing is to see the system's
598 functions in action and to observe the system's functionality results in action. The system
599 functionality testing yielded the desired result.
600
601 Figure 15: System functionality Testing.
602
603 The figures above depict the overall functionality testing of the system. It clearly demonstrates
604 that our system is fully operational and functional. The percentage represents the detection and
605 recognition accuracy level.
606 5.1.2 Module Testing
607 Modules are distinct parts of a framework that can eventually be coordinated to create a fully
608 functional framework. We have a deep learning module, a static object detection module, a moving
609 object tracking module, a pre-defined objects module, and an object recognition module in our
610 system [31, 36]. Following implementation, we thoroughly tested and ran these modules to ensure
611 that no bugs were discovered. These modules function properly. If any of the modules contained
612 bugs or programming errors, the system would not function properly and would not be operational.
613 5.1.3 System Performance and Accuracy Testing on Live Detection
614 As previously stated, our system performs better in live application testing. One of the
615 requirements was that the system be operational on a local PC. We ran the system through its paces
616 on our laptop. The system successfully detected various objects and can identify the class of the
617 detected objects. The bounding boxes are used to separate the objects. The boxes can be used to
618 locate an object in a video sequence. Figure 17 is a demonstration of live testing. Because the object
21
619 is not moving in this type of static object detection module testing, the system successfully detected
620 the objects and recognized the detected object classes [44-46].
621
622
623 Figure 16: Live objects detection and recognition and accuracy testing.
624 Figure 16 depicts system functionality on a live stream. Each detected object has an accuracy
625 associated with it. The bounding boxes are used to separate the objects. The moving object tracking
626 functionality is depicted in the figure below.
627
628
629 Figure 17: Moving object tracking.
630 The object in Figure 17 is not static, but rather continues to move; this is an example of a
631 moving object tracking module. The system detected the moving object and identified the detected
632 moving object class. The class name is displayed in a green color box, along with the accuracy level.
633
634 6. Conclusion
635 The primary goals of this research were to investigate deep learning and its various
636 techniques and structures, followed by the development of a real-time object detection system
637 that uses deep learning and neural systems for object detection and recognition. Similarly, the
638 system had to be kept running on reasonable equipment. Several deep learning structures were
639 tried and evaluated during the coding procedure. The main contribution of this paper is to test the
640 pre-trained models with SSD on various types of datasets to determine which model is more
641 accurate in detecting and recognizing the object, as well as which model performs best on which
642 data set. As a result, on the MS COCO dataset, we concluded that the pre-trained model
643 SSD_MobileNet_v1_coco outperformed the others.
22
644 We achieved good results and discovered that we had designed and developed a real-time
645 object detection system successfully. During the system testing phase, we also test the various
646 modules of our proposed system and the detection accuracy results. We also assessed the system's
647 functionality. Graphs have been used to represent the evaluation results. We also tested the
648 dataset with pre-trained models to see which models have high accuracy under different
649 conditions.
650 This work can also be extended to detect the action of the objects, such as detecting what the
651 object (person) is doing and whether the person is using a mobile phone or a laptop. In other
652 words, the system should act intelligently in order to detect the action of a specific object. If the
653 person is driving, the system should detect and display this information. It will also be very
654 interesting to expand this system to track the movement of vehicles on the road. The velocity of
655 the moving object will be calculated for this purpose using some type of programming, and the
656 output will be displayed on the screen. On the other hand, the CCTV camera should be
657 programmed so that it can use this system to calculate the motion (velocity) of moving vehicles
658 on the road.
659 7 Conflict of Interest

660 The authors declare no conflict of interest.
661 8 Author Contributions

662 Conceptualization, F.W.; methodology, F.W.; and I.U.; software, F.W., and A.S.; validation, F.W.,
663 I.U., and R.A.K.; formal analysis, F.W., I.U.; and A.C.; investigation, F.W.; resources, I. U., and
664 M.S.A.; data curation, F.W., I.U.; R.A.K; A.C; and M.S.A.; writing—original draft preparation,
665 F.W.; and I.U.; writing—review and editing, I.U., and A.S.; visualization, I.U., and R.A.K.;
666 supervision, I.U. and A.C.; project administration, I. U., A.C and M. S. A; funding acquisition, A.C.
667 All authors have read and agreed to the published version of the manuscript.
668 9 Funding
669 This work was supported by a National Research Foundation of Korea (NRF) grant funded by the
670 Korean government (MSIT) (No. NRF-2021R1F1A1062181).
671
672 10 References
673 [1] Murugan, V., Vijaykumar, V.R. and Nidhila, A., 2019, April. A deep learning RCNN approach for
674 vehicle recognition in traffic surveillance system. In 2019 International Conference on
675 Communication and Signal Processing (ICCSP) (pp. 0157-0160). IEEE.
676 [2] Zhao, Z.Q., Zheng, P., Xu, S.T. and Wu, X., 2019. Object detection with deep learning: A
677 review. IEEE transactions on neural networks and learning systems, 30(11), pp.3212-3232.
678 [3] Mansoor, A., Porras, A.R. and Linguraru, M.G., 2019, April. Region proposal networks with
679 contextual selective attention for real-time organ detection. In 2019 IEEE 16th International
680 Symposium on Biomedical Imaging (ISBI 2019) (pp. 1193-1196). IEEE.
681 [4] Biswas, D., Su, H., Wang, C. and Stevanovic, A., 2019. Speed estimation of multiple moving objects
682 from a moving UAV platform. ISPRS International Journal of Geo-Information, 8(6), p.259.
683 [5] Wei, H. and Kehtarnavaz, N., 2019. Semi-supervised faster RCNN-based person detection and load
684 classification for far field video surveillance. Machine Learning and Knowledge Extraction, 1(3),
685 p.44.
686 [6] Chandan, G., Jain, A. and Jain, H., 2018, July. Real time object detection and tracking using Deep
23
687 Learning and OpenCV. In 2018 International Conference on inventive research in computing
688 applications (ICIRCA) (pp. 1305-1308). IEEE.
689 [7] Asadi, K., Ramshankar, H., Pullagurla, H., Bhandare, A., Shanbhag, S., Mehta, P., Kundu, S., Han,
690 K., Lobaton, E. and Wu, T., 2018. Building an integrated mobile robotic system for real-time
691 applications in construction. arXiv preprint arXiv:1803.01745.
692 [8] Mao, H., Yao, S., Tang, T., Li, B., Yao, J. and Wang, Y., 2016. Towards real-time object detection on
693 embedded systems. IEEE Transactions on Emerging Topics in Computing, 6(3), pp.417-431.
694 [9] Shin, D.K., Ahmed, M.U. and Rhee, P.K., 2018. Incremental deep learning for robust object detection
695 in unknown cluttered environments. IEEE Access, 6, pp.61748-61760.
696 [10] Runz, M., Buffier, M. and Agapito, L., 2018, October. Maskfusion: Real-time recognition, tracking
697 and reconstruction of multiple moving objects. In 2018 IEEE International Symposium on Mixed and
698 Augmented Reality (ISMAR) (pp. 10-20). IEEE.
699 [11] Singh, G., Yadav, A., Bhardwaj, I. and Chauhan, U., 2021, December. Web-Page Interfaced Real-
700 Time Object Detection Using TensorFlow. In 2021 3rd International Conference on Advances in
701 Computing, Communication Control and Networking (ICAC3N) (pp. 1439-1441). IEEE.
702 [12] Bian, X., Chen, Y., Wang, S., Cheng, F. and Cao, H., 2021, March. Medical Waste Classification
703 System Based on OpenCV and SSD-MobileNet for 5G. In 2021 IEEE Wireless Communications and
704 Networking Conference Workshops (WCNCW) (pp. 1-6). IEEE.
705 [13] Martinez-Alpiste, I., Golcarenarenji, G., Wang, Q. and Alcaraz-Calero, J.M., 2022. Smartphone-based
706 real-time object recognition architecture for portable and constrained systems. Journal of Real-Time
707 Image Processing, 19(1), pp.103-115.
708 [14] Tufail, A.B., Ullah, I., Khan, W.U., Asif, M., Ahmad, I., Ma, Y.K., Khan, R. and Ali, M., 2021.
709 Diagnosis of diabetic retinopathy through retinal fundus images and 3D convolutional neural networks
710 with limited number of samples. Wireless Communications and Mobile Computing, 2021.
711 [15] Khan, R., Yang, Q., Ullah, I., Rehman, A.U., Tufail, A.B., Noor, A., Rehman, A. and Cengiz, K.,
712 2022. 3D convolutional neural networks based automatic modulation classification in the presence of
713 channel noise. IET Communications, 16(5), pp.497-509.
714 [16] Tufail, A.B., Ullah, I., Khan, R., Ali, L., Yousaf, A., Rehman, A.U., Alhakami, W., Hamam, H.,
715 Cheikhrouhou, O. and Ma, Y.K., 2021. Recognition of Ziziphus lotus through Aerial Imaging and
716 Deep Transfer Learning Approach. Mobile Information Systems, 2021.
717 [17] Hung, J. and Carpenter, A., 2017. Applying faster R-CNN for object detection on malaria images.
718 In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 56-
719 61).
720 [18] Du, J., 2018, April. Understanding of object detection based on CNN family and YOLO. In Journal of
721 Physics: Conference Series (Vol. 1004, No. 1, p. 012029). IOP Publishing.
722 [19] Manana, M., Tu, C. and Owolawi, P.A., 2018, December. Preprocessed faster RCNN for vehicle
723 detection. In 2018 International Conference on Intelligent and Innovative Computing Applications
724 (ICONIC) (pp. 1-4). IEEE.
725 [20] Nalla, B.T., Sharma, T., Verma, N.K. and Sahoo, S.R., 2018, July. Image dehazing for object
726 recognition using faster RCNN. In 2018 International Joint Conference on Neural Networks
727 (IJCNN) (pp. 01-07). IEEE.
728 [21] Yang, B., Luo, W. and Urtasun, R., 2018. Pixor: Real-time 3d object detection from point clouds.
729 In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 7652-7660).
730 [22] Ren, Y., Zhu, C. and Xiao, S., 2018. Object detection based on fast/faster RCNN employing fully
731 convolutional architectures. Mathematical Problems in Engineering, 2018.
732 [23] Manana, M., Tu, C. and Owolawi, P.A., 2017, December. A survey on vehicle detection based on
733 convolution neural networks. In 2017 3rd IEEE International Conference on Computer and
734 Communications (ICCC) (pp. 1751-1755). IEEE.
735 [24] Zhiqiang, W. and Jun, L., 2017, July. A review of object detection based on convolutional neural
736 network. In 2017 36th Chinese control conference (CCC) (pp. 11104-11109). IEEE.
737 [25] Xu, Y., Yu, G., Wang, Y., Wu, X. and Ma, Y., 2017. Car detection from low-altitude UAV imagery
738 with the faster R-CNN. Journal of Advanced Transportation, 2017.
739 [26] Zhou, X., Gong, W., Fu, W. and Du, F., 2017, May. Application of deep learning in object detection.
24
740 In 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS) (pp.
741 631-634). IEEE.
742 [27] Shi, K., Bao, H. and Ma, N., 2017, December. Forward vehicle detection based on incremental
743 learning and fast R-CNN. In 2017 13th International Conference on Computational Intelligence and
744 Security (CIS) (pp. 73-76). IEEE.
745 [28] Saqib, M., Khan, S.D., Sharma, N. and Blumenstein, M., 2017, August. A study on detecting drones
746 using deep convolutional neural networks. In 2017 14th IEEE international conference on advanced
747 video and signal based surveillance (AVSS) (pp. 1-5). IEEE.
748 [29] Lee, Y., Kim, H., Park, E., Cui, X. and Kim, H., 2017, June. Wide-residual-inception networks for
749 real-time object detection. In 2017 IEEE Intelligent Vehicles Symposium (IV) (pp. 758-764). IEEE.
750 [30] Zhang, R. and Yang, Y., 2017, November. Merging recovery feature network to faster RCNN for low-
751 resolution images detection. In 2017 IEEE Global Conference on Signal and Information Processing
752 (GlobalSIP) (pp. 1230-1234). IEEE.
753 [31] Han, K., Sun, M., Zhou, X., Zhang, G., Dang, H. and Liu, Z., 2017, December. A new method in
754 wheel hub surface defect detection: Object detection algorithm based on deep learning. In 2017
755 International Conference on Advanced Mechatronic Systems (ICAMechS) (pp. 335-338). IEEE.
756 [32] Risha, K.P. and Kumar, A.C., 2016. Novel method of detecting moving object in video. Procedia
757 Technology, 24, pp.1055-1060.
758 [33] Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2015. Region-based convolutional networks for
759 accurate object detection and segmentation. IEEE transactions on pattern analysis and machine
760 intelligence, 38(1), pp.142-158.
761 [34] Wang, J.G., Zhou, L., Pan, Y., Lee, S., Song, Z., Han, B.S. and Saputra, V.B., 2016, June.
762 Appearance-based brake-lights recognition using deep learning and vehicle detection. In 2016 IEEE
763 Intelligent Vehicles Symposium (IV) (pp. 815-820). IEEE.
764 [35] Mishra, P.K. and Saroha, G.P., 2016, March. A study on video surveillance system for object
765 detection and tracking. In 2016 3rd International Conference on Computing for Sustainable Global
766 Development (INDIACom) (pp. 221-226). IEEE.
767 [36] Hoang Ngan Le, T., Zheng, Y., Zhu, C., Luu, K. and Savvides, M., 2016. Multiple scale faster-rcnn
768 approach to driver's cell-phone usage and hands on steering wheel detection. In Proceedings of the
769 IEEE conference on computer vision and pattern recognition workshops (pp. 46-53).
770 [37] Shilpa, P. H.L,. 2016, A Survey on Moving Object Detection and Tracking Techniques, International
771 Journal of Engineering and Computer Science, 5(5): 16376-16382.
772 [38] Zhao, X., Li, W., Zhang, Y., Gulliver, T.A., Chang, S. and Feng, Z., 2016, September. A faster
773 RCNN-based pedestrian detection system. In 2016 IEEE 84th Vehicular Technology Conference (VTC-
774 Fall) (pp. 1-5). IEEE.
775 [39] Luo, W., Yang, B. and Urtasun, R., 2018. Fast and furious: Real time end-to-end 3d detection,
776 tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference
777 on Computer Vision and Pattern Recognition (pp. 3569-3577).
778 [40] Salvador, A., Giró-i-Nieto, X., Marqués, F. and Satoh, S.I., 2016. Faster r-cnn features for instance
779 search. In Proceedings of the IEEE conference on computer vision and pattern recognition
780 workshops (pp. 9-16).
781 [41] Chen, Y., Li, W., Sakaridis, C., Dai, D. and Van Gool, L., 2018. Domain adaptive faster r-cnn for
782 object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern
783 recognition (pp. 3339-3348).
784 [42] Ahmad, S., Ullah, T., Ahmad, I., Al-Sharabi, A., Ullah, K., Khan, R.A., Rasheed, S., Ullah, I., Uddin,
785 M. and Ali, M., 2022. A novel hybrid deep learning model for metastatic cancer
786 detection. Computational Intelligence and Neuroscience, 2022.
787 [43] Ahmad, I., Ullah, I., Khan, W.U., Ur Rehman, A., Adrees, M.S., Saleem, M.Q., Cheikhrouhou, O.,
788 Hamam, H. and Shafiq, M., 2021. Efficient algorithms for E-healthcare to solve multiobject fuse
789 detection problem. Journal of Healthcare Engineering, 2021.
790 [44] Shafiq, M., Tian, Z., Bashir, A.K., Du, X. and Guizani, M., 2020. CorrAUC: a malicious bot-IoT
791 traffic detection method in IoT network using machine-learning techniques. IEEE Internet of Things
792 Journal, 8(5), pp.3242-3254.
25
793 [45] Wahab, F., Zhao, Y., Javeed, D., Al-Adhaileh, M.H., Almaaytah, S.A., Khan, W., Saeed, M.S. and
794 Kumar Shah, R., 2022. An AI-Driven Hybrid Framework for Intrusion Detection in IoT-Enabled E-
795 Health. Computational Intelligence and Neuroscience, 2022.
796 [46] Shafiq, M. and Gu, Z., 2022. Deep Residual Learning for Image Recognition: A Survey. Applied
797 Sciences, 12(18), p.8972.
798 11 Data Availability Statement

799 The datasets for this study can be found on request from the corresponding author.
26

Frontiers Paper

Uploaded by

Copyright:

Available Formats

You might also like

Frontiers Paper

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Frontiers Paper

Uploaded by

Copyright:

Available Formats

1 Design and Implementation of Real Time Object Detection System

2 Based on SSD and OpenCV

9 Department of AI and Software, Gachon University, Seongnem-si 13120, Republic of Korea

46 The process of recognizing a moving or nonstationary object in a video sequence is known as

106 2. Related Work

142 3. System Model

156 Figure 1: The functional requirements system modules.

158 3.1 Data Processing Module Analysis

165 (1) Kitti Dataset

172 (2) MS COCO Dataset

181 (3) PASCAL VOC Dataset

189 3.3.1 Training and Testing Data

209 Figure 2: Data conversion to TF record.

211 3.2 Object detection deep learning module

224 3.3 Pre-defined Objects Module

236 Figure 3: Classification, localization and instance segmentation.

237 3.4 Static Object Detection Module

253 3.5 Moving Object Tracking Module

276 Figure 4: The flow chart of the deep learning module.

278 3.5.1 Background Subtraction

295 3.5.2 Feature Extractor

304 Figure 5: Feature’s extractor and classifier in architecture of meta data.

309 3.5.3 Image Scaling

315 3.6 Object Recognition Module

330 Figure 6: Object recognition steps.

332 3.7 Models Analysis

336 3.7.1 Pre-Trained Model

346 Table 1: Common pre-trained Models.

Name Speed (ms) COCO mAP [ ˄ Productivity

Faster_RCNN_resnet_v1 106 32 Boxes

348 4. System Design

352 4.1 System Architecture

385 Figure 7: The system architecture.

386 4.2 Design of Data Processing Module

396 Figure 8: Activity diagram of the data processing module.

399 4.3 Class Diagram of the Proposed System

412 4.4 System Implementation

422 Figure 9: System implementation.

428 4.4.1 Implementation of Deep Learning Module

445 4.4.2 Implementation of Static Object Detection Module

456 4.4.3 Implementation of Pre-Defined Objects Module

464 4.4.4 Implementation of Moving Object Tracking Module

493 4.4.5 Implementation of Object Recognition Module

511 4.4.6 Critical Algorithm and Pseudo Code

527 5. Experimental results and Evaluation

 Accuracy is defined by the TP

535 Figure 10: Model’s accuracy levels on MS COCO data set.

544 Figure 11: Model’s accuracy levels on kitti data set.

557 Figure 13: Models prediction error.

570 Figure 14: Models evaluation metrics on MS COCO dataset.

579 5.1 System Testing

589 d) Installed and used in the intended environments.

591 5.1.1 System Functionality Testing

606 5.1.2 Module Testing

613 5.1.3 System Performance and Accuracy Testing on Live Detection

629 Figure 17: Moving object tracking.

659 7 Conflict of Interest