Professional Documents
Culture Documents
Recognizing Traffic Signs With Synthetic Data and Deep Learning
Recognizing Traffic Signs With Synthetic Data and Deep Learning
Abstract. Recently, in-depth learning about computer vision and object classification tasks has surpassed other
machine learning (ML) algorithms. This algorithm, alike similar ML algorithms, requires a dataset for training.
In most real cases, developing an appropriate dataset is expensive and time-consuming. Also, in some situations,
providing the dataset is unsafe or even impossible. In this paper, we proposed a novel framework for traffic sign
recognition using synthetic data and deep learning. The main feature of the proposed method is its independence
from the real-life dataset, which leads to high accuracy in the real test dataset. Creating one-by-one synthetic data
is more labor-intensive and costlier than providing real data. To tackle the issue, the proposed framework uses a
procedural method, which gives the possibility to develop countless high-quality data that are close enough to the
real data. Due to its procedural nature, this framework can be easily edited and tuned.
Keywords: Deep Learning, Convolutional Neural Networks, Computer Graphics, Synthetic Data, Traffic Sign
Recognition.
1
2 Avaz Naghipour and Rahim Pasbani
few datasets, there has been no access to CAD models yet. random forest is used to finalize the recognition process.
Considering the issue of making CAD data, it is perceptible Ref. [14] has tried to prove the effectiveness of the random
that capturing real images may be cheaper than developing forest recognition algorithm in both accuracy and speed
synthetic ones. Modeling even a very simple 3D model on traffic sign recognition. Nowadays, modern computer
is a time-consuming process that requires experts to be vision classifiers mostly deploy CNN for recognition tasks.
accomplished. In ref. [15], the densely connected CNN is used for traffic
Another problem with most synthetic images is their sign detection. In ref. [16], the authors have shown the
dissimilarity to real objects. These images are far distinctive results of different architectures of CNN to solve the same
from real-world references in terms of appearance. By recognition problem.
browsing some accessible CAD datasets, it can be explicitly In the area of synthetic data deployment, some remark-
seen that the objects do not have the proper lighting as able works have been presented so far. The authors in
we have in the real world. Another significant issue that ref. [17] have suggested a 2D synthetic text generator
makes CAD models look rough is the texture and material engine, which places texts onto random backgrounds and
of models. Instead of resembling real materials such as employs the obtained data to train a CNN-based classifier
wood, fibers, and metals, those models seem to be made to recognize texts in graphics. Furthermore, a 3D synthetic
of solid clay. If the ML application is trained by non-real- dataset was used in ref. [18] to predict hand gestures
like images, the result will not give adequate accuracy in an image. The accuracy has risen after adding some
in ground-truth cases. This is why sometimes researchers real images to the training dataset. The CAD data have
choose to mix them with some real images to improve the been used in some research for classifying and object
functionality of the models [10]. detection tasks [19, 20]. Another important challenge in
To overcome these challenges, an efficient approach is computer vision is viewport prediction. In ref. [21], ren-
proposed in the present work to develop synthetic traffic dered images trained a CNN, and the result was excellent.
signs. To this end, a procedural way is used to provide Most of the relevant methods have used premade online
the desired dataset without any quantity limitation. In the CAD libraries, such as Trimble 3D warehouse, TurboSquid,
proposed method, every small detail is taken into account Yobi3D, and ShapeNet. Using premade datasets is handy
to make the images quite real-looking, so that they can to test new models or benchmark competitions. But, in
hardly be recognized from real images. To do so, various real applications, any problem needs its own exclusively
ML classifiers were employed to be trained by the devel- developed dataset to be prepared.
oped dataset. The outline of the paper is as follows: Section
2 discusses related works. Section 3 presents synthetic SYNTHETIC IMAGE GENERATION
image generation. The overview of image processing filters
and image augmentation are studied in Sections 4 and 5, The proposed method in this research contains two main
respectively. Section 6 describes setting up deep convolu- steps. In the first step, a procedural virtual 3D scene is cre-
tional neural network (DCNN) architecture. In Section 7, ated, and in the second step, a specific DCNN architecture
experiments and results are reported. Section 8 concludes is designed to be trained by the generated dataset obtained
this paper. in the previous step.
First of all, a virtual 3D scene needs a setup. Then,
RELATED WORKS procedurally, in 3D world space, feasible variations, and
randomizations are made, so that every state forms a
There are various algorithms for classifying traffic signs. believable arrangement of the scene’s objects and com-
The most regarded algorithms in this field are based on ML ponents. By every run, a specific arrangement is made
methods; however, there are some research projects that and rendered. In the next step, to make extra purpose-
employ the color and shape of the signs. These characteris- ful variations, some image processing-based filters and
tics cannot be directly used to classify the traffic signs, but manipulations are applied to the rendered images. Finally,
they can remarkably help the actual classifier. the image augmentation technique is employed to general-
The support vector machine (SVM) for classification has ize the proposed classifier, as well as, image augmentation
always been a selection at hand. In ref. [11], Maldonado increases the size of the training dataset.
et al. used SVM for automatic multiclass traffic sign detec- The aim was to set up a versatile virtual scene that
tion and classification using a one-vs-all approach with a can automatically develop a feasible arrangement of the
Gaussian kernel. In the other attempt [12], by considering objects. To reach such a system, some objects are required
the limitation in the shape and color of signs, authors used such as some 3D objects, lights, sky objects, and sev-
a color segmentation and shape matching approach, and eral controllers that are used to control the properties of
then, the dataset has classified using SVM. The obtained scene components. The controllers provide mathematical
results are promising. In the method suggested in ref. [13], relationships among all objects in the scene. To avoid
after the detection of a sign via the MSER procedure, infeasible cases, some constraints are considered. In fact,
the HSV-HOG-LBP features are extracted, and then, a the constraints are small codes written in Python, which
Recognizing Traffic Signs with Synthetic Data and Deep Learning 3
• Illumination: One of the main issues in rendering • Signboard Damages and Imperfections: Usually
photo-realistic images refers to the correct lighting road signs are exposed to physical damage and
of the scene. In CG, the lighting procedure is almost strikes. These damages often cause deformation.
divided into two main parts, i.e., direct lighting and To mimic this effect, some deformers have been
indirect lighting. Direct light takes charge of the main deployed. Deforming is usually done by using dis-
illumination, which usually casts sharp shadows on placement maps. These maps are gray-scale noisy
objects, and on the other hand, indirect light is an images projected onto object UV coordinates and
environmental light that is resulted from bouncing push polygons up or down corresponding to the
rays. Since traffic signs are almost always placed brightness of the projected map, along polygon nor-
outdoors, they are mostly illuminated by the sun and mal vectors. By every run, this map is regenerated
sky. The sun is considered a direct light, and the sky with a different noisy pattern.
is responsible for indirect lighting. In the real world
• Dust and Scratches: Rain, storm, snow, dust, and
rules, both the sun angle and sky color are inter-
other natural phenomena may dirty the signs and
twined [22]. The sky color gradient varies according
makes them unclear. Some controllers are designed
to the sun’s position and some other factors such
to simulate these types of effects by adding some
as the haze and aerosol in the air. Simulation of the
random pattern onto sign textures. For adding more
sun as an infinite light source (direct lighting source)
details, divers’ noisy images are deployed to fake
in most CG applications is simple. The technique
dirtiness on the sign boards. Also, some mask tex-
that is usually used for indirect lighting simulation
tures specify the areas where this dirtiness should
is referred to as image-based lighting (IBL). In this
appear.
method, a big sphere or hemisphere surrounds the
scene, and its texture sends light rays into the scene. • Backdrop and Environment: Each season has its own
In this research, the Preetham sky model was used to visual effects on the objects’ appearance. In rendered
4 Avaz Naghipour and Rahim Pasbani
Figure 2. Histogram of an object with four different shadow situations. Shadow cast by the environment; changes the entire distribution
of pixels’ data.
images, these effects can be realized by changing remains invariant to all these variations is a big chal-
the background image. An important factor to be lenge both in image processing and computer vision
taken into account is that neural networks can learn fields. So the goal is to provide a comprehensive
unwanted patterns such as backgrounds. Thus, we dataset that is able to include the road signs in any
should be aware of using repetitive backdrops as condition. This makes the classifier behave invariant
much as possible. To prevent this side effect, the toward unnecessary information.
proposed method uses one hundred different images;
however, the risk is yet probable by every 100 runs.
IMAGE PROCESSING FILTERS
Hence, for every run, the position, scale, and rotation
of the background images are changed randomly.
Pictures captured by ordinary cameras often contain some
This can guarantee that final rendered images will
noise. This noise mostly can be seen explicitly in low-light
never have the same background. These randomiza-
situations. Also in rendering, due to indirect illumination,
tions are controlled by controllers in order to prevent
inherently all rendered images are noisy. However, for
infeasible images.
emphasizing, some subtle noise is randomly added to the
rendered images.
• Shadows: The sign objects are placed in different
By browsing GTSRB data more accurately, it can be seen
positions that may receive any type of shadows cast
that some images are very dark and some images are very
by other objects. These shadows, when analyzed
bright. Despite the different lighting situations regarded
numerically, have a significant effect on their appear-
in the 3D scene rendering step, some extra darkening and
ance and color. To implement these shadows in a vir-
brightening filters are applied to some rendered images.
tual world, several objects with different shapes and
As mentioned before, the motion blur and defocus can
sizes have been settled in the scene. With each run,
be faked by 2D filters. In this step, these effects are also
some properties of these objects, such as positions,
applied to randomly selected rendered images.
rotations, distances, and visibilities, become random.
After many trials and errors, it was found that the
Shadows play a significant role in the overall looking
mentioned filters help the final result and accuracy get
of any image. For more clarity, in Figure 2, the same
better.
scene has been rendered four times just by changing
in received shadows. Every time each image’s his-
togram has been plotted. As seen on these plots, most IMAGE AUGMENTATION
pixel values were changed while semantically all of
these images represent the same sign. In addition At the final step of dataset preparation, all the rendered
to the light and shadows, any change in position, images are candidates for applying augmentation. In this
rotation, scale, shear, color, and other properties will step, all the variations supposed to be applied on ren-
overturn pixel data. Thus, designing a classifier that dered images are offline, and there is no access to 3D
Recognizing Traffic Signs with Synthetic Data and Deep Learning 5
Figure 3. Heavy augmentation is applied to rendered images. Most of the image properties have been changed during this operation,
such as position, scale, crop, distortion, and color.
Figure 5. Schematic diagrams of proposed model layers. This architecture is comprised of convolution, pooling, batch normalization,
dropouts, and fully connected layers. Input images have 80 pixels for both height and width.
Rectified Linear Units (ReLUs). These blocks finally ended Now the cost function J (Θ) can be obtained by forming
with two fully connected layers. These layers usually are (i )
Eq. (3). In this equation, yk is the true label of instance i
used to collect and optimize scores for each class. The
that belongs to class k. This value is 1 if ith instance belongs
first fully connected layer contains 128 neurons, and a
to the class k and 0 in other cases. To obtain the gradient
batch normal and dropout follow it. The last layer is the
vector of class k, it needs to calculate the gradient of the
second fully connected with only 12 neurons, and softmax
cost function with respect to kth parameter (θ (k) ) [29]:
is used as the activation function. This layer decides that
the input image belongs to which class. The mentioned 1 m
∑i=1 ( p̂k
(i ) (i )
architecture was schematically plotted and can be seen ∇ θ (k) J ( Θ ) = − y k ) x( i ) (4)
m
in Figure 5.
The proposed model is ready to receive the provided Now using one of the gradient descent family optimizers,
synthetic images as input to begin the training process. But the model finds the parameters Θ that minimize the cost
for achieving optimum weights and biases, a proper loss function. In fact, these parameters are the filters and other
function must be established. Imagine that x is an instance types of learnable variables [29].
image vector and sk (x) is the score of class k which softmax
computes, so there is a linear relationship between x and EXPERIMENTS AND RESULTS
the score as below [29]:
The designed synthetic data generator is capable of gener-
s k (x) = x T θ ( k ) (1) ating any number of images with any essential dimension.
In Equation (1), θ (k) represents parameter vector for class A total of 80 pixels for both height and width are chosen.
k. We need the probability of belonging to class k, so the In total, 24000 images are included in training the proposed
softmax function at the end of the model chain calculates DCNN model. Figure 6 (1&2) shows the train and valida-
this probability ( p̂k ) [29]: tion loss/accuracy over 200 epochs.
As seen in Figure 6 around epoch number 200, the model
e s k (x) almost converges, and validation loss and accuracy are in
p̂k = s (x)
(2)
∑Kj=1 e j an acceptable situation in terms of overfitting. To test the
dataset, corresponding classes from the GTSRB are used.
where K is the number of classes. In the machine learning field and especially in supervised
Since the softmax predicts only one class per time, this is machine learning, the confusion matrix is considered one
suitable for our case as every sign only belongs to one class. of the significant visualization methods for statistical clas-
Cross entropy is a proven way for classification problems sification tasks [30].
to define a loss function [29]: The confusion matrix for our proposed model on the test
1 m K
dataset is depicted in Figure 7. Classes that are more close
J (Θ) = − ∑i=1 ∑k=1 yk log p̂k
(i ) (i )
(3) to each other are confused with similar classes.
m
Recognizing Traffic Signs with Synthetic Data and Deep Learning 7
CONCLUSION
Deploying machine learning in industry-level production
Figure 7. Normalized confusion matrix for the German Traffic is required to provide an exclusive dataset that meets
Sign Recognition Benchmark (GTSRB) dataset. the requirements. Providing labeled ground-truth datasets
for computer vision tasks is usually expensive, time-
consuming, and labor-intensive. Furthermore, there are
By referring to the plotted confusion matrix in some cases that create a real dataset that is not safe or prac-
Fig. 7(1&2), it is clear that predicting classes 3 and 4 leads tically impossible. Using CAD models is another option,
to higher errors than others. The first reason refers to the but creating desired models one by one in most cases
appearance of the two mentioned classes. They are very becomes more expensive than providing real datasets.
8 Avaz Naghipour and Rahim Pasbani
To cover the challenge in this paper and develop a syn- [5] Wang, T., Wu, D. J., Coates A., Ng, A. Y. (2012). End-to-End
thetic dataset for the traffic sign recognition task, a proce- Text Recognition with Convolutional Neural Networks, Proceedings
dural method was used. By using computer graphic tools, of the 21st International Conference on Pattern Recognition Tsukuba,
3304–3308.
the proposed method facilitates generating numerous [6] Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazırbas, C.,
images that are precisely analogous to real-life instances. Golkov, V. (2015). FlowNet: Learning Optical Flow with Con-
Moreover, a well-structured DCNN architecture was set volutional Networks, IEEE International Conference on Computer,
up that decently fulfilled the classification task. Without 2758–2766.
seeing any real data, this classifier could categorize the [7] Mayer, N., Ilg, E., Hausser, P., Fischer, P. (2016). A Large Dataset to
real-world GTSRB dataset with more than 91.91% accuracy. Train Convolutional Networks for Disparity, Optical Flow, and Scene
Flow Estimation, IEEE Conference on Computer Vision and Pattern
The provided dataset has more details than the require- Recognition, 4040–4048.
ments of the GTSRB dataset. We took many details into [8] Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A. M. (2016).
account that might not be necessary, but it made the clas- The SYNTHIA Dataset: A Large Collection of Synthetic Images
sifier more reliable for complicated situations. Rendered for Semantic Segmentation of Urban Scenes, IEEE Conference on
images and real pictures captured by a camera intrinsically Computer Vision and Pattern Recognition, 3234–3243.
contain many dissimilarities. Using synthetic images to [9] Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S. Cipolla,
R. (2016). Understanding Real World Indoor Scenes with Synthetic
train machine learning models requires narrowing this Data, IEEE Conference on Computer Vision and Pattern Recognition,
similarity gap. Augmentation and other image processing 4077–4085.
filters are helpful in enhancing accuracy. Additionally, [10] Tsai, C., Tsai, S. Hsu, Y., Wu, Y. (2017). Synthetic Training of
without augmentation and dropout, overfitting and gen- Deep CNN for 3D Hand Gesture Identification, International Con-
eralization issues would be bold. For the next research, ference on Control, Artificial Intelligence, Robotics & Optimization,
165–170.
we decide to utilize this procedure for more complicated
[11] Maldonado-Bascon, S., Lafuente-Arroyo, S., Gil-Jimenez, P., Gomez-
tasks like road and street object detection. Clearly, such a Moreno, H., Lopez-Ferreras, F. (2007). Road-Sign Detection and
procedure requires higher attempts to set up a system that Recognition Based on Support Vector Machines, IEEE Transactions on
can provide credible rendered images. Intelligent Transportation Systems, 8(2), 264–278.
[12] Wali, S. B., Hannan, M. A., Hussain, A., Samad, S. A. (2015). An
Automatic Traffic Sign Detection and Recognition System Based
CONFLICT OF INTEREST on Colour Segmentation, Shape Matching, and SVM, Mathematical
Problems in Engineering, 1–11.
The authors declare that the research was conducted in the [13] Kuang, X., Fu, W., Yang, L. (2018). Real-Time Detection and Recog-
absence of any commercial or financial relationships that nition of Road Traffic Signs using MSER and Random Forests,
International Journal of Online Engineering, 14(3) 34–51.
could be construed as a potential conflict of interest.
[14] Ellahyani, A., Ansari, M. E., Jafari, I. E. (2016). Traffic Sign Detection
and Recognition Based on Random Forests, Applied Soft Computing,
AUTHOR CONTRIBUTIONS 46, 805–815.
[15] Liang, Z., Shao, J., Zhang, D., Gao, L. (2019). Traffic Sign Detection
and Recognition Based on Pyramidal Convolutional Networks, Neu-
RP conducted an initial literature review and data collec- ral Computing and Applications, 32(11), 6533–6543.
tion, performed the experiments, prepared the results, and [16] Shustanov, A., Yakimov, P. (2017). CNN Design for Real-Time Traffic
drafted the manuscript. AN helped in writing-editing and Sign Recognition, Procedia Engineering, 201, 718–725.
conceptualization, analyzed the result, and contributed to [17] Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A. (2014).
supervision. Both authors read and approved the final Synthetic Data and Artificial Neural Networks for Natural Scene Text
manuscript. Recognition, arXiv:1406.2227.
[18] Tsai, C., Tsai, Y., Hsu, S., Wu, Y. (2017). Synthetic Training of Deep
CNN for 3D Hand Gesture Identification, International Conference on
REFERENCES Control, Artificial Intelligence, Robotics & Optimization, 165—170.
[19] Peng, X., Sun, B., Ali, K., Saenko K. (2015). Learning Deep Object
[1] Li, L., Huang, W., Liu, Y., Zheng, N., Wang, F. (2016). Intelligence Detectors from 3D Models, IEEE International Conference on Computer
Testing for Autonomous Vehicles: A New Approach, IEEE Transac- Vision, 1278–1286.
tions on Intelligent Vehicles, 1(2), 158–166. [20] Sun, B., Saenko, K. (2014). From Virtual to Reality: Fast Adaptation
[2] Gidado, U. M., Chiroma, H., Aljojo, N., Abubakar, S., Popoola, of Virtual Object Detectors to Real Domains, Proceedings of the British
S. I., Al-Garadi, M. A. (2020). A Survey on Deep Learning for Machine Vision Conference.
Steering Angle Prediction in Autonomous Vehicles, IEEE Access, 8, [21] Su, H., Qi, C. R., Li, Y., Guibas, L. J. (2015). Render for CNN:
163797–163817. Viewpoint Estimation in Images using CNNs Trained with Rendered
[3] Arnold, E., Al-Jarrah, O. Y., Dianati, M., Fallah, S., Oxtoby, D. 3D Model Views, IEEE International Conference on Computer Vision,
Mouzakitis, A. (2019). A Survey on 3D Object Detection Methods for 2686–2694.
Autonomous Driving Applications, IEEE Transactions on Intelligent [22] Satilmis, P., Bashford-Rogers, T., Chalmers, A., Debattista, K. (2017).
Transportation Systems, 20(10), 3782–3795. A Machine-Learning-Driven Sky Model, IEEE Computer Graphics and
[4] Gjoreski, H., Ciliberto, M., Wang, L., Morales, F. J. O., Mekki, S., Applications, 37(1), 80–91.
Valentin, S., Roggen D. (2018). The University of Sussex-Huawei [23] Jung, J., Lee, J. Y., Kweon, I. S. (2019). One-Day Outdoor Photometric
Locomotion and Transportation Dataset for Multimodal Analytics Stereo using Skylight Estimation, International Journal of Computer
with Mobile Devices, IEEE Access, 6, 42592–42604. Vision, 127(8), 1126–1142.
Recognizing Traffic Signs with Synthetic Data and Deep Learning 9
[24] Bilal, A., Jourabloo A., Ye, M., Liu, X., Ren, L. (2018). Do Convolu- [28] Zhang, Z., Wang, H., Liu S., Xiao, B. (2018). Consecutive Convolu-
tional Neural Networks Learn Class Hierarchy?, IEEE Transactions tional Activations for Scene Character Recognition, IEEE Access, 6,
on Visualization and Computer Graphics, 24(1), 152–162. 35734–35742.
[25] LeCun, Y., Bottou, L., Bengio Y., Haffner, P. (1998). Gradient-Based [29] Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn,
Learning Applied to Document Recognition, Proceedings of the IEEE, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build
86(11), 2278–2324. Intelligent Systems, O’Reilly Media, 2th ed.
[26] Bjorck, J., Gomes, C., Selman, B., Weinberger, K. Q. (2018). Under- [30] Stehman, S. V. (1997). Selecting and Interpreting Measures of The-
standing Batch Normalization, Advances in Neural Information Pro- matic Classification Accuracy, Remote Sensing of Environment, 62(1),
cessing Systems, 7694–7705. 77–89.
[27] Krizhevsky, A., Sutskever, I., Hinton, G. E. (2017). ImageNet Classi- [31] German Traffic Sign Benchmarks, https://benchmark.ini.rub.de/gts
fication with Deep Convolutional Neural Networks, Communications rb_results_ijcnn.html.
of the ACM, 60(6), 84–90.