Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Where’s Wally?

—Use RCNN Model to Find Wally


Pin-Kuei Huang, Chung-Fan Tang, Chua-Wei Chao , Ssu-Yi Cheng
Department of Computer Science and Information Engineering
National Taiwan Normal University
Taipei, Taiwan
pierce123455@gmail.com, frank051284@gmail.com, ga891111@gmail.com, 80947001s@gapps.ntnu.edu.tw

Abstract—"Where's Wally?" (Where's Waldo?) is a


worldwide popular children's book created by the British
illustrator Martin Handford. The book's goal is to find a
specific character - Wally - in a picture of a sea of people.
There are simple versions of “Where's Wally?”, but some
versions are complicated for the human eye to spend much
time looking for Wally. In our project, we try to identify the
characters through RCNN model, and we try to generate a new
“Where's Wally?” image using the GAN model. After 3000
epochs, the RCNN result of the loss function is reasonable, and
we tested some images and got a significant identification. In
the demo image of finding Wally, we got 98~100% accuracy,
which is a good result for the characters identified in the
illustrations using RCNN model. But the GAN result is not Fig. 2. The difficult version of “Where's Wally? “
qualified without enough dataset. However, we can find that From “Where's Wally?” In Hollywood
the GAN-generated images are close to the original image in
terms of composition and color.
II. PROJECT PURPOSE
Keywords—RCNN, Where’s Wally, GAN
In this project, we want to try to identify the characters
through a deep learning model, using the “Where's Wally?”
children's book sketchbook as a dataset and visualizing the
I. INTRODUCTION process of finding Wally. Finally, we hope to generate a new
“Where's Wally?” image using the GAN model.
“Where's Wally?”( Where's Waldo?) is a set of children's
books created by British illustrator Martin Handford.
Moreover, it has been continuously published in various
countries since 1987, and it is still a popular children's book III. RELATED WORK
in 2022. A. RCNN model
The book's goal is to find a specific character - Wally - in In this project, we used the R-CNN model for finding
a picture of a sea of people. There are simple versions of Wally. Unlike working on many regions, the RCNN
“Where's Wally?” (see Fig. 1), but some versions are algorithm proposes creating multiple bounding boxes in the
complicated for the human eye to spend much time looking image and checking whether these boxes contain the target
for Wally (see Fig. 2). objects. RCNN uses selective search to extract these
bounding boxes from a single image. The steps for detecting
a target object with RCNN are as follows. (see Fig. 4~8)
1. We first take a pre-trained convolutional neural network.
2. The network's last layer is trained according to the
number of target classes to be detected.
3. We obtain the Region of Interest for each image and
reshape these regions to meet the input size requirements
of the CNN.
4. After obtaining these regions, we train a Support Vector
Machine (SVM) to discriminate between the target object
and the background. For each class, we train a binary
Fig. 1. The simple versions of “Where's Wally? “ SVM.
From Where’s Waldo? in the Social Distancing Age

1
5. Finally, we train a linear regression model to generate B. Generative Adversarial Networks(GAN) model
more accurate bounding boxes for each recognized
object. In this project, we used the GAN model to generate a new
Wally image. Generative adversarial networks (GANs) are
an emerging technique for both semi-supervised and
unsupervised learning was proposed by Ian Goodfellow in
2014.[1]
They can be characterized by training a pair of networks
in competition with each other. The generator G creates
forgeries, and the discriminator D receives both forgeries and
real images and aims to tell them apart (see Fig. 9). Both are
trained simultaneously and in competition with each other.
Fig. 4. Input image

Fig. 5. Selectively search for regions of interest, input these regions into the
CNN, and go through the convolutional network

Fig. 9. Construction of the GANs. From [2]

The cost of training is evaluated using a value function V


(G, D) that depends on both the generator(G) and the
discriminator(D). The formula is as follows:

Fig. 6. CNN extracts features for each region and uses SVM to classify
these regions into different categories
The evaluation criterion is that D(x) is as large as possible,
the closer to 1, the better, and D(G(z)) is as small as
possible. [5]

IV. MODEL DESIGN


A. Dataset Preparation
We use the dataset from github (https://github.com/vc14
92a/Hey-Waldo); there are 19 origin images labeled in 256
x256 pixels. Furthermore, we also collected 41 new images
from the web and relabeled them. (see Fig. 10) The steps are
as follows:
1.training set of Where’s Wally puzzles
Fig. 7. Use bounding box regression to predict the position of the bounding 2.scalars and categorical string
box for each region
3.packing labels and images into a binary file(.tfrecord)

Fig. 10. Labeled images


Fig. 8. RCNN method for detecting target objects

2
B. Model design, preparation and training V. RESULT
We used the training data with the label for model After 3000 epochs, the result of the loss function is
training and combined the object recognition and test data for reasonable (see Fig. 12), and we tested some images and got
the model of finding Wally. Our model architecture(see Fig. a significant identification. (see Fig. 13) In the demo image
11) as follows: of finding Wally, we got 98~100% accuracy, which is a good
result for the characters identified in the illustrations using
RCNN model.

Fig. 11. Model design

We used the tensorflow object detection API for pre-


trained models, the transfer learning technique, and the
RCNN model. The goal is looking for Wally label (id:1), and
Fig. 12. the result of the loss function
we retrain the model until the loss is below 0.01 and stop
learning. Parameters configuring and settings as follow:
Parameters settings of the object recognition: kernel size:
3, regularization weight: 0.00004, activation: RELU, depth:
16, decay: 0.9997, Loss type: CLASSIFICATION.

TABLE I. PARAMETERS SETTINGS OF THE OBJECT RECOGNITION

Object Recognition
Classes number:90
Kernel size: 3
Regularization weight: 0.00004
Feature extractor
depth: 16
activation: RELU
decay: 0.9997
Hard example miner
Num hard examples: 3000
Loss type: CLASSIFICATION
Max negatives per positive: 3
Min negatives per image: 0

Parameters settings of the Training: Batch size: 24,


Optimizer: RMSprop, Momentum optimizer value: 0.9,
decay: 0.9, initial learning rate: 0.004.

TABLE II. PARAMETERS SETTINGS OF THE TRAINING

Training
Batch size: 24
Optimizer:RMSprop
Momentum optimizer value: 0.9
decay: 0.9
Exponential decay learning rate
Initial learning rate: 0.004
Decay steps: 800720
Decay factor: 0.95
Fig. 13. Finding Wally

3
[3] Q. Xu, G. Huang, Y. Yuan, C. Guo, Y. Sun, F. Wu, K. Weinberger.”
An empirical study on evaluation metrics of generative adversarial
We used the dataset and GAN model to generate a new wally networks”.arXiv preprint arXiv:1806.07755
image, but the result is not qualified without enough dataset. [4] M.T. Rosenstein, Z. Marx, L.P. Kaelbling, and T.G. Dietterich, “To
(see Fig. 14) However, we can find that the GAN-generated transfer or not to transfer,” In NIPS05 Workshop, Inductive Transfer:
10 Years Later, 2005.
images are close to the original image in terms of
composition and color. [5] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
Transactions on knowledge and data engineering, vol. 22, no. 10, pp.
1345–1359, 2010.
[6] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I.
Goodfellow, and R. Fergus, “Intriguing properties of neural
networks,” arXiv preprint arXiv:1312.6199, 2013.
Fig. 14. GAN result
[7] Hossein Hosseini, Baicen Xiao, Mayoore Jaiswal, and Radha
Poovendran. “On the Limitation of Convolutional Neural Networks in
Recognizing Negative Images”. In: Machine Learning and
VI. CONCLUSION Applications (ICMLA), 2017 16th IEEE International Conference on.
IEEE. 2017, pp. 352–358.

Revised from
https://diglib.eg.org/bitstream/handle/10.2312/cgvc20211313/027-
031.pdf?sequence=1&isAllowed=y
Label src:
https://github.com/tadejmagajna/HereIsWally/blob/master/trained_mo
del/labels.txt
Graph src: https://arxiv.org/pdf/1311.2524.pdf
https://github.com/tensorflow/models/edit/master/research/
object_detection/samples/configs/ssd_inception_v2_coco.config

Fig. 15. Demo result

In this project, we tried to train the RCNN model to find


the illustrated character "Wally" by using the popular fairy
tale book "Where’s Wally?" as the dataset.
It is an exciting project to find fixed illustrated characters.
However, compared to the common face recognition
training, such as this project has the problem of the
insufficient dataset. In our project, it is still possible to
effectively train and identify fixed roles in this project
through RCNN.
However, in the GAN generation of new images, the
insufficient dataset cannot be overcome, and a sufficient
amount of data is still needed to generate new images. We
can still get some interesting results with a small dataset in
GAN training. For example, we can still see similar dataset
features in the generated image of composition and color.

REFERENCES
[1] I. Goodfellow et al. "Generative adversarial nets." Advances in neural
information processing systems. 2014.
[2] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta et
al.” Generative adversarial networks: An overview”.IEEE Signal
Processing Magazine 35 (1), 53-65.

You might also like