Professional Documents
Culture Documents
Surfcnn: A Descriptor Accelerated Convolutional Neural Network For Image-Based Indoor Localization
Surfcnn: A Descriptor Accelerated Convolutional Neural Network For Image-Based Indoor Localization
April 8, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2981620
ABSTRACT Convolutional neural network (CNN) is a powerful tool for many data applications. However,
its high dimension nature, large network size and computational complexity, and the need of large amount
of training data make it challenging to be used in edge computing applications, which are becoming
increasingly popular, relevant and important. In this paper, we propose a descriptor based approach to
accelerate convolutional neural network training, reduce input dimension and network size, which greatly
facilitates the use of CNN for edge computating and even cloud computing. By using image descriptors to
extract features from original images, we report a simpler convolutional neural network with fast training
speed, low memory usage and outstanding accuracy without the need for a pre-trained network as opposed
to the state of art. In indoor localization, our SURF descriptors accelerated CNN (SurfCNN) can reach an
average position error of 0.28 m and orientation error of 9.2◦ . Compared to the conventional CNN that uses
original images as input, our algorithm reduces the dimension of the input features by a factor of 48 without
impairing the accuracy. Further, at an extreme feature reduction of 14,440 times, our model still retains an
average position error retained at 0.41 m and orientation error at 14◦ .
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
59750 VOLUME 8, 2020
A. M. Elmoogy et al.: SurfCNN: Descriptor Accelerated CNN for Image-Based Indoor Localization
from an indoor localization application, we combine both AlexNet [16], GoogleNet [17] and ResNet [18] achieve state
technologies by first using an image descriptor to extract of the art performance in multiple applications. For image
features from images. The feature set, which has significantly based indoor localization, it is typically implemented by
reduced dimension compared to the images, is input to a training an end to end architecture to learn the global pose of
CNN to extract more useful features with further reduced a single image with respect to a known reference frame [19].
dimension. The combined techniques result in a significant However, CNN requires high dimensional optimization pro-
fewer parameters in the CNN and reduced training and testing cedure in which the training time is long when the input
time, which substantially lower the required communications size is large. Furthermore, CNN need to be re-trained or fine
and computing resources and the overall latency. tuned when the testing scene is significantly different from
People normally spend over 87% of their daily lives the training scene.
indoors [1]. Consequently, indoor localization is important to To reduce the complexity of CNN, we propose, for the
many applications ranging from indoor navigation to virtual first time according to the authors’ knowledge, to use the
reality [2]. Unlike outdoor localization where the global posi- image features’ descriptors instead of the image itself as input
tioning system (GPS) is prevalent, indoor localization relies to CNN to learn the global image pose. Among all features
on readings from sensors such as received signal strength descriptors and detectors, SURF descriptors are used exten-
indicator (RSSI), light detection and ranging (LIDAR), iner- sively in computer vision applications such as face recog-
tial measurement unit (IMU) and images from cameras as nition [20], visual simultaneous localization and mapping
GPS signals are too weak [3]. In particular, image based (SLAM) [21] and object detection [22]. In addition, [23]
localization is popular in many fields including robotics [4] demonstrates that the SURF descriptor reaches the highest
and simultaneous localization and mapping (SLAM) [5] for accuracy in image matching for indoor localization. Typi-
its high accuracy and low cost. cally, SURF can convert an image with around 1 million
Image based localization methods can be categorized into pixels into feature set with fewer than 20 thousand values by
two sub-categories: feature based and direct methods [6]. taking the strongest 300 features’ descriptors, sorted by the
In direct methods, all the image pixels are used for loca- corresponding Hessian threshold. This significantly reduces
tion estimation process, making it computationally expensive the data dimension without noticeable loss of image informa-
despite of its capability to generating dense maps [7]. On the tion. The proposed method aims to combine the advantages of
other hand, feature based methods extract features from the both the classical methods and the direct CNN method while
images and estimate the location using these features. Con- mitigating their disadvantages. Our system does not require
sequently it is much faster than direct methods. For feature an initial pose and the correspondences between descriptors
based methods, there have been various feature detectors for or a large database. Moreover, it reduces the complexity of
feature extraction, including scale-invariant feature transform the direct CNN by reducing the input dimensions to CNN for
(SIFT) [8], sped up robust feature (SURF) [9], features from model simplification, easier training and re-training on differ-
accelerated segment test (FAST) [10], binary robust indepen- ent environments, and faster inference. This greatly facilitates
dent elementary features (BRIEF) [11] and oriented FAST its use in edge and cloud computing with less demanding data
and rotated BRIEF (ORB) [12], etc. A typical procedure storage, transmission and computation requirements.
for feature based localization is as follows: features are first
extracted using one of these algorithms. Then, descriptors are II. RELATED WORK
estimated to identify the area around each feature for match-
A. DESCRIPTORS RELATED CNN MODEL
ing. Following, either the consecutive image descriptors are
CNN has been used to extract the descriptors of an input
matched to find the relative pose (position and orientation)
image which is either a full image or a pair of non corre-
between them [13], or each image descriptors are matched
sponding patches of the image. For example, Siamese CNN
against a large database of key frames or landmarks with
is proposed in [24] to learn the descriptors of small patches
known locations to identify what is the closest match to infer
of images to achieve patch matching at 95% recall rate. The
the current pose of the camera [14].
same approach is employed in [25] but using pairs of non cor-
Although good performances have been achieved by
responding patches where the network learns 128-D descrip-
these classical methods, they have drawbacks. For example,
tors whose Euclidean distances reflect patch similarity.
the accuracy of all these methods rely on a good estimation
Reference [26] uses similar idea of using pair of patches to
of initial pose location. In addition, the pose accuracy is
learn a similarity score from. Further, unsupervised learn-
determined by the matching between descriptors. Therefore,
ing [27] with VGG-16 network architecture [28] is intro-
wrong correspondences can accumulate errors. Finally, large
duced [29] to learn compact binary descriptor for efficient
database is required for image retrieval methods with a size
visual object matching.
proportional to the area of the scene.
Recently, convolutional neural network (CNN) [15], which
is capable of extracting features from high dimensional data B. IMAGE-BASED LOCALIZATION
including images, videos, etc., has been a popular choice Using images for localization is a part of the visual SLAM [6].
for indoor localizaiton. Multiple CNN architectures including In the past, most classical visual SLAM systems used either
handcrafted local features extracted (feature-based methods) to form the output pose vector P [x, y, z, qw , qx , qy , qz ]T of
or the whole image without extracting any features (direct size 7 of the camera pose.
methods). In feature-based methods, [13], [30] employ ORB
descriptor matching to find the correspondences between A. DATASET
consecutive frames and then uses the 8-point algorithm [31] Since our system is designed for working in indoor environ-
with bundle adjustment [32] to find the relative pose between ments, the training and testing are performed on 3 well known
frames. Other techniques [33], [34] rely on finding the indoor datasets:
2D-3D correspondence via descriptor matching between the 1) Microsoft RGB-D 7 scenes [45] dataset: It is composed
2D image and 3D model using e.g., structure from motion of 7 scenes with constantly changing views and varying
(SfM) [35]. camera heights as seen in Fig. 1. It contains both the
Paper [7] is a direct method that uses the full image RGB image and depth map with corresponding pose.
information. It provides a denser map but without affecting 2) RGB-D SLAM Dataset and Benchmark [46]: It is a
the pose estimation significantly compared to [13]. In addi- benchmark for the evaluation of visual odometry and
tion, [36] uses image retrieval techniques to search for the visual SLAM systems. It contains RGB and depth
similarity between the current image and images in the images with the ground truth trajectory. We choose to
database. Consequently, the position at which current image work with 4 scenes, each of which provides sequences
was taken can be estimated. for training with ground truth poses and others for
With the boom of deep learning, neural networks are used testing without ground truth positions. An online tool
in visual SLAM and image based localization by mapping is available for evaluation of the testing scenes.
directly from single image space to absolute position while 3) The ICL-NUIM dataset [47]: It is a recently developed
circumventing challenges in classical methods. For example, benchmark similar to the RGB-D SLAM benchmark
transfer learning [37] and the pretrained GoogLeNet [38] are dataset for RGB-D, visual odometry and SLAM algo-
employed in [19] to regress the pose of the camera. In later rithms. We will work on the office room scene with
publications, the same authors improve the accuracy by 4 sequences, one sequence for training and the other
modifying the loss function according to the geome- 3 for testing.
try, the relation between position and orientation [39],
or the uncertainty calculation [40]. Further, [41] generates B. SURF DESCRIPTORS
uncertainty for the camera poses using Gaussian process The SURF algorithm [9] starts with computing the interest
regressors. Several orientation representations and data aug- points or features. To find the correspondences between these
mentation are introduced in [42] along with a complex net- features, the description of the area around these features is
works with shared convolutions to learn the position and computed for robust matching. This is called the descriptor.
orientation. Recently, [43] adopts long-short term memory In the original SURF implementation, there are two ways
(LSTM) [44] to similar model to memorize good features. to find the descriptor of the detected feature depending on
In [18], pretrained ResNet-34 is applied for regressing camera its rotational invariance. To make the descriptors invariant
pose. It adopts encoder-decoder design and uses skipped to rotations, the dominant orientation of a neighbourhood
connection to move the features from the early layers to the around the feature is calculated using Haar-wavelet transform
output layers. in both x and y direction. Subsequently, the area around
Despite the outstanding performance of the neural the feature is divided into 4 × 4 grid along the dominant
networks-based systems, the training of these networks is orientation. As opposed to the computationally expensive
time consuming. The complexity of training increases if the counterpart SIFT [8], Haar-wavelet transform [48] is obtained
model needs to be fine tuned when the testing images are for both vertical and horizontal directions. The absolute val-
taken in completely different environments from what the ues of all vertical and horizontal responses are summed to
training images taken from. To reduce the training com- form the descriptor vector of all the sub-regions around the
plexity, we propose a novel technique to reduce the input feature with size 64. In the case where the orientation infor-
dimensions and consequently the parameter size of the CNN. mation needs to be maintained within the SURF descrip-
tor, the dominant orientation calculation step is skipped
III. NETWORK MODEL and the resulting vector is called Upright SURF (U-SURF)
In this section, we apply our model to image based indoor descriptor. In our models, we use both SURF and U-SURF
localization application. The task is that given an image I descriptors and demonstrate that using U-SURF does
taken by a camera with unknown intrinsic parameters with enhance the orientation calculation compared to the ordinary
corresponding SURF descriptors vector D, the network learns SURF descriptor.
the camera pose in the form of global Cartesian position Fig. 2 displays the histogram of features for images of the
[x, y, z]T and orientation in quaternion form [qw , qx , qy , qz ]T . 7 scenes dataset. As shown, the number of features ranges
Note, in contrast to the rotation matrix and Euler angles, from 100 to 4000. To make the input dimension feasible,
quaternion is a stable form to represent orientation. Con- we choose to work with the full 300 features or fewer.
sequently, both position and orientation can be combined Here, each feature has 64 values and the maximum input
FIGURE 1. Illustrations of image feature extraction using SURF descriptor. Red dots represents the location of (from left to right:) 10, 50, 100 and
300 SURF features of (from top to bottom:) office, stairs, fire, pumpkin, heads and chess scenes.
vector size is 300 × 64. Compared to the origin image of It is seen that the SURF algorithm focuses on feature points
480 × 640×3 pixels, use of the discriptor reduces the CNN such as edges and corners, making them good representative
input size by at least a factor of 48. of images that uniquely describe every feature’s surrounding
Further, the location of the 300, 100, 50 and 10 most area. Consequently, the majority of the information in the
important SURF features are shown in Fig. 1 as red dots. image is retained in SURF descriptors.
FIGURE 2. The histogram of features for Stairs, Office, Fire, Chess, Heads and Pumpkin scenes.
TABLE 1. Dimensions of CNN layers where N is the number of SURF TABLE 2. The number of layers for the current work SurfCNN with
features. 300 × 64 input features, the average number of parameters and the
median error in position, compared with PoseNet [19], Pose-LSTM [43]
and Pose-Hourglass [52].
FIGURE 5. The bar plot for the position error (left) and orientation error (right) of U-SURF (in brown) and SURF (in blue) for the
7 scenes dataset.
number of features for our SurfCNN. It illustrates that the training time around 15 minutes, an average positional and
relation between the number of features and training time orientation error of 0.41 m and 14.54◦ is achieved. This is bet-
is nearly linear with a minimum time of 15 minutes for ter than PoseNet and Posenet-U with an order-of-magintude
using 1 features and max time of 90 minutes for 300 features. reduction in the input and number of parameters. The error is
Additionally, the inverse relationship between the average acceptable in indoor environments while the input dimension
error and training time is shown where a maximum average is only 1 × 64. This opens the door for on-line training or
error of 0.41 m with only 15 minutes training time and 0.28 m training on embedded systems as the agent with the camera
for 90 minutes are obtained. The training was done using mounted on it has to train every time it goes to a different
NVidia TITAN XP GPU and Adam solver [53] with a learn- scene.
ing rate of 0.0001. Using this visualization, we can choose Among all individual scenes, our system outperforms all
the number of features needed for SurfCNN according to the the previous work in the scenes of heads and fire. This is
target training time, the computational resource availability because the images in these scenes have the distinctive fea-
and maximum allowable error. Overall, SurfCNN is superior tures and low field of view as shown in Fig. 1. Consequently,
to other work that use the full images as input. as few as 200 features with input dimension 200 × 64 is
The median position/orientation error for the U-SURF needed to outperform all the previous work that used input
descriptors of features sizes 300, 100, 50, 10, 1 along with size of 224 × 224 × 3 after cropping the input images. In the
the state of the art previous work including PoseNet [19], stairs scene, the system has very close performance to [52].
G-Posenet [41], Posenet-U [40], Pose-L [43], Branch- By analyzing the histogram of the stairs scene in Fig. 2, it is
Net [42] Pose-H [52], VidLoc [54], RelocNet [55] and found that the maximum number of features in this scene
Mobile-PoseNet [56] for the 7 scenes dataset are shown is 500, which makes choosing 300 features is sufficient for
in Table 3. As shown, with 300 U-SURF features, our system learning. However, the chess, office and pumpkin scenes
displays 0.28 m position error and 9.17◦ orientation error. are wide view angle images and the number of features is
Compared to the previous works, the accuracy of our model large. Therefore 300 features are insufficient to represent the
is comparable but we are advanced as to adopt only half the input. Nevertheless, in these 3 scenes, SurfCNN still has a
number of parameters and much smaller input dimension. median error that is acceptable in indoor localization and
Among all other models, Pose-H [52] demonstrates a better competitive to the other work with substantial reduction in
accuracy with 5 cm error margin whilst Pose-H employs input dimension and lower training time.
an enormously complex network and needs to be trained We further demonstrate that our model is valid for different
seperately in every scene. VidLoc achieves 0.25 m positional datasets other than the 7 scenes dataset, by comparing with
error but with using up to 400 frames which is very hard to PoseNet [19]. Firstly we validate our system with the RGB-D
feed to the network with frame size of 224 × 224 × 3 and SLAM Dataset and other public datasets.
require very expensive computational power. Further when We also extend this work on our locally generated data
the number of features is reduced to as low as 1 feature with using a robot mounted with multiple sensors including
TABLE 3. The median error in position (m)/ orientation (degrees) for the 7 scenes, with 5 input sizes 300 × 64, 100 × 64, 50 × 64, 10 × 64 and 1 × 64,
compared with PoseNet [19], G-Posenet [41], Posenet-U [40], Pose-L [43], BranchNet [42], Pose-H [52], VidLoc [54], RelocNet [55] and Mobile-PoseNet [56].
FIGURE 6. The training time (in blue) and the average error (in orange) V. CONCLUSION
versus the number of features. We have implemented image based indoor localization
(SurfCNN) that uses SURF descriptors to reduce the input
TABLE 4. The median translational (m)/ rotational (degrees) error for the dimension of CNN. Taking advantages of both SURF descrip-
RGB-D dataset (fr1/xyz, fr2/xyz, fr1/rpy and fr2/rpy and our own
measured data. tors and CNN, our network has a competitive performance
with the state of the art without the need for a pretrained
network. Given a sufficient number of features, SurfCNN
reaches the same accuracy as state of the art with only half
the parameters and excluding the pretrained network. This
advantage is essential in real time application where memory
size is limited and edge/cloud computing applications. The
proposed approach is also versatile to other CNN related
applications in addition to indoor localization.
TABLE 5. The median error of position (m) for ICL-NUIM dataset (office
REFERENCES
room 0, 1 and 2), the RGB-D dataset (RGBD-1: fr3/long office household)
compared to SurfCNNfor ICL-NUIM dataset (office room 0, 1 and 2), [1] N. E. Klepeis, W. C. Nelson, W. R. Ott, J. P. Robinson, A. M. Tsang,
the RGB-D dataset (RGBD-1: fr3/long office household) compared to P. Switzer, J. V. Behar, S. C. Hern, and W. H. Engelmann, ‘‘The national
SurfCNN. human activity pattern survey (NHAPS): A resource for assessing exposure
to environmental pollutants,’’ J. Exposure Sci. Environ. Epidemiol., vol. 11,
no. 3, pp. 231–252, Jul. 2001.
[2] A. Yassin, Y. Nasser, M. Awad, A. Al-Dubai, R. Liu, C. Yuen, R. Raulefs,
and E. Aboutanios, ‘‘Recent advances in indoor localization: A survey on
theoretical approaches and applications,’’ IEEE Commun. Surveys Tuts.,
vol. 19, no. 2, pp. 1327–1346, 2nd Quart., 2017.
[3] S. Miura, L.-T. Hsu, F. Chen, and S. Kamijo, ‘‘GPS error correction
with pseudorange evaluation using three-dimensional maps,’’ IEEE Trans.
Intell. Transp. Syst., vol. 16, no. 6, pp. 3104–3115, Dec. 2015.
[4] H. Liu, H. Darabi, P. Banerjee, and J. Liu, ‘‘Survey of wireless indoor
camera, LIDAR, odometer and SONAR to collect RGB positioning techniques and systems,’’ IEEE Trans. Syst., Man, Cybern. C,
images with ground truth poses. As shown in Table 4, Appl. Rev., vol. 37, no. 6, pp. 1067–1080, Nov. 2007.
[5] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
the median translational/rotational error are consistent with I. Reid, and J. J. Leonard, ‘‘Past, present, and future of simultaneous
the previous results of the 7 scenes dataset and our system localization and mapping: Toward the robust-perception age,’’ IEEE Trans.
outperforms PoseNet in all scenes. Robot., vol. 32, no. 6, pp. 1309–1332, Dec. 2016.
[6] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-Mancha, ‘‘Visual
We further extend the comparison with traditional classical simultaneous localization and mapping: A survey,’’ Artif. Intell. Rev.,
systems that do not depend on learning. Here, SurfCNN vol. 43, no. 1, pp. 55–81, Jan. 2015.
is compared with two state of the art systems, the fea- [7] J. Engel, T. Schöps, and D. Cremers, ‘‘LSD-SLAM: Large-scale direct
ture based method ORB-SLAM [13] and the direct method monocular SLAM,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2014,
pp. 834–849.
LSD-SLAM [7]. As shown in Table 5, for the scenes from [8] D. G. Lowe, ‘‘Distinctive image features from scale-invariant keypoints,’’
ICL-NUM dataset and RGBD SLAM benchmark, SurfCNN Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
[9] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, ‘‘Speeded-up robust [33] T. Sattler, B. Leibe, and L. Kobbelt, ‘‘Fast image-based localization using
features (SURF),’’ Comput. Vis. Image Understand., vol. 110, no. 3, direct 2D-to-3D matching,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011,
pp. 346–359, Jun. 2008. pp. 667–674.
[10] P. Dollar, R. Appel, S. Belongie, and P. Perona, ‘‘Fast feature pyramids for [34] T. Sattler, B. Leibe, and L. Kobbelt, ‘‘Efficient & effective prioritized
object detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8, matching for large-scale image-based localization,’’ IEEE Trans. Pattern
pp. 1532–1545, Aug. 2014. Anal. Mach. Intell., vol. 39, no. 9, pp. 1744–1756, Sep. 2017.
[11] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, ‘‘BRIEF: Binary robust [35] D. Forsyth and J. Ponce, Computer Vision: A Modern Approach.
independent elementary features,’’ in Proc. Eur. Conf. Comput. Vis. Upper Saddle River, NJ, USA: Prentice-Hall, 2011.
Springer, 2010, pp. 778–792. [36] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys,
[12] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ‘‘ORB: An efficient ‘‘Hyperpoints and fine vocabularies for large-scale location recogni-
alternative to SIFT or SURF,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011, tion,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
pp. 2564–2571. pp. 2102–2110.
[13] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, ‘‘ORB-SLAM: A versa- [37] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, ‘‘Learning and trans-
tile and accurate monocular SLAM system,’’ IEEE Trans. Robot., vol. 31, ferring mid-level image representations using convolutional neural net-
no. 5, pp. 1147–1163, Oct. 2015. works,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
pp. 1717–1724.
[14] Y. Kalantidis, G. Tolias, E. Spyrou, P. Mylonas, and Y. Avrithis, ‘‘Visual
[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
image retrieval and localization,’’ in Proc. 7th Int. Workshop Content-
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’
Based Multimedia Indexing, Athens, Greece, 2009, pp. 1–20.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
[15] Y. LeCun et al., ‘‘Convolutional networks for images, speech, and time
pp. 1–9.
series,’’ in The Handbook of Brain Theory and Neural Networks, vol. 3361,
[39] A. Kendall and R. Cipolla, ‘‘Geometric loss functions for camera pose
no. 10. 1995, p. 1995.
regression with deep learning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification Recognit. (CVPR), Jul. 2017, pp. 5974–5983.
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro- [40] A. Kendall and R. Cipolla, ‘‘Modelling uncertainty in deep learning for
cess. Syst., 2012, pp. 1097–1105. camera relocalization,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking May 2016, pp. 4762–4769.
the inception architecture for computer vision,’’ in Proc. IEEE Conf. [41] M. Cai, C. Shen, and I. Reid, ‘‘A hybrid probabilistic model for camera
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826. relocalization,’’ in Proc. Brit. Mach. Vis. Conf., 2018, p. 8.
[18] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image [42] J. Wu, L. Ma, and X. Hu, ‘‘Delving deeper into convolutional neural
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), networks for camera relocalization,’’ in Proc. IEEE Int. Conf. Robot.
Jun. 2016, pp. 770–778. Autom. (ICRA), May 2017, pp. 5644–5651.
[19] A. Kendall, M. Grimes, and R. Cipolla, ‘‘PoseNet: A convolutional net- [43] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and
work for real-time 6-DOF camera relocalization,’’ in Proc. IEEE Int. Conf. D. Cremers, ‘‘Image-based localization using LSTMs for structured fea-
Comput. Vis. (ICCV), Dec. 2015, pp. 2938–2946. ture correlation,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
[20] G. Du, F. Su, and A. Cai, ‘‘Face recognition using SURF features,’’ Proc. pp. 627–637.
SPIE, vol. 7496, Oct. 2009, Art. no. 749628. [44] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
[21] N. Engelhard, F. Endres, J. Hess, J. Sturm, and W. Burgard, ‘‘Real-time Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
3D visual SLAM with a hand-held RGB-D camera,’’ in Proc. RGB-D [45] B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, ‘‘Real-time RGB-
Workshop 3D Perception Robot. Eur. Robot. Forum, Västerås, Sweden, D camera relocalization,’’ in Proc. IEEE Int. Symp. Mixed Augmented
vol. 180, 2011, pp. 1–15. Reality (ISMAR), Oct. 2013, pp. 173–179.
[22] R. Chincha and Y. Tian, ‘‘Finding objects for blind people based on SURF [46] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, ‘‘A bench-
features,’’ in Proc. IEEE Int. Conf. Bioinf. Biomed. Workshops (BIBMW), mark for the evaluation of RGB-D SLAM systems,’’ in Proc. IEEE/RSJ Int.
Nov. 2011, pp. 526–527. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580.
[23] E. Bayraktar and P. Boyraz, ‘‘Analysis of feature detector and [47] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, ‘‘A benchmark for
descriptor combinations with a localization experiment for various RGB-D visual odometry, 3D reconstruction and SLAM,’’ in Proc. IEEE
performance metrics,’’ 2017, arXiv:1710.06232. [Online]. Available: Int. Conf. Robot. Autom. (ICRA), May 2014, pp. 1524–1531.
http://arxiv.org/abs/1710.06232 [48] C. F. Chen and C. H. Hsiao, ‘‘Haar wavelet method for solving lumped and
[24] L. Chen, F. Rottensteiner, and C. Heipke, ‘‘Invariant descriptor learning distributed-parameter systems,’’ IEE Proc.-Control Theory Appl., vol. 144,
using a siamese convolutional neural network,’’ in Proc. 23rd ISPRS no. 1, pp. 87–94, Jan. 1997.
Congr., Commission, vol. 3. Göttingen, Germany: Copernicus GmbH, [49] S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating deep network
2016, pp. 11–18. training by reducing internal covariate shift,’’ in Proc. Int. Conf. Mach.
Learn., 2015, pp. 448–456.
[25] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and
[50] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
F. Moreno-Noguer, ‘‘Discriminative learning of deep convolutional
R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
feature point descriptors,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
from overfitting,’’ J. Mach. Learn. Res., vol. 15, pp. 1929–1958, Jan. 2014.
Dec. 2015, pp. 118–126.
[51] A. Canziani, A. Paszke, and E. Culurciello, ‘‘An analysis of deep neu-
[26] S. Zagoruyko and N. Komodakis, ‘‘Learning to compare image patches via
ral network models for practical applications,’’ 2016, arXiv:1605.07678.
convolutional neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
[Online]. Available: http://arxiv.org/abs/1605.07678
Recognit. (CVPR), Jun. 2015, pp. 4353–4361.
[52] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, ‘‘Image-based local-
[27] J. A. Hertz, Introduction to the Theory of Neural Computation. Boca Raton, ization using hourglass networks,’’ 2017, arXiv:1703.07971. [Online].
FL, USA: CRC Press, 2018. Available: http://arxiv.org/abs/1703.07971
[28] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for [53] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail- optimization,’’ 2014, arXiv:1412.6980. [Online]. Available:
able: http://arxiv.org/abs/1409.1556 http://arxiv.org/abs/1412.6980
[29] K. Lin, J. Lu, C.-S. Chen, and J. Zhou, ‘‘Learning compact binary descrip- [54] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, ‘‘VidLoc: A deep
tors with unsupervised deep neural networks,’’ in Proc. IEEE Conf. Com- spatio-temporal model for 6-DoF video-clip relocalization,’’ in Proc. IEEE
put. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1183–1192. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6856–6864.
[30] R. Mur-Artal and J. D. Tardos, ‘‘ORB-SLAM2: An open-source SLAM [55] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, ‘‘Camera relocalization
system for monocular, stereo, and RGB-D cameras,’’ IEEE Trans. Robot., by computing pairwise relative poses using convolutional neural network,’’
vol. 33, no. 5, pp. 1255–1262, Oct. 2017. in Proc. IEEE Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2017,
[31] R. I. Hartley, ‘‘In defense of the eight-point algorithm,’’ IEEE Trans. pp. 929–938.
Pattern Anal. Mach. Intell., vol. 19, no. 6, pp. 580–593, Jun. 1997. [56] C. Cimarelli, D. Cazzato, M. A. Olivares-Mendez, and H. Voos, ‘‘Faster
[32] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. visual-based localization with mobile-PoseNet,’’ in Proc. Int. Conf. Com-
Cambridge, U.K.: Cambridge Univ. Press, 2003. put. Anal. Images Patterns. Springer, 2019, pp. 219–230.
AHMED M. ELMOOGY received the B.Sc. TAO LU received the B.Sc. degree from the
degree in electronics and telecommunication Department of Physics, University of Manitoba,
engineering from Mansoura University, Egypt, in 1995, the M.Sc. degree from the Department
in 2012. He is currently pursuing the Ph.D. of Electrical and Computer Engineering, Queen’s
degree in electrical engineering with the Univer- University at Kingston, in 1998, and the Ph.D.
sity of Victoria, Victoria, Canada. He worked as degree from the Department of Physics, University
a Research and Teaching Assistant at Mansoura of Waterloo, in 2005. He has worked in indus-
University, Egypt, from 2012 to 2016. His cur- try with various companies, including Nortel Net-
rent research interests include simultaneous local- works, Kymata Canada, and Peleton on optical
ization and mapping (SLAM) using camera and communications. Before joining UVic, he was a
LIDAR data with different machine learning algorithms. Postdoctoral Fellow with the Department of Applied Physics, California
Institute of Technology, from 2006 to 2008. His research interests include
optical microcavities and their applications to ultra narrow linewidth laser
source, bio and nano photonics. His current research include machine learn-
ing algorithms with applications to spectral analysis, the Internet of Things,
and indoor localization.