Surfcnn: A Descriptor Accelerated Convolutional Neural Network For Image-Based Indoor Localization

Received February 28, 2020, accepted March 6, 2020, date of publication March 24, 2020, date of current version
April 8, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2981620
SurfCNN: A Descriptor Accelerated Convolutional

Neural Network for Image-Based
Indoor Localization
AHMED M. ELMOOGY 1 , XIAODAI DONG 1 , (Senior Member, IEEE), TAO LU 1,
ROBERT WESTENDORP 2 , AND KISHORE REDDY TARIMALA 2

1 Electrical and Computer Engineering, University of Victoria, Victoria, BC V8P 5C2, Canada
2 Fortinet, Burnaby, BC V5C 6C6, Canada
Corresponding authors: Xiaodai Dong (xdong@ece.uvic.ca) and Tao Lu (taolu@ece.uvic.ca)
This work was supported in part by the Nature Science and Engineering Research Council of Canada (NSERC) Collaborative Research and
Development Grant under Grant CRDPJ-52098-2017, in part by the NSERC Discovery Grant under Grant RGPIN-2015-06515, and in part
by the NVidia Corporation TITAN-X GPU Grant.
ABSTRACT Convolutional neural network (CNN) is a powerful tool for many data applications. However,
its high dimension nature, large network size and computational complexity, and the need of large amount
of training data make it challenging to be used in edge computing applications, which are becoming
increasingly popular, relevant and important. In this paper, we propose a descriptor based approach to
accelerate convolutional neural network training, reduce input dimension and network size, which greatly
facilitates the use of CNN for edge computating and even cloud computing. By using image descriptors to
extract features from original images, we report a simpler convolutional neural network with fast training
speed, low memory usage and outstanding accuracy without the need for a pre-trained network as opposed
to the state of art. In indoor localization, our SURF descriptors accelerated CNN (SurfCNN) can reach an
average position error of 0.28 m and orientation error of 9.2◦ . Compared to the conventional CNN that uses
original images as input, our algorithm reduces the dimension of the input features by a factor of 48 without
impairing the accuracy. Further, at an extreme feature reduction of 14,440 times, our model still retains an
average position error retained at 0.41 m and orientation error at 14◦ .
INDEX TERMS CNN, localization, SURF, SLAM.
I. INTRODUCTION specific tasks. Such neural network, however, requires high

Internet of things (IoT) has found widespread use in different dimensional optimization procedure in which the training
industries by integrating computing, communications and time is significantly longer when the input dimension is large.
control functions into a network formed by sensor nodes, This poses significant challenges to communications links
edge devices and remote servers. The raw data collected by to transport a large amount of raw data to a cloud server
sensors are eventually sent to cloud servers with or without for CNN training and inference. If inference is shifted to
processing at the sensor and edge nodes. The distribution of edge and/or sensor nodes, the computing resources are usu-
computing to IoT network components is tailored based on ally very limited in comparison to the CNN computational
the capacity of communications links, computing resources complexity. Therefore, ways to reduce the CNN dimension
at each level, and the delay requirement of the applications. and complexity are necessary to facilitate the use of CNN
As more intelligence is being built into IoT, neural networks in IoT applications. It is well known that image descriptors
play an important role in data processing. Among various extract features from images through deterministic means that
deep learning tools, convolutional neural networks (CNN) are orders of magnitude faster than CNN. The drawback of a
are capable of extracting features from high dimensional descriptor is that, the output feature size is usually large com-
data such as images, videos, etc., that suits the application pared to those from CNN as most of the image information is
retained during extraction regardless of whether it is needed
The associate editor coordinating the review of this manuscript and for the target application. Inspired by the characteristics of
approving it for publication was Heng Wang . CNN and image descriptors, here, through the demonstration
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
59750 VOLUME 8, 2020
A. M. Elmoogy et al.: SurfCNN: Descriptor Accelerated CNN for Image-Based Indoor Localization
from an indoor localization application, we combine both AlexNet [16], GoogleNet [17] and ResNet [18] achieve state
technologies by first using an image descriptor to extract of the art performance in multiple applications. For image
features from images. The feature set, which has significantly based indoor localization, it is typically implemented by
reduced dimension compared to the images, is input to a training an end to end architecture to learn the global pose of
CNN to extract more useful features with further reduced a single image with respect to a known reference frame [19].
dimension. The combined techniques result in a significant However, CNN requires high dimensional optimization pro-
fewer parameters in the CNN and reduced training and testing cedure in which the training time is long when the input
time, which substantially lower the required communications size is large. Furthermore, CNN need to be re-trained or fine
and computing resources and the overall latency. tuned when the testing scene is significantly different from
People normally spend over 87% of their daily lives the training scene.
indoors [1]. Consequently, indoor localization is important to To reduce the complexity of CNN, we propose, for the
many applications ranging from indoor navigation to virtual first time according to the authors’ knowledge, to use the
reality [2]. Unlike outdoor localization where the global posi- image features’ descriptors instead of the image itself as input
tioning system (GPS) is prevalent, indoor localization relies to CNN to learn the global image pose. Among all features
on readings from sensors such as received signal strength descriptors and detectors, SURF descriptors are used exten-
indicator (RSSI), light detection and ranging (LIDAR), iner- sively in computer vision applications such as face recog-
tial measurement unit (IMU) and images from cameras as nition [20], visual simultaneous localization and mapping
GPS signals are too weak [3]. In particular, image based (SLAM) [21] and object detection [22]. In addition, [23]
localization is popular in many fields including robotics [4] demonstrates that the SURF descriptor reaches the highest
and simultaneous localization and mapping (SLAM) [5] for accuracy in image matching for indoor localization. Typi-
its high accuracy and low cost. cally, SURF can convert an image with around 1 million
Image based localization methods can be categorized into pixels into feature set with fewer than 20 thousand values by
two sub-categories: feature based and direct methods [6]. taking the strongest 300 features’ descriptors, sorted by the
In direct methods, all the image pixels are used for loca- corresponding Hessian threshold. This significantly reduces
tion estimation process, making it computationally expensive the data dimension without noticeable loss of image informa-
despite of its capability to generating dense maps [7]. On the tion. The proposed method aims to combine the advantages of
other hand, feature based methods extract features from the both the classical methods and the direct CNN method while
images and estimate the location using these features. Con- mitigating their disadvantages. Our system does not require
sequently it is much faster than direct methods. For feature an initial pose and the correspondences between descriptors
based methods, there have been various feature detectors for or a large database. Moreover, it reduces the complexity of
feature extraction, including scale-invariant feature transform the direct CNN by reducing the input dimensions to CNN for
(SIFT) [8], sped up robust feature (SURF) [9], features from model simplification, easier training and re-training on differ-
accelerated segment test (FAST) [10], binary robust indepen- ent environments, and faster inference. This greatly facilitates
dent elementary features (BRIEF) [11] and oriented FAST its use in edge and cloud computing with less demanding data
and rotated BRIEF (ORB) [12], etc. A typical procedure storage, transmission and computation requirements.
for feature based localization is as follows: features are first
extracted using one of these algorithms. Then, descriptors are II. RELATED WORK
estimated to identify the area around each feature for match-
A. DESCRIPTORS RELATED CNN MODEL
ing. Following, either the consecutive image descriptors are
CNN has been used to extract the descriptors of an input
matched to find the relative pose (position and orientation)
image which is either a full image or a pair of non corre-
between them [13], or each image descriptors are matched
sponding patches of the image. For example, Siamese CNN
against a large database of key frames or landmarks with
is proposed in [24] to learn the descriptors of small patches
known locations to identify what is the closest match to infer
of images to achieve patch matching at 95% recall rate. The
the current pose of the camera [14].
same approach is employed in [25] but using pairs of non cor-
Although good performances have been achieved by
responding patches where the network learns 128-D descrip-
these classical methods, they have drawbacks. For example,
tors whose Euclidean distances reflect patch similarity.
the accuracy of all these methods rely on a good estimation
Reference [26] uses similar idea of using pair of patches to
of initial pose location. In addition, the pose accuracy is
learn a similarity score from. Further, unsupervised learn-
determined by the matching between descriptors. Therefore,
ing [27] with VGG-16 network architecture [28] is intro-
wrong correspondences can accumulate errors. Finally, large
duced [29] to learn compact binary descriptor for efficient
database is required for image retrieval methods with a size
visual object matching.
proportional to the area of the scene.
Recently, convolutional neural network (CNN) [15], which
is capable of extracting features from high dimensional data B. IMAGE-BASED LOCALIZATION
including images, videos, etc., has been a popular choice Using images for localization is a part of the visual SLAM [6].
for indoor localizaiton. Multiple CNN architectures including In the past, most classical visual SLAM systems used either
VOLUME 8, 2020 59751

handcrafted local features extracted (feature-based methods) to form the output pose vector P [x, y, z, qw , qx , qy , qz ]T of
or the whole image without extracting any features (direct size 7 of the camera pose.
methods). In feature-based methods, [13], [30] employ ORB
descriptor matching to find the correspondences between A. DATASET
consecutive frames and then uses the 8-point algorithm [31] Since our system is designed for working in indoor environ-
with bundle adjustment [32] to find the relative pose between ments, the training and testing are performed on 3 well known
frames. Other techniques [33], [34] rely on finding the indoor datasets:
2D-3D correspondence via descriptor matching between the 1) Microsoft RGB-D 7 scenes [45] dataset: It is composed
2D image and 3D model using e.g., structure from motion of 7 scenes with constantly changing views and varying
(SfM) [35]. camera heights as seen in Fig. 1. It contains both the
Paper [7] is a direct method that uses the full image RGB image and depth map with corresponding pose.
information. It provides a denser map but without affecting 2) RGB-D SLAM Dataset and Benchmark [46]: It is a
the pose estimation significantly compared to [13]. In addi- benchmark for the evaluation of visual odometry and
tion, [36] uses image retrieval techniques to search for the visual SLAM systems. It contains RGB and depth
similarity between the current image and images in the images with the ground truth trajectory. We choose to
database. Consequently, the position at which current image work with 4 scenes, each of which provides sequences
was taken can be estimated. for training with ground truth poses and others for
With the boom of deep learning, neural networks are used testing without ground truth positions. An online tool
in visual SLAM and image based localization by mapping is available for evaluation of the testing scenes.
directly from single image space to absolute position while 3) The ICL-NUIM dataset [47]: It is a recently developed
circumventing challenges in classical methods. For example, benchmark similar to the RGB-D SLAM benchmark
transfer learning [37] and the pretrained GoogLeNet [38] are dataset for RGB-D, visual odometry and SLAM algo-
employed in [19] to regress the pose of the camera. In later rithms. We will work on the office room scene with
publications, the same authors improve the accuracy by 4 sequences, one sequence for training and the other
modifying the loss function according to the geome- 3 for testing.
try, the relation between position and orientation [39],
or the uncertainty calculation [40]. Further, [41] generates B. SURF DESCRIPTORS
uncertainty for the camera poses using Gaussian process The SURF algorithm [9] starts with computing the interest
regressors. Several orientation representations and data aug- points or features. To find the correspondences between these
mentation are introduced in [42] along with a complex net- features, the description of the area around these features is
works with shared convolutions to learn the position and computed for robust matching. This is called the descriptor.
orientation. Recently, [43] adopts long-short term memory In the original SURF implementation, there are two ways
(LSTM) [44] to similar model to memorize good features. to find the descriptor of the detected feature depending on
In [18], pretrained ResNet-34 is applied for regressing camera its rotational invariance. To make the descriptors invariant
pose. It adopts encoder-decoder design and uses skipped to rotations, the dominant orientation of a neighbourhood
connection to move the features from the early layers to the around the feature is calculated using Haar-wavelet transform
output layers. in both x and y direction. Subsequently, the area around
Despite the outstanding performance of the neural the feature is divided into 4 × 4 grid along the dominant
networks-based systems, the training of these networks is orientation. As opposed to the computationally expensive
time consuming. The complexity of training increases if the counterpart SIFT [8], Haar-wavelet transform [48] is obtained
model needs to be fine tuned when the testing images are for both vertical and horizontal directions. The absolute val-
taken in completely different environments from what the ues of all vertical and horizontal responses are summed to
training images taken from. To reduce the training com- form the descriptor vector of all the sub-regions around the
plexity, we propose a novel technique to reduce the input feature with size 64. In the case where the orientation infor-
dimensions and consequently the parameter size of the CNN. mation needs to be maintained within the SURF descrip-
tor, the dominant orientation calculation step is skipped
III. NETWORK MODEL and the resulting vector is called Upright SURF (U-SURF)
In this section, we apply our model to image based indoor descriptor. In our models, we use both SURF and U-SURF
localization application. The task is that given an image I descriptors and demonstrate that using U-SURF does
taken by a camera with unknown intrinsic parameters with enhance the orientation calculation compared to the ordinary
corresponding SURF descriptors vector D, the network learns SURF descriptor.
the camera pose in the form of global Cartesian position Fig. 2 displays the histogram of features for images of the
[x, y, z]T and orientation in quaternion form [qw , qx , qy , qz ]T . 7 scenes dataset. As shown, the number of features ranges
Note, in contrast to the rotation matrix and Euler angles, from 100 to 4000. To make the input dimension feasible,
quaternion is a stable form to represent orientation. Con- we choose to work with the full 300 features or fewer.
sequently, both position and orientation can be combined Here, each feature has 64 values and the maximum input
59752 VOLUME 8, 2020

FIGURE 1. Illustrations of image feature extraction using SURF descriptor. Red dots represents the location of (from left to right:) 10, 50, 100 and
300 SURF features of (from top to bottom:) office, stairs, fire, pumpkin, heads and chess scenes.
vector size is 300 × 64. Compared to the origin image of It is seen that the SURF algorithm focuses on feature points
480 × 640×3 pixels, use of the discriptor reduces the CNN such as edges and corners, making them good representative
input size by at least a factor of 48. of images that uniquely describe every feature’s surrounding
Further, the location of the 300, 100, 50 and 10 most area. Consequently, the majority of the information in the
important SURF features are shown in Fig. 1 as red dots. image is retained in SURF descriptors.
VOLUME 8, 2020 59753

FIGURE 2. The histogram of features for Stairs, Office, Fire, Chess, Heads and Pumpkin scenes.
C. THE LOSS FUNCTION

We choose to represent the output pose as one 7 dimensional
vector instead of having two outputs for position and orien-
tation in [19].Having one output vector makes it easy to train
using one loss function instead of having two parts of the loss
function with a weight factor between them that needs to be
tuned for every scene. Consequently, the objective function is

loss(D) = P̂ − P (1)

2
FIGURE 3. The architecture of Surf-CNN.
where D is the image descriptors vector, P̂ is the predicted
pose, and P is the ground truth pose.
layers are adopted to reduce the dimensions of the layers out-
D. ARCHITECTURE put, RELU activation function and batch normalization [49],
Our model is composed of as few as 7 layers as shown resulting in faster learning. The first convolutional layer
in Fig. 3. Here, 5 convolutional layers with max pooling has strides of (2,2), and the remaining 4 layers have (1,1).
59754 VOLUME 8, 2020

TABLE 1. Dimensions of CNN layers where N is the number of SURF TABLE 2. The number of layers for the current work SurfCNN with
features. 300 × 64 input features, the average number of parameters and the
median error in position, compared with PoseNet [19], Pose-LSTM [43]
and Pose-Hourglass [52].
the images. The extraction of the SURF descriptors shown

in Fig. 3 only takes 0.9 seconds per image on average and
can be done off-line or in real time.
IV. PERFORMANCE ANALYSIS

To compare the SURF and the U-SURF descriptors, we plot
the position and orientation errors for the 7 scenes dataset
as bar plot in Fig. 5. As shown, the U-SURF descriptors,
having the orientation information of the features detected,
have more precise orientation estimates than using the SURF
descriptors with an average error margin of 6◦ . In addition,
FIGURE 4. The effect of increasing the convolutional layers on the using the U-SURF enhances the position estimation, not only
positional error for Heads scene of the 7 scenes dataset. with the same extent as orientation but also with an average
error difference of 4.5 cm. Evidently, using the U-SURF
descriptors is better than using the SURF descriptors for
They are followed by a fully connected (FC) layer the localization application where the orientation of the fea-
of 2,048 neurons with ReLU activation function. To prevent tures plays an important role. As U-SURF descriptors are
over-fitting, dropout with probability of 0.5 [50] is adopted. rotationally variant, every rotation in the image will give
The output layer has 7 neurons with linear activation for the different descriptors depending on the orientation of the area
pose [x, y, z, qw , qx , qy , qz ]T . The details of the architecture is around the feature which will benefit localizing the image
shown in Table 1. The hyper-parameter tuning procedure to and especially finding the orientation of the image as shown
arrive at the architecture is summarized as follows. We begin on Fig. 5.
the first layer with filter size 7 × 7 and then decrease it The main advantage of our SURF accelerated CNN (Sur-
subsequently to 5 × 5 and 3 × 3 as the dimension of the fCNN) is the increase of the speed of training and testing and
feature map decreases along the network. The number of the reduction in the memory requirements for storing both
filters in the first layer is 64 and then increased to 128 and the input data and network parameters, which accordingly
256 in the subsequent layers following the well known CNN lowers the amount of data for transmission in edge computing
architectures [51]. The number of layers is chosen based on and cloud computing applications. Regarding input data we
experiments on the positional error. Moreover, the trade off reduce the input image dimensions from 480 × 640 × 3
between the number of convolutional layers and the posi- to descriptor vectors with sizes ranging from 300 × 64 to
tional error of the heads scene of the 7 scenes dataset is shown 1 × 64 with a reduction factor ranging from 48 to 14,400.
in Fig. 4. As shown, increasing the number of convolutional This substantial reduction is critical for low memory sys-
layers up to 5 layers leads to a reduction of the error to tems as the SURF descriptors extraction process can be done
0.17 m; however, the error starts to increase with the addition efficiently. Further, Table 2 shows the number of layers and
of more layers to 0.65 m with 10 convolutional layers. One number of learning parameters of our SurfCNN with features
reason for the error increase is that every convoltional layer size ranging from 1 to 300 in comparison with the previous
has a pooling layer which downsamples the feature maps, work. As shown, SurfCNN reduces the number of layers to
so with 10 convolutional layers, the input shrinks in size to 7 layers compared to 24 or more layers in the previous work.
a level where the important features are lost. Moreover, using In particular, a reduction factor of 1.6 or more is achieved
SURF descriptors with low dimensions helps reducing the for the number of learning parameters when all 300 features
number of layers compared to other CNN-based algorithms are adopted and 6.2 or more when as few as 1 feature is
that uses pretrained networks as the descriptors represent the used. Furthermore, in previous works, pre-trained network
extracted features from the image which makes the learning may take weeks to train while our model does not require to
process easier than using the raw images. Unlike the previ- be pre-trained.
ous works [19], [40], [41], our system does not require any Fig. 6 shows the trade-off between training time and
pre-processing such as cropping or subtracting the mean for average position error for the 7 scenes dataset with various
VOLUME 8, 2020 59755

FIGURE 5. The bar plot for the position error (left) and orientation error (right) of U-SURF (in brown) and SURF (in blue) for the
7 scenes dataset.
number of features for our SurfCNN. It illustrates that the training time around 15 minutes, an average positional and
relation between the number of features and training time orientation error of 0.41 m and 14.54◦ is achieved. This is bet-
is nearly linear with a minimum time of 15 minutes for ter than PoseNet and Posenet-U with an order-of-magintude
using 1 features and max time of 90 minutes for 300 features. reduction in the input and number of parameters. The error is
Additionally, the inverse relationship between the average acceptable in indoor environments while the input dimension
error and training time is shown where a maximum average is only 1 × 64. This opens the door for on-line training or
error of 0.41 m with only 15 minutes training time and 0.28 m training on embedded systems as the agent with the camera
for 90 minutes are obtained. The training was done using mounted on it has to train every time it goes to a different
NVidia TITAN XP GPU and Adam solver [53] with a learn- scene.
ing rate of 0.0001. Using this visualization, we can choose Among all individual scenes, our system outperforms all
the number of features needed for SurfCNN according to the the previous work in the scenes of heads and fire. This is
target training time, the computational resource availability because the images in these scenes have the distinctive fea-
and maximum allowable error. Overall, SurfCNN is superior tures and low field of view as shown in Fig. 1. Consequently,
to other work that use the full images as input. as few as 200 features with input dimension 200 × 64 is
The median position/orientation error for the U-SURF needed to outperform all the previous work that used input
descriptors of features sizes 300, 100, 50, 10, 1 along with size of 224 × 224 × 3 after cropping the input images. In the
the state of the art previous work including PoseNet [19], stairs scene, the system has very close performance to [52].
G-Posenet [41], Posenet-U [40], Pose-L [43], Branch- By analyzing the histogram of the stairs scene in Fig. 2, it is
Net [42] Pose-H [52], VidLoc [54], RelocNet [55] and found that the maximum number of features in this scene
Mobile-PoseNet [56] for the 7 scenes dataset are shown is 500, which makes choosing 300 features is sufficient for
in Table 3. As shown, with 300 U-SURF features, our system learning. However, the chess, office and pumpkin scenes
displays 0.28 m position error and 9.17◦ orientation error. are wide view angle images and the number of features is
Compared to the previous works, the accuracy of our model large. Therefore 300 features are insufficient to represent the
is comparable but we are advanced as to adopt only half the input. Nevertheless, in these 3 scenes, SurfCNN still has a
number of parameters and much smaller input dimension. median error that is acceptable in indoor localization and
Among all other models, Pose-H [52] demonstrates a better competitive to the other work with substantial reduction in
accuracy with 5 cm error margin whilst Pose-H employs input dimension and lower training time.
an enormously complex network and needs to be trained We further demonstrate that our model is valid for different
seperately in every scene. VidLoc achieves 0.25 m positional datasets other than the 7 scenes dataset, by comparing with
error but with using up to 400 frames which is very hard to PoseNet [19]. Firstly we validate our system with the RGB-D
feed to the network with frame size of 224 × 224 × 3 and SLAM Dataset and other public datasets.
require very expensive computational power. Further when We also extend this work on our locally generated data
the number of features is reduced to as low as 1 feature with using a robot mounted with multiple sensors including
59756 VOLUME 8, 2020

TABLE 3. The median error in position (m)/ orientation (degrees) for the 7 scenes, with 5 input sizes 300 × 64, 100 × 64, 50 × 64, 10 × 64 and 1 × 64,
compared with PoseNet [19], G-Posenet [41], Posenet-U [40], Pose-L [43], BranchNet [42], Pose-H [52], VidLoc [54], RelocNet [55] and Mobile-PoseNet [56].
on average outperforms both ORB-SLAM and LSD-SLAM.

This is also in agreement with the comparison with the deep
learning-based methods. Moreover, SurfCNN can still work
with a low number of features as shown in Table 5 while other
feature matching methods like ORB-SLAM will fail to match
images with low number of features. Also, SurfCNN uses a
sparse representation of the image which will be faster than
minimizing the photometric error using all the image pixels
done by LSD-SLAM.
FIGURE 6. The training time (in blue) and the average error (in orange) V. CONCLUSION
versus the number of features. We have implemented image based indoor localization
(SurfCNN) that uses SURF descriptors to reduce the input
TABLE 4. The median translational (m)/ rotational (degrees) error for the dimension of CNN. Taking advantages of both SURF descrip-
RGB-D dataset (fr1/xyz, fr2/xyz, fr1/rpy and fr2/rpy and our own
measured data. tors and CNN, our network has a competitive performance
with the state of the art without the need for a pretrained
network. Given a sufficient number of features, SurfCNN
reaches the same accuracy as state of the art with only half
the parameters and excluding the pretrained network. This
advantage is essential in real time application where memory
size is limited and edge/cloud computing applications. The
proposed approach is also versatile to other CNN related
applications in addition to indoor localization.
TABLE 5. The median error of position (m) for ICL-NUIM dataset (office
REFERENCES
room 0, 1 and 2), the RGB-D dataset (RGBD-1: fr3/long office household)
compared to SurfCNNfor ICL-NUIM dataset (office room 0, 1 and 2), [1] N. E. Klepeis, W. C. Nelson, W. R. Ott, J. P. Robinson, A. M. Tsang,
the RGB-D dataset (RGBD-1: fr3/long office household) compared to P. Switzer, J. V. Behar, S. C. Hern, and W. H. Engelmann, ‘‘The national
SurfCNN. human activity pattern survey (NHAPS): A resource for assessing exposure
to environmental pollutants,’’ J. Exposure Sci. Environ. Epidemiol., vol. 11,
no. 3, pp. 231–252, Jul. 2001.
[2] A. Yassin, Y. Nasser, M. Awad, A. Al-Dubai, R. Liu, C. Yuen, R. Raulefs,
and E. Aboutanios, ‘‘Recent advances in indoor localization: A survey on
theoretical approaches and applications,’’ IEEE Commun. Surveys Tuts.,
vol. 19, no. 2, pp. 1327–1346, 2nd Quart., 2017.
[3] S. Miura, L.-T. Hsu, F. Chen, and S. Kamijo, ‘‘GPS error correction
with pseudorange evaluation using three-dimensional maps,’’ IEEE Trans.
Intell. Transp. Syst., vol. 16, no. 6, pp. 3104–3115, Dec. 2015.
[4] H. Liu, H. Darabi, P. Banerjee, and J. Liu, ‘‘Survey of wireless indoor
camera, LIDAR, odometer and SONAR to collect RGB positioning techniques and systems,’’ IEEE Trans. Syst., Man, Cybern. C,
images with ground truth poses. As shown in Table 4, Appl. Rev., vol. 37, no. 6, pp. 1067–1080, Nov. 2007.
[5] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
the median translational/rotational error are consistent with I. Reid, and J. J. Leonard, ‘‘Past, present, and future of simultaneous
the previous results of the 7 scenes dataset and our system localization and mapping: Toward the robust-perception age,’’ IEEE Trans.
outperforms PoseNet in all scenes. Robot., vol. 32, no. 6, pp. 1309–1332, Dec. 2016.
[6] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-Mancha, ‘‘Visual
We further extend the comparison with traditional classical simultaneous localization and mapping: A survey,’’ Artif. Intell. Rev.,
systems that do not depend on learning. Here, SurfCNN vol. 43, no. 1, pp. 55–81, Jan. 2015.
is compared with two state of the art systems, the fea- [7] J. Engel, T. Schöps, and D. Cremers, ‘‘LSD-SLAM: Large-scale direct
ture based method ORB-SLAM [13] and the direct method monocular SLAM,’’ in Proc. Eur. Conf. Comput. Vis. Springer, 2014,
pp. 834–849.
LSD-SLAM [7]. As shown in Table 5, for the scenes from [8] D. G. Lowe, ‘‘Distinctive image features from scale-invariant keypoints,’’
ICL-NUM dataset and RGBD SLAM benchmark, SurfCNN Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
VOLUME 8, 2020 59757

[9] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, ‘‘Speeded-up robust [33] T. Sattler, B. Leibe, and L. Kobbelt, ‘‘Fast image-based localization using
features (SURF),’’ Comput. Vis. Image Understand., vol. 110, no. 3, direct 2D-to-3D matching,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011,
pp. 346–359, Jun. 2008. pp. 667–674.
[10] P. Dollar, R. Appel, S. Belongie, and P. Perona, ‘‘Fast feature pyramids for [34] T. Sattler, B. Leibe, and L. Kobbelt, ‘‘Efficient & effective prioritized
object detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 8, matching for large-scale image-based localization,’’ IEEE Trans. Pattern
pp. 1532–1545, Aug. 2014. Anal. Mach. Intell., vol. 39, no. 9, pp. 1744–1756, Sep. 2017.
[11] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, ‘‘BRIEF: Binary robust [35] D. Forsyth and J. Ponce, Computer Vision: A Modern Approach.
independent elementary features,’’ in Proc. Eur. Conf. Comput. Vis. Upper Saddle River, NJ, USA: Prentice-Hall, 2011.
Springer, 2010, pp. 778–792. [36] T. Sattler, M. Havlena, F. Radenovic, K. Schindler, and M. Pollefeys,
[12] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ‘‘ORB: An efficient ‘‘Hyperpoints and fine vocabularies for large-scale location recogni-
alternative to SIFT or SURF,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011, tion,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,
pp. 2564–2571. pp. 2102–2110.
[13] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, ‘‘ORB-SLAM: A versa- [37] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, ‘‘Learning and trans-
tile and accurate monocular SLAM system,’’ IEEE Trans. Robot., vol. 31, ferring mid-level image representations using convolutional neural net-
no. 5, pp. 1147–1163, Oct. 2015. works,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,
pp. 1717–1724.
[14] Y. Kalantidis, G. Tolias, E. Spyrou, P. Mylonas, and Y. Avrithis, ‘‘Visual
[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
image retrieval and localization,’’ in Proc. 7th Int. Workshop Content-
V. Vanhoucke, and A. Rabinovich, ‘‘Going deeper with convolutions,’’
Based Multimedia Indexing, Athens, Greece, 2009, pp. 1–20.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,
[15] Y. LeCun et al., ‘‘Convolutional networks for images, speech, and time
pp. 1–9.
series,’’ in The Handbook of Brain Theory and Neural Networks, vol. 3361,
[39] A. Kendall and R. Cipolla, ‘‘Geometric loss functions for camera pose
no. 10. 1995, p. 1995.
regression with deep learning,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification Recognit. (CVPR), Jul. 2017, pp. 5974–5983.
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro- [40] A. Kendall and R. Cipolla, ‘‘Modelling uncertainty in deep learning for
cess. Syst., 2012, pp. 1097–1105. camera relocalization,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
[17] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking May 2016, pp. 4762–4769.
the inception architecture for computer vision,’’ in Proc. IEEE Conf. [41] M. Cai, C. Shen, and I. Reid, ‘‘A hybrid probabilistic model for camera
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826. relocalization,’’ in Proc. Brit. Mach. Vis. Conf., 2018, p. 8.
[18] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image [42] J. Wu, L. Ma, and X. Hu, ‘‘Delving deeper into convolutional neural
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), networks for camera relocalization,’’ in Proc. IEEE Int. Conf. Robot.
Jun. 2016, pp. 770–778. Autom. (ICRA), May 2017, pp. 5644–5651.
[19] A. Kendall, M. Grimes, and R. Cipolla, ‘‘PoseNet: A convolutional net- [43] F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and
work for real-time 6-DOF camera relocalization,’’ in Proc. IEEE Int. Conf. D. Cremers, ‘‘Image-based localization using LSTMs for structured fea-
Comput. Vis. (ICCV), Dec. 2015, pp. 2938–2946. ture correlation,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
[20] G. Du, F. Su, and A. Cai, ‘‘Face recognition using SURF features,’’ Proc. pp. 627–637.
SPIE, vol. 7496, Oct. 2009, Art. no. 749628. [44] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
[21] N. Engelhard, F. Endres, J. Hess, J. Sturm, and W. Burgard, ‘‘Real-time Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
3D visual SLAM with a hand-held RGB-D camera,’’ in Proc. RGB-D [45] B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, ‘‘Real-time RGB-
Workshop 3D Perception Robot. Eur. Robot. Forum, Västerås, Sweden, D camera relocalization,’’ in Proc. IEEE Int. Symp. Mixed Augmented
vol. 180, 2011, pp. 1–15. Reality (ISMAR), Oct. 2013, pp. 173–179.
[22] R. Chincha and Y. Tian, ‘‘Finding objects for blind people based on SURF [46] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, ‘‘A bench-
features,’’ in Proc. IEEE Int. Conf. Bioinf. Biomed. Workshops (BIBMW), mark for the evaluation of RGB-D SLAM systems,’’ in Proc. IEEE/RSJ Int.
Nov. 2011, pp. 526–527. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580.
[23] E. Bayraktar and P. Boyraz, ‘‘Analysis of feature detector and [47] A. Handa, T. Whelan, J. McDonald, and A. J. Davison, ‘‘A benchmark for
descriptor combinations with a localization experiment for various RGB-D visual odometry, 3D reconstruction and SLAM,’’ in Proc. IEEE
performance metrics,’’ 2017, arXiv:1710.06232. [Online]. Available: Int. Conf. Robot. Autom. (ICRA), May 2014, pp. 1524–1531.
http://arxiv.org/abs/1710.06232 [48] C. F. Chen and C. H. Hsiao, ‘‘Haar wavelet method for solving lumped and
[24] L. Chen, F. Rottensteiner, and C. Heipke, ‘‘Invariant descriptor learning distributed-parameter systems,’’ IEE Proc.-Control Theory Appl., vol. 144,
using a siamese convolutional neural network,’’ in Proc. 23rd ISPRS no. 1, pp. 87–94, Jan. 1997.
Congr., Commission, vol. 3. Göttingen, Germany: Copernicus GmbH, [49] S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating deep network
2016, pp. 11–18. training by reducing internal covariate shift,’’ in Proc. Int. Conf. Mach.
Learn., 2015, pp. 448–456.
[25] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and
[50] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
F. Moreno-Noguer, ‘‘Discriminative learning of deep convolutional
R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
feature point descriptors,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
from overfitting,’’ J. Mach. Learn. Res., vol. 15, pp. 1929–1958, Jan. 2014.
Dec. 2015, pp. 118–126.
[51] A. Canziani, A. Paszke, and E. Culurciello, ‘‘An analysis of deep neu-
[26] S. Zagoruyko and N. Komodakis, ‘‘Learning to compare image patches via
ral network models for practical applications,’’ 2016, arXiv:1605.07678.
convolutional neural networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
[Online]. Available: http://arxiv.org/abs/1605.07678
Recognit. (CVPR), Jun. 2015, pp. 4353–4361.
[52] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu, ‘‘Image-based local-
[27] J. A. Hertz, Introduction to the Theory of Neural Computation. Boca Raton, ization using hourglass networks,’’ 2017, arXiv:1703.07971. [Online].
FL, USA: CRC Press, 2018. Available: http://arxiv.org/abs/1703.07971
[28] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for [53] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic
large-scale image recognition,’’ 2014, arXiv:1409.1556. [Online]. Avail- optimization,’’ 2014, arXiv:1412.6980. [Online]. Available:
able: http://arxiv.org/abs/1409.1556 http://arxiv.org/abs/1412.6980
[29] K. Lin, J. Lu, C.-S. Chen, and J. Zhou, ‘‘Learning compact binary descrip- [54] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, ‘‘VidLoc: A deep
tors with unsupervised deep neural networks,’’ in Proc. IEEE Conf. Com- spatio-temporal model for 6-DoF video-clip relocalization,’’ in Proc. IEEE
put. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1183–1192. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 6856–6864.
[30] R. Mur-Artal and J. D. Tardos, ‘‘ORB-SLAM2: An open-source SLAM [55] Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, ‘‘Camera relocalization
system for monocular, stereo, and RGB-D cameras,’’ IEEE Trans. Robot., by computing pairwise relative poses using convolutional neural network,’’
vol. 33, no. 5, pp. 1255–1262, Oct. 2017. in Proc. IEEE Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2017,
[31] R. I. Hartley, ‘‘In defense of the eight-point algorithm,’’ IEEE Trans. pp. 929–938.
Pattern Anal. Mach. Intell., vol. 19, no. 6, pp. 580–593, Jun. 1997. [56] C. Cimarelli, D. Cazzato, M. A. Olivares-Mendez, and H. Voos, ‘‘Faster
[32] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. visual-based localization with mobile-PoseNet,’’ in Proc. Int. Conf. Com-
Cambridge, U.K.: Cambridge Univ. Press, 2003. put. Anal. Images Patterns. Springer, 2019, pp. 219–230.
59758 VOLUME 8, 2020

AHMED M. ELMOOGY received the B.Sc. TAO LU received the B.Sc. degree from the
degree in electronics and telecommunication Department of Physics, University of Manitoba,
engineering from Mansoura University, Egypt, in 1995, the M.Sc. degree from the Department
in 2012. He is currently pursuing the Ph.D. of Electrical and Computer Engineering, Queen’s
degree in electrical engineering with the Univer- University at Kingston, in 1998, and the Ph.D.
sity of Victoria, Victoria, Canada. He worked as degree from the Department of Physics, University
a Research and Teaching Assistant at Mansoura of Waterloo, in 2005. He has worked in indus-
University, Egypt, from 2012 to 2016. His cur- try with various companies, including Nortel Net-
rent research interests include simultaneous local- works, Kymata Canada, and Peleton on optical
ization and mapping (SLAM) using camera and communications. Before joining UVic, he was a
LIDAR data with different machine learning algorithms. Postdoctoral Fellow with the Department of Applied Physics, California
Institute of Technology, from 2006 to 2008. His research interests include
optical microcavities and their applications to ultra narrow linewidth laser
source, bio and nano photonics. His current research include machine learn-
ing algorithms with applications to spectral analysis, the Internet of Things,
and indoor localization.
XIAODAI DONG (Senior Member, IEEE)

received the B.Sc. degree in information and con- ROBERT WESTENDORP received the Dipl.Eng.
trol engineering from Xi’an Jiaotong University, degree in electrical engineering and the Ph.D.
China, in 1992, the M.Sc. degree in electrical engi- degree from the Technical University of Munich,
neering from the National University of Singapore, Germany. After focusing on automation and con-
in 1995, and the Ph.D. degree in electrical and trol, he continued researching distributed oper-
computer engineering from Queen’s University, ating systems for industrial measurement and
Kingston, ON, Canada, in 2000. She was a Canada automation for which he received the Ph.D.
Research Chair (Tier II), from 2005 to 2015. From degree. His professional career led into designing
2002 to 2004, she was an Assistant Professor hardware and software for video surveillance and
with the Department of Electrical and Computer Engineering, University intrusion detection. He is currently managing a
of Alberta, Edmonton, AB, Canada. From 1999 to 2002, she was with product portfolio, including video systems and authentication servers with
Nortel Networks, Ottawa, ON, Canada, where she worked on the base Fortinet Canada Inc.
transceiver design of the third-generation (3G) mobile communication
systems. Since January 2005, she has been with the University of Victoria, KISHORE REDDY TARIMALA received the
Victoria, Canada, where she is currently a Professor with the Department B.Tech. degree in electronics and communication
of Electrical and Computer Engineering. Her research interests include engineering from Jawaharlal Nehru Technological
5G, mmWave communications, radio propagation, the Internet of Things, University, India. His professional career spanned
machine learning, terahertz communications, localization, wireless security, designing hardware and software for world market
e-health, smart grid, and nano-communications. She served as an Editor for in test and measurement instruments, consumer
the IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, from 2009 to 2014, electronics in analog and digital video, and leading
the IEEE TRANSACTIONS ON COMMUNICATIONS, from 2001 to 2007, the Journal large chip design projects in semiconductors. He is
of Communications and Networks, from 2006 to 2015. She is currently an currently a VP of engineering for enterprise class
Editor of the IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, and the IEEE WiFi products at Fortinet India.
Open Journal of the Communications Society.
VOLUME 8, 2020 59759

Surfcnn: A Descriptor Accelerated Convolutional Neural Network For Image-Based Indoor Localization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Surfcnn: A Descriptor Accelerated Convolutional Neural Network For Image-Based Indoor Localization

Uploaded by

Copyright:

Available Formats

Received February 28, 2020, accepted March 6, 2020, date of publication March 24, 2020, date of current version

SurfCNN: A Descriptor Accelerated Convolutional

ROBERT WESTENDORP 2 , AND KISHORE REDDY TARIMALA 2

INDEX TERMS CNN, localization, SURF, SLAM.

I. INTRODUCTION specific tasks. Such neural network, however, requires high

VOLUME 8, 2020 59751

59752 VOLUME 8, 2020

VOLUME 8, 2020 59753

C. THE LOSS FUNCTION

59754 VOLUME 8, 2020

the images. The extraction of the SURF descriptors shown

IV. PERFORMANCE ANALYSIS

VOLUME 8, 2020 59755

59756 VOLUME 8, 2020

on average outperforms both ORB-SLAM and LSD-SLAM.

VOLUME 8, 2020 59757

59758 VOLUME 8, 2020

XIAODAI DONG (Senior Member, IEEE)

VOLUME 8, 2020 59759

You might also like