SSRN Id4082868

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Contents lists available at ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

Landmark Recognition and Retrieval Using Resnet50 and Delf


P.Nikhil Chandraa, M. Kalyanb * , B. Rishi Ram Naikc ,KL Sailajad, Ramesh Kumar Pe
aDepartment of Computer Science and Engineering , Velagapudi Ramakrishna Siddhartha Engineering College, Kanuru, Vijayawada

Abstract

Image recognition and retrieval In computer vision, is a useful yet difficult task. Landmark recognition predicts
landmark labels directly from image pixels and Landmark retrieval finds similar images in a large database. In this
proposed project, a Graphical User Interface (GUI) is created that allows uploading image, querying landmark and to
get similar landmark images from the dataset. This project is done in two modules. The first module creates a model
that recognises the correct landmark (if any) in a set of difficult test photographs, allowing users to better comprehend
and organise their photo collections. In this project, a dataset containing historic monuments of India of around 100
classes is created with 1227 images. In the first module, i.e., Landmark recognition, the module is again divided into
three sub-modules. The Data augmentation is performed on the dataset having classes. The resnet50 model training
was done by changing the output layer with our classes and train the model with ImageNet weights and fit the model
with train dataset. When a sample image is provided to the resnet50 model, the model gives the top 10 labels of the
sample image. Consider two images for each label from the training dataset, Apply DELF on these images and sample
image. The inliers are calculated for these images using the Ransac algorithm. Image label which has a good number
of inliers with sample image is the landmark of that sample image. This is especially crucial for query photographs
with landmarks, which make up a significant portion of what people want to capture. The Resnet50 model is applied
to the Image data in the second module, and the output is saved in a data frame. A sample image is given as input, and
the top 10 labels are obtained using the ResNet50 model. The relevant images of those top 10 labels are retrieved from
data frame. Compare the 10 images with input image by inliers. The 9 images having maximum inliers are plot in the
GUI window to user.

© 2017 Elsevier Inc. All rights reserved.


Keywords: Data Augmentation, Deep Local Features(DELF), Landmark Recognition, Ransac algorithm, ResNet50.

1. Introduction

A prominent, easily identifiable geographic point (such as a mountain, a cliff, or a river) was referred to as a landmark.
A general landmark recognition technology can be applied in three steps: user-side image capture, server-side feature

P Nikhil Chandra et al., Landmark Recognition and Retrieval using ResNet50 and DELF, Computer Vision and Image
Understanding (2022),
http://dx.doi.org/10.1016/j.cviu.2017.00.000

Electronic copy available at: https://ssrn.com/abstract=4082868


2 P.Nikhil Chandraa, M. Kalyanb * , B. Rishi Ram Naikc ,KL Sailajad, Ramesh Kumar Pe/Computer Vision and Image Understanding 000 (2022)
000–000

extraction and categorization, and retrieved data returning to and displaying on mobile terminals. CNN [1] is a deep
learning model mainly used to image classification based on neural network architecture it consists of layers and
filters. Data augmentation [2] is a method of artificially boosting the size of a training set by creating modified
duplicates of the images in the dataset. It acts as a regularizer and helps to avoid overfitting when training a machine
learning model. The Convolutional Neural Network Resnet-50 [3] is a pre-trained Deep Learning model with
ImageNet weights used for classification of images. Skip connections, or shortcuts, are used by residual neural
networks compared to other plain networks. ResNets50 functions on the principle of building a deeper network
than other types of networks. Delf uses CNN to learn semantically equivalent local features from instance level
annotations. Delf [4] is a descriptor that gives complete description of images such as locators, scores, descriptors of
the image.

2. Related Works

Wang, Yuheng, et al. [5] created a Resnet-50 classification model using transfer learning. After training with TrashNet
dataset, the model got the accuracy of 92.4 percent and detection rate on a large-scale mixed waste object image of
48.4 percent.
The author [6] created a best CNN method is to use the ResNet50 model with Adamax optimization with an accuracy
value of 95.67%. The ResNet50 model has the best performance shown from the small number of errors compared to
other models in the autoscaling method on deployment as well as with the most concurrent users.
Lathuilière, Stéphane, et al. [7] hybrid model in this paper by layering linear regression on top of the CNN model. a
they start evaluating and comparing results by changing the optimizers network architecture and apply some data
preprocessing techniques and finally they observed that the general purpose model gives required results then the
complex models.
In this paper, Tianyou et.al [8] mention about how DELF is work on images and also finds the nearest location of
image based on the image locations from the dataset they face the situation by going through all the features locations
to overcome this situation they constructed CNN model to extract dense features and they use GFS method to select
features with local region highscores to improve the retrieval efficiency.
In this paper, the authors [9] used the Vector of Locally Aggregated Descriptors to propose an image retrieval pipeline
that takes advantage of local convolutional features and combines them to create a global descriptor (VLAD). They
propose a query expansion technique and discover the local convolutional features and global representation
outperform other systems even without training. They eventually came to the conclusion that reducing dimensionality
speeds up retrieval computation time.
At the order level, a technique [10] to identify insect specimen photos. The methodologies of digital picture
progression, pattern recognition, and taxonomy theory were used to construct several relative features. In this paper,
they used technologies like Artificial neural networks (ANNs) and a support vector machine (SVM) for pattern
recognition. During tests with an artificial neural network on nine frequent orders and suborders, the system showed
good stability and accuracy of 93 percent.
Nishant Nimbare, Parth Shah, Shail Shah, Ramachandra Mangrulkar et. al [11] The author goal is to create and develop
a deep learning model for classifying the landmarks which are taken from google landmark dataset. In this paper, they
develop four transfer learning algorithms VGG16, InceptionV3, Resnet50 and Pure CNN model from the scratch.
They also mention about combining Resnet50 with Deep local features for getting better accuracy then the other
models. The table 1 show the training and test accuracy of different CNN models.
Table 1. Accuracy of different Models

Model Training Accuracy (%) Testing Accuracy (%)


VGG16 99.8 89.1
Inception V3 27.3 18.6
ResNet50 100 90.4
Pure CNN 97.55 89.67
ResNet50+Delf 100 95.67

Electronic copy available at: https://ssrn.com/abstract=4082868


P.Nikhil Chandraa, M. Kalyanb * , B. Rishi Ram Naikc ,KL Sailajad, Ramesh Kumar Pe/Computer Vision and Image Understanding 000 (2022)
000–000 3

A solution to the Google Landmark Recognition problem has been proposed by Ruchi Jha, Perna Jain, Sandeep Tayal,
Ashish Sharma et al [12]. The challenge, according to this paper, is a massive and extensive dataset categorised into
an enormous number of predictions. The model aims to predict landmarks using CNN, preconfigured with VGG16
neural network and transfer learning from Imagenet, in the presence of a sample amount of "junk" photos, broken
links, and images with multi-class predictions. The evaluation is carried out using the Global Average Precision (GAP)
measure, which takes into account the confidence score for each anticipated landmark lebel. Deep Local Features uses
CNN to determine semantically equivalent local features by training using instance level annotations.
Hao Wu, Min chen et.al [13] proposed a landmark recognition using the Convolution neural network and Transfer
Learning. In which it aims to automatically seperate landmark categories with subtle visual differences to help better
understand landmarks and/or organize their photo collections. For image cropping they use a image cropping and data
augmentation techniques. Modeling will extract different number of landmarks and images from our dataset to build
the landmark recognition models using ResNet and VGG.
A Transfer Learning technique using Deep Residual Networks was proposed by Maxence Dutreix, Nathan Hatch,
Raghav Kuppan, Pranav Shenoy Kasargod Pattanashetty, Anirudha Sundaresan et al [14]. Because of its excellent
representational capabilities, the performance of image categorization, object identification, and face recognition has
improved. The training of very deep networks with ResNet and still get good results. They enhance the data to make
the training more robust, and they're also utilised to determine batch size, which is dependent on the system design.
The approach uses a fine-tuned CNN to extract characteristics from database photos and then uses principal component
analysis (PCA) to reduce their dimensionality, making retrieval easier.
Ramesh Kumar P, KL Sailaja et.al [15] suggested a paper that uses GPS Altitude, GPS Latitude, GPS Longitude, and
GPS position to examine image metadata and monitor individual nations, cities, routes, and streets.

3. Methodology

In this project, two types of datasets are considered one dataset contains different historic monuments of India. Each
historic monument is stored in single class. The dataset contains 100 classes. Another dataset contains group of
landmark images unclassified.

The Data augmentation is performed on the dataset having classes by dividing the dataset into training, testing and
validation in the ratio of 0.8,0.1 and 0.1 respectively the resnet50 model training was done by changing the output
layer with our classes and train the model with ImageNet weights and fit the model with train dataset. Now consider
the dataset having 1227 images and apply resnet50 model and store the label of that images in the data frame.

When the user gives the image and click the get label button the image will go to resnet50 model and get the top 10
labels of the image. Consider two images form every label and compare input image and model images by using DELF
and calculate the inliers of every image with given image. The image which has the maximum inliers with the given
image the label of the image will consider as the final label of the given image.

Electronic copy available at: https://ssrn.com/abstract=4082868


2 P.Nikhil Chandraa, M. Kalyanb * , B. Rishi Ram Naikc ,KL Sailajad, Ramesh Kumar Pe/Computer Vision and Image Understanding 000 (2022)
000–000

Fig. 1. Methodlogy.

When the user gives the image and click the getcluster button the image will go to resnet50 model and get the 10 ten
labels of the image. Get the images from data frame by considering the top 10 labels and compare each image with
given image by using Delf and calculate the inliers of every image with given image. Get the images having maximum
inliers with the given image will consider as similar images of given image.

4. Algorithm

Algorithm for ResNet50 + DELF


Step 1: Create a instance for ResNet50 model with ImageNet weights.
Step 2: Change the output layer of the model with our predictions.
Step 3: Fit the model with train_data and validation_data with 25 epochs.
Step 4: Save the model.
Step 5: Input image will go to ResNet50 model and get the top 10 labels having higher probability with input image.
Step 6: Consider the images with the labels from the dataset
Step 7: Compare the 2 images with the DELF and Calculate the inliers and the image having maximum inliers will
considered as DELF match image and the label of that image is the final label of the image.

Electronic copy available at: https://ssrn.com/abstract=4082868


P.Nikhil Chandraa, M. Kalyanb * , B. Rishi Ram Naikc ,KL Sailajad, Ramesh Kumar Pe/Computer Vision and Image Understanding 000 (2022)
000–000 5

for calculating the inliers and ouliers in the images here for calculating the inliers we should provide some model
parameters to fit the data.
𝑘 = log (1 ‒ 𝑝)/log (1 ‒ 𝜔𝑛)

here k is the probability that the algorithm never selects a set of n points that are inliers and p is probability
of the inliers w is the no of inliers in the data

5. Result and Analysis

This table 2. consists of 3 columns, one is input image and the other is Landmark label and similar images. An image
is taken from the user. This input image goes to ResNet50 and DELF model to get the output. The landmark label of
that image and the similar images are given as output.

Table 2. Label and similar images of input images


Input Image Landmark label Similar Images

Electronic copy available at: https://ssrn.com/abstract=4082868


2 P.Nikhil Chandraa, M. Kalyanb * , B. Rishi Ram Naikc ,KL Sailajad, Ramesh Kumar Pe/Computer Vision and Image Understanding 000 (2022)
000–000

Table 3. Accuracy and Time taken table for the different no.of labels.

S. No No. of Lables Accuracy (%) Time taken (in sec)


1 5 90.89 16
2 6 91.40 29
3 7 92.58 36
4 8 93.98 37
5 9 94.04 39
6 10 95.93 39
7 15 95.93 78
8 20 96.01 127

Fig. 2 Accuracy graph for the different no.of labels

For the prediction of the landmark labels, to get the best accuracy top 10 labels having high probability to the image
are considered and sent to DELF. The Fig 2 shows the different accuracies and time taken for different no of labels.
In the observation, finally came to know that the labels below 10 are giving less accuracy and taking less time. For
the labels above 10, the rate of change of accuracy is less and time taken is high. So for the 10 labels, the accuracy is
high and time taken is low which is giving the better performance.

6. Conclusion & Future work

In this paper, the model predicts the landmark label of the image using ResNet50 and DELF. The main objective
is to find the optimal method that can find the landmark label of the image using the image pixels and retrieve the
similar images of the given image. The landmark label is identified by giving the image to the ResNet50 model, the
model gives the top 10 labels of the image having high probability to the DELF. The DELF matches the image which
has highest inliers with given image and get the label. A dataset having images of different landmarks is given to
ResNet50 model. This ResNet50 model will find the landmark labels to all the images of the dataset and stores the
image and label in the dataframe. When input image is given, the image is sent to ResNet50 model and get the top 10

Electronic copy available at: https://ssrn.com/abstract=4082868


P.Nikhil Chandraa, M. Kalyanb * , B. Rishi Ram Naikc ,KL Sailajad, Ramesh Kumar Pe/Computer Vision and Image Understanding 000 (2022)
000–000 7

labels and get the images from dataframe based on labels. Compare the images with given image using DELF and get
the top 9 images having more inliers with the given image.
In the future work these images are plotted in google maps when the user select the place in google map the images
of that landmark are retrieved.

References

[1] Albawi, Saad, Tareq Abed Mohammed, and Saad Al-Zawi. "Understanding of a convolutional neural network." 2017
International Conference on Engineering and Technology (ICET). Ieee, 2017.
[2] Shorten, Connor, and Taghi M. Khoshgoftaar. "A survey on image data augmentation for deep learning." Journal of Big Data 6.1 (2019): 1-48.
[3] Targ, Sasha, Diogo Almeida, and Kevin Lyman. "Resnet in resnet: Generalizing residual architectures." arXiv preprint arXiv:1603.08029
(2016).
[4] Noh, Hyeonwoo, et al. "Large-scale image retrieval with attentive deep local features." Proceedings of the IEEE international conference on
computer vision. 2017.
[5] Wang, Yuheng, et al. "Recyclable Waste Identification Using CNN Image Recognition and Gaussian Clustering." arXiv preprint
arXiv:2011.01353 (2020).
[6] Aji, Indra Prasetya, and Gede Putra Kusuma. "Landmark classification service using convolutional neural network and kubernetes."
International Journal 9.3 (2020).
[7] Lathuilière, Stéphane, et al. "A comprehensive analysis of deep regression." IEEE transactions on pattern analysis and machine intelligence
42.9 (2019): 2065-2081.
[8] Chu, Tianyou, et al. "A grid feature-point selection method for large-scale street view image retrieval based on deep local features." Remote
Sensing 12.23 (2020): 3978.
[9] Imbriaco, Raffaele, et al. "Aggregated deep local features for remote sensing image retrieval." Remote Sensing 11.5 (2019): 493.
[10] Wang, Jiangning, et al. "A new automatic identification system of insect images at the order level." Knowledge-Based Systems 33 (2012):
102-110.
[11] Nimbare, Nishant, et al. "A Hybrid Approach for Landmark Recognition using Deep Local Features and Residual Network-50." ITM Web of
Conferences. Vol. 40. EDP Sciences, 2021.
[12] Jha, Ruchi, et al. "Landmark Recognition Using VGG16 Training." Smart and Sustainable Intelligent Systems (2021): 17-39.
[13] Wu, Hao, and Min Chen. "Chinese Landmark Recognition." 2020 International Conference on Computing, Networking and Communications
(ICNC). IEEE, 2020.
[14] Dutreix, Maxence, et al. "Google Landmark Recognition and Retrieval Challenges." (2018).
[15] Kumar, P. Ramesh, Ch Srikanth, and K. L. Sailaja. "Location Identification of the Individual based on Image Metadata." Procedia Computer
Science 85 (2016): 451-454.

Electronic copy available at: https://ssrn.com/abstract=4082868

You might also like