Object Detection Using Ryze Tello Drone With Help of Mask-RCNN

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)

IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

Object Detection using Ryze Tello Drone with Help


of Mask-RCNN
K V V Subash
Mr. M. Venkata Srinu
Department of Electronics and communication Engineering,
Department of Electronics and communication
Koneru Lakshmaiah Education
Engineering, Koneru Lakshmaiah Education Foundation,
Foundation, Vaddeswaram, India.
Vaddeswaram, India..
kvvsubash1@gmail.com

M.R.V. Siddhartha N.C. Sri Harsha


Department of Electronics and communication Engineering, Department of Electronics and communication
Koneru Lakshmaiah Education Foundation, Engineering, Koneru Lakshmaiah Education Foundation,
Vaddeswaram, India.. Vaddeswaram, India..
Praneeth Akkala
Department of Electronics and communication Engineering,
Koneru Lakshmaiah Education Foundation, Vaddeswaram,
India.

`
Abstract— Now a days UNMANNED AERIAL VEHICLES UAV Nevertheless , object detection is not a easy task .The
incorporating image / video segmentation and object detection image/video that was taken often get noisy due to the motion of
methods in them for various applications such as reconnaissance drone/UAV not only that less stability under high windy
, surveillance ,tracking , search and rescue operations in the conditions and also low resolution camera used by drone
recent years convolutional neural networks usage has been results in obscure images/video and for real time deployment
increased leading to develop rcnn (regional convolutional neural in various fields object detection methods must be more robust
networks) which have gained interest in the research community , accurate and hardware used must be portable ,
due to its reliability and robust class of recognizing image content unambiguousness, and less cost. There are some object
even after rcnn we come across state of the art object detection
detection methods that are capable of finding or acquiring
methods such as fast rcnn ,faster rcnn , masked rcnn which are
its successors which are known for their speed and accuracy (in
multiple target objects in the video/ image that was taken.
the year masked rcnn neural network was developed which is the Where it finds it application in real time where effective
successor of faster rcnn and in this paper )by utilizing Mask utilization of hardware is done. In the last few years there has
rcnn neural network algorithm we illustrate /show object been research going on convolutional neural networks
detection and segmentation in image / video .Mask rcnn is especially for image recognition and segmentation purposes.
competent of accomplishing good results on a range of object Region based convolutional neural networks that are developed
detection and segmentation tasks but in aerial images or video which have proved themselves more robust and accurate in
that was taken , where the object is obscure due to the nature.
environmental conditions which pose a great (challenge) to
accurately / precisely detect and classify an object in image/video
In real time execution or scenario developed system need to
.so here this research paper describes /shows the implementation obtain more accurate results so that we need more potential
of Mask RCNN to locate , detect and classify the object in the algorithms like deep neural networks i.e. rcnn (region
image / video that was taken by Ryze Tello drone and also there convolutional neural networks) which are quite complex in
further more constraints i.e. Mask requires more computational nature and demands a lot of computational power here speed
power and appropriate /sufficient amount of training samples depends .on processing speed of region based convolutional
should be given to obtain accurate results neural networks despite of consuming good amount of
computational power and good amount of training samples
Keywords—component; formatting; style: Ryze Tello, mask R- accurate results are obtained
CNN, deep neural networks, object detection

For UAV systems that should be of light weight but not


I. INTRODUCTION autonomous in nature may not require high processing power
UAV with combination of object detection methods finds for its onboard hardware i.e. no higher end gpu (graphics
their applications in various fields for the different purposes processing unit) is required for to perform colossal amount of
such as tracking ,surveillance ,reconnaissance, search and computations so the video footage sent to ground station
rescue and also object recognition and tracking used for where it is controlled so here backend video processing is done
navigation and obstacle collision avoidance in autonomous by applying object detection model developed for it. So in case
drones .So object detection becomes the crucial component of of fully autonomous UAV it requires high processing power
gpu on board in order to make complex decisions for obstacle

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 484


Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

collision avoidance, autonomous navigation etc. on its own Here image is given as input to cnn which provides
without intervention of ground station so mostly video convolutional feature map .Faster RCNN is faster when
footage will be processed by onboard hardware for example compared to fast rcnn , rcnn because here selective search
self-driving cars, ucav developed by us navy etc. formatter will algorithm is not deployed to predict and recognize the region
need to create these components, incorporating the applicable of proposals and then they are mutated here after that adopted
criteria that follow. by ROI pooling layer. So that is used to classify the image
with in prospective region to forecast the offset values of
bounding boxes. Mask rcnn is almost similar to faster rcnn but
II. LITERATURE SURVEY some improvements are made. When compared to faster rcnn
mask rcnn employs a distant branch was included following roi
Neural Networks is one of the prominent machine learning pooling layer and roi align. So that the branch added is binary
algorithms nowadays. So these Neural Networks and Deep the described pixel will give response to query i.e. whether to
learning have definitively demonstrated that they can exceed mask or not .So if it is one it will be masked and if it is zero it
other algorithms when comes to accuracy and speed been at so after getting won’t be masked so after getting feature map
the same time colossal amount of data has to be processed by from faster rcnn . We move to the masking step.
them i.e. incase of object detection methods. Actually any
neural network comprises of neurons and activation functions
which are referred as basic building blocks of a neural III. RELATED WORK
network. First and foremost in order to understand a neural In [6] Authors had alluded to detect cars in the aerial images
network we need to checkout the layers that are present in a using faster rcnn, yolo i.e. (you only look once). The contrast
neural network .It is an assortment of neurons when fed up between yolov3 and faster rcnn is evaluated on different
with inputs leads to formation of outputs. Generally it has one parameters in order to deploy which is suitable for distinct
input layer ,one output layer and middle layer of nodes is environments and applications. Same datasets were fed to
known as hidden layer. them, but yolo v3 (you only look at once) surpasses faster rcnn
Generally neural network with more than one hidden layer in average precision parameters. So it is also known for its
is referred as deep neural network . Some applications require speed in object detection .Here faster rcnn is one of the
high processing power such as image processing and object prominent rcnn algorithms. In order to know the speed of
identification paved the way for development of deep neural execution and accuracy on various datasets .Some external
networks such as convolutional neural networks .So its factors that are responsible for disturbances in the field of
named after the hidden layers that comprises of convolutional view such as environment so comparison done for state of the
layers, pooling layers, normalization layers and fully art algorithms based on different performance parameters
connected layers and also one of the downside of general their application can be determined
CNN is that it can describe the class of objects present in that
scene and it is achievable to regress bounding boxes from the
cnn. It can be done for one object at a time to may not give In [3] Authors had developed a model that acclimates and
information where the objects located but it can be done .For scales down the object detection module i.e. especially for
suppose conglomerate of objects in the scene/field of view vehicles thereupon to curtail the inductive reasoning time in
bounding box regression may not work well due to the order to predict more than one object in a region. So they have
interference.
adjoined a small section of code or component to faster rcnn
In case of rcnn where cnn is contrived to concentrate on a i.e. search area reduction module which cleave the input image
single region of image/ frame in a video. So there is a into regions. To cardinally reduce the inductive reasoning time
contraction of interference to a maximum extent and here in further stages or steps the images that do not have vehicles
image or frame in a video is divided into nearly 2000 regions are filtered or refined through this way of approach. So that
of recommendations . After that cnn is enforced for each and here leading edge results of object detection are obtained on
every region of image or frame in a video because only the publicly available dataset, while the inductive reasoning
Single object of interest influence in a given region. for object detection module is reduced to a vast extent.
Thereupon that a Selective search algorithm detects the
regions in a given image or video and pursued by rescaling In [2] Authors had prefaced a model which over shadow the
leads to formation of regions of same sizes prior victualed to
typical methods in object detection task for precise detection
cnn for classification and bounding box regressor when
of multiple objects like vehicles in aerial images. So due to the
compared to RCNN here Fast RCNN uses single
convolutional neural networks instead of 2000 convolutional minute size of objects in aerial images some of the models
networks used by rcnn for each image and fast rcnn uses such asVGG-16 or miniature networks are pertinent to obtain
selective search algorithm i.e softmax which exceed in terms an adequate high feature map resolution but these feature
of performance when compared to svm algorithm used by maps results in obtaining only slighter amount of semantic
rcnn and also inorder to increase object recognition accuracy and contextual information. This leads to inaccurate object
fast rcnn employs multitask loss on training of deep detection and trigger false alarm for different object of similar
convolutional neural networks. So here after rcnn fast rcnn, shape Resolution is sustained amply high for localization of
faster rcnn were discovered minute objects. So before that for faster rcnn we add
deconvolutional module .So that up samples low dimensional

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 485


Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

feature maps of deep layers and merges up sample features V. METHODOLOGY


with components of the Shallow layers so the frame work.
Mask R-CNN consists of three networks: feature pyramid
network (FPN), regional proposal network (RPN) and faster
In [1] Author had introduced a model that propels the R-CNN. Fpn is fundamentally a neural network which serves
implementation of instance segmentation in VHR (very high as an extractor of data for exceptional depiction of multiscale
resolution) remote sensing .So they have improved a precision objects ,so this was invented by author of Mask RCNN .so the
Mask RCNN neural network i.e. for object detection and initial layers detect low level traits like edges and corners and
instance segmentation .PMRCNN is executed on VHR high level traits like person, aircraft etc are sensed
remote sensing images for each and every occurrence of object subsequently in the following layers so that here by inclusion
in an image produce bounding boxes and segmentation masks of another pyramid that enhances the default pyramid
in paradoxical to ROI (region of interest) align. Where size of extraction function so here Rendered image attains a attribute
the bin does not acclimates for the sample points that are chart of 32x32x2048 as long as it is transiting by dint of
preconfigured. In order to evade loss of precision two orders Backbone network for a sample RGB image of resolution
integral is performed by precise. ROI pooling on progressive 1024x1024pxx 3 ,some of the backbone networks that are
feature map by deploying state of the art PMRCNN on VHR utilized i.e. resnet50,resnet101etc.Actually backbone network
dataset .So NWPU VHR -10 dataset which overshadowed architecture i.e. like resnet 101 is utilized to extract features
other neural networks in terms of accuracy or precision for from image or frame of video so that these features serve as an
object detection and instance segmentation. So it displayed inputs for the upcoming layer .so here to achieve 256 channels
promising results and proven it was robust on VHR Images the feature map generated at layer 2 is imposed to 1 X 1
and it can be deployed for real time application on VHR convolutions. The precursory iteration output is up sampled
images was taken by satellite. and appended in an elementwise manner. A 3 X 3
convolutional layer is imposed on the outputs obtained from
IV. HARDWARE PLATFORM the previous process to generate from the 4 feature maps here
fifth feature map obtained by max pooling of last feature map
We use Ryze Tello drones as a low-cost hardware i.e. p5 so here (w/32, h/32)is the tiniest size feature map
platform to test our cloud-based recognition approach. The embroiled in up sampling
Ryze Tello drone costs about US$ 100, is small and light
weight (98mm*92.5mm*41mm. Propeller: 3 inches. Built-
In Functions: Range Finder, Barometer, LED, Vision
System, WIFI 802.11n 2.4G, 720P Live View. Port: Micro
USB Charging Port.) And can be operated indoors and
outdoors. The RYZE TELLO Drone is fitted with a single
camera, an IMU, including a 3-axis gyroscope, a 3-axis
accelerometer, and a 3- axis magnetometer, and a pressure
and IR-based altitude detector. The front-facing camera has
a resolution of 8mp and it can take a video of resolution
1280 × 720 at 30fps

Fig 1tello drone


Fig 2 FPN EXTRACTION

The main task of rpn is exploring an object in a sliding


window format and analyzing object encompassing regions.

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 486


Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

So here regions partitioned by the rpn which are known as boxes non max suppression is used that imbricate greater
Anchors .So they overrun over on picture field typically there than threshold. From anchor selection to non max
are enormous number of anchors of distinct dimensions and suppression process is executed for every feature of
aspect ratios but typically. We deploy colossal images and pyramid produced by FPN backbone ,and after that every
additional anchors specifically for Mask RCNN. So it anchor box present in coordinate system of rescaled image
consumes good amount of time, so here rotating gate facilitate are grouped but data of ROI produced from fpn layer is not
to explore all regions simultaneously on a gpu because of stored.
intricacy design of rpn .In order to eradicate superfluous
computations. It reuses the eliminated functions .Primarily Further we move into box head here we come across
each anchor produces two outputs namely anchor class and fpn roi mapping. Here depending upon area of specific ROI
bounding box refinement in anchor class produces foreground gets associated with pertinent feature map of FPN by
and background A 3 X 3 convolution layers is imposed on alluding pooler scales. All feature maps from p2 to p5 are
every feature map obtained from previous process. After that utilized in rpn to produce proposals .By the help of above
obtained output is traversed through two branches i.e. one to equation we can obtain integer level respectively for
bounding box regressor and other to obtain object scores. specific ROI.
Three anchor ratios and single anchor stride is utilized for a
feature pyramid .therefore 12 channels allocated for bounding
box regressor and three channels for objectness.

The prominent changes done for Mask RCNN


compared to FASTER RCNN is by substituting ROI Align
in the place of ROI pooling. Here for every ROI a uniform
P X P matrix is produced by ROI Align and ROI Pool .ROI
Pool doesn’t handle well when comes to instance
segmentation task because due to the involvement of large
number of quantization steps that influence production of
masks. In the next level output is traversed through a fully
connected layer after output obtained from every ROI is
reshaped, so number of output channels is one of the hyper
Fig 3 RPN Head parameters known as representation size and traversed
through other fully connected layer with equal number of
input and output channels.

There is a bit of a problem to overcome until we start.


Classifiers do not accommodate variable input size very
Fig 4 Box Head
well. Usually, they need a constant input size .In the next
step we go for generation of proposals after that choosing of In the upcoming step a predictor containing two
anchor depends on their respective objectness score .Here branches where ROI vectors are traversed through it .In order
selection of anchors based on respective objectness scores to obtain bounding box regression values ROI vectors
.Altered image dimensions will be width and height. In the traversed through first branch. By traversing through second
next step eliminate the bounding boxes in which any of the branch we predict the object class.
coordinate that lies outside the image in order to remove

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 487


Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

Fig 5 Bounding Box Processing


The mask division is a convolutional network which takes
the optimistic regions chosen by the ROI classifier and
produces masks for them. The masks produced are low
resolution: 28x28 pixels. These are soft masks, though,
represented by float numbers, so they have more information
than binary masks. The small size of the mask helps keep the
branch light for the mask. The ground- truth masks should be
scaled down to 28x28 during the training to measure the loss,
And scale up the predicted masks to the size of the ROI
bounding box during inferencing which gives us the final
masks, one per object. After that generated masks can be
rescaled to the dimensions of input image. In order to avoid
boundary effect because of up- sampling that is done previous Fig 7 Architecture of MASK RCNN
step by padding the mask tensors. As per the new mask VI. DEVELOPMENT OF MODEL
bounding box coordinates are rescaled according to image
dimensions for this an interpolation method is utilized i.e. In this project we are making use of some of the libraries
bilinear. We finally obtain Image height and image width i.e. such as numpy, scipy, cython , h5 py, pillow , matpolit,open
mask of the object by the help of one of hyper parameters cv libraries after that keras mask rcnn , tensorflow frame
known as Mask threshold .According to it for every pixel if the works have to be installed which are vital for backend
value of Mask threshold is greater than 0.5 then the object is processing. So object detection model we have developed for
expected to be present in the respective pixel if it is less than drone utilizes the frameworks and hardware in more efficient
way and is executed on different operating systems .Here we
0.5 then object is expected to be absent in that respective pixel
are giving a dataset comprises of 100 classes of objects which
contains 10000 images or photographs and xml annotation are
done. After that you need to describe a model configuration
element. This is a new class that extends the mrcnn.config.
Config class and defines the prediction problem properties
such as name, number of classes. The model training
algorithm learning rate can be known.
The configuration must define the name of the
configuration via the name attribute this will be used during
the run to save details and models to be filed. The
specification also needs to define the number of classes in the
question of prediction via the attribute ' NUM CLASSES’.
Eventually, the number of samples or photos used in each
training epoch must be specified. This is the number of
pictures in the training dataset, 100 in this situation. We
trained the considered approach using one GPU with batch
size of 2 for 100 epochs in Mask R-CNN. The pre-trained
network on our own COCO dataset The first step is to
Fig 6 Processing of Masks
download the model file architecture and weights for the pre-
fit Mask R-CNN model Download the model weights is used
for fine tuning in the first 40 epochs. We trained the network

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 488


Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

with learning rate of 0.001. The Mask R- CNN template can


be used as a starting.
VII . RESULTS AND DISCUSSION

The algorithm has been tested on the above-mentioned


configuration, but I'm pretty sure that other combinations
would also work effectively. But please make sure that you
use appropriate suitable TF/Keras versions combination
should be checked. Open CV 3.4 would suffice. If you have
a GPU in your laptop and use CUDA-accelerated learning
this version will work better. In a legion Gtx 1650 laptop
and i5-9300h and 8GB RAM, the FPS obtained is 7.12.and
we have got best accuracy in detection 0.99 and we have
received mean accuracy of 0.956 which is of
95.6percentage Even, make sure you have the Ryze Tello
drone switched on and connected to your Wi-Fi network to
check it on a Tello. Fig 10 MASK RCNN DETECTING LAPTOP,
KEYBOARD AND PERSON

8 MASK RCNN DETECTING TV AND PERSONS

Fig 11 MASK RCNN DETECTING TV AND BED

VIII. CONCLUSION AND FUTURE WORK


Object Detection, Classification and Detection are one of the
broad problems. Obstacle avoidance, search and rescue
purpose, reconnaissance, navigation and mapping of unknown
territories etc. are some of the features seen in autonomous
drone/UAV. Real time deployments of autonomous
drone/UAV in various fields require object detection methods
embedded in them should be more robust and accurate so that
we use deep neural networks i.e. MASK RCNN to detect and
classify the objects in image/frame of video. There are some
disadvantages as well because MASK RCNN demands good
amount of processing, large dataset in which all classes were
Fig 9 MASK RCNN DETECTING MULTIPLE uniformly distributed across several altitude or zoom
CHAIRS AND PERSONS thresholds would significantly increase accuracy. That said,
the relatively small number of periods trained also hinders
detection accuracy, range of drone is less, video resolution of

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 489


Proceedings of the Second International Conference on Innovative Mechanisms for Industry Applications (ICIMIA 2020)
IEEE Xplore Part Number: CFP20K58-ART; ISBN: 978-1-7281-4167-1

Drone is 720p and megapixels of camera is 8mp .Some of [8] M. Bhaskaranand, and J. D. Gibson, “Low-complexity video encoding
for UAV reconnaissance and surveillance,” in Proc. IEEE Military
Applications require camera of high resolution and our drone
Communications Conference (MILCOM), pp. 1633-1638, 2011
do not have optical image stabilization but it has electronic [9] Anantharaman, R., Velazquez, M., & Lee, Y. (2018). Utilizing Mask R-
image stabilization. The Environment conditions are also CNN for Detection and Segmentation of Oral Diseases. 2018 IEEE
responsible for obtaining obscure images and our drone cannot International Conference on Bioinformatics and Biomedicine (BIBM).
doi:10.1109/bibm.2018.8621112
fly under heavy windy conditions. But it is cost effective i.e.
[10] Open Source Computer Vision (OpenCV): http://opencv.org/ (access
hundred dollars. Our future work will be controlling drone 28.05.2017).
with hand gestures human and pose estimation with help of [11] R. Baran, A. Glowacz, and A. Matiolanski. “The efficient real-and non-
MASK RCNN neural network. real-time make and model recognition of cars,” in Multimedia Tools and
Applications, vol 74, no. 12, 2015, pp.4269-4288.
[12] H. Chung-Hsien, etal. “A hybrid moving object detection method for
. aerial images,” in Pacific-Rim Conference on Multimedia, Springer,
Berlin, Heidelberg, 2010, pp.357- 368.
REFERENCES [13] T.Y. Lin, et al. “Microsoft coco: Common objects in context,” in
[1] Su, H., Wei, S., Yan, M., Wang, C., Shi, J., & Zhang, X. (2019). Object European conference on computer vision, Springer, Cham, 2014,
Detection and Instance Segmentation in Remote Sensing IEEE pp.740-755.
International Geoscience and Remote Sensing Symposium. [14] S. Nadim and B. Bhanu. “Physical models for moving shadow and
doi:10.1109/igarss.2019.8898573 object detection in video,” in IEEE transactions on pattern analysis and
[2] Sommer, L., Schumann, A., Schuchert, T., & Beyerer, J. (2018). Multi machine intelligence, vol. 26, no. 8, 2004, pp.1079-1087
Feature Deconvolutional Faster R-CNN for Precise Vehicle Detection in [15] M. Nagao, T. Matsuyama, and Y. Ikeda. “Region extraction and shape
Aerial Imagery. 2018 IEEE Winter Conference on Applications of analysis in aerial photographs,” in Computer Graphics and Image
Computer Vision (WACV). doi:10.1109/wacv.2018.00075 . Processing, vol. 10, no. 3, 1979, pp.195-223
[3] Sommer, L., Schmidt, N., Schumann, A., & Beyerer, J. (2018). Search [16] S. Ren, et al. “Faster r-cnn: Towards real-time object detection with
Area Reduction Fast-RCNN for Fast Vehicle Detection in Large Aerial region proposal networks,” in Advances in neural information
Imagery. 2018 25th IEEE International Conference on Image processing systems, 2015, pp.91-99.
Processing (ICIP). doi:10.1109/icip.2018.8451189 [17] M. Abadi et al. (2016). ‘‘TensorFlow: Large-scale machine learning on
[4] K. He, et al. “Mask r-cnn,” in Computer Vision (ICCV), 2017 IEEE heterogeneous distributed systems.’’ [Online]. Available: https://
International conference on, IEEE, 2017, pp.2980-2988. arxiv.org/abs/1603.04467
[5] R. Girshick. “Fast r-cnn,” in Proceedings of the IEEE international [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
conference on computer vision, 2015, pp.1440-1448. with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf.
[6] Adel Ammar,Anis Koubaa, Mohammed Ahmed , Abdulrahman Saad Process. Syst., 2012, pp. 1097–1105
“Aerial Image Processing for car Detection using Convolutional Neural [19] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian, 2016. Deep
Networks: Comparison between Faster R-CNN and YoloV3” in residual learning for image recognition. Proceedings of the IEEE
arXiv:1910.07234v1 [cs.CV] 16 Oct 2019 conference on computer vision and pattern recognition, 770–778.
[7] A. Borji, et al. “Salient object detection: A benchmark,” in IEEE .
transactions on image processing, vol. 24, no. 12, 2015, pp.5706-5722.

978-1-7281-4167-1/20/$31.00 ©2020 IEEE 490

You might also like