Professional Documents
Culture Documents
Batch 1 - Smart Home Security Using Telegram Chatbot
Batch 1 - Smart Home Security Using Telegram Chatbot
CHATBOT
PROJECT REPORT
Submitted by
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
of the APJ ABDUL KALAM UNIVERSITY
We undersigned hereby declare that the project report “Smart Home Security Using
Telegram Chatbot”, submitted for partial fulfillment of the requirements for the award of degree
of Master of Technology of the APJ Abdul Kalam Technological University, Kerala is bona fide
work done by us under the supervision of Prof. Krishnakumar. This submission represents our
ideas in our own words and where ideas or words of others have been included, we have adequately
and accurately cited and referenced the original sources. We also declare that we have adhered to
ethics of academic honesty and integrity and have not misrepresented any data or idea or fact or
source in my submission. We understand that any violation of above will be a cause for disciplinary
action by the institute and/or University and can also evoke penal action from the sources which
have thus not been properly cited or from whom proper permission has not been obtained. This
report has not previously formed the basis for the award of any degree, diploma or similar title of
any other university.
CERTIFICATE
This is to certify that the report entitled “Smart Home Security Using Telegram Chatbot”
submitted by ‘Abijith Kuruppath (ATP16CS001), Navneeth Suresh (ATP16CS033), Sanjay
PS (ATP16CS040), Vinayan V (ATP16CS049)' to the APJ Abdul Kalam Technological
University in partial fulfillment of requirement for the award of degree of Master of Technology
in Computer Science and Engineering is bonafide record of the project work carried out by us
under Prof. Krishnakumar’s guidance and supervision. This report in any form has not been
submitted to any other university or institute for any other purpose.
Signature Signature
First and foremost, we would like to thank the GOD ALMIGHTY for his infinite grace and
help, without which this project would not have reached its completion.
We would like to express my sincere gratitude to the MANAGEMENT for their timely support
extended throughout the seminar.
We would like to express our sincere gratitude to Dr MAHADEVAN PILLAI, the Honorable
Director for his timely support extended throughout the project.
We would like to express our sincere gratitude to Dr MAHADEVAN PILLAI, the Honorable
Principal for his timely support extended throughout the project.
We would like to express our sincere thanks to our HOD Dr S. GUNASEKARAN for his help
and guidance through the project.
Words can’t count the source, motivation, guidance and inspiration that our Guide Prof.
KRISHNAKUMAR who rendered us through the entire project without whom this would not
have been possible.
Any project would not be successful if it does not rely on the reference material. In this context
we wish to express profound sense of gratitude to all teaching and non-teaching staff of our
college for giving us supportive environment in the college.
IV
ABSTRACT
In real world application, home security has become a pressing issue and gained more importance
nowadays due to the occurrence of unwanted events in our surroundings. We need a surveillance
system that can analyse and explain the object’s behaviours which consists of static and moving
entities to improve object detection and video tracking. This paper focuses on detection of moving
objects in a video surveillance system then tracking the detected objects in the scene and
classifying them and adding to that, sending intruder alerts as chat through Telegram. So, we need
fast, reliable and robust algorithms for moving object detection and tracking. The algorithm
includes background subtraction in the image sequences thus detecting the moving objects in the
foreground. And add to that, the motion detection works in tandem with the facial recognition to
filter out false positives and only give accurate intrusion alerts. The system also records footage in
real time and sends it to the user via the Telegram chatbot where the video clips are stored safely
in the Telegram server.
And with the help of a neural network we will classify the target object with a preloaded
data set. We will perform face detection, extract face embeddings from each face using deep
learning, train a face recognition model on the embeddings, and then finally recognize faces in
both images and video streams with OpenCV. For all such requirements we integrate our system
with the Telegram Bot, as a remote control and to receive notification from the system regarding
the surveillance. Telegram being an open source makes it that much easier for home security
integration, plus all the messages that are sent from the Surveillance system to the Telegram bot
are end-to-end encrypted means that it's secure and it cannot be eavesdropped on.
V
CONTENTS
DECLARATION II
CERTIFICATE III
ACKNOWLEDGEMENT IV
ABSTRACT V
LIST OF FIGURES VI
LIST OF ABBREVIATIONS IX
1 INTRODUCTION 1
2 EXISTING SYSTEM 3
3 LITERATURE SURVEY 4
4 PROPOSED SYSTEM 19
4.1 Architecture 19
5 IMPLEMENTATION 24
7 CONCLUSION 40
REFERENCES 42
APPENDIX-1 44
APPENDIX-2 46
LIST OF FIGURES
VI
6.2 Size of trainer file vs Loading time 38
VII
LIST OF TABLES
VIII
LIST OF ABBREVATIONS
TP True positive
TN True negative
IX
CHAPTER 1
INTRODUCTION
Home Security Bot is an ambitious project aimed at solving a common but overlooked
problem in the daily lives of people. Security is the first attention in everywhere, every time
and for everyone. The main aim is basically to protect individuals and property from theft and
loss. Although modern technological solutions exist, people tend to avoid using them because
of the sole reason of high expense and one of the major problems that occurs in most of the
systems that is “False Alarm”. Some of the existing systems give the users false alarms that
alerts the user even though there are no intruders entering their house.
These systems operate on the basis of detecting motion alone which means that false
alerts are more likely to make it through. They cannot classify the objects detected in the frame
and would send out an alert. Since it works on the basis of detecting alone this system suffers
from some major flaws because almost all algorithms which work on motion change tend to
detect motion in case of lights turning on, sudden change in lighting etc.
The system we want to introduce addresses these issues in a flawless way by including
strong object detection and tracking algorithms. This means that not only will our system detect
movement but it can also identify who is in the frame. This helps to avoid sending out false
alerts. The system we wish to introduce is a chatbot integrated motion detection and recognition
system, a simple python script.
The system will be able to detect any object movement, classify the occupants of the
house from intruders in light day or darkness, take a picture, record footage and automatically
send the data to a smartphone via Wi-Fi using the Telegram chatbot application. The data
include the picture and notification message "Intruder Alert".
The advantage of using this system is that it is a very crucial choice for energy save,
cost saving and home security. Also, another advantage is that it is a simple system and able to
work at any time in the light of day or darkness. The alert message sent to the user is also
secured through telegram’s end-to-end message encryption.
1
The elementary principle for such a system to work is segmentation, detection and
objects tracking. Object Detection is to detect and identify suitable objects in the video.
Segmentation helps in finding the relations between objects, as well as the context of objects.
Face recognition, identifying number plates, and satellite image analysis are such examples
and Object tracking involves figuring out the path as the object moves. Confronting object
tracking is a bit of a challenge because of the noise, object occlusions and their complex
structures.
Some of the major changes in the level of brightness may occur and the correct and
clear image may not be obtained. Therefore, the level of brightness is totally different from
darkness to that in the daylight. The object occlusion may also occur in some setup that is some
of the images or videos may get over the previous one which may act as a blockage to the
previous image or video and the obtained image may not be visible by the user. Noises may
also be a part in the image or the videos that are recorded in the security camera which is the
major problem to the user as the noises affect the image and the video in such a way the user
will not be able to identify the person who entered his/her house. While recording and sending
the image and the footage to the particular user should be faster and the user should be alerted
as soon as possible as the current existing systems contain these kinds of flaws as they send
false alerts and some response time is very slow.
In our work, we seek to propose a method in which we can trim down challenges and
limitations which occur during real time operation and can run a fairly effective and efficient
manner in terms of results and response time.
2
CHAPTER 2
EXISTING SYSTEM
CCTV is one of the most commonly used security systems. The systems are highly
influential in crime prevention, industrial process, traffic monitoring etc. It is a non-active
monitoring device which requires constant and continuous human supervision. The monitoring
of such situations continuously is a complex task as all of the captured footage should be
watched manually by each person which requires a lot of patience. This technique is costly and
many times the collected information gets corrupted also. Some of the cameras cannot even
differentiate between the intruder and the person who is living nearby, due to this there will be
false alarm sent to the user which may lead to misunderstanding. The files may get corrupted
as mentioned earlier so there should be a clean storage space where the photos and videos
should be stored securely without getting corrupted.
The existing system has some of the major disadvantages like false alarm, corruption
of files stored inside the storage system. This may lead to an unstable security system and some
of the major changes are required in usage of algorithms of face detection and object detection.
The major problem arises when the camera has to detect faces at night, so more accurate
algorithms should be used to detect faces at night to reduce the false alarms to the user and get
the perfect snapped photo of the intruder. The inherent issues with smart surveillance systems
which exist today is that it costs a premium to set up and in addition would charge a monthly
subscription for the storage of surveillance footage. Most surveillance systems are trained to
detect just change in motion which may be caused by several reasons other than intruders like
change in lighting or shadows. Only the top of the line systems implements facial recognition
but they are not made for household use. This is the major challenge in face detection at night
which can be reduced by using more accurate algorithms.
3
CHAPTER 3
LITERATURE SURVEY
3.1 Convolutional Neural Networks for Image Recognition - Samer Hijazi, Rishi
Kumar, Chris Rowen
Four types of layers are most common: convolution layers, pooling/subsampling layers,
non-linear layers, and fully connected layers.
The system is pre-trained with a set of images of faces of all users. The convolution
operation extracts different features of the input. The first convolution layer extracts low-level
features like edges, lines, and corners. Higher-level layers extract higher-level features. The
4
input is of size N x N x D and is convolved with H kernels, each of size k x k x D separately.
Convolution of an input with one kernel produces one output feature, and with H kernels
independently produces H features. Starting from the top-left corner of the input, each kernel
is moved from left to right, one element at a time. Once the top-right corner is reached, the
kernel is moved one element in a downward direction, and again the kernel is moved from left
to right, one element at a time. This process is repeated until the kernel reaches the bottom-
right corner.
The pooling/subsampling layer reduces the resolution of the features. It makes the
features robust against noise and distortion. There are two ways to do pooling: max pooling
and average pooling. For average pooling, the average of the four values in the region are
calculated. For max pooling, the maximum value of the four values is selected.
The ReLU function is an activation function which lies between convolution and pooling.
A ReLU implements the function y = max(x,0), so the input and output sizes of this layer are
the same. It increases the nonlinear properties of the decision function and of the overall
network without affecting the receptive fields of the convolution layer. In comparison to the
5
other nonlinear functions used in CNNs, the advantage of a ReLU is that the network trains
many times faster.
Fully connected layers are often used as the final layers of a CNN. These layers
mathematically sum a weighting of the previous layer of features. In the case of a fully
connected layer, all the elements of all the features of the previous layer get used in the
calculation of each element of each output feature.
To conclude what we learned from this journal, CNNs give the best performance in
pattern/image recognition problems and even outperform humans in certain cases. In a CNN,
since the number of parameters is drastically reduced, training time is proportionately reduced.
Also, assuming perfect training, we can design a standard neural network whose performance
would be the same as a CNN. But in practical training, a standard neural network equivalent to
CNN would have more parameters, which would lead to more noise addition during the training
process. Hence, the performance of a standard neural network equivalent to a CNN will always
be poorer.
In this literature the Phases of basic system model video Surveillance is discussed, they are:
6
3.2.1 Image Acquiring Phase
With the help of a camera we can acquire images easily. When a camera captures an
image, its initial form is in raw format. Raw format image contains minimally processed data
from the image sensor of the camera. At this stage the image in raw form has a lot of
disturbances and blurriness. So, the images are not yet processed for further use. Hence the
camera quality plays an important role for the resulting image. If the quality of the image is
high, negative elements can be easily highlighted and identified, resulting in accurate and clear
output. Raw image is influenced by factors such as noise, shadow, light, image quality, contrast
etc. This raw image is converted into frames during the pre-processing phase.
In this phase the image which is in raw format is converted into frames and is being
processed. Image pre-processing is a technique which uses various computer algorithms to
perform the processing of an image. Processing helps the image in eliminating noises and
various factors which affect the quality of the Image thereby enhancing the quality of frames
as the video frames have a lot of noise due to camera, illumination, reflection etc.
This is a major phase of focus because the effective object detection and classification is
truly based on this phase.
Basically, it is the process of subdividing an image in order to analyse each part so that
those image data acquired can be used for various application activity such as Video
surveillance.
So here the image is classified into two types – Background and Foreground images.
The segmentation of the foreground is often obtained by applying a threshold. The threshold
parameter usually depends on the camera noise. When the objects of interest are identified then
for clear identification of the object, we use various techniques like edge-based extraction,
feature-based extraction, colour and pattern-based extraction, region-based extraction
mechanisms and object of interest.
7
Fig 3.2.3 current image and segmented image
Feature selection deals with various feature extraction techniques based on spatial,
transform, edge and boundary, colour, shape and texture features.
Spatial features of an object are characterized by its grey level, amplitude and spatial
distribution.
The histogram of an image refers to intensity values of pixels. The histogram shows the
number of pixels in an image at each intensity value.
8
3.2.7 A Histogram model
Edge detection of an image significantly reduces the amount of data and filters out
unimportant information. If an edge of an image is identified accurately, their properties like
shape and size can be measured.
Colour is a visual attribute of an object. Colour features can be derived from a histogram
of the image. It can be mainly used for object matching with the pre-loaded data.
The shape of an object refers to its physical structure. It is determined by its external
boundary abstracting from other properties such as colour, content and spatial properties.
9
To summarize the literature, they have presented different methods of moving object
detection, used in video surveillance. And have described the various phases of video
surveillance. So, this gives valuable insight into the area of moving object detection as well as
in the field of computer vision. It is clearly stated that if all the steps of video analytics are
taken into account, an effective mechanism can be built up which will result in better and clear
image capturing.
3.3 A Survey on Object Detection and Tracking Algorithms, Rupesh Kumar Rout (2013):
Tracking is basically the process of mapping an object within a sequence of frames, from
its first appearance to its last. The object of interest is decided by the application. An object
can be occluded by other objects as mentioned by author in [11]. So, a tracking system should
be able to predict the position of any occluded objects.
Video surveillance systems include tasks such as motion detection, tracking, and activity
Recognition. Here, detection of moving objects is the first important step and successful
segmentation of moving foreground objects from the background.
● Frame differencing,
● Background subtraction
10
3.3.1.3 Background Subtraction
The background subtraction is one of the most popular and common approaches for motion
detection. The basic idea of background subtraction subtracts the current image from a
reference background image, which is updated on every sequence. The result gives us the non-
stationary object Background subtraction highly dependent on a good background maintenance
model Because extremely sensitive to dynamic scene changes from lightning and other events.
The main problems with background subtraction are:
Illumination changes
● Memory
● Shadows
● Camouflage
● Bootstrapping
Illumination Changes:
The background model should be able to adapt, to gradual changes in illumination over
a period of time.
Memory:
The background module should not use much resources, in terms of computing power
and memory.
Shadows:
Shadows cast by moving objects should be identified as part of the background and not
foreground.
11
Camouflage:
Moving object should be detected even if pixel characteristics are similar to those of the
background
As we all know, Images are represented as arrays of pixels. A pixel is a scalar or vector
that shows the intensity or colour. A Gaussian mixture model can be used for separating the
pixels into similar segments and modelling the background pixel.
12
The GMM method models the intensity of each pixel with a mixture of K Gaussian
distribution, K Gaussian distributions are used to model each pixel by the Gaussian mixture
model and it is effective in modelling the background with repetitive motions. The probability
that a certain pixel has a value of Xt at time t can be written as:
where K is the number of distributions, ωk,t is the weight of the kth Gaussian in the
mixture at time t, and η ( Xt , μk,t , Σk,t ) is a Gaussian probability density function where μk,t
is the mean value and Σk,t is the covariance of the kth Gaussian at time t.
Every new pixel value, Xt, is checked against the existing K Gaussian distributions in turn,
until a match is found. If no match is found, the last distribution is replaced by a new Gaussian
with the current value as its mean, an initially high variance, and a low weight parameter.
At the first stage, here we first develop the background model, then background
subtraction to detect the foreground object.
In the second stage, only stationary pixels are processed to construct the initial
background model. The initial background for a pixel (i, j) is performed by a three-dimensional
vector: the minimum m(i, j), maximum n(i,j) intensity values and the maximum intensity
difference d(i, j) between the consecutive frames observed during this training period. Then the
background model is obtained:
Summarising the whole literature, here we have gone through various Detection methods to
showcase the Motion detection of an object tracking, which is one of the basic and first steps
for video surveillance. The first method we have frame difference method which uses pixel
wise difference from multiple frames, then we have background subtraction which is one of
the basic and mostly used and we have gaussian mixture model which reads the image and
separates the pixel by its similarities
13
3.4 Human motion detection system (Video motion detection module) by Soo Kuo Yang,
March 2005
This literature would be focused on the Video Motion Detection module where we would
perform research on the techniques and methodology to detect motion and to develop a module.
This module would record down motion and pass it into the next module that would be on
object classification where it classifies human and non-human objects. Thus, this literature [13]
is to come up with a solution that detects motion effectively and record it down with one or
more objects that are moving and causing motions.
In this chapter, we will look into the design of the methods and techniques implemented
in the final prototype system. Diagrams of the architecture of motion detection algorithms are
being presented as well.
Fig 3.4.1 Overview of the prototype human motion detection application system
As shown, there are two outputs given by the motion detection module. This means that
there are two algorithms being implemented which are namely spatial update and temporal
update + edge detection whose output were respectively denoted in the above figure.
14
Fig 3.4.1 An overview of the motion detection algorithm implemented.
The background updating model is an important issue for motion detection algorithms.
Since we’ve implemented two distinct algorithms for the background updating module.
The system developed here can capture sequences of images both from real time video
from a camera or from a recorded sequence, it is defined as the action of retrieving an image
from some source, usually a hardware-based source for processing. It is the first step in the
workflow sequence because, without an image, no processing is possible
15
3.4.1.2 Image Segmentation & Image Subtraction
Here two subsequent images are compared and processed using arithmetic operation of
subtraction to subtract the pixels’ value in the images. Since usually colour images are used,
the algorithm implemented considers the colour data of the images. The technique first
separates the images to three planes or channels. Then it performs the arithmetic subtraction to
each planes or channels. After that, the results are combined back again to form a colour image.
The next step would be to perform some image processing operations on the result
obtained from the previous step. the result obtained from a threshold function. This result is
further used for recognition purposes as it filters the human body shapes better than the other
output. An adaptive threshold function is implemented here.
The next step is performed on the eroded and dilated image and the threshold image to
identify individual contours that are present in the image. These contours are displayed with
different colours for contours considered as separated from one another. Those that share the
same colour may be considered connected to each other.
Some operations are done in order to remove overlapping boxes or rectangles when
drawn to the source image. Basically, rectangles or boxes that are near and almost crosses one
another’s edges would be joined up to form a larger rectangle or box. This would help eliminate
the problem of only portion of human being returned to be recognised.
Here, the threshold image obtained is then further enhanced to fill in the empty spaces
inside the binary region which represents the moving object. Basically, an algorithm which
scans through the vertical lines and fills up each vertical line’s first and last white pixels’ region
in between is implemented.
16
3.4.1.2.6 Area Mapping
After the bounding rectangles are identified, the position is then mapped to the source
image and the rectangle being drawn there. Mapping is done for the area of the bounding boxes
drawn in the source and also the corresponding area in the binary processed image. The area
from this processed binary images are the ones to be used for the recognition engines.
Here we’ve implemented two algorithms to update the background image which is to
be used in the image subtraction stage of the motion detection algorithm
The first algorithm implemented was designed in such a way where the background
model is updated based on spatial data instead of temporal data. By temporal data here, we
refer to the time or frame number of the images in the sequence. Thus, since this method uses
only spatial data which means the background is updated based on some percentage of the pixel
change in the subtraction result.
As for the second algorithm, we had implemented here a usual motion detection
approach for updating the background image. Compared to the first algorithm, this method uses
much more computation steps and thus giving slower response or performance. The
information that is obtained for the computations are from subsequent frames and not only
depend on a single subtraction result in fact it doesn’t concern any subtraction process results.
Basically, all frames are used for computation here for a running average image to be produced.
A basic running average method is defined as the sum of the previous value of a pixel at a
certain location with the new value taken by the corresponding pixel at the next frame in the
image sequence with a certain factor of degrading the old pixel value.
17
summarise the literature the latter human motion detection prototype system’s algorithm
developed is much more suitable to be implemented.
18
CHAPTER 4
PROPOSED SYSTEM
4.1 Architecture
The proposed system is a home security system which utilizes basic motion detection
and facial recognition for accurate remote monitoring. The system is integrated with a telegram
chat-bot by which the user can receive notifications about possible events. The web camera
from the laptop is used to demonstrate the potential of simple and effective algorithms which
can make use of day-to-day things we use.
Telegram is a free cloud-based instant messaging service. Users can send messages and
exchange photos, videos, stickers etc. through the telegram. The telegram chatbot can be
created with the help of a Bot called Bot-Father which is also a telegram bot which can be used
to create new telegram bots and control the existing bots. A simple and effective algorithm
works in tandem with the open source telegram chatbot, which a opens up a ton of potential
for this simple surveillance system
19
In order to achieve motion detection an computer vision software known as OpenCV is
used. The camera attached divides each frame in the video into a grid and the adjacent frames
are analysed to detect movement of different objects. The different objects in the frames are
classified and the objects are separated from the background using the background separation
algorithm. The system is trained with a class of datasets to distinguish the intruder from the
family members or the friends. So, whenever an intrusion occurs at the building the intended
user gets a notification and a photo regarding at the time of the intrusion. So, the users can view
the live footage of the building by sending simple commands to the chatbot. Due to advanced
technology the large capability of storage is available to store those taken footage and images
correctly at one place.
In this proposed system a method called frame differencing is used in which two or
three adjacent frames based on time series image to subtract and we get different images, its
working is very similar to background subtraction after the subtraction of image it gives
moving target information through the threshold value. This kind of method is highly adaptive
to dynamic scene changes as it generally fails in detecting whole relevant pixels of some type
of moving objects. The basic principle of background subtraction technique is separating the
estimated image from the observed image. This foreground process divides the image into two
complementary sets of pixels. There are some of the criteria that must be satisfied by every
detection algorithm. It must adapt itself by sudden changes like illumination changes, motion
changes, high frequency objects and their geometry especially in the outdoor surveillance
scenes. These include the unfamiliar changes in light intensity, camera oscillation, objects like
trees and parked vehicles.
Let Image be represented as F(x,y,t) and Background as K(x,y,t) at time t. Using Frame
Differencing method, background frame is represented as
Median filter uses the median of n previous frames as the background model
Where i = {0,1…n-1}
20
Background subtraction is the commonly used technique for motion segmentation in
some static images. This method will detect moving regions by subtracting the current image
pixel-by-pixel from a reference background image that is created by averaging images over
time in an initialization period. The basic idea of the background subtraction method is to
initialize a background first, and then by subtracting the current frame in which the moving
object presents the current frame is subtracted with the background frame to detect the moving
object. Background subtraction methods operate on pixels independently. All the pixels are
divided into groups first of N×N blocks and every block is processed as a 𝑁2 component vector.
Pixels are then classified based on the threshold difference between current image and
backspace projection of its PCA coefficients.
In face recognition, we have the face Detector which helps in localization of an image
in which the distinct features of the face can be identified. Then we have the Embedder which
is responsible for extracting facial embeddings via deep learning feature extraction.
21
Fig 4.1(b): Data flow diagram of facial recognition
It analyses the images and returns numerical vectors that represent each detected face. At first
we will load the image into the memory and construct a blob, blob (Binary large object) is a
collection of binary data stored i.e. image in binary form. And localize the faces in the image
22
via “detector” and present it to "detections". After that, it will loop over all the images present
in detections, then we can extract the Confidence score which is the probability that an anchor
box contains an object predicted by a classifier. A detection is considered a true positive (TP)
if it satisfies three conditions: confidence score > threshold; the predicted class matches the
class of the desired output; the predicted bounding box has an Intersection over Union (IoU)
greater than a threshold with the desired output. Violation of either of the latter two conditions
makes a false positive (FP). Then we compare the confidence to the minimum probability
detection threshold contained in our command line “args” dictionary, ensuring that the
computed probability is larger than the minimum probability. From there we can extract the
Face’s Region of Interest (ROI). For displaying the results, we construct a” text” string
containing the name and probability. And then we draw a bounding box around the face and
place the text above the box, and then finally we visualize the results on the screen.
23
CHAPTER 5
IMPLEMENTATION
Telegram is an instant messaging application which we use daily for chatting with family and
friends. The Free and Open Source nature of Telegram helped the developers to release a set
of APIs which are used for developing bots. Bots are the applications which automates the
tasks. By using this bot, it is possible to chat with home appliances from anywhere in the world.
In this paper we developed one such bot running on raspberry pi which is connected to a
camera. The bot receives the user's instruction and sends a reply accordingly. In this paper we
developed a security system in which the telegram chat bot helps the user to get an alert signal
whenever an intruder gets into the house. the camera and the raspberry pi are connected to the
telegram chatbot. Whenever there is a motion the user gets an alert signal and an image is sent
to the telegram chatbot. Security is ensured in such a way that a private token key will be
generated and it is unique for each user. The bot responds only for users whose token key is
registered. The bot checks the token key and gives access to the user for receiving the snapped
images on the camera. In this implementation the camera will not only detect the movement
but it can also identify who is in the frame. This helps to avoid sending false alerts.
24
Various mathematical and vector operations are used to analyse the image patterns and its
features. OpenCV and python libraries can work in a perfect line-up which can be very efficient
in coding.
Telegram is an online messaging app which is cloud based that works just like popular
messaging apps like WhatsApp. This means that we can use it to send messages to our friends
when connected to the internet via Wi-Fi or our mobile data. It uses voice over IP service.
Telegram is cloud-based and claims that it prioritizes security and speed, making it a good
alternative to other popular messaging apps. The thing which makes Telegram unique is that
it's open source which gives the developers a ton of freedom for the developers. And the
availability of telegram bot’s API opens up a whole array of possible project ideas.
Telepot is a python package which helps our program to talk with Telegram bot API. It
works on Python 2.7 and Python 3. For Python 3.5+. The command “pip install telepot” is
made use of to install the telepot package in our machine. To use the Telegram Bot API, we
first have to get a bot account by chatting with BotFather. BotFather will give us a token. With
the token in hand, we can start using telepot to access the bot account. Dlib is a C++ library,
we can use a number of its tools from python applications. It contains many algorithms and
tools regarding Machine learning. Dlib supports a histogram of oriented gradients (HOG) and
SVM which are crucial for image processing. Imutils is yet another library in python which
contains certain functions for image manipulation during image processing. It helps to resize,
translate, rotate video frames from the webcam. Imutils works along with matplotlib and also
displays the result of matplotlib computations. Imutils can easily be installed with a single
command ‘pip install imutils’.
Scikit-learn is a library focused on Machine learning used in python programming. Its
library includes NumPy, Matplotlib, sci-py etc which helps in array and algebraic operations.
Scikit-learn offers various algorithms such as SVM and gradient boosting which are mostly
suitable for face tracking.
5.1 Face Recognition
2. Each input batch of data includes positive image and negative image.
25
3.1 We are using Caffe based DL face detector to localize faces in an image
3.2 Next model is Torch-based and is responsible for facial embeddings via deep learning
feature extraction.
4. Detect faces in the image by passing through the detector network
4.1 Extract the detection with the highest confidence and check to make sure that the
confidence meets the minimum probability threshold used to filter out weak
detections
5. We have extracted 128-d embeddings for each face.
6. initialize our SVM model, and train the model
7. Using a deep neural network, compare the input data (128-d vector) to the known data set.
8. Recognize the face.
In face recognition, we have the face Detector which helps in localization of an image
in which the distinct features of the face can be identified.
modelPath = os.path.sep.join([args["detector"],
"res10_300x300_ssd_iter_140000.caffemodel"])
Then we have the Embedder which is responsible for extracting facial embeddings via deep
learning feature extraction. It analyses the images and returns numerical vectors that represent
each detected face
26
Fig 5.1 : Facial Recognition process
At first we will load the image into the memory and construct a blob, blob (Binary large object)
is a collection of binary data stored i.e. image in binary form.
image = cv2.imread(args["image"])
(h, w) = image.shape[:2]
imageBlob = cv2.dnn.blobFromImage(
And localize the faces in the image via “detector” and present it to "detections". After that, it
will loop over all the images present in detections, then we can extract the Confidence score
27
which is the probability that an anchor box contains an object predicted by a classifier. A
detection is considered a true positive (TP) if it satisfies three conditions: confidence score >
threshold; the predicted class matches the class of the desired output; the predicted bounding
box has an Intersection over Union (IoU) greater than a threshold with the desired output.
Violation of either of the latter two conditions makes a false positive (FP).
confidence = detections[0, 0, i, 2]
Then we compare the confidence to the minimum probability detection threshold contained in
our command line “args” dictionary, ensuring that the computed probability is larger than the
minimum probability. From there we can extract the Face’s Region of Interest (ROI)
For displaying the results, We construct a” text” string containing the name and probability.
And then we draw a bounding box around the face and place the text above the box, and then
finally we visualize the results on the screen
28
Fig 5.2 Facial recognition in Video stream
29
9. If the contour area is larger than the minimum area, draw a bounding box surrounding the
foreground and motion region.
10. Update room status to occupied, add timestamp to frame.
Define a variable ‘min-area’ which is the minimum number of pixels changing in the region of
an image to be considered ‘actual motion’. This is done in order to filter out false positives.
The camera is started and the first frame of the video is initialised to a still frame. Each frame
is taken and it is resized, turned to grayscale and the image is smoothed with Gaussian blur.
Compute difference between first frame and subsequent frames to calculate absolute value of
pixel intensity differences. If the difference is less than 25, the pixel is discarded and set to
black. If it is greater than 25, the pixel is set to white.
Apply contour detection to find outlines of white regions and start looping over each contour
to filter out small irrelevant contours. If the contour area is larger than ‘min area’, a bounding
box is drawn surrounding the foreground and motion region.
(x, y, w, h) = cv2.boundingRect(c)
30
Fig 5.2 Bounding box drawn over moving object
There is a bot called BotFather in Telegram which can be used to create new bot Accounts.
BotFather can also be used to manage Existing Bot accounts.
1) Open Telegram
5) Choose a unique Username for the bot and the username must end with “bot”.
6) Now you get a unique Token which can be used to access and control the Bot
31
Facial recognition and motion detection algorithms work together to find any possible
intruders into our home. A flag variable is initialised to check movement. The facial recognition
labels are stored in pickle file named labels. When a motion is detected, the program looks
through the labels and if it's not found, the frame will be saved to disk and immediately sent to
the user as an alert notification.
The system starts recording the footage once the surveillance starts running, and the user can
also request a 5 sec video clip if he/she wants to know what’s going on.
32
cv2.videoStream is used to read from the camera, the filename and CODEC to be used for
encoding the video is initialised before the streaming starts.
where footage is file to where the video is written, XVID is codec used for encoding the video,
30 is frame rate of the footage and 640x480 is the resolution at which the video will be
recorded.
frame = vs.read()
footage.write(frame)
Each frame read from the camera is written to the footage file. For the user to send the
surveillance footage the following script is used.
if command == 'Footage':
Receiving the surveillance footage could take some time depending upon the size of the file
and the bandwidth of the internet connection. The user can also request a 5 second clip by
sending the command ‘Video’.
if command == 'Video' :
tim=1
cont = cont + 1
start_time = time.time()
print(int(start_time))
if(tim==1):
result.write(frame)
33
print("Done recording")
34
CHAPTER 6
EVALUATION
In this chapter the system developed will be evaluated against different test cases and
varying scenarios. Even though the algorithm is able to draw the bounding box correctly over
a detected object, it is highly dependent on the threshold of the subtracted frame. The system
uses ‘.avi’ for saving the video-files. Full bound means the whole motion detected area is
wrapped in a bounding box, adequate bound means the area bound is passable to let the
algorithm recognise motion meaning it covers necessary areas.
The time taken to detect a moving object is calculated and it is based on the detection of an
object’s contour. Detection time is directly dependent upon the speed and position of the object
in motion. The position of the object is calculated using spatial moments.
Where V = speed of the object and Pi and Pi-1 are positions of the centre of the detected object
in subsequent frames (current frame i and previous i-1), Ti and Ti-1 are the times of frames i
and i-1.
In this experiment the objective is to calculate the speed limit where the algorithm can detect
the boundary of an object and its form.
Table 6.1: Distance travelled, time taken and speed of object in motion.
35
Fig 5.1: Line chart showing speed and position analysis of detection.
In the next method, we use recall r and precision p to calculate F1 in order to evaluate
performance of our algorithm.
Recall = TP / TP + FN 6.1(a)
Precision = TP / TP + FP 6.1(b)
FN = number of objects in frame which are not detected but have desired shape.
F1 reaches its best value near to 1 and its bad value near to 0.
Scene Number of TP FP FN R P F1
appearances
36
Objects in 287 271 28 16 0.94 0.91 0.92
motion
Fig 6.1(a): Line Chart showing values of F1, recall and precision.
Calculated values show that the proposed algorithm based on background subtraction allows
detection of moving objects with some distinct geometric shape. Despite the FP and FN, the
value of F1 is good and is close to 1. The false positive values occur when there are other
objects in front of moving objects causing occlusions, change in shadows or lighting. This can
be controlled by increasing the threshold value.
37
6.2 Face Recognition
In order for face recognition to work properly a trainer file is essential. A model is trained
consisting of images of faces of people which is then saved as a trainer file with the extension
‘.yml’. This file is loaded each time and is used by a local binary pattern histogram algorithm
for the purpose of facial recognition. The number of photos per person used for training the
model is selected carefully to balance between latency and accuracy.
In order for facial recognition algorithm to work properly the face has to be seen in frame. The
accuracy of the algorithm varies with the angle of the face in the scene. Metric evaluation is
used next to calculate accuracy.
Trials TP FN FP TN Accuracy
1-6 8 0 1 2 90.90 %
7-8 7 0 2 1 80 %
9 6 0 1 0 85.71 %
10 5 1 1 1 75 %
11 11 0 0 1 100 %
38
12 5 0 0 0 100 %
13 6 0 1 1 87.5 %
Where Accuracy = (TP + TN) / (TP + TN + FP + FN) and the average accuracy is 88.44%
The arrival time of intrusion alerts in telegram is directly dependent on the bandwidth of
the internet and if there are files being sent, the size of which also affects how fast the
notifications arrive.
Captured Image 1
39
CHAPTER 7
CONCLUSION
A smart surveillance system was designed which incorporated both the motion detection and
facial recognition algorithm to improve the accuracy of alert notification sent to the user. The
alert is sent as notifications to the user via a telegram chatbot where the user can send
commands to the system through chat. Telegram being open source and free allows for storage
of video files and images sent to users on their servers. This feature of telegram makes the
system extremely cost effective.
This system is made completely in OpenCV which makes it efficient and fast because OpenCV
is less resource intensive and can work in low end systems which makes our system budget
friendly.
The frame rate of the video streaming is greatly affected by the combination of motion
detection and facial recognition. However, the real time performance is unaffected. When a
motion is detected, the system looks for a face for some time, if it can’t find any it will send an
alert. But if the face is detected it will capture the image and send the image as alert notification,
when this happens there is a huge spike in frame drop while streaming but the detection works
fine.
Initially the algorithm picked up even the slightest change in pixels like shadows and change
in lighting which made false alerts but this was controlled by increasing the threshold density
value. The algorithm made false facial recognitions while the face was seen from different
angles. This was improved slightly training a large dataset for each person, pre-processing of
face images like face alignment made significant improvements.
However, the accuracy of facial detection was poor when working in low light or when the
face was partially covered. The face recognition with OpenCV was less resource hungry than
YOLO or TensorFlow which made it cost effective but this also affects the accuracy of the
system.
One other downside could be the time taken for arrival of notifications on telegrams. Regular
messages like image and text notifications arrive really fast but video files which are usually
40
large in size could take a while to arrive. The slight dip in accuracy of the system is a trade-off
between cost and performance.
41
REFERENCES
[1] B. Ortiz-Jaramillo, A. Kumcu “Computing contrast ratio in images using local content
information” – 2017, Vol.25, Issue.3, 6 Pages
[2] Chris Rowen, Samer Hijazi, Rishi Kumar, “Convolutional neural networks for image
recognition”- 2016, Vol.55, Issue.4, 8 Pages.
[3] Harish Kumar Sharma, Mayank Sharma “IOT based home security system with
wireless sensors and telegram messenger” Vol.25, Issue.6, 6 Pages.
[4] Hong Phat Truong, Justin Joseph2 “Low-Cost Computing Using Raspberry Pi 2 Model
B”-2017, Vol.5, Issue.2, 13 Pages.
[5] Jinzhou Huang, Ming Zhou “Extracting Chatbot Knowledge from Online Discussion
Forums” – 2018, Vol.2, Issue.1, 8 Pages
[6] Kaushal Mittal et.al “Classification, Clustering and Application in Intrusion Detection”
– 2014, Vol.4, Issue.1, 7 Pages.
[7] K. Sripath Roy, Bhanu Prakash “Realization of a low-cost smart home security using
telegram messenger and voice”-2017 Vol.115, Issue.5, 6 Pages.
[8] Michael Bächle, Stephen Daurera “Chatbots as a user interface for assistive
technology in the workplace”-2016, Vol.3, Issue.2, 6 Pages.
[9] P. Vigneswari, R.R. Narmatha “Automated security system using surveillance”, -2015,
Vol.5, Issue.2, 5 Pages.
[10] Priya B. Patel, Viraj M. Choksi “Smart motion detection system using raspberry pi”-
2017 Vol.10, Issue.5, 4 Pages.
[11] Rupesh Kumar Rout et.al “A survey on object detection and tracking algorithms”-2013
Vol.2, Issue.3, 5 Pages.
[12] Samer Hijazi, Rishi Kumar, Chris Rowen “Convolutional neural networks for image
recognition”- 2016, Vol.55, Issue.4, 8 Pages.
[13] Soo Kuo Yang et.al, “Human motion detection system (Video motion detection
module)”, March 2005, Vol.2, Issue.2, 10 Pages.
[14] https://www.elprocus.com/gsm-based-home-security-system-working-with-
applications/, referred on 5th October 2019
[15] https://www.instructables.com/id/Raspberry-Pi-as-low-cost-HD-surveillance-camera/,
referred on 26th September 2019
42
[16] https://www.hackster.io/hackershack/smart-security-camera-90d7bd, referred on 10th
October 2019
[17] https://machinelearningmastery.com/how-to-train-an-object-detection-model-with-
keras/, referred on 3rd November 2019
43
APPENDIX-1
44
Fig 1 (b): Code snippet of telegram alert notifications
45
APPENDIX-2
46
Fig 2 (b): Chatbot sending alerts about unknown movement
47