Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Acoustic detection of drone:

Introduction: In recent years, Unmanned Aerial Vehicles (UAV’s) /Drones


usage has increased, with a wide range of applications ranging from
photography, surveillance, video monitoring. However, many a times drones are
engaged in malicious activities such as armed drones can harm some individual,
drones can be used to spying.
Therefore, to address such issues, we purpose a deep learning framework using
CNN (Convolutional Neural Networks) that detects incoming drone using its
acoustic signature. As, it allows for long-range detection and does not require
line-of-sight access to the drone.

Advantages of Acoustic Detection:


= In case of video detection of drones, it might be difficult to capture drone in
night-time, also, it might not be possible to distinguish between drones and real
time objects that mimics drones such as birds, planes.
MEL Spectrogram
Mel spectrograms are commonly used in audio signal processing applications
where the frequency content of a signal needs to be analysed in a way that is
more aligned with human perception.
A spectrogram is a visual representation (somewhat like snapshot) of the
spectrum of frequencies of a signal as it varies with time.
(Mel scale is a non-linear scale that transforms the human auditory system's
response (pitch) to different frequencies. mel scale is non-linear, which means
that the spacing between adjacent mel frequency bins is proportional to the
perceived difference between them. This makes the mel scale more suitable for
modelling human perception of sound)
A mel spectrogram is a type of spectrogram that is commonly used in audio
signal processing and analysis. It is a representation of the spectrum of a sound
signal, where the frequency range is divided into a set of bands that are spaced
according to the mel scale.
(Variation of colour in mel spectrogram signifies change in frequency, intensity
of colour signifies energy of signal.)
The mel scale is a perceptual scale of pitch (how we hear frequencies) that is
based on the human ear's response to sound. It is designed to reflect the way that
humans perceive differences in pitch, with equal steps in the mel scale
corresponding to equal perceived pitch intervals. This makes the mel scale more
appropriate for representing the frequency content of sound signals as perceived
by humans.
To create a mel spectrogram, the audio signal is first divided into short
overlapping time frames, typically ranging from 10 to 50ms in duration. For
each frame, a Fast Fourier Transform (FFT) is performed to calculate the power
spectrum (frequency component) of the signal. The resulting spectrum is then
transformed into the mel scale(Y-axis) using a filter bank (used to merge non –
overlapping frequencies into a single output), which maps the frequencies in the
spectrum onto a set of mel frequency bands, where each frame represents the
distribution of energy across the Mel frequency scale at a specific time.

Mel spectrograms are widely used in speech and audio processing applications,
such as speech recognition, speaker identification, and music analysis. They
provide a useful visual representation of the frequency content of an audio
signal and can help to identify patterns and features that are relevant to the
analysis task.
(Before sending it to CNN model, we must need to optimize spectrograms to get
more accurate results
1. Normalize the spectrograms: Normalize the values in the spectrograms
so that they have zero mean and unit variance. This can help to reduce
the effect of differences in amplitude and background noise on the
model's performance.
2. Apply data augmentation: Apply data augmentation techniques such as
random cropping, flipping, and shifting to generate additional training
data and reduce overfitting.
3. Resize the spectrograms: Resize the spectrograms to a fixed size before
feeding them to the CNN model. This can reduce the amount of
computation required during training and inference and improve the
model's performance.
4. Convert to grayscale: Convert the spectrograms to grayscale before
feeding them to the CNN model. This can reduce the number of input
channels required by the model and simplify the training process.
5. Use transfer learning: Use transfer learning to fine-tune a pre-trained
model on your MEL spectrogram dataset. This can improve the model's
performance and reduce the amount of training data required.
)
Need of MEL spectrogram: They will be given input to CNN which are best
suited for image operation.

CNN: (Convolutional Neural Network)


Neural networks are core part of deep learning algorithms,
It is a type of deep learning model that is primarily used for image recognition
and classification tasks. CNNs are designed to process inputs that have a grid-
like topology, such as images, and are particularly effective at identifying
features that are spatially related.
CNNs usually consists of multiple layers of convolutional and pooling
operations, followed by one or more fully connected layers.
Convolutional layers use filters or kernels to convolve over the input image to
produce feature map (output after convolution), extracting relevant features at
each location.
Input image be 3 -D , kernel 2-D weight vector (3*3)

Feature map, doesn’t consist of all pixels of input image, only receptive field ,
hence , partially connected layers
Pre training parameters that affect output or feature map:
1) Number of kernels
2) Stride (number of pixels over which kernel will move, for feature map
generation)
3) Padding with 0’s: To get desired output, we accordingly pad our input,
such as , to enlarge output image , we can do zero padding across input
image borders
CL transforms images into numbers, thus allowing the neural network to
analyse and extract relevant patterns.
Pooling layers then down sample the output of the convolutional layers, it
reduces the spatial dimensions of the feature map while preserving their
essential information. (It prevents overfitting & speeds up training), it operates
on each channel of the feature map independently, reducing the height and
width of the feature map while preserving the number of channels. The size of
the pooling window and the stride are hyperparameters that can be adjusted to
control the degree of down sampling.
Adaptive pooling is a commonly used ,deep learning technique for image and
signal processing tasks. It refers to a type of pooling operation that dynamically
adjusts its size and shape based on the input data.
(Pooling is a common operation in convolutional neural networks (CNNs) that
is used to downsample the feature maps, reducing their size while preserving
the most important features). The traditional approach to pooling is to use fixed-
size pooling kernels (usually of size 2x2 or 3x3) to reduce the resolution of the
feature maps. However, this fixed-size approach can lead to information loss,
particularly when processing inputs with varying sizes or aspect ratios.
Adaptive pooling overcomes this limitation by using a variable-sized pooling
kernel that can adapt to the input data. The most common types of adaptive
pooling are average pooling and max pooling. In average pooling, the size of
the kernel is adjusted to match the size of the input feature map, and the output
value is the average of all the values in the kernel. In max pooling, the size of
the kernel is also adjusted to match the input size, but the output value is the
maximum value in the kernel.
(Adaptive pooling can improve the performance of deep learning models on
particularly when dealing with inputs of varying sizes or aspect ratios. It can
also reduce the number of parameters in the model, leading to faster training
and lower memory requirements.)
Fully connected layers then use the extracted features to make predictions
about the input image.
ReLU activation functions are commonly used in convolutional and pooling
layers, and SoftMax activation functions are commonly used in the output layer
of neural networks for multi-class classification tasks, the choice of activation
function depends on the specific task and network architecture. Different
activation functions can be used in different layers of a neural network to
achieve the desired behaviour and performance.

CNNs are trained using a large dataset of labelled images and a loss function
that measures the difference between the predicted and actual labels. During
training, the weights of the filters in the convolutional layers are adjusted using
backpropagation, which calculates the gradients of the loss function with
respect to the weights.
CNNs have achieved state-of-the-art results in many image classification tasks,
including object recognition, face recognition, and scene understanding. They
have also been applied to other domains such as natural language processing
and speech recognition, with modifications to their architecture to suit the
specific task.
Methodology,
1. Data Collection/ Obtaining Datasets: The first step is to collect audio
data of different types of drones in different environments. The audio data
should include the sound of the drone’s motors, propellers, and any other
sounds that are unique to that drone. The data should be collected using a
high-quality microphone and stored in a database.
2. Data Pre-processing: The audio data collected needs to be pre-processed
to extract relevant features that can be fed into the CNN for training. This
involves segmenting the audio into small windows, performing a Fourier
transform to obtain the frequency spectrum, and applying various signal
processing techniques to enhance the features.
3. Training the CNN: Once the pre-processing is done, the CNN needs to be
trained using the pre-processed audio data. The CNN will learn to
identify patterns in the audio data that are unique to each drone.
4. Testing the CNN: After the CNN is trained, it needs to be tested on new
audio data to evaluate its accuracy in detecting drones. The test data
should include audio samples of different drones in different
environments.
5. Deployment: Once the CNN is trained and tested, it can be deployed for
real-time drone detection. The audio data from the microphone can be fed
into the CNN, which will analyze it and identify whether a drone is
present or not.
It is important to note that the accuracy of the CNN in detecting drones depends
on the quality and quantity of the training data, the complexity of the CNN
architecture, and the signal processing techniques used in the pre-processing
stage.

You might also like