Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Applied Acoustics 211 (2023) 109468

Contents lists available at ScienceDirect

Applied Acoustics
journal homepage: www.elsevier.com/locate/apacoust

Technical note

Improved pitch shifting data augmentation for ship-radiated noise


classification
Xu Yuanchao ⇑, Cai Zhiming, Kong Xiaopeng
College of Electronic Engineering, Naval University of Engineering, Wuhan 430033, China

a r t i c l e i n f o a b s t r a c t

Article history: The limited amount of ship-radiated noise data causes machine learning models to be prone to overfitting
Received 1 February 2023 in training, and data augmentation methods could improve model generalization performance. The fre-
Received in revised form 24 May 2023 quency stability of harmonic line spectra in ship-radiated noise leads to a lack of sample diversity gen-
Accepted 28 May 2023
erated by the pitch shifting method, which is overcome by the proposed improved method. Nine
Available online 10 June 2023
classification algorithms combining three time-frequency features and three classifiers are implemented
and evaluated on the DeepShip and the ShipsEar datasets. The average accuracy increased by 1.67% on
Keywords:
DeepShip and 2.25% on ShipsEar, using improved pitch shifting and time stretching augmentation meth-
Ship-radiated noise classification
Data augmentation
ods. The constant-Q transformation-convolutional neural networks (CQT-CNN) performs best among the-
Time stretching se nine algorithms. Its accuracy improved from 68.33% to 74.08% on the DeepShip, and the F1 score of it
Pitch shifting improved from 57.92% to 61.45% on the ShipsEar. Data augmentation improves classification perfor-
Machine learning mance for each class of ships in different ways, suggesting that augmentation specific to the class and
Feature extraction state of the ship would improve classification performance further.
Ó 2023 Elsevier Ltd. All rights reserved.

1. Introduction algorithms perform well on the training set. However, it is a chal-


lenge to achieve satisfactory results on the test data set [17]. The
Ship-radiated noise classification is a complex and difficult task. lack of samples seriously affects the generalization ability of the
Various feature extraction methods have been used for target clas- model. Machine learning, especially deep learning models, is prone
sification to solve this challenge, including MFCC (Mel Frequency to overfitting.
Cepstrum Coefficients) [1], wavelet transformation[2], LOFAR The collection and tagging of underwater target data are expen-
(Low-Frequency Analysis Recording) [3], DEMON(Detection of En- sive. The total duration of the publicly available ShipsEar dataset in
velope Modulation on Noise) [4], mode decomposition [5]. Such 2016 was less than 2 h [22]. Despite 47 h of DeepShip [23] datasets
features are fed to various classifiers, ranging from support vector publicly available in 2021, current data volume is still far too lim-
machines (SVMs) [5–7], neural networks [1,2], and convolutional ited compared to datasets in the speech and image domains. Data
neural networks (CNN) [3,8,9]. Integrated features combined with augmentation is an elegant solution by deforming the original
multi-classifiers were utilized [10,11]. Information theory, includ- sample to increase the data. Data enhancements are implemented
ing entropy-based [12,13] and complexity-based [14,15] methods, in image classification utilizing nonlinear geometric deformation
was used for feature extraction, which could be combined with [24], applying random perturbations to colours [25], etc. Data aug-
variational mode decomposition [16]. In recent years, deep learn- mentation has been widely studied in audio recognition, such as
ing techniques, represented by CNN, have been widely used for speech recognition, acoustic scene classification, environmental
ship-radiated noise classification[17]. Relying on the powerful sound classification, and music information retrieval. Vocal Tract
fitting ability of CNN, the raw audio data are directly input to the Length Perturbation increases the number of samples by applying
network, and features can be extracted automatically [18], or perturbations in the frequency axis to improve speech recognition
end-to-end learning [19,20,21] can be achieved. However, due to performance [26]. The stochastic feature mapping method imple-
the complex and variable nature of ship noise, few samples and ments data augmentation in the feature space learned by deep
unbalanced data, the current noise intelligent recognition neural networks to boost speech recognition performance [27].
The multiple-width frequency-delta method is used to modify
the acoustic scene classification task [28]. Software for music
⇑ Corresponding author. recognition tasks was published, which implements several data
E-mail address: xycwshr@126.com (X. Yuanchao). augmentation methods [29]. Time Stretching (TS) and Pitch

https://doi.org/10.1016/j.apacoust.2023.109468
0003-682X/Ó 2023 Elsevier Ltd. All rights reserved.
X. Yuanchao, C. Zhiming and K. Xiaopeng Applied Acoustics 211 (2023) 109468

Shifting (PS) were applied to environmental sound classification The DeepShip and ShipsEar datasets are employed. The flow of
[30,31]. The effects of four data augmentation methods on environ- the classification algorithm combined with data augmentation is
mental sound classification, including TS, PS, dynamic range com- given in Fig. 1. The raw noise records and the augmented data un-
pression, and adding background noise, were discussed [32]. With dergo feature extraction, slicing, and normalization and are fed in-
the development of deep learning, some data augmentation meth- to the classification model.
ods based on generative adversarial networks (GAN) have gained
attention in recent years. The robustness of CNN for speech recog- (1) Feature extraction. The original noisy recordings and the
nition could be improved after generating the Mel spectrum for audio signal transformed by data augmentation are first
data augmentation by GAN and conditional GAN [33]. Based on downsampled to 8192 Hz, and then the time–frequency fea-
GAN, audio could be generated directly to improve the speech tures are extracted as follows.
emotion recognition performance of convolutional recurrent neu-
ral networks (CRNN) [34]. Underwater acoustic channel modelling STFT: The window size of fast Fourier transform (FFT) is
and transfer learning were used to augment underwater acoustic 8192. The hop length, the number of points between adja-
data [35]. cent STFT columns, is 4096. After obtaining the time-
Data augmentation methods for ship-radiated noise classifica- frequency spectrogram, the former 2048 frequency bins
tion have been studied. A data augmentation method to generate are picked up, and every four adjacent bins are averaged to
spectrograms based on conditional GAN improves underwater obtain 512 bins. The frequency range is 0 to 2048 Hz with
acoustic target recognition performance [36]. The PS and TS meth- a frequency resolution of 4 Hz.
ods and the temporal and frequency masking of the Mel spectrum MEL: The window size of FFT is 8192. The hop length is
were applied to generate data to improve the underwater target 4096. The number of Mel filters is 512. The frequency range
recognition performance of CRNN [37]. PS and TS methods trans- is 0 to 2048 Hz. The frequency axis is linear up to 1 kHz and
form audio data directly in the temporal domain, which is conve- logarithmic above 1 kHz, containing 256 frequency bins re-
nient and widely used; therefore, this paper focuses on the effect spectively (using the default parameters of the ‘‘librosa”
of these two data augmentation methods on ship-radiated noise. module in Python).
The literature [23], while proposing the DeepShip dataset, investi- CQT: The hop length is 4096. The frequency range is 8 to
gates a variety of time-frequency features, including Mel spectro- 2048 Hz with 512 bins. The number of bins per octave is 64.
gram, constant-Q transformation (CQT) features, and several
classifiers, including CNN, SVM, and random forests (RF) classifiers. (2) Slicing and normalization. The hop length used for each
Inspired by this, three time-frequency features and three common- time-frequency feature is 4096, corresponding to a duration
ly used classifiers are selected, and nine classification algorithms of 0.5 s. Every eight frames are sliced into one segment, and
are constructed to investigate the effect of data augmentation 512 frequencies are taken for each frame to obtain a sample
methods on ship-radiated noise classification performance im- of size 8512. The feature input of the classifier is obtained
provement. Data augmentation is carried out for the Mel spectro- by taking the logarithm of the spectrum value for each sam-
gram and CRNN in the literature [37]. Unlike them, this work ple and then scaling it linearly to the interval from 0 to 1. Be-
focuses on the effectiveness of TS and PS methods that undertake cause the ship-radiated noise characteristics are primarily
transformations in the temporal domain, which could be applied low-frequency, the highest frequency of the time-
to general features and classifiers. Preliminary experiments show frequency features is set to 2048 Hz.
that the PS method does not improve the algorithm classification (3) Classification models. The feature sample size is 8512, a
performance. Based on the ship-radiated noise audio characteris- dimension too high for traditional classification models.
tics, we propose an improved PS method (IPS), which enriches Therefore, for the traditional classification models SVM and
the sample diversity and significantly improves the algorithm clas-
sification performance. Moreover, inspired by literature [32], the
effects of different data augmentation methods on the classifica-
tion performance of various types of ships are analyzed. The exper-
imental results show that the augmentation method considering
ship categories and states would further improve classification
performance.

2. Method

2.1. Classification algorithms

Three time-frequency features to examine the effectiveness of


the data augmentation technique, including short-time Fourier
transform (STFT), Mel Spectrogram (MEL) and CQT, are extracted
and three classification models, including CNN, RF and SVM, are
implemented, yielding a total of nine classification algorithms con-
structed. STFT and CQT are linear and logarithmic frequency coor-
dinates, respectively, and MEL is linear up to 1000 Hz and
logarithmic above 1000 Hz. These three features amplify or com-
press the information in different frequency bands. Among the Fig. 1. The flow of the classification algorithm combined with data augmentation. A
three classifiers, CNN is the most representative deep learning part of the audio recordings is used for testing, and the rest for training. The audio
recordings obtained from data augmentation transformations share category labels
model that has shown strong performance in recent studies, SVM with the original recordings during training. Three features and three classification
is the most widely used traditional classifier, and RF represents models are combined to produce nine classification algorithms, and their respective
ensemble learning. performance scores are obtained under different data augmentation criteria.

2
X. Yuanchao, C. Zhiming and K. Xiaopeng Applied Acoustics 211 (2023) 109468

RF, features are reduced to 128 dimensions employing prin- t 0s ðnÞT l


ts ðnÞ ¼ 0 ð4Þ
cipal component analysis (PCA) before inputting the classi- t s ðNs  1Þ
fiers. The kernel function of the SVM is the radial basis
Due to the time stretching of this step, the actual sampling rate
function (RBF). The number of decision trees of RF is 100.
of the signal is f ðtÞ ¼ bf s ðtÞ ¼ f p bAcosð2pt=TþuÞ , where b ¼
t 0s ðN s 1Þ
Tl
.
The input size of the CNN is the original 8512. CNN contains Since the transformation aims to enrich the sample pitch diversity,
seven convolutional layers and is followed by three fully connected the effect of b can be ignored when A is small. After obtaining the
layers. The number of filters in each convolutional layer is 16, 32, interpolation moment t s ðnÞ, the original audio is interpolated and
64, 64, 64, 64, 64, and 128, respectively. The filter size is three. resampled to obtain a variable sampling rate time series. This is
Batch normalization is employed in each convolutional layer. Con- a downsampling process since the original audio sampling rate is
volutional block outputs 1024-dimensional features to the fully 32000 Hz and 52734 Hz in DeepShip and ShipsEar, respectively,
connected layer. The fully connected layer’s neurons are 256, 64, while the sampling rate during pre-processing is 8192 Hz. The
and 4, respectively. Dropout (p = 0.2) is utilized in the fully con- original signal must be filtered low-pass before interpolation, and
nected layer. The activation function is ReLU. The loss function is the filter cutoff frequency is bf p =2A. Since the downsampled signal
cross-entropy. The training optimizer is Adam. The learning rate will be used to generate target classification samples to perform
is 0.001 and 0.0005 for DeepShip and ShipsEar, respectively. The data augmentation, signal fidelity is not the primary consideration;
CNN classifier is trained using entropy loss. A weight, specifically, slight signal distortion would enhance the robustness of the classi-
the reciprocal of the audio number of each ship class, is assigned fication model. Hence cubic spline interpolation is used to achieve
to each class in the loss function to handle the unbalanced problem fast downsampling. Different initial phases are chosen to generate
of the ShipsEar dataset. multiple sets of samples that constitute the augmentation data set
to increase the samples’ diversity further.
2.2. Data augmentation Presently, TS and PS methods are mostly used for audio data
augmentation. TS changes the duration of the audio while keeping
Since the duration of the original audio recordings varies from the pitch constant. Conversely, PS changes the pitch of the audio
6 s to 25 min, the ship’s state remains essentially constant in each while keeping the duration constant. The IPS method proposed in
recording. Therefore the samples sliced from one recording are not this paper considers the characteristics of the ship-radiated noise
sufficiently diverse for the ship classification task. The line spec- audio, which makes the audio pitch change with time while keep-
trum in ship-radiated noise is a key feature for target classification. ing the duration constant. The above three methods are used for
The line spectrum frequency varies with the vessel’s working state data augmentation, and the parameters are set as follows.
and the relative speed to the hydrophone. The harmonic character- TS: Accelerate or decelerate the audio while keeping the pitch
istic line spectrum from the same noise source, whose frequency constant. Four scaling factors are used for each recording: {0.89,
varies jointly with the fundamental frequency. Therefore, process- 0.94, 1.06, 1.12}.
ing audio using PS could theoretically simulate harmonic funda- PS: increase or decrease the pitch of the audio while keeping
mental frequency variations, thus increasing the diversity of the duration constant. The pitch of each recording is shifted by four
samples. values: {0.89, 0.94, 1.06, 1.12}, in semitones, i.e. {2, 1, 1, 2}.
For a long audio recording, the PS processing has equal pitch de- IPS: increase or decrease the pitch of the audio while keeping
viation at each moment, so the diversity of the generated samples the duration constant. The pitch of each recording varies with time.
remains insufficient. We propose an IPS method: let the pitch of The variation period is T ¼ maxðT l ; 400 secÞ and
the audio change over time to traverse a wide range of fundamen- T ¼ maxðT l ; 100 secÞ for DeepShip and ShipsEar, respectively. The
tal frequencies. This method could be implemented quickly by variation amplitude is A ¼ 1:12, and the initial phase u is set as
changing the audio sampling rate over time. Let the actual four values: {0, p=2, p, 3p=2}.
sampling rate f s ðtÞ of the audio varies with time t. And then use Fig. 2 shows the spectrogram of ship #1 of the Cargo class,
a uniform sampling rate f p to pre-process the audio to extract where the audio recording duration is about 7.6 min. The CQT
time-frequency features. spectrogram in Fig. 2(a) shows that the radiated noise of this ship
contains harmonic lines spectra below 2048 Hz, and its frequencies
f s ðtÞ ¼ f p  Acosð2pt=TþuÞ ð1Þ remain stable over a long duration. Fig. 2(b – d) show the CQT,
where A, T and u are the maximum amplitude of shifting, variation STFT, and MEL spectrogram after IPS transformation, where u is
period and initial phase, respectively. A P 1, aðtÞ 2 ½1=A; A. Variable p=2, p, and 3p=2, respectively. Fig. 2(a) shows that the CQT in log-
sampling rates are carried out using interpolation. Let the audio du- arithmic frequency magnifies the information at the low-frequency
side and compresses the information at the high-frequency side.
ration be T l , then the number of samples is N s ¼ f p T l . Let t ¼ NsT1
l
n,
Fig. 2(b) shows that the interval between harmonic lines in the
where n ¼ 0; 1;    ; N s  1 denote the serial numbers of time domain
IPS-transformed CQT spectrogram remains unchanged. The line
sampling points, then
  spectrum frequency varies over time. In the STFT spectrogram with
2pnT l linear frequency in Fig. 2(c), the intervals between harmonic lines
cos TðNs 1Þ
þu
f s ðnÞ ¼ f p  A ð2Þ change as the fundamental frequency changes. The MEL in Fig. 2(d)
could be regarded as a combination of CQT and STFT.
is the sampling frequency at the point n. Then the sampling interval
of the signal is T s ðnÞ ¼ f s1ðnÞ. Thus, the interpolation point moment of
the signal is 2.3. Evaluation

0; n¼0
t0s ðnÞ ¼ Pn1 ð3Þ The augmentation methods are evaluated on the DeepShip [23]
0 T s ðnÞ; n ¼ 1; 2;    Ns  1
and ShipsEar [22] datasets. The DeepShip consists of 609 different
The last interpolated moment t 0s ðn ¼ N s  1Þ is not necessarily recordings of 4 vessel classes with a total duration of 47 h and
equal to T l , the duration of the original signal. Normalize the inter- 4 min. These classes include Cargo, Passenger, Tug, and Tanker.
polation moment so that it is equal to the original signal duration, The ShipsEar consists of 90 recordings of 11 vessel types with a to-
thus obtaining the interpolation moment as tal duration of 1 h and 43 min. These 11 vessel types are merged
3
X. Yuanchao, C. Zhiming and K. Xiaopeng Applied Acoustics 211 (2023) 109468

Fig. 2. The original and IPS-transformed spectrograms of ship #1 of the Cargo class
in the DeepShip. (a) CQT spectrogram of the original audio. (b) CQT spectrogram of
the audio after IPS transformation with u ¼ p=2. (c) STFT spectrogram of the audio
after IPS transformation with u ¼ p. (d) MEL spectrogram of the audio after IPS
transformation with u ¼ 3p=2. The audio recording lasts about 7.6 min and has 915
frames. Each adjacent eight frames are used as one input sample for the classifier,
with a sample size of 8512. The line spectral frequencies of the original audio
remained almost unchanged across sample slices. After IPS transformation, the
frequency variation of the line spectrum across the samples is richer, and the
original harmonic relationship is maintained.

into four practical classes based on vessel size and one background
noise class (see sec. 3 of ref. [22] for more details).
The r  k stratified cross-validation (r  k CV) is adopted. The
dataset is divided into c sets, S1 Sc , according to ship categories
first. Each category set Si is further randomly divided into k sets,
c
Si1 Sik , according to recordings. Then we get k sets Dj ¼ [ Sij ,
i¼1
j ¼ 1; 2;    ; k. Perform k-fold CV: each set takes turns as the test
set and the rest as the training set. Randomly and independently
perform r repetitions of k-fold CV to get r  k score estimations.
The 25 CV and 510 CV are used in experiments of DeepShip
and ShipsEar, respectively. As shown in Fig. 1, the augmented
and original data are jointly used to train the classification model,
and only the original data are used for testing. The accuracy and F1
score are adopted as performance metrics.
The SVM and RF are fitted on the training set and then evaluat-
ed on the test set. Training is stopped when there is a drop in the Fig. 3. Comparison of algorithm accuracy before and after data augmentation. The
accuracy of the CNN on the training set while the accuracy of the scatters in the (a), (b), (c), and (d) are the average accuracy obtained from the 25
CV on DeepShip. The scatters in the (e), (f), (g), and (h) are the average accuracies
previous epoch exceeds 0.98.
obtained from the 510 CV on ShipsEar. The scatter above the diagonal indicates an
The performance scores of the nine classification algorithms are increase in algorithm accuracy after data augmentation, while the opposite
reported in the results, and the differences in algorithm perfor- indicates a decrease. The shape and colour of the scatter indicate the results of
mance before and after data augmentation are compared via paired the paired t-test (p-value less than 0.05). Blue dots indicate a significant increase in
t-tests. the accuracy of the algorithm after data augmentation, red triangles indicate a
significant decrease in the accuracy, and green squares indicate no significant
difference in the accuracy before and after augmentation. (For interpretation of the
3. Results and discussion references to colour in this figure legend, the reader is referred to the web version of
this article.)

3.1. DeepShip

The accuracy of each classification algorithm with different data Regarding the average accuracy over the nine classification al-
augmentation methods on the DeepShip is shown in Table 1. The gorithms, instead of improving the classification performance, PS
‘‘none” in the table indicates that no data augmentation is used, reduces the average accuracy by 0.58 compared to 69.3 without
and IPS + TS indicates that both IPS and TS are used. data augmentation (all ‘‘%” signs are omitted below). In contrast,
4
X. Yuanchao, C. Zhiming and K. Xiaopeng Applied Acoustics 211 (2023) 109468

the IPS method proposed in this paper improves the accuracy by performance except for the PS method. The CQT-CNN method
1.18. TS improved the average classification accuracy by 0.49. has the largest increase. The accuracy increases from the original
The average accuracy reaches an optimal 70.98 when IPS and TS 68.33 to the highest 74.08 with IPS + TS augmentation, which ex-
are used together, an improvement of 1.67 compared to no data ceeds all other algorithms.
augmentation. Fig. 3 compares the accuracy of each classification algorithm
Regarding classification features (the accuracy of the three before and after the four data augmentation methods, where
models is averaged for each feature), the PS method reduces the ac- the scatters with different shapes or colours show the results of
curacy for each feature. The IPS + TS is the best, and the IPS is the the paired t-test (p-value < 0.05). Fig. 3(a) shows that the accura-
second. The MEL features are always optimal for all augmentation cy of MEL-RF and STFT-RF is significantly reduced after the PS
methods, indicating that the MEL features have the best discrim- data augmentation. There is no significant difference in the accu-
inability. The CQT feature has the most significant performance im- racy of the other algorithms. Fig. 3(b) shows that after the TS aug-
provement after data augmentation among the three features, but mentation, the accuracy of STFT-CNN is significantly reduced, five
its original accuracy itself is the lowest. It could be said that data algorithms have significantly higher accuracy, and the other three
augmentation compensates for the inferiority of CQT compared have no significant difference. Fig. 3(c) shows that after IPS data
to the other two features. augmentation, the accuracy of STFT-RF is significantly reduced,
In classification models (the accuracy of the three features is av- three algorithms have significantly higher accuracy, and the other
eraged for each model), the PS method reduces the accuracy, ex- five have no significant difference in accuracy. Although there are
cept for slightly improving the performance of the CNN. IPS + TS more ‘‘wins” in Fig. 3(b) than in Fig. 3(c), the paired t-test be-
is optimal for all three classification models. IPS outperforms TS tween TS and IPS (not given in the figure) indicates that IPS gets
for CNN and SVM but not for RF. The performance of CNN is always four ‘‘wins” and one ‘‘loss”. The improvement of CQT by IPS is sig-
the best for all kinds of data augmentation methods, and it increas- nificantly greater than that by TS. IPS also improves MEL-CNN
es the most with the IPS + TS. significantly more than TS. Moreover, regarding the average of
Regarding each classification algorithm, all the data augmenta- the nine algorithms in Table 1, IPS is 0.69 higher than TS. Fig. 3
tion methods have improved the algorithm classification (d) shows that the accuracy of all algorithms is improved after

Table 1
Accuracy (%) of each algorithm with different data augmentations on the DeepShip dataset.

Feature Model None TS PS IPS IPS + TS


STFT CNN 72.00 70.47 71.65 72.30 73.42
RF 67.94 68.81 66.05 67.06 68.42
SVM 71.09 71.45 70.50 71.95 72.11
MEL CNN 71.93 71.17 72.38 72.80 72.69
RF 68.07 68.30 66.56 67.59 68.78
SVM 71.39 71.88 70.80 72.38 72.63
CQT CNN 68.33 71.91 68.99 73.34 74.08
RF 64.10 64.33 63.78 65.81 65.52
SVM 68.86 69.76 67.82 71.08 71.13
Avg. over all algorithms 69.30 69.79 68.73 70.48 70.98
STFT Avg. over models 70.34 70.24 69.40 70.44 71.32
MEL 70.46 70.45 69.91 70.92 71.37
CQT 67.10 68.67 66.86 70.08 70.24
Avg. over features CNN 70.75 71.18 71.01 72.81 73.40
RF 66.70 67.15 65.46 66.82 67.57
SVM 70.45 71.03 69.71 71.80 71.96

Table 2
F1 score (%) of each algorithm with different data augmentations on the ShipsEar dataset.

Feature Model None TS PS IPS IPS + TS


STFT CNN 55.38 58.69 58.07 58.63 57.75
RF 48.57 52.35 51.49 50.79 53.29
SVM 55.31 56.83 56.99 58.47 57.86
MEL CNN 59.44 59.07 59.68 60.58 60.06
RF 45.19 49.43 52.17 51.06 51.56
SVM 58.18 60.06 57.17 59.48 60.23
CQT CNN 57.92 58.6 59.18 60.26 61.45
RF 44.38 48.95 50.38 48.71 50.85
SVM 56.74 59.5 58.28 58.55 59.99
Avg. over all algorithms 53.46 55.94 55.93 56.28 57.00
STFT Avg. over models 53.09 55.95 55.52 55.96 56.30
MEL 54.27 56.19 56.34 57.04 57.28
CQT 53.01 55.68 55.95 55.84 57.43
Avg. over features CNN 57.58 58.79 58.98 59.82 59.75
RF 46.05 50.24 51.34 50.19 51.90
SVM 56.74 58.79 57.48 58.83 59.36

5
X. Yuanchao, C. Zhiming and K. Xiaopeng Applied Acoustics 211 (2023) 109468

the IPS + TS data augmentation, with seven of them improving by 344, which further reveals that data augmentation improves
significantly. the distinction between these two types of ships by CQT-CNN.
On the other hand, the number of Tug samples misclassified as Pas-
3.2. ShipsEar senger ship samples decreased by 856. However, the number of
Passenger ship samples misclassified as Tug samples increased
The F1 score of each classification algorithm with different data by 312, which suggests that the improved performance for Tug
augmentation methods on the ShipsEar is shown in Table 2. Note classification shown in Fig. 4 comes at the sacrifice of classification
that the ShipsEar dataset is small and unbalanced, so the 510 for Passenger. The displacement of Cargo and Tanker is generally
CV and F1 scores are employed in experiments. The accuracy is also large, the sailing speed is slow, the sailing state is stable, and the
calculated. A comparison of the accuracy of each classification al- line spectrum of their radiation is stronger and frequency stable.
gorithm before and after data augmentations is shown in Fig. 3. Therefore, IPS and TS transformations are reasonable as the data
The results on ShipsEar are similar to that on DeepShip. The pro- augmentation method for Cargo and Tanker but not for Tug and
posed IPS data augmentation method improves the PS on the Passenger. Further classification performance improvements could
DeepShip and ShipsEar datasets. be achieved by utilizing data augmentation that considers the ship
Regarding the average score over nine algorithms, the proposed categories and the state of sailing in recording accordingly. This
IPS method improves the F1 by 0.35 compared with the PS. The av- idea will be explored further in future work.
erage F1 reaches an optimal 57.00 when IPS and TS are used to-
gether, an improvement of 3.54 compared to no data 3.4. Discussion
augmentation. Besides, the average accuracy increases by 2.25
when IPS and TS are used. The improvement of IPS on ShipsEar is not as significant as on
Regarding classification features, the IPS outperforms the PS for the DeepShip. The reason is that the average duration of recordings
STFT and MEL. The MEL features are always optimal for all aug-
mentation methods, which aligns with the results on DeepShip.
Regarding classification models, the IPS outperforms PS for CNN
and SVM but not for RF. The performance of CNN is always the best
for all kinds of data augmentation methods, and it increases the
most with the IPS.
Regarding each classification algorithm, the F1 of the CQT-CNN
method increases from the original 57.92 to the highest 60.26 with
IPS + TS augmentation, which exceeds all other algorithms.
Fig. 3(e) shows that the accuracy of MEL-SVM is significantly re-
duced after the PS data augmentation. Fig. 3(g) shows that after IPS
data augmentation, five algorithms have significantly higher accu-
racy, and the other four have no significant difference in accuracy.
Fig. 3(d) shows that the accuracy of all algorithms is improved after
the IPS + TS data augmentation, with six improving significantly.

3.3. Analysis of vessel types

The improvement effect of data augmentation on CQT-CNN is


particularly significant, especially on the DeepShip dataset. (The
following analysis in this subsection is based on DeepShip.) The
IPS augmentation makes a prominent contribution to classification Fig. 4. F1 scores of CQT-CNN algorithm for each ship on DeepShip. The box plot
performance improvement. The IPS-augmented CQT-CNN algo- shows the median, quartiles, and extreme values of the ten cross-validation
rithm achieves an accuracy of 73.34, slightly worse than the estimates; the white dots are the means.
IPS + TS-augmented CQT-CNN and STFT-CNN algorithms. This is
because the harmonic line spectral features in the CQT with loga-
rithmic frequencies fit the shift equivariance of the convolutional
layer of CNN. This is also shown in Fig. 3(a), where only the accu-
racy of MEL-CNN and CQT-CNN are slightly improved after PS aug-
mentation. The F1 score (suitable for indicating the classification
performance for a particular class) of the CQT-CNN algorithm for
the classification of various types of ships is further analyzed, as
shown in Fig. 4.
Regarding average scores, all data augmentation methods are
beneficial for classifying all types of ships, and their gain effects
are IPS + TS, IPS, TS, and PS in descending order. The data augmen-
tation significantly improves classification performance for both
Cargo and Tanker, while it has a moderate improvement for Tug
and a poor one for Passenger. Fig. 5 shows the classification confu-
sion matrix of CQT-CNN and the difference before and after IPS + TS Fig. 5. Classification confusion matrix of CQT-CNN (sum of 10 times cross-
augmentation. Fig. 5(a) shows that Cargo is easily confused with a validation results) on DeepShip, (a) is the result without data augmentation, (b)
Tanker, while a Tug is easily confused with a Passenger. After is the difference before and after IPS + TS augmentation. Red on the diagonal
indicates the increase in the number of correctly classified samples in (b), and red
IPS + TS data augmentation, as shown in Fig. 5(b), the number of and blue on the non-diagonal indicate the increase and decrease in the number of
Tanker samples misclassified as Cargo decreased by 1682, and misclassified samples, respectively. (For interpretation of the references to colour in
the number of Cargo samples misclassified as Tanker decreased this figure legend, the reader is referred to the web version of this article.)

6
X. Yuanchao, C. Zhiming and K. Xiaopeng Applied Acoustics 211 (2023) 109468

is 277.8 s in DeepShip, which is longer than that (68.7 s in specific) [3] Chen J, Han B, Ma X, Zhang J. Underwater Target Recognition Based on Multi-
Decision LOFAR Spectrum Enhancement: A Deep-Learning Approach. Future
in ShipsEar. On the one hand, the proposed IPS increases the pitch
Internet 2021;13(10):pp. https://doi.org/10.3390/fi13100265.
diversity of the samples in a long-duration recording. On the other [4] Chung KW, Sutin A, Sedunov A, Bruno M. DEMON Acoustic Ship Signature
hand, the pitch of samples varies continuously when IPS is adopt- Measurements in an Urban Harbor. Advances in Acoustics and Vibration
ed, so the samples distribute more evenly in feature space than PS. 2011;2011:1–13. https://doi.org/10.1155/2011/952798.
[5] Esmaiel H, Xie D, Qasem ZAH, Sun H, Qi J, Wang J. Multi-Stage Feature
When the recording duration is short, the sample diversity intro- Extraction and Classification for Ship-Radiated Noise. Sensors 2021;22(1):112.
duced by IPS is similar to PS. Nonetheless, the IPS is still recom- https://doi.org/10.3390/s22010112.
mended because it provides a more stable gain than PS, as [6] H. Li, Y. Cheng, W. Dai, and Z. Li. ‘‘A method based on wavelet packets-fractal
and SVM for underwater acoustic signals recognition.” In: 2014 12th
shown in Fig. 3(e and g), especially when the classifier is CNN. International Conference on Signal Processing (ICSP), Hangzhou, Zhejiang,
The results indicate that the TS improves the classification per- China. Oct. 2014. 2169–2173. doi: 10.1109/ICOSP.2014.7015379.
formance. The TS is implemented based on a phase vocoder. This [7] H. Yang, A. Gan, H. Chen, Y. Pan, J. Tang, and J. Li. ‘‘Underwater acoustic
target recognition using SVM ensemble via weighted sample and feature
simplified implementation intends primarily for data augmenta- selection.” In: 2016 13th International Bhurban Conference on Applied
tion purposes. It does not attempt to handle transients and is likely Sciences and Technology (IBCAST). 2016. 522–527. doi: 10.1109/
to produce many audible artefacts. These artefacts are suspected to IBCAST.2016.7429928.
[8] Zhang Q, Da L, Zhang Y, Hu Y. Integrated neural networks based on feature
increase the robustness of the machine learning algorithms. In fusion for underwater target recognition. Appl Acoust 2021;182:108261.
other words, the TS augments the dataset by adding noise to the https://doi.org/10.1016/j.apacoust.2021.108261.
ship radiation signals, a popular data augmentation method in [9] Honghui Y, Junhao L, Meiping S. Underwater acoustic target multi-attribute
correlation perception method based on deep learning. Appl Acoust
speech recognition tasks. The IPS and TS improve sample diversity
2022;190:108644. https://doi.org/10.1016/j.apacoust.2022.108644.
in different ways, which is the theoretical basis for using IPS and TS [10] Ke X, Yuan F, Cheng E. Integrated optimization of underwater acoustic
simultaneously. Future research will further study how the TS pro- ship-radiated noise recognition based on two-dimensional feature fusion.
motes the classification algorithms and whether it can be im- Appl Acoust 2020;159:107057. https://doi.org/10.1016/j.apacoust.2019.
107057.
proved as the PS. [11] Song G, Guo X, Wang W, Ren Q, Li J, Ma L. A machine learning-based
underwater noise classification method. Appl Acoust 2021;184:108333.
https://doi.org/10.1016/j.apacoust.2021.108333.
4. Conclusion [12] Li Y, Jiao S, Geng B. A comparative study of four multi-scale entropies
combined with grey relational degree in classification of ship-radiated noise.
Appl Acoust 2021;176:107865. https://doi.org/10.1016/j.apacoust.2020.
This paper proposes an improved PS data augmentation method 107865.
for ship-radiated noise classification. The IPS promotes the classifi- [13] Li Y, Jiao S, Geng B, Zhou Y. Research on feature extraction of ship-radiated
cation algorithms’ performance on the DeepShip and ShipsEar noise based on multi-scale reverse dispersion entropy. Appl Acoust
2021;173:107737. https://doi.org/10.1016/j.apacoust.2020.107737.
datasets. The average performance score significantly improves [14] Li Y, Jiang X, Tang B, Ning F, Lou Y. Feature extraction methods of ship-radiated
when the IPS and TS are used together. The CQT-CNN has the best noise: From single feature of multi-scale dispersion Lempel-Ziv complexity to
classification performance with IPS + TS augmentation among the mixed double features. Appl Acoust 2022;199:109032. https://doi.org/
10.1016/j.apacoust.2022.109032.
nine classification algorithms. [15] Li Y, Tang B, Yi Y. A novel complexity-based mode feature representation for
Data augmentation improves the classification of various cate- feature extraction of ship-radiated noise using VMD and slope entropy. Appl
gories of ships differently, suggesting that the algorithm perfor- Acoust 2022;196:108899. https://doi.org/10.1016/j.apacoust.2022.108899.
[16] Li Y, Tang B, Jiao S. SO-slope entropy coupled with SVMD: A novel adaptive
mance would be further improved by applying class-conditional feature extraction method for ship-radiated noise. Ocean Eng
or state-conditional data augmentation. 2023;280:114677. https://doi.org/10.1016/j.oceaneng.2023.114677.
The IPS and TS improve sample diversity in different ways. Fu- [17] Domingos LCF, Santos PE, Skelton PSM, Brinkworth RSA, Sammut K. A Survey
of Underwater Acoustic Data Classification Methods Using Deep Learning for
ture research will further study how the TS promotes the classifi-
Shoreline Surveillance. Sensors 2022;22(6):2181. https://doi.org/10.3390/
cation algorithms and whether it can be improved as the PS. s22062181.
[18] Hu G, Wang K, Peng Y, Qiu M, Shi J, Liu L. Deep Learning Methods for
Underwater Target Feature Extraction and Recognition. Comput Intell
CRediT authorship contribution statement Neurosci 2018;2018:1–10. https://doi.org/10.1155/2018/1214301.
[19] Shen S, Yang H, Li J, Xu G, Sheng M. Auditory Inspired Convolutional Neural
Networks for Ship Type Classification with Raw Hydrophone Data. Entropy
Xu Yuanchao: Conceptualization, Methodology, Software, Vali- 2018;20(12):990. https://doi.org/10.3390/e20120990.
dation, Formal analysis, Validation, Visualization, Writing – origi- [20] Yang H, Li J, Shen S, Xu G. A Deep Convolutional Neural Network Inspired by
nal draft. Cai Zhiming: Supervision, Writing – review & editing. Auditory Perception for Underwater Acoustic Target Recognition. Sensors
2019;19(5):1104. https://doi.org/10.3390/s19051104.
Kong Xiaopeng: Data curation, Writing – review & editing. [21] Li J, Yang H. The underwater acoustic target timbre perception and recognition
based on the auditory inspired deep convolutional neural network. Appl
Acoust 2021;182:108210. https://doi.org/10.1016/j.apacoust.2021.108210.
Data availability [22] Santos-Domínguez D, Torres-Guijarro S, Cardenal-López A, Pena-Gimenez A.
ShipsEar: An underwater vessel noise database. Appl Acoust 2016;113:64–9.
https://doi.org/10.1016/j.apacoust.2016.06.008.
The authors do not have permission to share data. [23] Irfan M, Jiangbin Z, Ali S, Iqbal M, Masood Z, Hamid U. DeepShip: An
underwater acoustic benchmark dataset and a separable convolution based
autoencoder for classification. Expert Syst Appl 2021;183:115270. https://doi.
Declaration of Competing Interest org/10.1016/j.eswa.2021.115270.
[24] Lecun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to
document recognition. Proc IEEE 1998;86(11):2278–324. https://doi.org/
The authors declare that they have no known competing finan-
10.1109/5.726791.
cial interests or personal relationships that could have appeared [25] Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep
to influence the work reported in this paper. Convolutional Neural Networks. Commun ACM 2017;60(6):84–90. https://doi.
org/10.1145/3065386.
[26] N. Jaitly and G. E. Hinton, ‘‘Vocal Tract Length Perturbation (VTLP) improves
References speech recognition,” in International Conference on Machine Learning (ICML).
2013.
[27] Xiaodong Cui, Goel V, Kingsbury B. Data Augmentation for Deep Neural
[1] Zhang L, Wu D, Han X, Zhu Z. Feature Extraction of Underwater Target Signal
Network Acoustic Modeling. IEEE/ACM Trans Audio Speech Lang Process
Using Mel Frequency Cepstrum Coefficients Based on Acoustic Vector Sensor.
2015;23(9):1469–77. https://doi.org/10.1109/TASLP.2015.2438544.
Journal of Sensors 2016;2016:1–11. https://doi.org/10.1155/2016/7864213.
[28] Y. Han and K. Lee. ‘‘Acoustic scene classification using convolutional neural
[2] Azimi-Sadjadi MR, Yao D, Huang Q, Dobeck GJ. Underwater target classification
network and multiple-width frequency-delta data augmentation.” ArXiv, vol.
using wavelet packets and neural networks. IEEE Trans Neural Netw 2000;11
abs/1607.02383, 2016.
(3):784–94. https://doi.org/10.1109/72.846748.

7
X. Yuanchao, C. Zhiming and K. Xiaopeng Applied Acoustics 211 (2023) 109468

[29] B. McFee, E. J. Humphrey, and J. P. Bello, ‘‘A software framework for musical [34] N. T. Pham, D. N. M. Dang, and S. D. Nguyen. ‘‘Hybrid Data Augmentation and
data augmentation,” in Proceedings of the 16th International Society for Music Deep Attention-based Dilated Convolutional-Recurrent Neural Networks for
Information Retrieval Conference, ISMIR 2015, 2015, pp. 248–254. Speech Emotion Recognition.” arXiv, Sep. 18, 2021. Accessed: Jul. 07, 2022.
[30] K. J. Piczak. ‘‘Environmental sound classification with convolutional neural [Online]. Available: http://arxiv.org/abs/2109.09026.
networks,” in 2015 IEEE 25th International Workshop on Machine Learning for [35] Li D, Liu F, Shen T, Chen L, Zhao D. Data augmentation method for underwater
Signal Processing (MLSP), Boston, MA, USA, Sep. 2015, pp. 1–6. doi: 10.1109/ acoustic target recognition based on underwater acoustic channel modeling
MLSP.2015.7324337. and transfer learning. Appl Acoust 2023;208:109344. https://doi.org/10.1016/
[31] Mushtaq Z, Su S-F. Environmental sound classification using a regularized j.apacoust.2023.109344.
deep convolutional neural network with data augmentation. Appl Acoust Oct. [36] Luo X, Zhang M, Liu T, Huang M, Xu X. An Underwater Acoustic Target
2020;167:107389. https://doi.org/10.1016/j.apacoust.2020.107389. Recognition Method Based on Spectrograms with Different Resolutions. JMSE
[32] Salamon J, Bello JP. Deep Convolutional Neural Networks and Data 2021;9(11):1246. https://doi.org/10.3390/jmse9111246.
Augmentation for Environmental Sound Classification. IEEE Signal Process [37] Liu F, Shen T, Luo Z, Zhao D, Guo S. Underwater target recognition using
Lett 2017;24(3):279–83. https://doi.org/10.1109/LSP.2017.2657381. convolutional recurrent neural networks with 3-D Mel-spectrogram and data
[33] Qian YY, Hu H, Tan T. Data augmentation using generative adversarial augmentation. Appl Acoust 2021;178:107989. https://doi.org/10.1016/j.
networks for robust speech recognition. Speech Comm 2019;114:1–9. apacoust.2021.107989.
https://doi.org/10.1016/j.specom.2019.08.006.

You might also like