Anti-Forensics of Environmental-Signature-Based Audio Splicing Detection and Its Countermeasure Via Rich-Feature Classification

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2543205, IEEE
Transactions on Information Forensics and Security
Anti-Forensics of Environmental-Signature-Based
Audio Splicing Detection and Its Countermeasure
via Rich-feature Classification
Hong Zhao, Yifan Chen Member, IEEE, Rui Wang Member, IEEE, and Hafiz Malik Senior Member, IEEE
Abstract—Numerous methods for detecting audio splicing operation consists of removing a specific part of the recording
have been proposed. Environmental-signature-based methods are to conceal the identity of the speaker; and tampering with the
considered to be the most effective forgery detection methods. recording time signature requires a modification of the electric
The performance of existing audio forensic analysis methods is
generally measured in the absence of any anti-forensic attack. network frequency (ENF).
Effectiveness of these methods in the presence of anti-forensic Audio splicing is one of the most popular and easiest
attacks is therefore unknown. In this paper, we propose an attacks; in such an attack, target audio is assembled by
effective anti-forensic attack for environmental-signature-based splicing segments from multiple audio recordings. The splicing
splicing detection method and countermeasures to detect the
technique can be used to manipulate the voice of speakers or
presence of the anti-forensic attack. For anti-forensic attack,
dereverberation-based processing is proposed. Three derever- crime scene events in evidentiary audio. The increasing use
beration methods are considered to tamper with the acoustic of digital recordings as evidence in every sector of litigation
environment signature. Experimental results indicate that the and criminal justice proceedings demonstrates that there is an
proposed dereverberation-based anti-forensic attack significantly urgent need for integrity verification of digital audio record-
degrades the performance of the selected splicing detection
ings, which justifies further research in the areas of integrity
method. The proposed countermeasures exploit artifacts intro-
duced by the anti-forensic processing. To detect the presence authentication and tamper detection and localization.
of potential anti-forensic processing, a machine learning-based
framework is proposed. Specifically, the proposed anti-forensic
detection method uses a rich-feature model consisting of Fourier A. Related Works
coefficients, spectral properties, high-order statistics of “musical
noise” residuals, and modulation spectral coefficients to capture Over the past decade, various audio forensic methods have
traces of dereverberation attacks. The performance of the pro- been proposed for authenticating the integrity of audio/speech
posed framework is evaluated on both synthetic data and real- recordings. These methods include post-processing detection,
world speech recordings. The experimental results show that the
tamper detection, and tamper localization [1]. Existing audio
proposed rich-feature model can detect the presence of anti-
forensic processing with an average accuracy of 95%. forensic methods [2]–[16] can be classified into the following
three main categories:
Index Terms—Audio Splicing Detection, Anti-forensics, Spec-
tral Features, Musical Noise, Modulation Spectrum 1) ENF-based methods [2]–[4]: The ENF-based forensic
analysis method is one of the most reliable integrity
verification methods. This method extracts the ENF sig-
I. I NTRODUCTION
nature from the evidentiary recording and compares it
IGITAL audio is one of the dominant means for com-
D munication and information sharing. With the availability
of many digital data manipulation software packages, Internet
with a reference ENF database for verification and tamper
detection [5], [6].
2) Post-processing distortion-based methods [7], [8], [10]:
users are increasingly vulnerable to malicious forgeries. Pow- These methods exploit the traces of double compression
erful, inexpensive, and easy-to-use digital media manipulation left by manipulations. They generally utilize the statistical
tools (such as CoolEditor and Audacity, among others) can properties of modified discrete cosine transform coeffi-
be used to achieve a wide range of manipulations, including cients or Mel-frequency cepstral coefficients (MFCCs) to
splicing, deletion and tampering with the recording time identify the presence of double compression [9]–[11].
signature. Splicing operation refers to cutting a segment from 3) Acoustic environment-based methods [12]–[16]: These
one audio file and inserting it into another; whereas, deletion methods exploit the inconsistency in the acoustic environ-
mental signature extracted from the evidentiary recording
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be for forensic applications such as recording location iden-
obtained from the IEEE by sending a request to pubs-permissions@ieee.org tification and splice detection.
Hong Zhao, Yifan Chen and Rui Wang are with the Department of
Electrical and Electronic Engineering, South University of Science and Recently, Malik et al. [12], [16] proposed the use of
Technology of P.R. China. email: zhao.h@sustc.edu.cn, chen.yf@sustc.edu.cn, reverberation parameters for environment identification and its
wang.r@sustc.edu.cn application to audio forensics. However, the results of this
Hafiz Malik is with the Department of Electrical and Computer Engineering,
The University of Michigan – Dearborn, MI 48128; ph: +1-313-593-5677; scheme have large discrepancies from reality due to its over-
email: hafiz@umich.edu simplified assumption regarding the room impulse response
1556-6013 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIFS.2016.2543205, IEEE
(RIR) model. Similarly, Moore et al. [17] utilized the early de- attack. Moreover, the environmental-signature-based methods
cay time and reverberation time (T60 ) as roomprints for foren- are robust against compression attacks and do not require a
sic audio applications. High accuracy (∼ 96%) was achieved reference database for forgery detection. The performance of
only if a clean RIR was provided. However, a clean RIR the environmental-signature-based approach has been exten-
model is not always available in practice. Moreover, Peters et sively evaluated in the absence of any anti-forensic attack.
al. [18] proposed a room classification method with underlying Therefore, its robustness against expert attacks is unknown.
acoustic features (i.e., MFCCs) and a Gaussian mixture model Research on anti-forensics and its countermeasures can be
(GMM) classification with average detection errors as high as used to (i) redefine the performance of the authentication
39% for musical signals and 15% for speech signals. In [19], method under anti-forensic attacks, (ii) highlight its limitations
the autocorrelation function features from the Gammatone under anti-forensic attacks, and (iii) develop countermeasures
filterbank response were used. Recently, Zhao et al. [13]–[15] for the detection of anti-forensic processing and for robust
extended the scheme in [12] by taking advantage of high- and reliable authentication. Moreover, the research on the
order statistics (referred to as an environmental signature) environmental-signature-based splicing detection method, in
extracted from the reverberation and background noise for the presence of anti-forensic attacks, complements ENF-based
environment identification. In this scheme [12], classification methods and serves as a stepping stone for the development
performance has been verified in various experimental settings of a reliable and robust framework for audio forensic analysis.
(e.g., different microphone types and recording locations).
Zhao et al. [20], [21] extended environmental-signature-based B. Our Contributions
audio forensics for applications in audio splicing detection and This paper investigates the anti-forensic technique against
localization. The performance of these methods [20], [21] was the environmental-signature-based audio splicing detection
evaluated using both synthetic speech and real-world data. In method proposed in [20], [21] and proposes a framework
general, these methods yield better performance for splicing for anti-forensic attack detection. Specifically, a rich-feature
detection than existing methods [22], [23]. model is proposed to capture traces of anti-forensic processing.
All of the above forensic methods [12]–[21] assume that The proposed rich model includes short time Fourier transform
tampered audio is generated by simply splicing segments from (STFT) coefficients, spectral properties, high-order statistics
multiple sources and that the temporal trace of background of “musical noise” residuals, and perceptual relevant features
acoustic distortion can be estimated from the evidentiary (e.g., modulation spectral coefficients). A supervised machine
recording. Inconsistency in the estimated background acoustic learning is used to learn the underlying model for anti-forensic
distortion is used for detecting forgeries. The performance processing detection. To address potential overfitting issues
of these schemes against anti-forensic attacks has not yet associated with the high-dimensional feature space, effect
been evaluated. More specifically, “it is unclear whether these of both linear- and nonlinear-based dimensionality reduction
schemes would be equally robust if the acoustic environment methods on the performance is also evaluated. The detection
signature is tampered with.” For example, a determined attack- performance of the individual feature sets and of the entire
er may devise an effective attack strategy for bypassing current feature set is analyzed. The performance of the proposed
forensic methods. We refer to these techniques as anti-forensic framework is evaluated on both synthetic reverberant speech
techniques, which aim to conceal the manipulation distortion. and real-world speech recordings. Experimental results suggest
Over the past few years, image forensics has witnessed that (i) the proposed anti-forensic attack method can be used to
considerable advancements in anti-forensics and its counter- elude splicing detection by the audio splicing detection method
measures. For instance, Valenzise et al. [24] revealed the [20], [21]; (ii) the proposed rich-feature model combined with
traces of JPEG compression and proposed an anti-forensic the supervised learning technique can detect the presence of
method for eliminating these traces. Similarly, Wu et al. [25] anti-forensic processing with high accuracy; (iii) the nonlin-
proposed an anti-forensic method based on median filtering by ear dimensionality reduction method outperforms the linear
adding artificial noise. However, the anti-forensic framework method; (iv) the proposed scheme has similar performance for
and its countermeasures for audio forensics remain relatively both the source-matched case, in which the dereverberation
less developed. Chuang et al. [26], [27] proposed the first anti- methods used in training and testing are perfectly matched,
forensic scheme and its countermeasures for ENF-based audio and the source-mismatched case, in which the dereverberation
authentication. Chuang et al. proposed various anti-forensic methods used in training and testing are different; and (v)
operations for tampering with the underlying ENF signal while the proposed scheme is robust against moderate noise, median
preserving the quality of the host signal. The scheme proposed filtering and mean filtering.
in [26], [27] is specifically designed for ENF-based audio The remainder of this paper is organized as follows. Section
authentication. Therefore, it cannot be applied in other au- II summarizes the splicing detection method, and Section III
dio forensic methods, such as environmental-signature-based outlines the details of the anti-forensic operations for the
audio splicing detection. environmental-signature-based splicing detector. Countermea-
The environmental-signature-based forensic method was sures against anti-forensic attacks are proposed in Section IV.
first proposed in 2010 and is thus a relatively new method. It The experimental setup and results based on synthetic data
enables the authentication performance to be improved using and real-world speech recordings are presented in Section
the environmental signature and the potential countermeasures V. Finally, conclusions and future directions are discussed in
to be investigated under the situation of an anti-forensic Section VI.
II. R EVIEW OF THE AUDIO S PLICING D ETECTION 1) Remove the environmental signature from the au-
M ETHOD dio/speech signals,
In [20], [21], Zhao et al. proposed to use the environmental 2) Carefully splice these signals together to avoid any dis-
signature (based on the RIR and background noise) to detect continuity of the speech content (i.e., the resulting signal
the presence of splicing and localize tampering in the queried should be realistic for listening),
audio signal. Experimental results demonstrated that the per- 3) Generate a desired environmental signature and add it on
formance of these methods is superior compared to that of the spliced signal.
other related works when the frame duration is longer than Shown in Fig. 1 is the flowchart of the proposed anti-
1.5s. For the sake of completion, a brief overview of this forensic procedure. The core of the framework shown in
splicing detection method is provided as follows: Fig. 1 is that the proposed anti-forensic method is achieved
1) Divide the input audio into frames of length L with 50% using dereverberation operation, which is a commonly used for
overlapping. speech enhancement applications [28], [29]. Existing derever-
2) Estimate the environmental signature Ĥi from the ith beration methods can be classified into two main categories:
frame. (i) non-blind dereverberation methods [30], which exploit
3) Use first M frames as a reference and calculate the knowledge of the underlying RIR, and (ii) blind dereverber-
correlation coefficients ρi with the following averaging, ation methods [31], which do not have any prior knowledge
 of the underlying RIR. It is reasonable to assume that the

 1 ∑ i

i − 1 N CC(Ĥj , Ĥi ), if 2 ≤ i ≤ M attacker does not have a prior knowledge of the acoustic
j=1
ρi =
 1 ∑M
(1) environment where the recording is made. For this attack

 N CC(Ĥj , Ĥi ),
 otherwise model, blind dereverberation-based attack is considered here.
M j=1
Any existing blind dereverberation method can be used to
where N CC(a, b) represents the normalized correlation remove the underlying environmental signature. It has been
coefficient between a and b. observed through extensive investigation that effectiveness of
4) Detect the suspected frame using the raw detection the propose anti-forensic attack depends on the effective-
method, and set the label pi of the ith frame as follows, ness of the dereverberation operation. Following three blind
dereverberation algorithms are considered here to illustrate
pi = 1, if ρi < T (2)
effectiveness of the proposed attack model:
where T is the decision threshold. • Spectrum classification and inverse filtering (SCIF) [31]:
5) Refine the results using the neighborhood similarity score The algorithm employs RASTA-filtered MFCC as a fea-
as follows, ture set to train a GMM-based classifier. The channel
∑k=i+W/2 response is estimated using a spectrum classification
k=i−W/2 pk − pi
qi = 1, if > Rs (3) method and used for dereverberation via inverse filtering.
W • Gammatone spectral domain filtering and non-negative
where W and Rs represent the window size (e.g., W =3 matrix factorization (GSF-NMF) [32]: This method mod-
or 5) and similarity score threshold (e.g., Rs ∈ [0.7, 0.9]), els the reverberation as a convolution of a clean speech
respectively. qi = 1 indicates that the ith frame is and a RIR. The non-negativity and sparsity of the speech
detected as spliced. spectra are then used to estimate the clean spectra under
the least-squares error criterion.
III. A NTI - FORENSIC S CHEME • Two-stage spectral subtraction (TSSS) [33]: This is a two-
In this section, anti-forensic techniques against the stage process. In the first stage, an inverse filter is esti-
environmental-signature-based audio splicing detection are mated to reduce the coloration effect such that the signal-
proposed. The objective of anti-forensic attacks is to conceal to-reverberant energy ratio is increased. Then, spectral
the trace of malicious manipulations while preserving the qual- subtraction is employed to minimize the influence of
ity of the audio content. An attacker can alter the underlying long-term reverberation.
signature to be used for integrity verification. The anti-forensic To evaluate the effectiveness of the proposed anti-forensic
attack can be treated as a post-processing step that is likely attack, a forged audio is assembled from two speech recordings
to leave traces in the resulting audio. Therefore, any artificial randomly selected from TIMIT [34]. The resulting forged
degradation of the quality in evidentiary recordings can be recording is subjected to anti-forensic attacks based on SCIF,
used for anti-forensic detection. GSF-NMF, and TSSS. All four recordings (one forged and
We have shown in [20], [21] that the inconsistency in the three anti-forensic processed) are then analyzed using the
estimated environmental signatures can be used to identify the splicing detection method (described in Section II). Shown
traces of splicing. What if an attacker chooses to develop a in Fig. 2 is the output of the splicing detection method for
strategy that removes the underlying environmental signature all four recordings. Here, frames marked with blue circles
from the evidentiary recording and injects the desired environ- and red triangles belong to two different environments. If
mental signature? For example, consider the case in which an the coefficients of all of the frames are at a similar level,
anti-forensic attack aims to eliminate splicing inconsistency then the signal is authentic; otherwise, it is subjected to
via the following steps: splicing manipulation. It can be observed from Fig. 2(a) that
Speech Samples Input Parameters Consider the proposed dereverberation-based anti-forensic at-
tack, the primary objective of dereverberation is to improve
Synthetic RIR and
Dereverberation
Noise Generation
the robustness of speech recognition or speaker recognition
against acoustic distortion rather than improving the speech
Clean Speeches
quality/intelligibility or listening experience. It was shown in
Extracted
Reverberation [35] that many reported dereverberation methods introduce
and Background
Noise
nonlinear distortion that slightly degrades speech quality.
Splicing
Nonlinear distortion due to dereverberation processing may
be perceptible, particularly during silent regions. We propose
Add Synthetic RIR to Spliced Speech
to exploit the nonlinear distortion in the queried recording to
Anti-Forensics
detect the presence of dereverberation processing. More specif-
Speech ically, we propose using traces of dereverberation processing
to detect anti-forensic attacks. For this purpose, the perceptual
Fig. 1: Flowchart of the anti-forensic technique against envi-
evaluation of speech quality (PESQ) [36]) measure can be
ronmental signature-based audio splicing detection.
used to determine the presence of anti-forensic attacks for
cases in which the reference signal is available. However, the
the splicing detection method has a very high probability of reference signal is generally not available in practice. Hence,
identifying and localizing the spliced frames for the forged a statistical machine learning-based classification framework
audio. In the presence of anti-forensic processing, however, based on a rich model is proposed to learn the characteristics
the detection performance degrades significantly (see Figs. 2 of distortions due to anti-forensic processing. The proposed
(b)-(d)). It can also be observed from Fig. 2 that SCIF-based anti-forensic detection framework is shown in Fig. 3.
filtering is the most effective method among all three methods.
This is because the implementation of the splicing detection Evidential Recording Training Database
method described in Section II also uses the SCIF algorithm

Feature Extraction
to extract the environmental signature. In other words, if
the attacher has complete knowledge of the forensic analysis
technique used by the forensic investigator, the attacker can
STFT Spectral Shape Modulation Musical Noise
use an anti-forensic attack to successfully evade the detection Coefficients Features Spectral Features Features
process. The effectiveness of the countermeasures, which will
be discussed in Section IV, is evaluated for all three anti-
forensic attacks, i.e., SCIF, GSF-NMF and TSSS.
Feature Dimensional Reduction
(a) Results of Splicing Detection without Anti-forensics
0.6
0.4
ρ̇
Model Training
0.2 Classification
5 10 15 20 25 30 35 40 45 50
(SVM)
Frame Index
( b) Results of Splicing Detection with Anti-forensics (SCIF)
0.6
Authentic?
ρ̇
0.4
0.2 (Y/N)
5 10 15 20 25 30 35 40 45 50
Frame Index
(c) Results of Splicing Detection with Anti-forensics (GSM-NMF) Fig. 3: Flowchart of the proposed anti-forensic detection
0.6
framework.
ρ̇
0.4
0.2
5 10 15 20 25 30 35 40 45 50
Frame Index
(d) Results of Splicing Detection with Anti-forensics (TSSS)
0.6
A. Rich-feature Model-based Classification
ρ̇
0.4
0.2
A single feature is barely sufficient for capturing the various
5 10 15 20 25 30 35 40 45 50 55 distortions derived from malicious manipulations. Hence, a
Frame Index
feature vector of large dimension is considered. For feature
Fig. 2: Splicing detection performance for a reverberant signal extraction, therefore, a rich model consisting of comprehen-
and its anti-forensic signals processed with different derever- sive features derived from the signal spectrum domain and
beration methods. modulation spectrum domain is proposed. The elaboration of
the rich model is given as follows.
1) Spectral Coefficients in the STFT domain: The main mo-
IV. C OUNTERMEASURES OF A NTI - FORENSICS tivation for using STFT coefficients is that the STFT domain is
It can be observed from Fig. 2 that anti-forensic methods the primary choice for many dereverberation methods because
can be used to successfully conceal the presence of splic- the Fourier transform operator converts convolution distortion
ing. Therefore, forensic investigators need to devise coun- in the time domain into multiplicative distortion in the fre-
termeasures for detecting traces of anti-forensic processing. quency domain [32], [33]. Furthermore, some dereverberation
methods [37], [38] estimate reverberation in the Log-STFT • Spectral Tonal Power Ratio: The tonalness of a signal.
domain by applying the log operator to the Fourier coefficients. ET (n)
Therefore, the distortion introduced by dereverberation is fST P R (n) = ∑K/2 (11)
k=1 | X(k, n) |
2
directly superimposed to the spectral coefficients or the log-
spectral coefficients. Let x(t) be the time domain represen- where ET (n) is the tonal power.
tation of the audio/speech signal; X(k, n) be the spectral • Spectral Flux: The amount of change in the spectral
coefficients in the STFT domain, where k (1 ≤ k ≤ K) is shape.
the index of the frequency bins; and n (1 ≤ n ≤ N ) be the √
∑K/2
k=1 (| X(k, n) | − | X(k, n − 1) |)
index of the frames. The feature vector fLS consists of the 2
averaged log-spectral coefficients over frames and is given as fSF I (n) =

K/2
follows. (12)
• Spectral Rolloff: The bandwidth of the analyzed frame.
N
1 ∑
fLS (k) = log | X(k, n) | (4) fSR (n) = i |∑i ∑K/2 (13)
N n=1 k=1 |X(k,n)|=κ× k=1 |X(k,n)|
where κ is set to 0.85 or 0.95.

2) Spectral Shape Features: Spectral shape features char- • Spectral Skewness: The asymmetry of the distribution
acterize the spectral properties, including the centroid, spread, of spectral coefficients.
slope, and crest factor. These features are extracted from [( )3 ]
the magnitude of the spectral coefficients. These spectral X(·, n) − µ
fSSK (n) = E (14)
properties have shown their effectiveness for many applica- σ
tions, including event detection, music retrieval, and sound
classification [39]. A brief description of the spectral features where µ and σ represent the mean and standard deviation
used is provided next. of X(·, n), respectively, and E is the expectation operator.
• Spectral Kurtosis: Measures the peakedness of the dis-
• Spectral Centroid: The center of gravity of the spectral
tribution of spectral coefficients.
energy.
∑K/2 E[(X(·, n) − µ)4 ]
fSK (n) = (15)
k=1 k× | X(k, n) |
2
fSC (n) = ∑ (5) (E[(X(·, n) − µ)2 ])2
K/2
k=1 | X(k, n) |
2
• Spectral Entropy: The disorderliness of the spectral
• Spectral Spread: The concentration of the power spec- coefficients.
trum around the spectral centroid. ∑ X(k, n)
v fSE (n) = − pk,n · log2 pk,n , pk,n = K
u ∑K/2 ∑
u (k − fSC (n))2 × | X(k, n) |2
1≤k≤K X(k, n)
fSS (n) = t k=1 ∑K/2 (6) k=1
(16)
k=1 | X(k, n) |
2
We anticipate that distortions due to dereverberation pro-
• Spectral Decrease: The steepness of the decrease of the cessing can be captured through spectral features described in
spectral envelope over the frequency. (5) ∼ (16). We refer to these features as spectral shape and
∑K/2 1 distribution features, which are denoted by a vector fSSD .
× (| X(k, n) | − | X(0, n) |)
fSD (n) = k=1 k ∑K/2 (7) 3) Modulation Spectral Features: As mentioned above,
k=1 | X(k, n) | dereverberation processing may degrade the perceptual quality
of the input signal. The extent of degradation depends on how
• Spectral Slope: The slope of the spectral shape.
strongly the dereverberation method is applied. To capture
K/2
∑ K/2
∑ K/2
∑ traces of dereverberation processing, a perceptually relevant
K k · | X(k, n) | − k· | X(k, n) | feature set called modulation spectral coefficients [40] is
k=1 k=1 k=1
fSSI (n) = ( )2 considered here. The details of modulation spectral coefficients
K/2
∑ K/2
∑ extraction are as follows.
K k2 − k
k=1 k=1 (i) Gammatone filterbank analysis:
(8) The speech signal x(t) is filtered by a bank of 24 critical-
• Spectral Crest Factor: The tonalness of the audio signal. band gammatone filters to emulate the processing of the
max | X(k, n) | cochlea. The resulting output of the ith gammatone filter
1≤k≤K/2 is given as,
fSCF (n) = ∑K/2 (9)
xi (t) = x(t) ∗ hi (t) (17)
k=1 | X(k, n) |
• Spectral Flatness: How noise like a signal is. where hi (t) is the impulse response of the ith
( ∑K/2 ) filter [41]. The center frequencies of the gammatone
exp K 2
× k=1 log | X(k, n) | filters range from 125 Hz to nearly half of the
fSF (n) = ∑K/2 (10) sample rate. Shown in Fig. 4 are plots of the
K × k=1 | X(k, n) |
2
magnitude gains of 24-channel gammatone filterbank.
0
x1 (t ) e1 (t ) E1 (k , n )
−20 x1 (m, n )

−40
24 Channels
x (t ) Temporal
Magnitude Gain (dB)

−60 Gammatone Windowing Mel-Filterbank
Envelope x i ( m, n )
Filterbank and FFT Mapping
−80 Computation
Analysis
−100
x24 (m, n )
−120 x24 (t ) e24 (t ) E24 (k , n )
−140
−160
Fig. 5: Block diagram of the modulation spectral-based feature
−180
extraction.
0 1000 2000 3000 4000 5000 6000 7000 8000
Frequency (Hz)
Fig. 4: Response of 24-channel gammatone filterbank.

processing operations, we propose to consider distortions that
(ii) Envelope detection: are uniquely attributed to dereverberation processing.
After filterbank analysis, the envelope, ei (t), of each
Generally, dereverberation methods start by estimating the
channel output xi (t) is detected using the Hilbert trans-
RIR or spectrum of reverberation, followed by subtracting it
form H(·) as,
√ from the reverberant signal. Due to the inaccurate estimation of
ei (t) = xi (t)2 + H(xi (t))2 (18) the RIR or reverberation, the subtraction may lead to undesired
results (e.g., negative values). To overcome this problem, a
(iii) Windowing and FFT: commonly accepted solution is to apply a nonlinear truncation
The temporal envelope ei (t) of each channel is mul- to the signal after spectrum subtraction, which results in a
tiplied by a 256 ms Hamming window with a 32 ms specific residual noise, often referred to as an “artificial noise”
overlap. Then, the modulation spectrum for the ith or “musical noise”, to account for its perceptual characteris-
critical band is obtained by applying FFT (denoted as tics [35]. Temporal discontinuities, window-FFT analysis and
F(·)) to ei (t); synthesis mismatch, and unbalanced spectral modification are
Ei (k, n) = F(ei (t)) (19) other major sources of musical noise [43].
To illustrate the musical noise phenomenon, the spectrogram
where k and n represent the indices of the frequency bin of an audio signal processed using the TSSS-based dereverber-
and frame, respectively. ation method is shown in Fig. 6. The following observations
(iv) Mel-filterbank mapping: can be made from Fig. 6:
For each frame, the dimension of the modulation spec- • The musical noise in the median/high frequencies is more
tral coefficients is up to 24 × k (e.g., k = 512). It pronounced (visible) than that in the low frequency.
is computationally expensive to use them directly for • The musical noise appears as isolated points in the
training. Moreover, the classifier is prone to overfitting. spectrogram.
To reduce the feature dimension, we apply Mel-filterbank
mapping [42] to the spectral coefficients of each frame.
The rationale of this approach is to make the extracted
features match more closely with what a human hears.
The mapping can be obtained as follows,
ξi (m, n) = MAP mel (Ei (k, n)) (20)
where m is the index of the Mel-filterbank bins (e.g.,
m = 30). MAP mel (·) is the mapping operator. ξi (m, n)
is the modulation spectral coefficient at the mth frequen-
cy bin and the nth frame of the ith channel. We refer to
ξi as perceptually relevant modulation spectral features
denoted by a vector fPRMS in the remainder of this Fig. 6: Phenomenon of “musical noise” in the spectrogram of
paper. a dereverberated signal.
Shown in Fig. 5 is the block diagram of the modulation
spectral feature extraction from an input audio signal x(t). It can also be observed from Fig. 6 that the isolated musical
4) Noise Residual Features: The spectral features in the noise is analogous to the “salt-and-pepper” noise in images.
STFT domain and the modulation spectrum features can be To track the musical noise from the input audio signal, the
used to capture the traces of spectral and perceptual distor- spectrogram of the queried audio signal is computed. This
tion introduced by dereverberation processing. However, these spectrogram can be treated as a 2D spatial image X̂(k, n).
distortions cannot be uniquely linked to dereverberation pro- The features are extracted as follows.
cessing. For example, other non-malicious signal processing (i) High-pass filtering:
operations, such as signal enhancement, denoising, and down- Because the musical noise occurs in the mid and high
sampling, may also cause such distortions. To differentiate frequency bins, therefore, removing the low frequency
between dereverberation processing and non-malicious signal coefficients is expected to improve the robustness of the
extracted features. The filter output is expressed as, C. Feature Dimensionality Reduction
X̂HP (k, n) = FilterHP (X̂(k, n)) (21) Shown in Table I is the dimension of each feature vector
after temporal integration. As shown in Table I, the feature
where FilterHP represents the high-pass filtering opera- dimension of all feature sets is up to 1035 (for K = 512),
tor. which is still quite large.
(ii) Musical noise extraction:
The second step is to extract the musical noise from the TABLE I: The Dimension of Each Individual Feature Set After
spectrogram. To achieve this goal, an image denoising Integration.
method is applied to obtain a clean version of the Feature Set Dimension
spectrogram. Then, the musical noise residual r(k, n) fLS K/2 + 1
can be extracted by subtracting the denoised spectrogram
fSSD 48 (12 × 4)
from X̂HP (k, n) as,
fPRMS 720 (24 channels × 30 mel bins)
r(k, n) = X̂HP (k, n) − Denoise(X̂HP (k, n)) (22) fCM 10
where Denoise(·) represents the denoising operator. In In machine learning and statistics, dimensionality reduction
this paper, the BM3D approach is employed due to its (DR) is a process of reducing the number of features by trans-
superior performance in image denoising applications forming the feature vector in the high-dimensional space to a
[44]. lower-dimensional space. Numerous DR methods have been
(iii) Feature extraction: proposed, which can be classify into linear and nonlinear DR
Rather than using r(k, n) as the features, the following approaches. The linear DR schemes include principal compo-
normalized central moments are extracted, nent analysis, linear discriminant analysis, locality preserving
K ∑
∑ N projections (LPP), and multidimensional scaling, and so on;
1
fCM (d) = | r(k, n) − µr |d (23) whereas, the nonlinear DR schemes include isomap, locally
K×N n=1 k=1 linear embedding, and Laplacian eigenmaps [45]. Following
K ∑
∑ N approaches are used for dimensionality reduction:
1
µr = r(k, n) (24) • Locality preserving projections (LPP) [46], which are
K×N n=1 k=1 linear projective mappings that optimally preserve the
where d = 1, . . . , 10 are the orders. We call these neighborhood structure of the feature set. It is an alter-
features central moments of the musical noise residual native to principal component analysis (PCA) due to its
and denote them as fCM . superior performance.
• Laplacian eigenmaps [47], which use spectral techniques
to perform dimensionality reduction. This technique relies
B. Temporal Integration of Features on the basic assumption that the feature lies in a low-
All of the feature vectors, except for fLS and fCM , are dimensional manifold in a high-dimensional space. The
extracted on a frame-by-frame basis. Consequently, the fea- algorithm provides a computationally efficient approach
ture dimension depends on the total length of each audio. for non-linear dimensionality reduction that also has
To avoid feature dimension mismatch, we propose to use locality preserving properties and a natural connection
first four moments for temporal feature integration approach. to clustering.
More specifically, for temporal feature integration, first four In general, nonlinear methods perform better than linear
moments such as mean, variance, skewness, and kurtosis are methods do, though at the expense of a higher computational
calculated over all of the frames. cost. Our experimental results in this paper also reinforce this
N
1 ∑ claim.
f∆,mean = f∆ (n) (25)
N n=1
D. Summary of the Proposed Anti-Forensic Detection
N
1 ∑ For each audio recording, a feature vector is extracted,
f∆,var = (f∆ (n) − f∆,mean )2 (26)
N n=1 integrated, and projected into the lower dimension. After the
∑N feature vectors are processed using DR methods, a classifier
− f∆,mean )3
n=1 (f∆ (n) such as support vector machine (SVM) is trained. A trained
f∆,ske = √ (27)
classifier is then used to evaluate classification performance on
(N − 1) f∆,var
3
a test dataset. The proposed machine learning process consists
∑N of two stages, that is, the training phase and detection phase.
n=1 (f∆ (n) − f∆,mean )4
f∆,kur = (28) Training phase:
(N − 1)f∆,var
2
1) Given the training dataset, the feature vector f∗ of each
where ∆ represents the name of the features in (5)-(16). recording, with ∗ representing one of the features listed
The mean value integration operator is also applied to the in Table I, is extracted using the methods described in
modulation spectral feature vector fPRMS . Section IV-A.
2) The feature vector f∗ is processed using the temporal For generation of the forged dataset, two reverberant signals,
integration method (see Section IV-B), followed by di- either artificially introduced or recorded directly, are randomly
mensionality reduction on all of the features (see Section selected and spliced by simply assembling them together in
IV-C). time domain. Furthermore, to avoid any perceptible distortion,
3) Feeding the final feature vector f of each recording into the splicing location in the silent region of the recording is
the SVM to learn the underlying model M, which consists selected.
of an optimal hyperplane C [48]. For the classification, a binary support vector machine
Detection phase: (SVM) [48] with a radial basis kernel function is used due
1) Given the suspected forgery in the evidential recording to its superior performance. For each experiment, the optimal
signal y, the features are extracted (from the test dataset) parameters for the classifier are determined using a grid-search
using the methods in Section IV-A, followed by feature technique with five-fold cross-validation on the training data.
integration (see Section IV-B) and dimensionality reduc- The kernel parameters C (penalty parameter) and γ (kernel
tion (see Section IV-C). parameter of the radial basis function) were selected from
2) The resulting feature vector fy is then fed into model M the multiplicative grid GC × Gγ , GC = {2a }, a ∈ {−10, 10},
and compared with C. If fy belongs to the positive side Gγ = {2b }. In each experiment, half of the speech recordings
of C, then the recording is classified as an anti-forensic are randomly selected for training, and the remainder are used
recording. Otherwise, it is classified as an authentic for testing. Each experiment is repeated 10 times, and the
recording. classification accuracy is averaged over all runs.
The performance of the proposed detection scheme is eval-
uated in terms of the true positive rate (TPR, the probability
V. P ERFORMANCE E VALUATION
of the anti-forensic speech being correctly identified), false
In this section, the performance degradation of the splicing positive rate (FPR, the probability of the original reverberant
detection method discussed in Section II under the anti- speech being incorrectly classified as anti-forensic speech),
forensic attacks is first analyzed. The effectiveness of the true negative rate (TNR, the probability of the original re-
proposed anti-forensic detection using (i) the PESQ measure verberant speech being correctly identified), false negative
and (ii) rich model- and SVM-based classification system rate (FNR, the probability of the anti-forensic speech being
is evaluated on both synthetic data and real-world speech incorrectly classified as original speech) and accuracy (Acc),
recordings. defined as the average value of T P Rτ and T N Rτ with the
optimal decision threshold τ :
A. Datasets and Experimental Setup
T P Rτ + T N Rτ
Two datasets, one synthetic and one real-world dataset, Acc = (29)
2
are used for the performance evaluation of the proposed
framework. The speech dataset of the TIMIT corpus [34] is
used to generate the synthetic dataset. The TIMIT dataset B. Performance of Splicing Detection with Anti-forensic At-
contains broadband recordings (with a sampling frequency tacks
fs = 16 kHz) of 630 speakers (438 male and 192 female). In
the recording, each speaker utters 10 sentences. Each utterance The preliminary results presented in Section III indicate
is approximately 3 seconds long. The synthetic room response that the three proposed anti-forensic techniques can potentially
generated by the source-image method [49] for a rectangular bypass the splicing detection method proposed in [20], [21].
room is convolved with the speech to obtain the reverberant Further details regarding the performance evaluation under
signal, followed by adding artificial Gaussian noise. The the selected anti-forensic processing are provided here. To
signal-to-noise ratio (SNR) value of noise addition was set this end, the splicing detection performance of the method
to 30 dB, which is also the default value in the simulation proposed in [20], [21] in the presence/absence of anti-forensic
studies. attacks is evaluated on the synthetic dataset. Shown in Fig.
Similarly, a real-world dataset [15], [16] that consist- 7 are the receiver operator characteristic (ROC) curves for
s of more than 1000 speech recordings is used. These the forged dataset subjected to anti-forensic processings using
speech recordings were recorded using four different types SCIF, GSM-NMF and TSSS. As shown in Fig. 7, in the
of commercial-grade external microphones in various en- absence of anti-forensic attacks, the selected splicing detection
vironments, such as outdoors, small offices (predominantly method achieves very high accuracy (≈ 96.99%). How-
furnished with carpet and drywall), stairs (predominantly ce- ever, in the presence of anti-forensic attacks, the accuracy
ramic tiles and concrete walls), and restrooms (predominantly decreases to 60.43%, 88.16%, and 58.77% for SCIF, GSM-
furnished with ceramic tiles). In each recording environment, NMF and TSSS, respectively. Moreover, Fig. 7 shows that the
the content was either read by a human or played back using detection performance for GSM-NMF-based dereverberation
a loud speaker. The audio was originally recorded with a 44.1 is approximately 88%, which makes it an unattractive choice
kHz sampling frequency and 16 bits/sample resolution and for anti-forensic purpose. However, SCIF- and TSSS-based
then downsampled to 16 kHz. Each recording is with a length dereverberation methods significantly degrade the detection
of 7 s∼9 s. performance.
1 indicates that a low PESQ score does not directly translate

into perceptual degradation.
0.8
We also believe that the PESQ score for anti-forensic
TPR 0.6 processing can be improved by fine tuning the parameters of
the dereverberation algorithms, such as the SCIF algorithm.
0.4
To achieve this objective, we tune the parameters of SCIF to
0.2
Without Anti−forensics
With SCIF Method
increase the PESQ score of the processed audio recordings
With GSM−NMF Method to 2 and test the performance of the proposed algorithm
With TSSS Method
0
0 0.2 0.4 0.6 0.8
on these processed audio recordings. The average detection
FPR performance of the proposed method on the new set of pro-
Fig. 7: Detection performance with three different anti- cessed audio is summarized in TableIII. It can be observed that
forensics. The frame size is set to 2 s. for some subsets of features, the accuracy decreases slightly.
However, the overall accuracy is still in the acceptable range,
that is, over 97%.
C. PESQ Evaluation TABLE II: Scale of MOS.
In our second experiment, PESQ scores are used to evaluate MOS Quality Impairment
audio quality degradation due to anti-forensic processing based 5 Excellent Imperceptible
on the selected dereverberation methods, including SCIF, GSF- 4 Good Perceptible but not annoying
NMF and TSSS. Specifically, the mean opinion score (MOS) 3 Fair Slightly annoying
metric is used to determine the level of the musical noise [43]
2 Poor Annoying
introduced by dereverberations. The MOS scale is shown in
1 Bad Very annoying
Table II. To achieve this goal, the synthetic forged dataset
is processed using the selected dereverberation schemes with
the default parameter settings recommended in [31]–[33]. For (a) SCIF (b) GSF-NMF (c) TSSS
0.05 0.02 0.025
each test signal (processed using the dereverberation method),
0.04 0.02
the PESQ score (in terms of MOS) is calculated using the 0.015
Frequency
Frequency
Frequency
0.03 0.015
corresponding reference signal. Shown in Fig. 8 are the 0.01
distributions of the MOS values of each anti-forensic technique 0.02 0.01
0.005
used. The mean values of these MOS for SCIF, GSF-NMF 0.01 0.005
and TSSS are 1.733, 2.1995, and 2.1696, respectively. The 0

1.5 2 2.5 3
0
1.5 2 2.5 3
0
1.5 2 2.5 3
low MOS values here suggest the presence of (i) a strong MOS MOS MOS
difference between the original audio and that processed using Fig. 8: Frequency distribution of the MOS of each anti-forensic
the anti-forensic methods, and (ii) musical noise introduced by technique: (a) SCIF, (b) GSF-NMF, and (c) TSSS
the anti-forensic methods.
It is important to highlight that the PESQ-based degradation
calculated here reveals a degree of distortion between the ref- TABLE III: Average Classification Accuracy of the SCIF
erence and processed audio recordings. To evaluate distortion method with Enhanced Audio Quality
due to anti-forensic processing, a small-scale subjective test is Feature Set Average Detection Accuracy
also conducted. For this purpose, two graduate students( one fLS 96.34%
male and one female) with normal hearing ability are enrolled fSSD 98.44%
to participate in the subjective evaluation. The students are fPRMS 99.59%
asked to listen to original and processed audio recordings
fCM 92.12%
and provide feedback in terms of ‘unnoticeable’, ‘noticeable’,
All Features 97.02%
‘noticeable but not annoying’, and ‘annoying’. The average
subjective test score for the audio recordings processed using
the selected anti-forensic methods indicates that processing
distortion for the GSF-NMF and TSSS methods is barely D. Anti-Forensic Identification Using the Rich Model
noticeable, whereas the distortion for the SCIF method is The performance of the proposed rich model-based anti-
noticeable but not annoying. The subjective evaluation reveals forensic identification is evaluated on both the synthetic and
that the subjective evaluation score depends on the individual real-world datasets. The performance is evaluated for both the
and on the audio recording used. The motivation behind select- source-matched and source-mismatched cases.
ing three anti-forensic methods is to evaluate the performance 1) Source-Matched Cases for Synthetic Data: In our third
of the proposed method not only on different anti-forensic experiment, we evaluate the performance of the proposed
methods but also on the different distortion levels introduced anti-forensic identification method for source-matched cases
by the dereverberation methods. Although the audios pro- on the synthetic dataset. In this experiment, we assume that
cessed using the selected anti-forensic methods exhibit low the forensic investigator has knowledge of the anti-forensic
PESQ scores, our very small-scale informal subjective test technique used. This is a reasonable assumption because the
10
long-term surveillance of suspected forgers can be used to • The linear dimensionality reduction method (LPP) signif-
predict their attack models. To this end, an SVM-based clas- icantly improves the classification accuracy for all feature
sifier is trained and tested using the rich-feature set extracted sets except for fSSD .
from the datasets generated using the selected dereverberation • The performance with the LPP method significantly fluc-
methods. Moreover, the identification performance is evaluated tuates and its performance is a function of the feature set
for individual feature sets, including fLS , fSSD , fPRMS , and used and the underlying dereverberation methods used.
fCM , as well as the entire feature set. Table IV, Table V, and • As expected, the classification performance for the non-
Table VI show the classification accuracies for the individual linear dimensionality reduction method (Laplacian) out-
feature sets and for the entire feature set of the proposed performs both the original feature set and the LPP
scheme for SCIF, GSF-NMF and TSSS. Here, the result of the method. The accuracy is nearly 100% for the entire
entire feature set without dimensionality reduction is omitted feature set.
due to its low accuracy and high training cost. • For the entire feature set, both the LPP and the Laplacian
methods perform quite well, and the Laplacian method is
TABLE IV: Classification Accuracies of the Proposed Rich shown to be superior to the LPP method.
Model-based Identification for SCIF • The LPP method degrades the classification accuracy
Dimensionality Reduction Method of the feature set fSSD for the SCIF and GSF-NMF
Feature Set methods. The reasons can be explained as follows:
NO LPP Laplacian
fLS 51.71% 83.46% 100.0% – The classification accuracies of the feature sets fSSD
fSSD 99.07% 89.97% 99.98% for the SCIF and GSF-NMF methods have already
fPRMS 70.74% 96.63% 100.0% exceeded 98%. The use of the DR method may not
fCM 70.56% 93.95% 100.0% be able to further improve the accuracy.
ALL 99.88% 100.0% – The feature vectors may not satisfy the linear as-
sumption used by LPP. Therefore, the LPP method
returns non-optimal feature subsets. We re-conduct
TABLE V: Classification Accuracies of the Proposed Rich the experiments on the feature set fSSD using linear
Model-based Identification for GSF-NMF discriminant analysis [50]. The accuracies for SCIF
and GSF-NMF are 99.27% and 99.67%, respectively.
Dimensionality Reduction Method
Feature Set The results indicate that the performance strongly
NO LPP Laplacian
depends on the choice of the linear dimensionality
fLS 58.91% 66.96% 100.0%
reduction method.
fSSD 98.19% 90.98% 99.95%
fPRMS 63.98% 96.41% 99.94%
• The DR significantly improved the accuracy, which is
a counterintuitive observation. To determine the reason
fCM 72.62% 96.15% 99.98%
behind this phenomenon, we repeat experiments without
ALL 99.86% 100.0%
using DR hundreds of times. It has been observed that the
training accuracies are approximately 100% on average.
TABLE VI: Classification Accuracies of the Proposed Rich However, the testing accuracies are similar with the
Model-based Identification for TSSS results presented in Table IV, Table V and Table VI
(ranging from 55% to 70%). It can also be observed
Dimensionality Reduction Method that the DR methods are the most effective for fLS and
Feature Set
NO LPP Laplacian fPRMS , which have a dimensionality of several hundreds.
fLS 56.71% 63.02% 100.0% It is therefore reasonable to conclude that the lower
fSSD 90.72% 94.36% 100.0% testing accuracies might be attributed to the overfitting of
fPRMS 64.74% 98.64% 100.0% the classifier. To further investigate this issue, a machine
fCM 65.34% 96.31% 99.99% learning method that is robust to overfitting is utilized
ALL 99.66% 100.0% to train the model on the original feature sets (without
DR) [51]. The experimental results are shown in Table
The following observations can be made based on the results VII. Compared with the previous results, the average
presented in Tables IV-VI: improvements in terms of accuracy for fLS , fPRMS and
• Generally, the individual feature sets without dimension-
fCM were 22%, 14%, and 21%, respectively. If the DR
ality reduction exhibit the worst performance (except method (Laplacian) is applied, the classification accuracy
fSSD ). reaches 99%, which is consistent with the results present-
• In the absence of DR, the feature vector of fSSD is the
ed in Table IV, Table V and Table VI (last columns).
most effective for detecting the presence of the three dere- Consequently, the poor detection performance, in the
verberation methods. These results indicate that among absence of DR, can be attributed to the overfitting issues.
all feature sets, spectral shape and distribution features Hence, the gain due to the DR can be achieved by:
are the most efficient ones in capturing the distortions – Reducing the feature dimension and lowering the risk
introduced by the dereverberation methods. of overfitting.
11
– Preserving the most important features and the lo- effective for the source-matched scenario.
cality properties of features.
(a) SCIF → GSM − NMF (b) SCIF → TSSS
• Finally, the nonlinear DR method always outperforms
the linear DR approach. Optimal DR selection is not 0.9 0.9
0.8 0.8
the focus of this paper; therefore, we have not evaluated
TPR
TPR
0.7 0.7
the performance of the other DR methods. Furthermore, 0.6 0.6
Accuracy=99.23% Accuracy=97.50%
there is no universal optimal DR method for all cases. For 0.5 0.5
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
the performance evaluation of the following experiments, FPR FPR
(c) GSM − NMF → SCIF (d) GSM − NMF → TSSS
only the nonlinear (Laplacian) DR method is used.
0.9 0.9
0.8 0.8
TABLE VII: Classification Accuracies of the Proposed Rich-
TPR
TPR
0.7 0.7
feature Model with AdaBoost (without dimensionality reduc- 0.6 0.6
Accuracy=94.13% Accuracy=96.70%
tion) 0.5 0.5
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
FPR FPR
Dereverberation Method (e) TSSS → SCIF (f ) TSSS → GSM − NMF
Feature Set
SCIF GSF-NMF TSSS
0.9 0.9
fLS 88.71% 74.80% 70.16% 0.8 0.8
TPR
TPR
fPRMS 86.81% 75.28% 78.05% 0.7 0.7
0.6 Accuracy=94.56% 0.6 Accuracy=95.08%
fCM 88.82% 88.35% 94.97%
0.5 0.5
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
FPR FPR
2) Source-Mismatched Case on Synthetic Data: In practical
applications, the forensic investigator may not have knowledge Fig. 9: ROC curves of the proposed scheme on the source-
of the dereverberation techniques adopted by the attacker. mismatched case.
It is therefore not reasonable to train the classifier on the
same dereverberation method. In our fourth experiment, the From the experimental results presented in Table VIII(a)
performance of the proposed rich model-based anti-forensic and shown in Fig. 9, the following facts can be observed:
identification method is evaluated for a source-mismatched • The feature set (fCM ) extracted from the musical noise
case. residual results in the best accuracy.
The average classification accuracies of the proposed • The feature set fCM is robust to the source-mismatched
scheme for the individual feature sets, where the classifier is case because only the noise residual is used for feature
trained on dereverberation method A and tested on method extraction, which alleviates the interference of the signal
B (A, B ∈ {SCIF, GSM-NMF, TSSS}) are summarized in content.
Table VIII(a). It can be observed from Table VIII(a) that • The average classification accuracy for the entire feature
the classification accuracy depends on the dereverberation set (96.2%) is marginally superior to the best performance
algorithms. The lowest accuracy (85.5%) is obtained when for the individual feature set (e.g., 95.6%).
A = SCIF and B = TSSS and for the feature set fLS . Feature 3) Robustness Against Noise Addition: In our fifth exper-
set fCM yields the best accuracy (i.e., ∼ 100%), regardless iment, we investigate the robustness of the proposed method
of the dereverberation methods employed. Furthermore, the against additive white Gaussian noise (AWGN) attack. For
noise residual feature (fCM ) is the most robust metric among this purpose, the performance of the proposed method is
the remaining feature sets for the source-mismatched scenario. evaluated in the presence of AWGN noise with various signal-
The overall accuracy for all of the feature sets is 95.62%, to-noise ratio (SNR) values. Shown in Fig. 10 are the detection
which is equivalent to a performance deterioration of ≈ 5% accuracies of the proposed method for the entire feature set
compared with the source-matched case. under various SNR values. It can be observed from Fig. 10
Likewise, shown in Fig. 9 are the plots of ROC curves that there is no significant accuracy degradation for both the
along with detection accuracies of the proposed scheme for SCIF and GSM-NMF methods in the source-matched case.
the source-mismatched case with the entire feature set. Here, The variation in accuracy can be attributed to the random
“A → B” indicates that the classifier is trained on method A selection of the training and testing datasets. Moreover, the
and tested on method B. As expected, the accuracy decreases performance degrades for the TSSS method for SNR ≤ 15 dB.
as a result of the source mismatch. The following observations It can be observed from Fig. 10 that increasing the noise level
can be made from Fig. 9: slightly decreases the detection accuracy. Overall, the proposed
• The percentage of performance deterioration depends on method is robust to AWGN attack for SNR > 15 dB.
the dereverberation approaches used. In our sixth experiment, we evaluate the robustness of the
• There is almost no performance deterioration if the clas- proposed “musical noise” feature fCM under AWGN attack.
sifier is trained on SCIF (see Figs. 9 (a)& (b)). For this purpose, the performance for fCM is also evaluated
• Training on the GSM-NMF method and testing on the in the presence of AWGN noise with various signal-to-noise
SCIF method yields the largest decrease of ≈ 6%. ratio (SNR) values. Shown in Table IX is the corresponding
• The average accuracy of the source-mismatched case is detection performance under the source-matched case. It can
96.2%, which indicates that the proposed scheme is also be observed from Table IX that the selected feature set
12
TABLE VIII: Classification Accuracies of the Proposed Scheme on Different Feature Subsets for Source-mismatched Case
(a) Synthetic Data
Classification Accuracy for Different Feature Sets and Testing Methods
Training
fLS fSSD fPRMS fCM
Methods
SCIF GSM-NMF TSSS SCIF GSM-NMF TSSS SCIF GSM-NMF TSSS SCIF GSM-NMF TSSS
SCIF - 97.5% 85.5% - 91.2% 98.6% - 97.4% 99.0% - 99.4% 100%
GSM-NMF 99.7% - 96.7% 89.1% - 90.0% 98.6% - 96.6% 99.5% - 97.0%
TSSS 99.4% 98.0% - 91.4% 90.0% - 90.82% 89.7% - 100% 100% -
Average Acc 96.07% 91.72% 95.35% 99.32%
(b) Real-World Recordings

Classification Accuracy for Different Feature Sets and Testing Methods
Training
fLS fSSD fPRMS fCM
Methods
SCIF GSM-NMF TSSS SCIF GSM-NMF TSSS SCIF GSM-NMF TSSS SCIF GSM-NMF TSSS
SCIF - 97.9% 96.5% - 94.4% 98.6% - 99.9% 99.3% - 99.9% 98.1%
GSM-NMF 96.8% - 97.8% 95.0% - 98.8% 96.9% - 99.5% 99.5% - 100%
TSSS 94.2% 97.8% - 95.8% 97.8% - 99.8% 99.8% - 100% 100% -
Average Acc 96.80% 96.73% 99.20% 99.58%
105
SCIF (Matched) scheme for real-world recordings. For this experiment, only
GSM-NMF (Matched)
TSSS (Matched) the source-mismatched case is considered.
Mismatched (average)
100
Shown in Table VIII(b) are the classification accuracies
of the proposed scheme for the individual feature sets with
real-world recordings. It can be observed from Table VIII(b)
Detection Accuracy
95
that, for the real-world recordings, the average accuracies for

90 the feature sets of fLS and fCM do not degrade compared
with the synthetic dataset. Moreover, marginal performance
85
improvements over the synthetic dataset are observed for the
feature sets of fSSD (5% ↑) and fPRMS (3.85% ↑). These
improvements can be attributed to the fact that dereverberation
80
15 20 25
SNR (dB)
30 35
processing removes both the reverberation and background
noise, which leads to distinct noise levels in the original
Fig. 10: Detection accuracy of the proposed entire feature set
reverberant signal and anti-forensic signal. The proposed fea-
under various noise conditions.
ture set also captures the difference of noise levels. Finally,
the experimental results for the synthetic data and real-world
recordings show that the feature (fCM ) extracted from the
fCM achieves an extremely high accuracy of ∼ 100% for
musical noise outperforms the other features and is the most
SNR = 20 dB, and it decreases to 95% for SNR= 15 dB. The
robust for the source-mismatched scenario.
performance degradation for stronger noise can be attributed
to the fact that the residual noise captures distortions due to Likewise, shown in Fig. 11 are plots of detection perfor-
the anti-forensic methods and background noise. The presence mance in terms of ROC curves of the proposed scheme for the
of strong background noise is expected to degrade the effec- real-world recordings for the entire feature set. As shown in
tiveness of fCM . The detection performance shown in Fig. 10 Fig. 11, for the real-world data, an average accuracy of 97.78%
and Table IX validates this claim. is achieved. This result is comparable to that for the synthetic
dataset (see Fig. 9). It is reasonable to conclude that the
TABLE IX: Detection Accuracies of the Proposed Feature entire feature set does not result in superior performance. The
fCM Under Noisy Conditions. underlying reason is that the DR method does not guarantee
that an optimal feature subset is obtained. If any subset (such
Anti-Forensic SNR
as fCM ) of the entire feature set results in good performance,
Methods 20dB 15dB
increasing the feature dimension does not guarantee further
SCIF 100% 94.65%
improvement in the classification accuracy. Subsequently, the
GSM-NMF 99.93% 94.86%
musical noise feature set, fCM , significantly outperforms the
TSSS 100% 96.84% other feature sets and the entire feature set. This is mainly
because that the musical noise feature is specifically designed
4) Performance of Real-world Recordings: In our seventh to characterize the musical noise introduced by dereverberation
experiment, we evaluate the performance of the proposed processing.
13
(a) SCIF→GSM-NMF (b) SCIF→TSSS TABLE XI: Classification Results of the Proposed Entire
1 1
0.9 0.9
Feature Set Against Mean Filtering
0.8 0.8
TPR
TPR
Training Identified Methods
0.7 0.7
Methods SCIF GSM-NMF TSSS
0.5 0.5 SCIF - 99.82% 98.08%
0 0.1 0.2 0.3 0 0.1 0.2 0.3
FPR FPR GSM-NMF 97.90% - 96.50%
(c)GSM-NMF→SCIF (d)GSM-NMF→TSSS
1 1
TSSS 96.76% 98.16% -
0.9 0.9
0.8 0.8 Average Acc 97.87%
TPR
TPR
0.7 0.7
0.5
0 0.1 0.2 0.3
0.5
0 0.1 0.2 0.3
6) Further Discussion on Source-Mismatched Issue: The
FPR
(e)TSSS→SCIF
FPR
(f )TSSS→GSM-NMF
performance of the proposed method has been extensively
1 1
evaluated using different datasets and subjected to a diverse
0.9 0.9
0.8 0.8
set of attacks. In this section, a considerably more challenging
TPR
TPR
0.7 0.7 issue, that of classifier training and testing, is briefly discussed.
0.6 Accuracy=93.51% 0.6 Accuracy=97.78% Here, the SVM-based classifier is trained on the synthetic
0.5
0 0.1 0.2 0.3
0.5
0 0.1 0.2 0.3
database, and then it is used to classify the real-world speech
FPR FPR recordings. Shown in Table XII are the classification accu-
Fig. 11: ROC Curves for the Real-world Recordings for racies of the proposed entire feature set. It can be observed
Source-mismatched Case. from Table XII that the performance degrades significantly as
a result of the highly mismatched sources.
TABLE XII: Classification Results of the Proposed Entire

5) Robustness to Filtering Attacks: In our final experiment, Feature Set
we evaluate the performance of the proposed method under
filtering attacks. We investigate the robustness of the proposed Training Identified Methods
feature sets against median filtering and mean filtering attacks. Methods SCIF GSM-NMF TSSS
For this purpose, a real-world dataset is tested on the selected SCIF 83.58% 69.42% 75.86%
filtering methods (median and mean filtering with window size GSM-NMF 65.19% 79.74% 67.75%
equals to p). The performance of the proposed method is then TSSS 53.20% 72.37% 52.51%
evaluated on the resulting two filtered datasets (one for median
filtering and another for mean filtering). Moreover, the performance of the proposed musical noise
Shown in Table X are the classification accuracies for the feature set, fCM , is also investigated. Shown in Table XIII
entire feature set for median filtering with p = 5 samples. are the classification accuracies of the proposed feature set
It can be observed from Table X that the proposed method fCM . As shown in Table VIII, the accuracies decrease moder-
is robust against median filtering attack because almost no ately if the dereverberation methods for training and testing
performance deterioration is observed if comparing to the are the same (average decrease is 5.3%). However, if the
results presented in Fig. 11 (from 97.78% to 97.58%). dereverberation methods for training and testing are different,
the average classification accuracy decreases significantly, that
TABLE X: Classification Results Using Entire Feature Set is, from 99% to 71.06%. Moreover, the performance of the
Against Median Filtering different dereverberation methods becomes highly unreliable.
The accuracy fluctuates from 99.16% to 51.23%. It can also
Training Identified Methods
be observed from Tables XII and XIII that fCM yields better
Methods SCIF GSM-NMF TSSS
performance than the entire feature set. The fact is that the
SCIF - 99.05% 98.26%
fCM only characterizes the “musical noise”, which is less
GSM-NMF 97.20% - 98.84%
dependent on the speech content.
TSSS 97.10% 95.00% -
Average Acc 97.58% TABLE XIII: Classification Results of the Proposed Feature
fCM
Likewise, shown in Table XI are the classification accuracies Training Identified Methods
for the entire feature set for mean filtering attack with p = 5 Methods SCIF GSM-NMF TSSS
samples. It can be observed from Table XI that the proposed SCIF 93.45% 71.87% 54.57%
method is also robust against mean filtering attack because GSM-NMF 84.02% 97.10% 51.23%
there is no significant performance degradation under mean TSSS 65.78% 99.16% 91.05%
filtering attack (see Fig. 11).
In short, it can be observed from Fig. 11, Table X and In general, the performance in this case is less reliable
Table XI that the proposed method is robust against median than that in the other cases. This result may be due to the
and mean filtering attacks. following reasons: 1) some features, such as fLS and fSSD , are
14
speech content-dependent, which results in poor performance [4] A. Hajj-Ahmad, R. Garg, and M. Wu, “Spectrum combining for ENF
in this highly source-mismatched case; 2) the synthetic channel signal estimation,” IEEE Signal Processing Letters, vol. 20, no. 9, pp.
885–888, 2013.
impulse response generated by the image source model [52] [5] C. Jidong, L. Yuming, Y. Zhiyong, C. Richard W., and L. Yilu, “Tam-
does not match the actual channel impulse response in real- pering detection of digital recordings using electric network frequency
world recordings; and 3) the synthetic background noise is and phase angle,” in Audio Engineering Society Convention 135. Audio
Engineering Society, 2013.
also distinct from the real background noise that typically [6] M. M. Elmesalawy and M. M. Eissa, “New forensic ENF reference
presents in real-world speech. In practice, it is unnecessary database for media recording authentication based on harmony search
to use the synthetic database for training because there are technique using gis and wide area frequency measurements,” IEEE
Transactions on Information Forensics and Security, vol. 9, no. 4, pp.
plenty of low-cost portable recorders for capturing sufficient 633–644, 2014.
real-world speech for training. [7] Q. Liu, A. Sung, and M. Qiao, “Detection of double MP3 compression,”
Cognitive Computation, Special Issue: Advances in Computational In-
telligence and Applications, vol. 2, no. 4, pp. 291–296, 2010.
VI. C ONCLUSION [8] R. Yang, Y. Shi, and J. Huang, “Detecting double compression of audio
signal,” in Proc. of SPIE Media Forensics and Security II 2010, vol.
In this paper, we proposed a novel framework for anti- 7541, 2010, pp. 75 410k–1–75 410K–10.
forensic processing to attack the environmental-signature- [9] R. Yang, Y. Shi, and J. Huang, “Defeating fake-quality MP3,” in Proc.
based audio splicing detection method [21]. Three dereverber- of MM&Sec 2009, the 11th ACM Workshop on Multimedia and Security,
2009, pp. 117– 124.
ation methods are considered to illustrate effectiveness of the [10] T. Bianchi, A. D. Rosa, M. Fontani, G. Rocciolo, and A. Piva, “Detection
proposed anti-forensic attack. We also proposed a framework and classification of double compressed MP3 audio tracks,” in Proc. of
for detecting anti-forensic attacks. To detect the presence of IHMMSec 2013, the first ACM Workshop on Information Hiding and
Multimedia Security, 2013, pp. 159–164.
potential anti-forensic processing, both the objective quality [11] D. Luo, W. Luo, R. Yang, and J. Huang, “Identifying compression
evaluation and a machine learning-based technique are used. history of wave audio and its applications,” ACM Transactions on
Specifically, a rich-feature model is used to capture traces Multimedia Computing, Communications, and Applications, vol. 10,
no. 3, p. 30, 2014.
anti-forensic processing. The proposed rich-feature model con- [12] H. Malik and H. Farid, “Audio forensics from acoustic reverberation,”
sisting of Fourier coefficients, spectral properties, high-order in Proc. of ICASSP 2010, IEEE International Conference on Acoustics,
statistics of “musical noise” residuals, and perceptually rele- Speech, and Signal Processing, Dallas, TX, 2010, pp. 1710 – 1713.
[13] H. Malik and H. Zhao, “Recording environment identification using
vant features to capture traces of dereverberation processing. acoustic reverberation,” in Proc. of ICASSP 2012, IEEE International
The performance of the proposed framework is evaluated on Conference on Acoustics, Speech, and Signal Processing, Kyoto, Japan,
synthetic and real-world datasets. The experimental results 2012, pp. 1833 – 1836.
[14] H. Zhao and H. Malik, “Audio forensics using acoustic environment
demonstrate that the proposed scheme can detect the presence traces,” in Proc. of SSP 2012, IEEE Statistical Signal Processing
of anti-forensic processing. More specifically, the tests on the Workshop, Aug 2012, pp. 373–376.
individual feature set show that the feature extracted from [15] H. Zhao and H. Malik, “Audio recording location identification using
acoustic environment signature,” IEEE Transactions on Information
the “musical noise” residual results in the best performance. Forensics and Security, vol. 8, no. 11, pp. 1746–1759, 2013.
Moreover, effectiveness of the proposed method has also [16] H. Malik, “Acoustic environment identification and its application to
been evaluated for source-matched and source-mismatched audio forensics,” IEEE Transactions on Information Forensics and
Security, vol. 8, no. 11, pp. 1827–1837, 2013.
scenarios. Finally, robustness of the proposed method has [17] A. H. Moore, M. Brookes, and P. A. Naylor, “Roomprints for forensic
also been evaluated for common audio processing attacks, audio applications,” in Proc. of WASPAA 2013, IEEE Workshop on
including AWGN, median filtering, and mean filtering. Applications of Signal Processing to Audio and Acoustics, 2013, pp.
1–4.
[18] P. Nils, L. Howard, and F. Gerald, “Name that room: Room identification
ACKNOWLEDGMENTS using acoustic features in a recording,” in Proc. of ACMMM 2012, the
20th ACM international conference on Multimedia, 2012, pp. 841–844.
This work is supported by the National Natural Science [19] X. Valero and F. Alias, “Narrow-band autocorrelation function features
Foundation of China (Grant No. 61402219), 2013 Guangdong for the automatic recognition of acoustic environments,” The Journal of
Natural Science Funds for Distinguished Young Scholars the Acoustical Society of America, vol. 134, no. 1, pp. 880–890, 2013.
[20] H. Zhao, Y. Chen, R. Wang, and H. Malik, “Audio source authentication
(S2013050014223), and a grant from the National Plan for and splicing detection using acoustic environmental signature,” in Proc.
Science, Technology and Innovation (MAARIFAH), King Ab- of IHMMSec 2014, the 2nd ACM Workshop on Information Hiding and
dulaziz City for Science and Technology, Kingdom of Saudi Multimedia Security, 2014, 159-164.
[21] H. Zhao, Y. Chen, R. Wang, and H. Malik, “Audio splicing detection
Arabia, award number 12-INF2634-02 and a grant from the and localization using environmental signature,” 2014, arXiv:1411.7084.
National Science Foundation award number CNS-1440929. [22] X. Pan, X. Zhang, and S. Lyu, “Detecting splicing in digital audios
Thanks are extended to Louis Wai Yip Liu for proofreading using local noise level estimation,” in Proc. of ICASSP 2012, IEEE
International Conference on Acoustics, Speech, and Signal Processing,
the manuscript. Kyoto, Japan, 2012, pp. 1841–1844.
[23] J. Chen, S. Xiang, W. Liu, and H. Huang, “Exposing digital audio
R EFERENCES forgeries in time domain by using singularity analysis with wavelets,”
in Proc. of IHMMSec 2013, the first ACM workshop on Information
[1] M. C. Stamm, M. Wu, and K. Liu, “Information forensics: An overview Hiding and Multimedia Security, 2013, pp. 149–158.
of the first decade,” IEEE Access, vol. 1, pp. 167–200, 2013. [24] G. Valenzise, M. Tagliasacchi, and S. Tubaro, “Revealing the traces of
[2] C. Grigoras, “Digital audio recording analysis: The electric network JPEG compression anti-forensics,” IEEE Transactions on Information
frequency (ENF) criterion,” International Journal of Speech Language Forensics and Security, vol. 8, no. 2, pp. 335–349, 2013.
and the Law, vol. 12, no. 1, pp. 1350–1771, 2005. [25] Z.-H. Wu, M. C. Stamm, and K. J. R. Liu, “Anti-forensics of median
[3] D. Rodriguez, J. Apolinario, and L. Biscainho, “Audio authenticity: filtering,” in Proc. of ICASSP 2013, IEEE International Conference on
Detecting ENF discontinuity with high precision phase analysis,” IEEE Acoustics, Speech and Signal Processing, 2013, pp. 3043–3047.
Transactions on Information Forensics and Security, vol. 5, no. 3, pp. [26] W.-H. Chuang, R. Garg, and M. Wu, “How secure are power network
534 –543, 2010. signature based time stamps?” in Proc. of ACMCCS 2012, the ACM
15
conference on Computer and Communications Security, 2012, pp. 428– Hong Zhao received his B.S. and Ph.D. degrees
438. in information security from the Southwest Jiaotong
[27] W.-H. Chuang, R. Garg, and M. Wu, “Anti-forensics and countermea- University, Chengdu, China, in 2007 and 2013, re-
sures of electrical network frequency analysis,” IEEE Transactions on spectively. From 2010 to 2012, he was a visiting
Information Forensics and Security, vol. 8, no. 12, pp. 2073–2088, 2013. scholar at the University of Michigan-Dearborn. He
[28] P. A. Naylor and N. D. Gaubitch, Eds., Speech Dereverberation. is currently a research fellow with the Department
Springer, 2010, vol. XVIII. of Electrical and Electronic Engineering, South U-
[29] N. Tomohiro, Y. Takuya, K. Keisuke, M. Masato, and J. Biing-Hwang, niversity of Science and Technology of China. His
“Speech dereverberation based on variance-normalized delayed linear current research interests include steganalysis, audio
prediction,” IEEE Transactions on Audio, Speech, and Language Pro- forensics, wireless communication security, etc.
cessing, vol. 18, no. 7, pp. 1717–1731, 2010.
[30] C. Schuldt and P. Handel, “Noise robust integration for blind and non-
blind reverberation time estimation,” in Pro. of ICASSP 2015, IEEE
International Conference on Acoustics, Speech and Signal Processing,
2015, pp. 56–60.
[31] N. D. Gaubitch, M. Brooks, and P. A. Naylor, “Blind channel magnitude
response estimation in speech using spectrum classification,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 21,
no. 10, pp. 2162–2171, 2013.
[32] K. Kumar, R. Singh, B. Raj, and R. Stern, “Gammatone sub-band
magnitude-domain dereverberation for ASR,” in Proc. of ICASSP 2011,
IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing, 2011, pp. 4604–4607.
[33] M. Wu and D. Wang, “A two-stage algorithm for one-microphone
reverberant speech enhancement,” IEEE Transactions on Audio, Speech,
and Language Processing, vol. 14, no. 3, pp. 774–784, 2006.
[34] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L.
Dahlgren, and V. Zue, TIMIT Acoustic-Phonetic Continuous Speech
Corpus. Linguistic Data Consortium, Philadelphia, 1993.
[35] L. Katia, B. Jean-Marc, and D. PN, “A new method based on spec-
tral subtraction for speech dereverberation,” Acta Acustica united with
Acustica, vol. 87, no. 3, pp. 359–366, 2001.
[36] http://www.itu.int/rec/T-REC-P.862/en.
[37] M. Wolfel, “Enhanced speech features by single-channel joint compen-
sation of noise and reverberation,” IEEE, Transactions on Audio, speech,
and Language Processing, vol. 17, no. 2, pp. 312–323, 2009.
[38] A. Sehr, R. Maas, and W. Kellermann, “Reverberation model-based Yifan Chen received his B.Eng. (Hons I) and Ph.D.
decoding in the logmelspec domain for robust distant-talking speech degrees in electrical and electronic engineering from
recognition,” IEEE Transactions on Audio, Speech and Language Pro- Nanyang Technological University (NTU), Singa-
cessing, vol. 18, no. 7, pp. 1676–1691, 2010. pore, in 2002 and 2006, respectively. From 2005 to
[39] A. Lerch, An Introduction to Audio Content Analysis: Applications in 2007, he was a Research Fellow with the Singapore-
Signal Processing and Music Informatics. Wiley-IEEE Press, July 2012. University of Washington Alliance (SUWA) in Bio-
[40] T. H. Falk and W.-Y. Chan, “Modulation spectral features for robust engineering, supported by the Singapore Agency
far-field speaker identification,” IEEE Transactions on Audio, Speech, for Science, Technology and Research (A∗STAR),
and Language Processing, vol. 18, no. 1, pp. 90–100, 2010. NTU, and University of Washington at Seattle, USA.
[41] http://www.acousticscale.org/wiki/index.php/The Gammatone Auditory From 2007 to 2012, he was a Lecturer and then a
Filterbank. Senior Lecturer with the University of Greenwich,
[42] http://www.ee.columbia.edu/∼dpwe/resources/matlab/gammatonegram/. U.K., and with Newcastle University, U.K. In 2013, he was a Visiting
[43] http://www.icsi.berkeley.edu/ftp/pub/speech/papers/gelbart-ms/mask/. Professor with the Singapore University of Technology and Design, Singapore.
[44] A. Danielyan, V. Katkovnik, and K. Egiazarian, “BM3D frames and He is currently a Professor with the South University of Science and
variational image deblurring,” IEEE Transactions on Image Processing, Technology of China. His current research interests include wireless and
vol. 21, no. 4, pp. 1715–1728, 2012. pervasive communications for healthcare, micro/nanoscale electromagnetic
[45] http://lvdmaaten.github.io/drtoolbox/. imaging, sensing, and information transmission, wireless propagation and
[46] X. He and P. Niyogi, “Locality preserving projections,” in Advances in network channel modeling, and cognitive wireless systems. Professor Chen
Neural Information Processing Systems, vol. 16, 2003, pp. 234–241. received the Promising Research Fellowship in 2010 and the Early Career
[47] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality Research Excellence Award in 2009 from the University of Greenwich. He
reduction and data representation,” Neural computation, vol. 15, no. 6, was also selected by China 8th Recruitment Program of Global Experts
pp. 1373–1396, 2003. (“1000 Talent Plan”) in 2012, received Guangdong Natural Science Funds
[48] C. Chang and C. Lin. (2012) Libsvm: A library for support vector for Distinguished Young Scholars in 2013, and was accredited the Shenzhen
machines. [Online]. Available: http://www.csie.ntu.edu.tw/∼cjin/libsvm Distinguished Overseas Talent.
[49] E. A. Lehmann and A. M. Johansson, “Diffuse reverberation model
for efficient image-source simulation of room impulse responses,” IEEE
Transactions on Audio Speech and Language Processing, vol. 18, no. 6,
pp. 1429–1439, 2010.
[50] A. J. Izenman, Linear Discriminant Analysis. Springer, 2008. Rui Wang received his Bachelor of Engineering
[51] http://graphics.cs.msu.ru/en/science/research/machinelearning/adaboost- Degree in Computer Science & Engineering from
toolbox. the University of Science & Technology of China
[52] E. Lehmann and A. Johansson, “Prediction of energy decay in room (USTC) in 2004. In 2008, he obtained his Ph.D. de-
impulse responses simulated with an image-source model,” The Journal gree in wireless communications at the Hong Kong
of Acoustic Society of America, vol. 1, no. 121, pp. 269–277, 2008. University of Science & Technology (HKUST). In
2009, he worked as a post-doctoral research asso-
ciate in HKUST. From 2009-2012, he worked as a
senior research engineer at Huawei-HKUST Inno-
vation Laboratory, Huawei Technology, Co., Ltd. He
was selected by China “National 1000 Talent (Young
Investigator)” Program in 2012 and joined the South University of Science &
Technology of China (SUSTC) as an associate professor.
16
Hafiz Malik is an Assistant Professor in the Electri-

cal and Computer Engineering (ECE) Department,
The University of Michigan – Dearborn. His re-
search in multimedia forensics and security, wire-
less sensor networks, steganography/steganalysis,
and biometric security is funded by the National
Academies and other agencies. He has published
more than 40 papers in leading journals, conference,
and workshops. Dr. Malik has been serving as an
Associate Editor for the Springer Journal of Signal,
Image, and Video Processing (SIVP) since 2013; he
is also on the Review Board of the IEEE Technical Committee on Multimedia
Communications (MMTC). He organized the Special Track on Doctoral
Dissertations in Multimedia at the 6th IEEE International Symposium on Mul-
timedia (ISM) 2006. He is serving on several technical program committees,
including the IEEE ICASSP, AVSS, ICME, ICIP, MINES, ISPA, CCNC, and
ICC.

Anti-Forensics of Environmental-Signature-Based Audio Splicing Detection and Its Countermeasure Via Rich-Feature Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anti-Forensics of Environmental-Signature-Based Audio Splicing Detection and Its Countermeasure Via Rich-Feature Classification

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

method described in Section II also uses the SCIF algorithm

averaged log-spectral coefficients over frames and is given as fSF I (n) =

where κ is set to 0.85 or 0.95.

Magnitude Gain (dB)

Fig. 4: Response of 24-channel gammatone filterbank.

1 indicates that a low PESQ score does not directly translate

and TSSS are 1.733, 2.1995, and 2.1696, respectively. The 0

(b) Real-World Recordings

that, for the real-world recordings, the average accuracies for

TABLE XII: Classification Results of the Proposed Entire

Hafiz Malik is an Assistant Professor in the Electri-

You might also like