Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322837447

Snoring classified: The Munich-Passau Snore Sound Corpus

Article in Computers in Biology and Medicine · January 2018


DOI: 10.1016/j.compbiomed.2018.01.007

CITATIONS READS
47 2,252

11 authors, including:

Christoph Janott Maximilian Schmitt


Technische Universität München Self-employed
22 PUBLICATIONS 500 CITATIONS 92 PUBLICATIONS 2,799 CITATIONS

SEE PROFILE SEE PROFILE

Yue Zhang Kun Qian


Imperial College London Beijing Institute of Technology
31 PUBLICATIONS 1,020 CITATIONS 145 PUBLICATIONS 2,128 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Kun Qian on 25 February 2019.

The user has requested enhancement of the downloaded file.


COMPUTERS IN BIOLOGY AND MEDICINE 1

Snoring Classified:
The Munich Passau Snore Sound Corpus
Christoph Janott, Maximilian Schmitt, Yue Zhang, Kun Qian, Vedhas Pandit, Zixing Zhang,
Clemens Heiser, Winfried Hohenhorst, Michael Herzog, Werner Hemmert, and Björn Schuller

Abstract—O BJECTIVE Snoring can be excited in descriptors (LLDs) related to energy, spectral features,
different locations within the upper airways during mel frequency cepstral coefficients (MFCC), formants,
sleep. It was hypothesised that the excitation locations voicing, harmonic-to-noise ratio (HNR), spectral har-
are correlated with distinct acoustic characteristics of monicity, pitch, and microprosodic features.
the snoring noise. To verify this hypothesis, a database R ESULTS An unweighted average recall (UAR) of
of snore sounds is developed, labelled with the location 55.8% could be achieved using the full set of LLDs
of sound excitation. including formants. Best performing subset is the
M ETHODS Video and audio recordings taken dur- MFCC-related set of LLDs. A strong difference in
ing drug induced sleep endoscopy (DISE) examina- performance could be observed between the per-
tions from three medical centres have been semi- mutations of train, development, and test partition,
automatically screened for snore events, which subse- which may be caused by the relatively low number of
quently have been classified by ENT experts into four subjects included in the smaller classes of the strongly
classes based on the VOTE classification. The resulting unbalanced data set.
dataset containing 828 snore events from 219 subjects C ONCLUSION A database of snoring sounds is
has been split into Train, Development, and Test sets. presented which are classified according to their sound
An SVM classifier has been trained using low level excitation location based on objective criteria and ver-
ifiable video material. With the database, it could be
C. Janott and W. Hemmert are with the Institute for Medical demonstrated that machine classifiers can distinguish
Engineering, Technische Universität München, Boltzmannstr. 11, different excitation location of snoring sounds in the
85748 Garching, Germany upper airway based on acoustic parameters.
M. Schmitt, V. Pandit and Z. Zhang are with the Chair of
Complex & Intelligent Systems, Universität Passau, Innstr. 43, Index Terms—Obstructive Sleep Apnea, Primary
94032 Passau, Germany, and with the ZD.B Chair of Embed- Snoring, Snore Sound Classification, Machine Learn-
ded Intelligence for Health Care and Wellbeing, University of ing, Drug-Induced Sleep Endoscopy
Augsburg, Eichleitnerstr. 30, 86159 Augsburg, Germany
Y. Zhang and K. Qian are with the Machine Intelligence
& Signal Processing group, MMK, Technische Universität I. I NTRODUCTION
München, Arcisstr. 21, 80333 Munich, Germany
C. Heiser is with the Department of Otorhinolaryngol- A. Background
ogy/Head and Neck Surgery, Klinikum rechts der Isar, Tech- Approximately one out of three adults in the
nische Universität München, Ismaningerstr. 22, 81675 Munich,
Germany western world snores [1], [2]. Snoring is excited by
W. Hohenhorst is with the Clinic for ENT Medicine, Head the inspiratory airflow causing soft tissue structures
and Neck Surgery, Alfried Krupp Krankenhaus, Alfried-Krupp- in the upper airways (UA) to vibrate [3]. Primary
Str. 21, 45131 Essen, Germany
M. Herzog is with the Clinic for ENT Medicine, Head snoring (simple snoring) is characterized by the
and Neck Surgery, Carl-Thiem-Klinikum, Thiemstr. 111, 03048 absence of apnoeic or hypopnoeic episodes. In
Cottbus, Germany contrast, Obstructive Sleep Apnea (OSA) is charac-
B. Schuller is with the GLAM - Group on Language, Audio
& Music, Imperial College London, 180 Queens Gate, Huxley terised by repeated episodes of decreased (hypop-
Bldg., London SW7 2AZ, UK, the Chair of Complex & Intelli- nea) or completely halted (apnea) airflow despite
gent Systems, Universität Passau, Innstr. 43, 94032 Passau, Ger- an ongoing effort to breathe. The averaged number
many, and the ZD.B Chair of Embedded Intelligence for Health
Care and Wellbeing, University of Augsburg, Eichleitnerstr. 30, of apnoeas and hypopnoeas occuring per hour of
86159 Augsburg, Germany sleep is measured by the Apnea-Hypopnea-Index
COMPUTERS IN BIOLOGY AND MEDICINE 2

(AHI), which is a measure for the severity of the the soft palate prove to be more successful in
OSA syndrome. OSA is a serious health condition patients where the reason for the snoring or the
affecting 13 % of men and 6 % of women in the OSA lies in the velar area [13], [14] and has less
US population [4]. Symptoms associated with OSA effect when snoring is predominantly generated by
include daytime sleepiness, excessive fatigue, and the tongue base or the posterior pharyngeal walls
morning headache. It is an independent risk factor [15]. Vice versa, procedures primarily targeting the
for cardiovascular diseases such as hypertension and hypopharyngeal area might not be the first choice of
myocardial infarction [1]. Loud snoring is a typical treatment in purely palatal snorers [16], [17]. There-
symptom associated with OSA in more than 80 % fore, knowledge of the mechanism and location of
of patients [5] [6]. obstruction and snore sound excitation within the
While primary snoring does not directly affect UA can be helpful for targeted interventions.
the health of the snorer, it can have a negative
effect on the sleep structure and quality of life
of the bed partner [7]. Further, snoring can be a B. Earlier Research
reason for social disturbance, e.g., when sleeping Extensive research has been carried out on the
in dormitories or camping sites, and it can affect acoustics of sleep related breathing disorders since
partnerships. It is frequently mentioned as a reason the 1980s. The acoustic properties of snoring
for sleeping in different bedrooms and even cited sounds are well described, a comprehensive liter-
as unreasonable behaviour in divorce proceedings, ature analysis has been published in [18]. Acoustic
although reliable statistics on these aspects are not parameters of snoring sounds have been used in
available. clinical trials to objectify the success of surgical
Standard for the treatment of OSA is continuous snoring interventions and to assess the effectiveness
positive airway pressure (CPAP), applied through of other diagnostic methods for the prediction of
a mask worn during sleep. While this treatment is surgical outcomes [19], [20].
highly effective, the long term compliance is only Snoring sounds have been assessed for their suit-
moderate. ability as diagnostic tools. The majority of the work
Numerous other methods for the treatment of pursued the goal to distinguish between primary
snoring and OSA have been developed, ranging snoring and OSA of different levels of severity, as
from established conservative measures such as oral well as the detection of apnoeic events, in order to
appliances to advance the mandible during sleep make suitable screening systems available that are
[8], to rather unusual methods such as didgeridoo based purely or mainly on acoustic information. A
playing as a means of oral musculature training literature review of publications on acoustic iden-
[9]. Weight loss effectively reduces the severity of tification of presence and severity of OSA can be
snoring and OSA in the majority of overweight found in [21].
patients [10]. Less has been published on the identification of
Surgical methods to treat snoring and the location of the sound generation. In a literature
OSA include for example tonsillectomy, research by the authors, eight papers have been
uvulopalatopharyngoplasty (UPPP), soft palate identified on this subject, for details refer to [21].
stiffening, tongue base reduction or tongue base Our group has investigated the classification of
suspension, and hypoglossal nerve stimulation. snore sound excitation locations using machine
Some surgical treatment methods are highly learning methods. The work was based on a prede-
effective, others are of limited efficacy or their cessor of the Munich Passau Snore Sound Corpus,
evidence is limited [11]. comprising snore sounds from 24 subjects labelled
The key to improved success rates of surgical according to the simplified VOTE classification.
measures is careful patient selection [12]. It is easy Using a feature set based on wavelet transform with
to apprehend that, for example, treatments targeting a support vector machine classifier, an unweighted
COMPUTERS IN BIOLOGY AND MEDICINE 3

average recall (UAR) of more than 70% could be of vibration and obstruction. However, it has a num-
achieved in this four class problem [22], [23]. ber of disadvantages: DISE is cost intensive, as it
Applying an unsupervised feature learning ap- requires the attendance of personnel and appropriate
proach clustering feature values within a given time- equipment for safe administration and monitoring of
segment into acoustic words (bags-of-audio-words) sedation, as well as endoscopic equipment. Further,
based on wavelet features, formants, and MFCC, it is time consuming, a DISE investigation typically
we could achieve a UAR of almost 80% [24]. requires more than 20 minutes overall. Also, it
The Munich Passau Snore Sound Corpus was cannot be performed during natural sleep, as the
first introduced as a sub-challenge in the INTER- introduction of the endoscope would cause the
SPEECH 2017 Computational Paralinguistics Chal- patient to wake up.
lenge [25]. It is freely available to researchers for
scientific purposes. D. Aim
It is therefore of interest to develop alternative
C. Diagnostic Standards methods for the identification of the excitation
location of snoring sounds that do not have the
The gold standard for the diagnosis of OSA is mentioned limitations. A possible solution can be
polysomnography (PSG), a multichannel recording the acoustic analysis of snore sounds.
of physiological parameters during natural sleep It was hypothesised that different excitation lo-
[11]. In most cases, PSG is recorded in a sleep lab- cations of snore sounds are correlated with distinct
oratory. Cardiorespiratory screening using portable acoustic characteristics. The snore signal is shaped
devices with fewer recording channels are often by a transfer function which depends on the cross-
used alternatively or as additional measures. In the sectional profile of the UA from the excitation
past years, an increasing number of methods and location to the nose and mouth opening [29]. The
applications have been researched and developed resulting sound is therefore a function of the excited
for sleep monitoring, e.g., [26], [27]. The diagnostic wave and the shape of the upper airway. Different
accuracy of these methods, however, has not yet snoring generation mechanisms and related exci-
been validated or proven in clinical trials. tation locations go along with typical lengths of
PSG and cardiorespiratory screening provide a the acoustically effective part of the UA, therefore
reliable diagnosis as towards the type and severity carrying characteristic acoustic properties which
of the sleep related breathing disorder. However, allow a classification of defined classes of snoring
it is of very limited use to identify its underlying [3], [30].
mechanisms. In order to test this hypothesis, the Munich
A diagnostic procedure that has been established Passau Snore Sound Corpus (MPSSC) has been
for the evaluation of obstruction and vibration lo- developed.
cations and mechanisms in the UA is Drug Induced For the first time, we present a database of snore
Sleep Endoscopy (DISE). It was developed in the sounds labelled by their class of excitation location.
late 1980s and first described by Croft and Pringle Annotation of the snore events has been carried out
in 1991 [28]. In DISE, the patient is put into based on simultaneous endoscopic video recordings
artificial sleep by means of titrated application of of the upper airways and is therefore objective and
narcotics. When the patient is in an unconscious independently verifiable. To our knowledge, no such
stage, the UA are intranasally inspected by means database is publicly available to date. On this basis,
of a flexible nasopharyngoscope. Video and audio machine learning strategies can be applied to train
signals are often recorded for documentation or later classifiers to distinguish snore sounds according
investigation. to their source of excitation. Perspectively, these
DISE is increasingly used by sleep surgeons and methods have the potential to complement DISE
appreciated as a useful tool to identify the location investigations or even replace them by acoustic
COMPUTERS IN BIOLOGY AND MEDICINE 4

analysis of snore sounds in selected patients, and


thus to decrease the physical strain for the patients
undergoing snoring diagnosis and to reduce health-

Amplitude
care cost.
In contrast to earlier work, we do not aim to
distinguish between primary snoring and OSA or
to classify OSA severity, but to identify vibration Time (ms) 100 200 300

locations, no matter if the snorer shows obstructive


episodes or not.

Amplitude
E. Structure of this Paper
This paper is structured as follows: the pro-
cess of data collection, audio pre-processing, event
selection, classification and labelling of the data
is outlined in chapter II. In chapters III and IV, Fig. 1. 300 ms section of the time domain signal of a breathing
the resulting properties of the database and our event (upper diagram) and a velar snoring event (lower diagram).
The periodical waves of the tonal components representing the
classification experiments are described. Results are fundamental frequency of the snoring sound can be clearly seen
summarised in chapter V, discussion and a conclu- in the lower diagram, whereas the breathing sound in the upper
sion follow in chapters VI and VII. diagram has a predominantly noisy character.

II. M ATERIALS AND M ETHODS


B. Data Collection
A. Definitions
According to the International Classification of The database is derived from original endoscopic
Sleep Disorders (ICSD-3), snoring itself is not a recordings of DISE examinations. The material is
sleep-related breathing disorder, but it can be an available in mp4 format and contains simultaneous
isolated symptom or normal variant of other sleep- video and audio recordings. The recordings were
related breathing disorders [31]. made during DISE examinations of patients who
A definition of snoring based on concrete acous- had undergone previous polysomnography (PSG)
tic parameters, and its delimitation to other noc- and were diagnosed with OSA. DISE was per-
turnal breathing sounds, does not yet exist [18]. formed as an additional diagnostic measure in these
The distinction has so far been based exclusively patients for planning of subsequent surgical inter-
on the subjective assessment of human listeners. In ventions, for pressure titration of a continuous pos-
a study by Rohrmeier et. al., in which 25 human itive airway pressure (CPAP) system, or for fitting
listeners were tasked to classify acoustic sequences of a mandibular advancement device (MAD). The
in respiratory sounds and snoring, 16% of the events material was obtained from three clinical centres
could not be assigned clearly [32]. which use DISE examinations as a routine diagnos-
In this work, snoring shall be distinguished from tic method in selected patients:
breathing sounds by the existence of predominant • Klinikum rechts der Isar (Technical University
tonal components in the resulting sound [33], [34]. Munich), Munich, Germany: recordings from
Fig. 1 shows typical examples of the time signal of 38 subjects taken 2013 through 2014.
a breathing sound and a snoring sound. • Alfried Krupp Hospital Essen, Germany:
For the sake of consistent nomenclature, in this recordings from 2090 subjects taken 2006
paper, the individual sound which is produced through 2015.
within one breath is called a snoring event, while a • University Hospital Halle/Saale, Germany:
sequence of snoring events (a period of continuous recordings from 46 subjects taken 2012
snoring) is referred to as a snoring episode. through 2015.
COMPUTERS IN BIOLOGY AND MEDICINE 5

TABLE I
R ECORDING SETUP AT THE CLINICAL CENTRES

Centre Recording equipment


Munich Storz flexible nasopharyngoscope
Storz Telepack X recording system
Headset microphone
Essen Olympus flexible nasopharyngoscope
Rehder/Partner rpSzene recording system
Handheld or headset or forehead-mounted microphone
Halle Storz flexible nasopharyngoscope
Storz AIDA recording system
Stand-mounted microphone

Table I shows the equipment used for recording


of the DISE videos.
As an example, Fig. 2 displays screenshots taken
from DISE recordings of typical snoring events. The
upper left image (V) shows a vibrating velum at
the palatal level. In the upper right image (O), the Fig. 2. Screenshots taken from DISE video recordings showing
palatal snoring (V), oropharyngeal snoring (O), tongue base
oropharyngeal level can be seen with vibrating pala- snoring (T), epiglottal snoring (E). All screenshots taken from
tine tonsils. In the lower left image (V), the tongue videos of the Essen centre.
base vibrates against the posterior pharyngeal wall.
And the lower right image (E) shows a vibrating
epiglottis. The white arrows in the images mark the
respective vibrating structures.

C. Pre-Processing
Amplitude

First, the audio signal was extracted from the mp4


files and stored in wav-format (16 bit, 44 100 Hz).
Subsequently, audio events were identified using an
automated algorithm. Octave 3.6.1 with GCC 4.6.2
was used as programming platform. The absolute
value of the signal amplitude was averaged in 10 ms
5 10 15 20
segments with no overlap and the background noise Time (s)

level was determined by means of a 1024-step his-


togram averaging 10 s segments. Background noise Fig. 3. Illustration of the segmentation procedure based on an
example of a 20 second audio signal from a DISE recording. The
level was defined as the respective maximum value blue curve shows the amplitude envelope of the snore signal. The
of the histogram. All segments exceeding a level of horizontal red line is the threshold amplitude of two times the
two times the determined background noise level noise level. The green line shows the selected audio segments
identified as events of min. 300 ms length.
for a minimum duration of 300 ms were annotated.
Adding 100 ms of signal before and after the actual
onset and end of the event, the events were extracted
from the original audio file, normalised, and saved values were experimentally optimised during the
as separate wav files (16 bit, 16 000 Hz). Fig. 3 il- algorithm development based on a subset of the
lustrates the segmentation procedure. All described DISE audio recordings.
COMPUTERS IN BIOLOGY AND MEDICINE 6

D. Pre-Selection: Snore and Non-Snore Sound a sound as a snore sound are not standardised
Events by objective acoustic parameters, and none of the
In a next step, an experienced human listener (the snore / non-snore classification algorithms that are
first author) listened to all selected events and clas- known to us refer to commonly accepted selection
sified them manually as either pure snoring (snore) schemes. From our raw material, snore sounds with
or other sounds (non snore). Also, those events overlaid non-stationary disturbing noise needed to
that contained a snore event but were disturbed be excluded. None of the existing algorithms are
by non-static background noise, such as speech described as to their sensitivity and specificity for
or acoustic alarm signals from medical equipment, excluding snore sounds with such artefacts.
were excluded from the snore group. The same
applies for snore events that were overdriven or E. Classification
distorted by disturbances in the recording chain Several schemes have been suggested for the
such as slack joints. classification of the location of snoring noise and
The criteria to include a sound event in the obstructions [39], [40], [13], [41]. A widely used
snore group were therefore subjective. A quite rigid scheme is the VOTE classification, introduced by
standard was applied to pass as snore sound. When Kezirian et al in 2011 [42]. The VOTE classification
in doubt, a sound event was rather excluded from distinguishes four structures that can be involved in
the snore group. airway narrowing and obstruction [43]:
A subject’s recording was discarded altogether if • V, Velum (palate), including the soft palate,
• no acoustic event could be extracted from the uvula, and lateral pharyngeal wall tissue at the
original recording, level of the velopharynx.
• non of the extracted acoustic events qualified • O, Oropharyngeal lateral walls, including the
as snore signal, palatine tonsils and the lateral pharyngeal wall
• all of the snore events were polluted by non- tissues that include muscles and the adjacent
static background sounds, overdriven or dis- parapharyngeal fat pads.
torted. • T, Tongue, including the tongue base and the
While the material from Halle/Saale and from airway posterior to the tongue base.
Munich was already pre-selected (by MH and CH) • E, Epiglottis, describing folding of the epiglot-
for videos containing snore episodes, the material tis due to decreased structural rigidity or due
from Essen had not been pre-screened. Therefore, to posterior displacement against the posterior
the yield of subjects with snore events from the pharyngeal wall.
Essen material was significantly lower than from Fig. 4 illustrates the corresponding locations
the other two centres. within the upper airways.
In total, snore events from 331 subjects were In addition, the VOTE classification contains a
selected for subsequent annotation (Essen: 266 sub- description of the shape of obstruction (anteropos-
jects, Munich: 31 subjects, Halle/Saale, 34 sub- terior, lateral, and concentric), and a qualitative
jects). The total number of snore events was 2 261, assessment of the degree of airway narrowing (no,
the number of snore events per subject ranged from partial or complete obstruction). The VOTE classi-
two to 30. fication as introduced by Kezirian et. al. is primarily
Several algorithms have been described and used to describe airway narrowing and obstruction
tested for automated classification of snore and in OSA patients.
non-snore sounds [35], [36], [37]. Although some For our research, we introduce a simplified ver-
prove high sensitivity and specificity of more than sion of the VOTE classification in order to describe
97% [38], for our database, a ‘human’ classification the location of vibration of the soft tissue generating
process was preferred over automated algorithms snoring noise. We do not distinguish between differ-
for the following reasons: 1. The criteria to define ent levels of airway narrowing. Furthermore, only
COMPUTERS IN BIOLOGY AND MEDICINE 7

events that create vibration of the airway structures maneuver, snoring originated from the epiglottis-
are of interest. Therefore, our selection of samples level. Consequently, the database contains both V-
is limited to partial narrowing according to the type snoring and E-type snoring events from this
VOTE classification. Further, we do not distinguish very subject.
different obstructive patterns. This leads us to a Further, we included only those events where the
four-class classification described by the labels V, vibration mechanism could be clearly seen in the
O, T, and E. DISE video recording. Samples with compromised
visibility (e.g., due to saliva on the endoscope tip)
were excluded, as were samples in which the video
recording showed a different level of the upper
airway than the location of excitation at the same
point of time (e.g., observing the epiglottis during a
suspected velum snore) and therefore the vibration
mechanism could not be visually confirmed.
For the remaining audio events, the correspond-
ing video sections of the DISE video were reviewed,
classifying the vibration location according to the
simplified VOTE scale. Only clear vibration pat-
terns where selected. Those which where unclear,
where multiple vibration sites where simultaneously
involved, and snoring events with an obstructive
event, were excluded.
Fig. 4. Areas in the upper airways according to the VOTE From the 331 subjects included in the annotation
classification step, a total of 112 had to be excluded altogether
for the following reasons:
• none of the snore events was limited to one of
our four defined levels,
F. Annotation to VOTE classes • disagreement on the level of vibration between
For all selected sound events, the respective video the annotators for all events,
files were watched by two experienced experts • impaired visibility of vibration level for all
(CH and CJ). Based on the video findings, each events,
snore event was assigned one of the four classes. • obstruction occurred during all snoring events.
Segments where both experts were not in agreement Of the remaining 219 subjects, a maximum of six
as to the correct class were excluded. snore events per subject and class were included in
Vibration and obstruction in the UA is not always the database. If more than six events of the same
limited to a single level according to our classifica- class were available in one subject, only the first six
tion. For this database, we excluded events in which events were used. Fig. 5 shows a summary of the
vibration was not clearly limited to one of our four selection steps taken and the number of subjects per
defined levels. However, during one DISE examina- centre included in the database after identification
tion session, the same subject might show vibration of snore events and after annotation.
patterns at different levels in different snore events, In order to verify the reliability of annotation, a
but limited to one vibration level per event. In subset of videos from 40 subjects was evaluated in-
this case, snore events were included and labelled dependently by an additional annotator (WH). The
accordingly. For example, one subject showed dis- subset included all 10 subjects that were annotated
tinct velum-level snoring when the mandible was to the T class, plus 30 randomly selected subjects.
advanced using an Esmarch-maneuver. Without this There was agreement for all subjects except for one
COMPUTERS IN BIOLOGY AND MEDICINE 8

(Annotator CH: O-type snoring; Annotator WH: to create subject disjunctive partitions, assignment
probably O-type, but not certain). Based on this is made based on subject, not event (i.e., all snore
sample of 18% of subjects from the total set, the events from a subject are assigned to the same
interrater-reliability according to Cohen’s Kappa is partition). To obtain this, we first sorted the subjects
κ = 0.96. 1 by class. Within each class, subjects were sorted
Interobserver agreement for evaluation of DISE by centre, then by gender, and then by age. Using
videos was studied by Vroegop et. al. in 2013 [44]. this order, subjects were successively, one by one,
For the level of collapse, interrater reliability values assigned to the train, development, and test parti-
between κ = 0.48 for the oropharyngeal level and tions. Fig. 6 illustrates this process. A two-tailed,
κ = 0.71 for the tongue base level were found unpaired t-test confirmed no significant differences
for a group of seven experienced ENT surgeons. between the partitions for age, gender, center or
Although these results are only comparable to a class (p>0.05). Table II shows the resulting number
limited extent (Vroegop et. al. evaluated collapse of events per class and partition. Since the number
instead of vibration, and they used a classification of snore events per subject differ, the partitions
additionally comprising the hypopharynx as a fifth contain different numbers of snore events, but equal
level), it is safe to conclude that the interrater- number of subjects.
agreement in our study offers a very high level of In particular, an even distribution of the data by
confidence in the annotation. Reasons for this com- centre reduces the risk of learning ambient acoustic
paratively good agreement can be that all annotators characteristics instead of snore sound properties.
are highly experienced in the evaluation of DISE However, of the T-type subjects, seven are from
recordings, and that events with unclear level of Essen, but only two from Munich, and one from
vibration have already been excluded in a previous Halle. For this reason, the instances from this class
step. could not be balanced completely evenly by centre
between the set splits. This should be considered
Munich Essen Halle Total
when interpreting the classification results.
38 2090 46 2174 Raw material from DISE recordings

Snore Event 1 Train Set Development Set Test Set


Selection steps:
Snore Event 2
- Automated identification and separation of audio events Subject 1
.....

- Manual classification: snore / non-snore Snore Event n1


- Exclusion of events disturbed by non-static noise Snore Event 1
Snore Event 2
Subject 2
.....

Snore Event n2
Munich Essen Halle Total
31 266 34 331 Selected Snore events Snore Event 1
Snore Event 2
Subject 3
.....

Snore Event n3
Selection steps:
Snore Event 1
- Manual annotation according to VOTE class Snore Event 2
Subject 4
.....

- Exclusion of events with unclear vibration source


- Exclusion of events with compromised visibility in video Snore Event n4

Snore Event 1
Snore Event 2
Subject 5
.....

Munich Essen Halle Total


Events classified based on VOTE Snore Event n5
25 164 30 219
Snore Event 1
Snore Event 2
Subject 6
.....

Snore Event n6
..........
..........

..........
..........

Fig. 5. Number of subjects per centre included in the database


after each data selection step
Fig. 6. Process of subject-disjunctive stratification. All snore
events from a subject are successively assigned to the respective
train, development or test set.
G. Partitioning
In order to prepare the corpus for machine learn-
ing experiments, we stratified the data into a train, III. DATABASE P ROPERTIES
a development (dev), and a test partition. In order
The resulting database contains audio samples of
1 Cohen’s Kappa was calculated using ReCal2 0.1, dfreelon.org 828 snore events from 219 subjects. All samples in
COMPUTERS IN BIOLOGY AND MEDICINE 9

TABLE II TABLE III


N UMBER OF SNORING EVENTS PER CLASS IN THE SET SPLITS N UMBER OF SUBJECTS PER CENTRE AND CLASS

Train Devel Test Σ Centre V O T E


V 168 161 155 484 Munich 14 4 2 5
O 76 75 65 216 Essen 100 46 7 15
T 8 15 16 39
Halle 19 6 1 4
E 30 32 27 89
Σ 282 283 263 828 Total 133 56 10 24

the database are available with a sampling rate of


V and T type snoring during the DISE investigation.
16 000 Hz and a resolution of 16 bit.
Thus, these four subjects are counted twice.
Average sample duration is 1.46 s (range
The number of events and subjects per class in
0.73 ... 2.75 s). Samples from the T-class are sig-
the database is strongly unbalanced, with the major-
nificantly shorter than those from the three other
ity of samples belonging to the V and O class (total
classes (p<0.001, see Fig. 7B). 2
84.5%), whereas T and E type snoring samples only
Since the sample duration itself might be a de-
account for 4.7%, and 10.8%, respectively, of the
scriptor for the respective class, the differences in
total number of events. This was to be expected
sample length are not a sign of inhomogeneity of
and is in line with earlier findings from DISE
the database, but rather a noteworthy fact.
evaluations. Hessel et. al. described in 2003 based
Average age of the subjects is 49.8 (range
on DISE examinations of 380 patients that single
24 ... 78) years, with no significant difference be-
level obstructive events at the hypopharyngeal level
tween classes (p>0.10), see Fig. 7A.
(thus, T, and E type according to our classification)
Further, notably, 93.6% of all subjects are male.
occured in only 2% of patients, whereas single level
V and O type events occured in 22% of patients,
thus 10 times as often [15]. Other researchers come
to similar results [45].
It is important to note that certain acoustic
properties of the sound samples from the three
centres are distinctly different. Firstly, the acoustic
characteristics of the room (ambient noise, room
acoustics) differ between the three centres. Sec-
ondly, different types and models of microphones
Fig. 7. Subject’s metadata per class. A: age per class, (age in were used, resulting in differences in the frequency
years at the time of DISE investigation). B: sample duration per response of the microphone itself, as well as the
class (in seconds per event).
position and distance of the microphone relative
to the snorer, which again can have a significant
Table III contains the number of subjects per
influence on the signal to noise ratio. In Munich, a
class and centre, which are included in the database.
headset microphone was used, in Halle, a stand-
Note that the total number for all classes in Table
mounted microphone was deployed. In Essen, a
III is 223, whereas the total number of actually in-
handheld microphone, a headset microphone, and
cluded subjects is 219. Reason for this discrepancy
a microphone to be fixed on the forehead were
is that one of the subjects showed both V and E type
available, and the type of microphone used for
snoring, another subject showed both V and O type
the audio recordings was chosen according to the
snoring, and again two other subjects showed both
preference of the surgeon performing the DISE
2 All probability values calculated with two-tailed, unpaired investigation.
t-test Fig. 8 shows spectrograms of the background
COMPUTERS IN BIOLOGY AND MEDICINE 10

learning experiments. Since the task is not about


distinguishing between centres, and each of the
snore classes contains a balanced number of sam-
ples from all three centres, the difference in ambient
sound characteristics might actually prevent the
machine from learning these features, and to focus
on the differences in actual snoring noise and to
create an even more stable classifier model.
Nevertheless, care should be taken to carefully
balance the number of samples from the different
centres per class and per partition.

IV. M ACHINE L EARNING E XPERIMENTS


A. C OM PAR E Baseline and Challenge Contribu-
tions
Fig. 8. Background noise frequency spectra for different record-
ing settings. A: Essen, using a handheld microphone. B: Essen,
The data of the MPSSC was introduced as
using a headset microphone. C: Halle, using a stand-mounted the Snore Sub-Challenge in the INTERSPEECH
microphone. D: Munich, using a headset microphone. 2017 Computational Paralinguistics ChallengE
(C OM PAR E). In this context, members of our group
performed baseline experiments using the official
noise in different recording settings, taken from INTERSPEECH C OM PAR E baseline feature set,
sections of the DISE recordings in the three cen- which includes low-level descriptors (LLDs) re-
tres. The spectrograms show that the background lated to energy, spectral features, mel frequency
noise characteristics are distinctly different. We cepstral coefficients (MFCC), voicing, harmonic-to-
performed a machine learning experiment in the noise ratio (HNR), spectral harmonicity, F0 (pitch),
same setup as described in the following chapter, and microprosodic features (jitter and shimmer). In
but using the centres as classes instead of the addition to these LLDs, their 1st order derivatives
snoring noise type. The results show that centres (delta) are computed. In a second step, statistics of
can be clearly distinguished with a UAR of 88.0% the LLDs, the so-called functionals are obtained.
(mean UAR of all partition permutations and using They comprise statistical moments of different or-
the INTERSPEECH C OM PAR E baseline feature set ders, percentiles and extrema. An exhaustive list and
plus the formants subset), proving that the snore description of the C OM PAR E feature set is found in
sounds indeed carry center-specific information. In [46] and [47]. In addition, we employed a bag-of-
order to evaluate the impact on the performance audio-words (BoAW) approach as well as an end-to-
of our classifier setup to distinguish snoring noise, end learning (e2e) model. The highest unweighted
we performed the machine learning experiments average recall (UAR) of 58.5% could be achieved
exactly as described in the following chapter, but using the C OM PAR E functionals in combination
using samples only from Essen, resulting in a with a Support Vector Machine (SVM). The e2e
slightly worse performance compared to the results learning and BoAW models have yielded inferior
including all three centres (53.4% UAR for Essen results. Details can be found in [48].
only vs 55.8% for all centres, rated by mean UAR Seven contributions on classification experiments
of all permutations and using the full C OM PAR E with the MPSSC were accepted in the context of the
feature set plus the formants subset). C OM PAR E Snore Sub-Challenge.
These experiments show that centre-specific Tavarez et. al. [49] used i-vector representa-
acoustic properties are not necessarily a weakness tions of MFCCs, constant Q cepstral coefficients
of the database, but can be desired for machine (CQCCs) and relative phase shift (RPS) features
COMPUTERS IN BIOLOGY AND MEDICINE 11

obtained at frame level combined with the music- achieving a UAR of up to 52.8% on the test
related pitch class profiles, tonal centroid and spec- partition. Interestingly, comparison of the confusion
tral contrast features as well as suprasegmental matrices reveals that the classification error between
spectral statistics, voice quality and prosodic fea- V and E class samples is reduced compared to the
tures to train a cosine distance classifier on the C OM PAR E baseline approach. Anatomically, Velum
MPSSC audio data. Late fusion of the MFCC and and Epiglottis, representing the excitation locations
RPS feature sets obtained the best classification in the upper airways for these two classes, are
performance of 54.3% UAR on the development set farthest apart, which might result in significantly
and 50.6% UAR on the test set. different filter coefficient estimates. On the other
Nwe et. al. [50] approached the snore sound clas- hand, V and O class instances are misclassified
sification task by fusing the results of three sub- more often using the source-filter model approach,
systems by majority voting. The first subsystem which can be explained by the close proximity of
consists of a Bhattacharyya-based Gaussian Mixture Velum and Oropharyngeal area, resulting in rather
Model (GMM) supervector in an SVM classifier, similar filter coefficient estimates.
using the C OM PAR E baseline set as input features. Gosztolya et. al. [52] extracted features at frame
In the second subsystem, The C OM PAR E baseline level using a feature set first proposed in the IN-
feature set is reduced to a subset of 53 out of the TERSPEECH 2013 C OM PAR E challenge [53], con-
originally 6 373 features by a correlation feature se- sisting of 39-dimensional MFCCs, voicing probabil-
lection step, subsequently training a random forest ity, harmonics-to-noise-ratio, F0 and zero-crossing
classifier. Thirdly, a convolutional neural network rate and their respective 1st and 2nd derivatives,
(CNN) is trained based on the log power spectro- as well as mean and standard deviation over nine
gram of the snore sound. Fusion of the three models neighbouring frames. Further, each instance is di-
achieved a UAR on the test set of 51.7%, while vided into 10 equal-sized segments, and each of
the Bhattacharyya-GMM-SVM subsystem reached the above features is averaged out in each of the
a UAR of 52.4%. segments. An SVM model was trained with this
A dual source-filter model simulating the acoustic feature set and the results eventually fused with
transfer function of the airways was proposed by those of a second SVM classifier trained on the
Rao et. al. for feature extraction [51]. The model original INTERSPEECH 2017 C OM PAR E baseline
consists of two all-pole filters resembling the acous- feature set, achieving a UAR of 64.0% on the test
tic properties of two consecutive tubes: the first set.
one ranges from the lungs to the obstruction lo- Kaya et. al. [54] particularly approached the un-
cation in the upper airways, whereas the second balanced nature of the corpus by proposing a
one models the upper airways from the obstruction weighting scheme for kernel classifiers. The audio
location to the lips. The first filter is excited by signal is represented by MFCCs and a RASTA
white noise at lung level, while the second one is perceptual linear prediction (PLP) cepstrum, com-
excited by periodic impulses at the obstruction level plemented by the 1st and 2nd order derivatives,
resembling snoring. Parameters of the two filters resulting in relatively small feature sets with a
are estimated in a multi-step process comprising dimension of 75 and 39, respectively. Both feature
detection of the snore beat cycle impulse location, sets are then fused and represented in a Fisher
construction of two windows to attenuate the effect vector for the classification task. In parallel, the
of source and filter, and estimation of the filter co- original baseline openSMILE feature set is applied.
efficients from the windowed signal. The resulting For classification, an Extreme Learning Machine
feature set consists of the filter coefficients and their (ELM) and a Partial Least Squares (PLS) classifier
respective framewise means, variances and medians with linear kernels are used. Using a weight matrix
and is used to train SVM classifiers with linear counter-balancing the under-represented classes,
and radial basis function (RBF) kernel, respectively, a ‘Weighted Kernel Extreme Learning Machine’
COMPUTERS IN BIOLOGY AND MEDICINE 12

(WKELM) and a ‘Weighted Kernel Partial Least B. C OM PAR E Baseline Feature Subsets
Squares (PLS)’ classifier are introduced and their To obtain a more detailed insight into the suit-
performance compared to the unweighted models ability of acoustic features for the task at hand, we
by applying a 2-fold cross-validation of the training evaluated the performance of the different subsets
and development partition. The weighted classifica- of the C OM PAR E feature set.
tion models clearly outperformed their unweighted Please refer to Table IV for a description of
counterparts in three of four combinations of feature the features deployed. In addition to the C OM -
sets and folds. Notably, distinct differences in per- PAR E features (lines 1 through 13), we extracted
formance could be observed between the two folds. frequency and bandwidth of the formants F1-F3
Fusing the best four combinations of feature sets (lines 14 through 19). The number of low level
and classifiers, a UAR of 64.2% on the test set could descriptors for each feature subset is given, as well
be achieved. as the resulting number of low level descriptors
The following contributions did not officially for each feature subset after calculating deltas and
participate in the challenge, since some of the co- functionals.
authors were part of the challenge organisers. The feature sets were extracted by the OPEN S-
MILE feature extraction and audio analysis tool
Amiriparian et. al. [55] generated feature vectors [57], [58]. All experiments were conducted using
using deep image CNNs trained with spectrogram an SVM with linear kernel. We used the open-
plots of the snoring audio data. The feature vectors, source toolkit L IBLINEAR [59]. As solver type,
with a dimension of 4 096 features, were extracted the default configuration (L2-regularised L2-loss
from the first and second fully connected neuronal support vector classification, dual) was chosen with
layers, respectively, and used to train linear kernel a bias of 1. For all experiments, the complexity
SVMs, achieving UARs of 44.8% on the develop- parameter of the SVM was optimised on the de-
ment set and 67.0% on the test set. Notably, the velopment set in the range of 2−30 , 2−29 , . . . , 20 .
choice of colour map for the spectrogram plots The complexity providing the maximum UAR was
had a significant impact on the classification perfor- selected and divided by 2 for the training of the
mance. Further, best results were achieved extract- final model, fusing train and dev set. As both sets
ing the features from the second fully connected have approximately the same size, this bisection of
neuronal layer of the ‘AlexNet’ CNN. Fusion of the complexity parameter has proven to be more
different colour maps and layers did not yield an suitable. Standardisation of all features (to zero
improvement in classification performance in this mean and unit standard deviation) was employed
model. in an on-line approach, i. e., the parameters (mean
Freitag et. al. [56] used the same setup of and standard deviation) were derived from the train
spectrogram-fed CNNs and combined it with an set (the fusion of train and dev set) only and then
evolutionary feature selection algorithm based on applied to the dev set (the test set). No resampling
competitive swarm optimisation, which was trained of the data was employed in any of the experiments.
using a wrapper algorithmn with a linear SVM. Re- In order to average out potential differences in
sults show that the UAR increases during the feature the characteristics of train, development, and test
selection process until the feature subset reaches a partition, we carried out the experiments six times
size of about 65% of the original feature set. Little in all possible permutations of the three partitions.
improvement in UAR is achieved when the number
of selected features is reduced further. With this V. R ESULTS
approach, a UAR of 57.6% on the devel set and Classification results are shown in Table V for
66.5% on the test set could be achieved, using a the best-performing feature sets together with the
feature subset containing 55% of the features from corresponding number of LLDs and the final num-
the original deep spectrum feature set. ber of features after computing the functionals.
COMPUTERS IN BIOLOGY AND MEDICINE 13

TABLE IV
BASELINE FEATURE SUBSETS . #LLDs: N UMBER OF LOW- LEVEL DESCRIPTORS ( WITHOUT COEFS AND DELTAS ); #Features:
N UMBER OF FEATURES INCLUDING COEFS AND DELTAS .

Line Feature type #LLDs #Features Description


1 audspec 1 100 Sum of the audSpec coefficients
2 audspecRasta 1 100 audspec incl. relative spectral
transform (RASTA)
3 pcm RMSenergy 1 100 RMS energy at 20 ms frame size
4 pcm zcr 1 100 Zero crossing rate
5 audSpec 26 2600 Perceptual linear predictive cepstral
coefficients generated from the mel
frequency spectrum
6 pcm fftMag 15 1500 Spectral energy in two frequency
bands plus spectral rolloff,
flux, centroid, entropy, variance,
skewness, kurtosis, slope,
harmonicity, sharpness
7 mfcc 14 1400 Mel frequency cepstral coefficients 1-14
8 F0final 1 83 Fundamental frequency
9 voicingFinalUnclipped 1 78 Voicing probability estimation
10 jitterLocal 1 78 Difference of period lengths
11 jitterDDP 1 78 Difference of difference of period lengths
12 shimmerLocal 1 78 Amplitude variations (shimmer)
13 logHNR 1 78 Logarithmic harmonics-to-noise ratio
14 F1frequency 1 78 Frequency of first formant
15 F1bandwidth 1 78 Bandwidth of first formant
16 F2frequency 1 78 Frequency of second formant
17 F2bandwidth 1 78 Bandwidth of second formant
18 F3frequency 1 78 Frequency of third formant
19 F3bandwidth 1 78 Bandwidth of third formant
20 F1-F3 6 468 Frequency and bandwidth of F1-F3
21 ALL wo. F1-F3 65 6373 Features Line 1 through 13
22 ALL 71 6841 Features Line 1 through 19

The mean unweighted average recall (UAR) over tures (F1-F3) yielded inferior classification results.
all permutations of the partitions as well as the Removing the formants subset from the full fea-
standard deviation between the permutations are ture set results in only a minor deterioration of
listed. Detailed results can be found in the appendix. 0.4% UAR, suggesting that formant frequencies and
ALL w/o F1-F3 with coefficients and deltas (line 7 bandwidths do not provide significant additional
in Table V) is the full C OM PAR E feature set. ALL information in our experiments.
shows the results when applying the C OM PAR E
It is remarkable that the results differ consider-
feature set plus F1-F3. For comparison, we also list
ably between the permutations. Range of perfor-
the F1-F3 subset, applying the same coefficients as
mance between the best and the worst performing
for jitter and shimmer.
permutation is up to 12.5% for the C OM PAR E
Rated by UAR, the best classification perfor- feature set, and still 7.1% for the full feature set.
mance could be achieved with the full feature A comparison of the confusion matrices reveals
set consisting of the C OM PAR E features plus the that the largest differences occur in the two small
formant set F1-F3 including functionals and deltas. classes T and E, with a range of 18%, and 28%
Best performing single subset is mfcc only coef, in class-specific recall, respectively, between the
consisting of MFCC-related LLDs, using function- permutations. Performance differences for the large
als, but not deltas. Using only formant-related fea- classes V and O are smaller by comparison. Table
COMPUTERS IN BIOLOGY AND MEDICINE 14

TABLE V
C LASSIFICATION RESULTS .
mean UAR: UNWEIGHTED AVERAGE RECALL , MEAN PERFORMANCE OF ALL PARTITION PERMUTATIONS ; Range: R ANGE OF
RESULTS BETWEEN PARTITION PERMUTATIONS ; Delta: COEFS AND / OR DELTAS USED ; LLDs: N UMBER OF LOW- LEVEL
DESCRIPTORS ; Features: N UMBER OF FEATURES INCLUDING COEFS AND DELTAS .

Line Feature type Delta #LLDs #Features mean UAR Range


1 mfcc coef+delta 28 1400 49.9 % 7.0 %
2 mfcc only coef 14 756 52.9 % 10.6 %
3 mfcc only delta 14 644 33.4 % 11.4 %
4 F1-F3 coef+delta 12 468 30.6 % 8.8 %
5 F1-F3 only coef 6 234 30.5 % 8.0 %
6 F1-F3 only delta 6 234 29.7 % 8.0 %
7 ALL w/o F1-F3 coef+delta 130 6373 55.4 % 12.5 %
8 ALL w/o F1-F3 only coef 65 3425 54.7 % 10.5 %
9 ALL w/o F1-F3 only delta 65 2948 47.1 % 9.5 %
10 ALL coef+delta 142 6841 55.8 % 7.1 %
11 ALL only coef 71 3659 54.7 % 10.1 %
12 ALL only delta 71 3182 48.0 % 4.7 %

VI shows confusion matrices for all permutations, the best results over all permutations.
and Table VII summarizes mean and range of all
permutations of the class-specific recalls. All results VI. D ISCUSSION
are for the pest-performing ALL feature set with
A. Classification Performance by Class
coef and delta. Class-specific recall results of the
four classes for all feature subsets can be found in Marked differences in classification performance
the appendix. can be found between the permutations of train,
dev, and test partition, mainly caused by the classes
It can be suspected that these discrepancies are
T and E. Due to the low number of subjects in
a result of chance, since the number of subjects in
these classes, misclassification of only few events
both classes is fairly small for a machine learning
can result in a significant performance difference
task. The first author has listened to all T and E
measured by unweighted average recall (UAR).
events with a ‘trained human ear’ and found that the
Still, UAR should be the ultimate measure for
snoring of the included subjects does indeed sound
performance in this task, as the WAR underrates
distinctively different. At the same time, the ‘typical
the performance in the small classes. No matter
sound’ of a tongue base or an epiglottis snorer
how small, each of the four classes have equal
could be descried in all samples. This subjective
importance, as a therapy decision for T or E type
judgment is based on an extensive experience of
snorers is distinctively different than for V or O type
the first author in the assessment of snore sounds
snoring, which occurs much more frequently. We
in several projects for a number of years. It is well
expect more stable results with data from a higher
possible that it requires a larger number of samples
number of subjects in the smaller classes, which
for a machine classifier to deduct the characteristic
will be available when adding data to the corpus
acoustic features of T and E snoring.
over time.
Applying the weighted average recall (WAR) as a
performance measure overweights the contribution
of the larger classes V and O, thereby reducing the B. Performance of Feature Subsets
influence of the questioned small classes. With a Snoring and speech have a lot of acoustic sim-
WAR of 65.4%, the combination of all employed ilarities: both are generated in the upper airway
features (ALL) with coefs but without deltas shows through vibrations caused by airflow, acoustically
COMPUTERS IN BIOLOGY AND MEDICINE 15

TABLE VI TABLE VII


C ONFUSION MATRICES OF ALL PERMUTATIONS FOR THE M EAN , MINIMUM , MAXIMUM , AND RANGE OF CLASS
BEST- PERFORMING FEATURE SET. SPECIFIC RECALL OF ALL PARTITION PERMUTATIONS .
Tr: T RAIN PARTITION ; De: D EVELOPMENT PARTITION ; Te:
T EST PARTITION
Class Mean Min Max Range
V 66.6 % 59.4 % 73.9 % 14.6 %
Tr+De>Te O 62.1 % 56.6 % 67.7 % 11.1 %
pred -> V O T E Recall T 24.4 % 13.3 % 31.3 % 17.9 %
V 92 37 8 18 59.4 % E 70.3 % 53.1 % 81.5 % 28.4 %
O 13 44 2 7 67.7 %
T 0 7 4 5 25.0 %
E 0 4 1 22 81.5 %

De+Tr>Te
pred -> V O T E Recall
Acoustic descriptors that have proven effective in
V 94 33 9 19 60.6 %
speech-related machine learning tasks are therefore
O 15 39 4 7 60.0 % likely to be well suited also for the classification of
T 0 6 5 5 31.3 % snoring noise. Our findings as well as the results
E 0 3 2 22 81.5 %
from the C OM PAR E Snore Sub-Challenge contri-
Tr+Te>De butions underpin this assumption. The presented
pred -> V O T E Recall acoustic tube model of the upper airways [51] has
V 119 32 2 8 73.9 % yielded results that are consistent with the under-
O 17 49 0 9 65.3 %
T 0 12 2 1 13.3 %
lying anatomy it aims to resemble. MFCC-based
E 10 4 1 17 53.1 % features haven proven most successful in classifica-
tion performance in [49], and those models using
Te+Tr>De feature sets based on MFCCs and PLP cepstrum
pred -> V O T E Recall
showed the best results of the challenge [52], [54].
V 118 33 1 9 73.3 %
O 15 49 3 8 65.3 % Our own findings when investigating the perfor-
T 0 10 4 1 26.7 % mance of the INTERSPEECH C OM PAR E feature
E 7 3 0 22 68.8 % subsets confirm this: the MFCC subset has shown
De+Te>Tr a superior classification performance compared to
pred -> V O T E Recall all other single subsets. Hence, the descriptors that
V 107 30 12 19 63.7 % prove sensitive in the classification task at hand
O 20 43 6 7 56.6 % are those representing the spectral properties of the
T 2 2 2 2 25.0 %
E 1 5 4 20 66.7 % signal, which can be seen as a confirmation for the
hypothesis that the upper airway transfer function
Te+De>Tr is characteristic for different excitation locations of
pred -> V O T E Recall
snoring sounds.
V 115 29 13 11 68.5 %
O 18 44 6 8 57.9 % Formant characteristics have been investigated
T 2 2 2 2 25.0 % for their suitability to describe snoring sounds in
E 2 5 2 21 70.0 %
earlier works. Peng et. al. have found a statistically
significant difference in frequency of F2 between
snoring generated by the velum versus the lateral
shaped by the frequency transfer function of the pharyngeal walls [60]. Koo et. al. looked at obstruc-
upper airway and emitted through mouth and nose. tion levels in OSAS patients and found significantly
The position of the tongue is of significance for higher frequencies for F1 and F2 in snorers with
shaping the different phonems in speech and in retrolingual obstruction compared to those with
the generation of different types of snoring, thus retropalatal obstruction [61]. In our experiments,
shaping the resulting sound in a characteristic way. MFCCs have clearly outperformed the subset that
COMPUTERS IN BIOLOGY AND MEDICINE 16

is based on formant characteristics alone, suggest- and the obstruction sites in OSA correlate is not
ing that formants are indeed descriptive for the known and should be subject of future studies.
excitation location of snore sounds, but inferior to
MFCCs. D. Drug Induced and Natural Sleep
There are also a number of differences between
speech and snoring. In speech generation, the sound Our database is based on recordings taken during
is excited in a fixed location, the voice box in the DISE examinations. It is an ongoing subject of
glottis, whereas vowels are formed by the position the scientific debate as to which extent the vibra-
of tongue, palate, mandible and lips, altering the tional and obstructive patterns observed under DISE
cross-sectional profile of the upper airway. At the are similar to those in natural sleep. In our case,
same time, the total length of the acoustically however, this question might not be of relevance.
effective tubes change only marginally. Snoring, The aim of our database is to provide material
in contrast, can be generated in different locations for the automatic classification of different snore
within the UA, resulting in a variable length of the sound excitation locations by means of machine
acoustically effective system for spectral shaping. learning methods. We hypothesise that the form of
While the glottis wave in speech can be al- sleep (natural or drug induced) has no significant
tered in pitch and loudness, in healthy speakers it influence on the acoustic characteristics of snore
has a characteristic shape. Also, the fundamental sounds from different excitation locations. In other
frequency range is defined for different speakers words: a velum snore sounds the same, no matter
(male female, children), the melody of speech (so- if generated in natural sleep or during DISE, as
called pitch) is mainly characterized by the prosodic long as it stems from the palate level. In turn,
content (speech melody). The excitation waveform there will be characteristic acoustic properties for
of snoring sounds, in contrast, can vary widely, and the different snore sound classes, independent of the
the fundamental frequency can range from as low type of sleep. Given this hypothesis is valid, results
as 10 Hz to as high as more than 500 Hz. The pitch based on our database material will be transferable
of a snoring event can vary in a lot of forms. to snore sound examinations during natural sleep.
Novel descriptors derived from those used in
speech classification tasks might help to further im- E. Weaknesses and Future Work
prove classification outcomes in future snore sound Due to the strongly imbalanced nature of the
classification experiments. database, the number of actual subjects with V and
E type snoring is fairly small, leading to unreliable
C. Snoring and OSA results using our machine learning models. In order
The majority of research in diagnosis and treat- to overcome this weakness, additional subjects with
ment of sleep related breathing disorders is under- these rare kinds of snoring should be added to the
taken with a focus on OSA and obstructive events. database, which will happen gradually over time.
In contrast, this database includes sounds of vibra- In addition to the four classes used here, the
tion events in the UA without obstructive disposi- VOTE classification also describes different occlu-
tion. The VOTE classification according to Kezirian sion patterns at the velar and the oropharyngeal
et. al. defines three levels of airway narrowing (no, level. In future experiments, a refined labelling of
partial, complete obstruction). It is their observation the data could be undertaken, adding even more
that snoring usually occurs during a stage of partial classes to the task.
narrowing without complete occlusion. In the symp- Only clearly identifiable, single-level snoring
tomatic treatment of primary snoring, information events have been chosen to be included in the
is required as to the location of the snoring sound database. Hessel et. al. report that single level ob-
generation in order to allow targeted therapy. To structions only occur in 35% of patients [15]. With
what extent the excitation location of snore sounds a well-trained classifier, multi-level snoring events
COMPUTERS IN BIOLOGY AND MEDICINE 17

could be added to the data probing the capability of [4] P. E. Peppard, T. Young, J. H. Barnet, M. Palta, E. W.
the classifier models in dealing with this new group Hagen, K. M. Hla, Increased prevalence of sleep-disordered
breathing in adults, American Journal of Epidemiology
of data. 177 (9) (2013) 1006–1014.
[5] K. Whyte, M. Allen, A. Jeffrey, G. Gould, N. Douglas,
VII. C ONCLUSION AND O UTLOOK Clinical features of the sleep apnoea/hypopnoea syndrome,
The Quarterly journal of medicine 72(267) (1989) 659–
For the first time, we present a database of 666.
snoring events that have been classified by the [6] M. S. Aldrich, Sleep Medicine, Transaction Publishers,
1999.
sound excitation location in the upper airways based [7] J. Parish, P. Lyng, Quality of life in bed partners of
on objective criteria and verifiable video material patients with obstructive sleep apnea or hypopnea after
from several medical centres. Baseline experiments treatment with continuous positive airway pressure, Chest
124) (2003) 942–947.
show that automatic classification models based [8] A. Dioguardi, M. Al-Halawani, Oral appliances in obstruc-
on the acoustic properties are able to distinguish tive sleep apnea, Otolaryngologic Clinics of North America
between snoring excited at the different levels of the 49(6) (2016) 1343–1357.
[9] M. Puhan, A. Suarez, C. Lo Cascio, A. Zahn, M. Heitz,
upper airway. Adding more subjects to the database, O. Braendli, Didgeridoo playing as alternative treatment for
refining the snoring classes and developing novel obstructive sleep apnoea syndrome: randomised controlled
descriptors for snoring sound characteristics are trial, BMJ 332(7536) (2006) 266–270.
areas of future work to further improve classifi- [10] H. Ashrafian, T. Toma, S. Rowland, L. Harling, A. Tan,
E. Efthimiou, A. Darzi, T. Athanasiou, Bariatric surgery
cation performance of different types of snoring, or non-surgical weight loss for obstructive sleep apnoea?
with the perspective of complementing DISE as a systematic review and comparison of meta-analyses,
a diagnostic measure in the targeted treatment of Obesity Surgery 25 (2015) 1239–1250.
[11] T. Verse, A. Dreher, C. Heiser, M. Herzog, J. Maurer,
sleep disordered breathing. W. Pirsig, K. Rohde, N. Rothmeier, A. Sauter, A. Steffen,
S. Wenzel, B. Stuck, Ent-specific therapy of obstructive
C ONFLICT OF I NTEREST sleep apnoea in adults: A revised version of the previously
published german s2e guideline, Sleep and Breathing 20(4)
The corresponding author holds a German (2016) 1301–1311.
patent on a method and system for the deter- [12] H. J. Xu, R. F. Jia, H. Yu, Z. Gao, W. N. Huang, H. Peng,
Y. Yang, L. Zhang, Investigation of the Source of Snoring
mination of anatomical causes of snoring noise Sound by Drug-Induced Sleep Nasendoscopy, ORL J.
(DE102012219128B4). All other authors declare no Otorhinolaryngol. Relat. Spec. 77 (6) (2015) 359–365.
conflict of interest. [13] K. Iwanaga, K. Hasegawa, N. Shibata, K. Kawakatsu,
Y. Akita, K. Suzuki, M. Yagisawa, T. Nishimura, En-
doscopic examination of obstructive sleep apnea syn-
ACKNOWLEDGEMENT drome patients during drug-induced sleep, Acta Otolaryn-
gol Suppl (550) (2003) 36–40.
The authors would like to thank all the colleagues [14] N. S. Hessel, N. Vries, Increase of the apnoea-hypopnoea
involved in the collection of the labeled VOTE index after uvulopalatopharyngoplasty: analysis of failure,
snoring sound data. This research did not receive Clin Otolaryngol Allied Sci 29 (6) (2004) 682–685.
[15] N. Hessel, N. de Vries, Diagnostic work-up of socially un-
any specific grant from funding agencies in the acceptable snoring. ii. sleep endoscopy, European Archives
public, commercial, or not-for-profit sectors. of Oto-Rhino-Laryngology 259 (2003) 158–161.
[16] C. den Herder, D. Kox, H. van Tinteren, N. de Vries, Bipo-
lar radiofrequency induced thermotherapy of the tongue
R EFERENCES base: Its complications, acceptance and effectiveness un-
[1] T. Young, M. Palta, J. Dempsey, J. Skatrud, S. We- der local anesthesia, Eur Arch Otorhinolaryngol 263 (11)
ber, S. Badr, The occurrence of sleep-disordered breath- (2006) 1031–1040.
ing among middle-aged adults, New England Journal of [17] D. Soares, H. Sinawe, A. J. Folbe, G. Yoo, S. Badr,
Medicine 328 (17) (1993) 1230–1235. J. A. Rowley, H. S. Lin, Lateral oropharyngeal wall and
[2] M. Ohayon, C. Guilleminault, R. Priest, M. Caulet, Snoring supraglottic airway collapse associated with failure in sleep
and breathing pauses during sleep: telephone interview apnea surgery, Laryngoscope 122 (2) (2012) 473–479.
survey of a united kingdom population sample, BMJ 314 [18] D. Pevernagie, R. M. Aarts, M. De Meyer, The acoustics of
(1997) 860–863. snoring, Sleep Medicine Reviews 14 (2) (2010) 131–144.
[3] F. Dalmasso, R. Prota, Snoring: analysis, measurement, [19] T. M. Jones, P. Walker, M. S. Ho, J. E. Earis, A. C.
clinical implications and applications, European Respira- Swift, P. Charters, Acoustic parameters of snoring sound to
tory Journal 9 (1) (1996) 146–159. assess the effectiveness of sleep nasendoscopy in predicting
COMPUTERS IN BIOLOGY AND MEDICINE 18

surgical outcome, Otolaryngol Head Neck Surg 135 (2) german society of otorhinolaryngology, head and neck
(2006) 269–275. surgery, HNO 61 (11) (2013) 944–957.
[20] T. M. Jones, M. S. Ho, J. E. Earis, A. C. Swift, Acoustic [35] W. D. Duckitt, S. K. Tuomi, T. R. Niesler, Automatic
parameters of snoring sound to assess the effectiveness of detection, segmentation and assessment of snoring from
the Mller Manoeuvre in predicting surgical outcome, Auris ambient acoustic data, Physiol Meas 27 (10) (2006) 1047–
Nasus Larynx 33 (4) (2006) 409–416. 1056.
[21] C. Janott, B. Schuller, C. Heiser, Acoustic information in [36] M. Cavusoglu, M. Kamasak, O. Erogul, T. Ciloglu,
snoring noises, HNO 65 (2) (2017) 107–116. Y. Serinagaoglu, T. Akcam, An efficient method for
[22] K. Qian, C. Janott, V. Pandit, Z. Zhang, C. Heiser, W. Ho- snore/nonsnore classification of sleep sounds, Physiolog-
henhorst, M. Herzog, W. Hemmert, B. Schuller, Classifi- ical Measurement 28 (8) (2007) 841–853.
cation of the Excitation Location of Snore Sounds in the [37] K. Qian, Z. Xu, H. Xu, Y. Wu, Z. Zhao, Automatic
Upper Airway by Acoustic Multi-Feature Analysis, IEEE detection, segmentation and classification of snore related
Trans Biomed Eng. signals from overnight audio recording, IET Signal Pro-
[23] K. Qian, C. Janott, Z. Zhang, C. Heiser, et al., Wavelet cessing 9 (1) (2015) 21–29.
features for classification of vote snore sounds, in: 2016 [38] K. Qian, Z. Xu, H. Xu, B. P. Ng, Automatic detection
41st IEEE International Conference on Acoustics, Speech of inspiration related snoring signals from original audio
and Signal Processing (ICASSP), IEEE, 2016, pp. 221– recording, in: Proceedings of the IEEE ChinaSIP 2014,
225. IEEE, 2014, pp. 95–99.
[24] M. Schmitt, C. Janott, V. Pandit, K. Qian, C. Heiser, [39] V. Abdullah, Y. Wing, C. van Hasselt, Video sleep
W. Hemmert, B. Schuller, A bag-of-audio-words approach nasendoscopy: the hong kong experience, Otolaryngologic
for snore sounds’ excitation localisation, in: Proceedings of Clinics of North America 36 (3) (2003) 461–471.
ITG Speech Communication, Paderborn, Germany, 2016, [40] M. Friedman, H. Ibrahim, L. Bass, Clinical staging for
pp. 230–234. sleep-disordered breathing, Otolaryngol Head Neck Surg
[25] B. Schuller, S. Steidl, A. Batliner, E. Bergelson, J. Krajew- 127 (1) (2002) 13–21.
ski, C. Janott, A. Amatuni, M. Casillas, A. Seidl, M. Soder- [41] C. Vicini, A. De Vito, M. Benazzo, S. Frassineti, A. Cam-
strom, S. A. Warlaumont, G. Hidalgo, S. Schnieder, panini, P. Frasconi, E. Mira, The nose oropharynx hy-
C. Heiser, W. Hohenhorst, M. Herzog, M. Schmitt, K. Qian, popharynx and larynx (NOHL) classification: a new sys-
Y. Zhang, G. Trigeorgis, P. Tzirakis, S. Zafeiriou, The tem of diagnostic standardized examination for OSAHS
interspeech 2017 computational paralinguistics challenge: patients, Eur Arch Otorhinolaryngol 269 (4) (2012) 1297–
Addressee, cold & snoring, in: Proc. Interspeech, Stock- 1300.
holm, Sweden, 2017, pp. 3442–3446. [42] E. J. Kezirian, W. Hohenhorst, N. de Vries, Drug-induced
[26] A. H. Khandoker, C. K. Karmakar, M. Palaniswami, sleep endoscopy: the vote classification, European Archives
Automated recognition of patients with obstructive sleep of Oto-Rhino-Laryngology 268 (8) (2011) 1233–1236.
apnoea using wavelet-based features of electrocardiogram [43] N. Charakorn, E. J. Kezirian, Drug-Induced Sleep En-
recordings, Comput. Biol. Med. 39 (1) (2009) 88–96. doscopy, Otolaryngol. Clin. North Am. 49 (6) (2016) 1359–
[27] W. Huang, B. Guo, Y. Shen, X. Tang, A novel method to 1372.
precisely detect apnea and hypopnea events by airflow and [44] A. V. Vroegop, O. M. Vanderveken, K. Wouters,
oximetry signals, Comput. Biol. Med. 88 (2017) 32–40. E. Hamans, M. Dieltjens, N. R. Michels, W. Hohenhorst,
[28] C. B. Croft, M. Pringle, Sleep nasendoscopy: a technique E. J. Kezirian, B. T. Kotecha, N. de Vries, et al., Observer
of assessment in snoring and obstructive sleep apnoea, Clin variation in drug-induced sleep endoscopy: experienced
Otolaryngol Allied Sci 16 (5) (1991) 504–509. versus nonexperienced ear, nose, and throat surgeons, Sleep
[29] A. Cohen, A. Lieberman, Analysis and classification of 36 (6) (2013) 947.
snoring signals, in: Proceedings of the IEEE ICASSP 1986, [45] J. Fiz, R. Jane, Snoring analysis. a complex question.,
IEEE, 1986, pp. 693–696. Journal of Sleep Disorders: Treatment and Care (1).
[30] C. Janott, W. Pirsig, C. Heiser, Akustische analyse [46] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli,
von schnarchgeräuschen, Somnologie-Schlafforschung und K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Ey-
Schlafmedizin 18 (2) (2014) 87–95. ben, E. Marchi, M. Mortillaro, H. Salamin, A. Poly-
[31] A. A. of Sleep Medicine, The International Classification chroniou, F. Valente, S. Kim, The INTERSPEECH 2013
of Sleep Disorders, Third Edition (ICSD-3), 2014. Computational Paralinguistics Challenge: Social Signals,
[32] C. Rohrmeier, M. Herzog, T. Ettl, T. Kuehnel, Distinguish- Conflict, Emotion, Autism, in: Proceedings of INTER-
ing snoring sounds from breath sounds: a straightforward SPEECH, Lyon, France, 2013, pp. 148–152.
matter?, Sleep and Breathing 18 (1) (2014) 169–176. [47] F. Eyben, Real-time speech and music classification by
[33] M. Norman, S. Middleton, O. Erskine, P. Middleton, large audio feature space extraction (2015).
J. Wheatley, C. Sullivan, Validation of the sonomat: a [48] B. Schuller, S. Steidl, A. Batliner, E. Bergelson, J. Krajew-
contactless monitoring system used for the diagnosis of ski, C. Janott, A. Amatuni, M. Casillas, A. Seidl, M. Soder-
sleep disordered breathing, Sleep 37 (9) (2014) 1477–1487. strom, A. S. Warlaumont, G. Hidalgo, S. Schnieder,
[34] B. Stuck, A. Dreher, C. Heiser, M. Herzog, T. Khnel, C. Heiser, W. Hohenhorst, M. Herzog, M. Schmitt, K. Qian,
J. Maurer, H. Pistner, H. Sitter, A. Steffen, T. Verse, Y. Zhang, G. Trigeorgis, P. Tzirakis, S. Zafeiriou, The
Sk2 guidelines diagnosis and therapy of snoring in adults interspeech 2017 computational paralinguistics challenge:
compiled by the sleep medicine working group of the Addressee, cold & snoring, in: Proceedings of INTER-
COMPUTERS IN BIOLOGY AND MEDICINE 19

SPEECH, Lyon, France, 2017 (submitted for publication), patients, Eur Arch Otorhinolaryngol 274 (3) (2017) 1735–
pp. 0–0. 1740.
[49] D. Tavarez, X. Sarasola, A. Alonso, J. Sanchez, L. Serrano,
E. Navas, I. Hernez, Exploring fusion methods and feature
space for the classification of paralinguistic information,
in: Proc. Interspeech, Stockholm, Sweden, 2017, pp. 3517–
3521.
[50] T. L. Nwe, H. D. Tran, W. Z. T. Ng, B. Ma, An in-
tegrated solution for snoring sound classification using
bhattacharyya distance based gmm supervectors with svm,
feature selection with random forest and spectrogram with
cnn, in: Proc. Interspeech, Stockholm, Sweden, 2017, pp.
3467–3471.
[51] A. R. M.V., S. Yadav, P. K. Ghosh, A dual source-filter
model of snore audio for snorer group classification, in:
Proc. Interspeech, Stockholm, Sweden, 2017, pp. 3502–
3506.
[52] G. Gosztolya, R. Busa-Fekete, T. Grsz, L. Tth, Dnn-based
feature extraction and classifier combination for child-
directed speech, cold and snoring identification, in: Proc.
Interspeech, Stockholm, Sweden, 2017, pp. 3522–3526.
[53] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli,
K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Ey-
ben, E. Marchi, H. Salamin, A. Polychroniou, F. Valente,
S. Kim, The interspeech 2013 computational paralinguis-
tics challenge: Social signals, conflict, emotion, autism, in:
Proc. Interspeech, 2013, p. 148152.
[54] H. Kaya, A. A. Karpov, Introducing weighted kernel classi-
fiers for handling imbalanced paralinguistic corpora: Snor-
ing, addressee and cold, in: Proc. Interspeech, Stockholm,
Sweden, 2017, pp. 3527–3531.
[55] S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Fre-
itag, S. Pugachevskiy, A. Baird, B. Schuller, Snore sound
classification using image-based deep spectrum features,
in: Proc. Interspeech, Stockholm, Sweden, 2017, pp. 3512–
3516.
[56] M. Freitag, S. Amiriparian, N. Cummins, M. Gerczuk,
B. Schuller, An end-to-evolution hybrid approach for snore
sound classification, in: Proc. Interspeech, Stockholm,
Sweden, 2017, pp. 3507–3511.
[57] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich
versatile and fast open-source audio feature extractor, in:
Proceedings of the 18th ACM International Conference on
Multimedia, ACM, 2010, pp. 1459–1462.
[58] F. Eyben, F. Weninger, F. Groß, B. Schuller, Recent de-
velopments in opensmile, the munich open-source multi-
media feature extractor, in: Proceedings of the 21st ACM
International Conference on Multimedia, ACM, 2013, pp.
835–838.
[59] R. Fan, K. Chang, C. Hsieh, X. Wang, C. Lin, Liblinear:
A library for large linear classification, Journal of machine
learning research 9 (August) (2008) 1871–1874.
[60] H. Peng, H. Xu, Z. Xu, W. Huang, R. Jia, H. Yu, Z. Zhao,
J. Wang, Z. Gao, Q. Zhang, W. Huang, Acoustic analysis
of snoring sounds originating from different sources deter-
mined by drug-induced sleep endoscopy, Acta Otolaryngol.
(2017) 1–5.
[61] S. K. Koo, S. B. Kwon, Y. J. Kim, J. I. S. Moon, Y. J. Kim,
S. H. Jung, Acoustic analysis of snoring sounds recorded
with a smartphone according to obstruction site in OSAS

View publication stats

You might also like