Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 47

Speaker Diarization

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Technology in Computer Science & Engineering

By: Avinash Kumar Pandey 2006CS50213

Under The Guidance of: Prof. K. K. Biswas Department of Computer Science IIT Delhi Email: kkb@cse.iitd.ernet.in

Department of Computer Science Indian Institute of Technology Delhi

Certificate
This is to certify that the thesis titled "Speaker Diarization" being submitted by Avinash Kumar Pandey, Entry Number: 2006CS50213, in partial fulfillment of the requirements for the award of the degree of Master of Technology in Computer Science & Engineering, Department of Computer Science, Indian Institute of Technology Delhi, is a bona-fide record of the work carried out by him under my supervision. The matter submitted in this dissertation has not been admitted for an award of any other degree anywhere unless explicitly referenced. Signed: ______________________

Prof. K.K. Biswas Department of Computer Science Indian Institute Of Technology, Delhi New Delhi-110016 India

Date: ______________________

ii

Abstract
In this document we describe a speaker diarization System which is basically an internal segmentation system which uses text independent speaker identification using Gaussian mixture models and MFCC feature vector set, aided by also a Gaussian mixture models based speech activity detection system. A general speaker diarization system or speaker detection and tracking system consists of four modules, Speech activity detection, speaker segmentation; speaker clustering and speaker identification. In an internal segmentation type diarization system the functions of all the three modules of Segmentation, Clustering and Identification are discharged by the same speaker identification module. Segmentation is done after identifying which speaker this particular audio segment belongs to, so clustering is also done simultaneously. A speaker identification system is again of various types, it can impose certain limitations on the text that speaker utters to get identified or it may keep it completely limit-free. The later systems are called text independent speaker identification systems. Different speaker identification systems work on different types of feature vectors, the text independent speaker identification systems work on lower level glottal features of the speaker. The feature vectors thus obtained can be modeled using different statistical models; we have experimented with two models mainly, Vector quantization and Gaussian mixture models.

iii

Acknowledgements
I would like to acknowledge the guidance of Prof. K. K. Biswas, (Department of Computer Science, IIT Delhi) whose guidance was the corner-stone of this project, without which this project would never have been possible. Thank you for your wonderful support. I would also like to express my gratitude towards Prof. S. K. Gupta and Prof. Saroj Kaushik for their guidance throughout the development of the project. I would like to gratefully acknowledge my debt to other people who have assisted in the project in different ways.

Signed: ______________________

Avinash kumar Pandey 2006CS50213 Indian Institute Of Technology, Delhi New Delhi-110016 India

Date: ______________________

iv

Contents

Certificate ................................ ................................ ................................ .................. i Abstract ................................ ................................ ................................ .................... ii Acknowledgements ................................ ................................ ................................ . iii List of Figures................................ ................................ ................................ .......... vi List of Tables ................................ ................................ ................................ .......... vii 1 Introduction ................................ ................................ ................................ ........... 1 1.1 Motivation........................................................................................ .....................1 1.2 Definition...................................................................................... ....................... .1 1.3 History .........................................................................................................2 1.3.1 Rich transcription framework........................................................................ .3 1.4 Applications.................................................................................. ........................ 4 1.5 Outline of the Work............................................................................................. .4 1.6 Chapter Outline of the Thesis..................................................... ..........................5 2 Speaker diarization.................................................................................................. .6 2.1 Introduction ..................................................................................... .................... 6 2.2 Speech activity detection..6 2.2 Speaker segmentation..................................................... ......................................8 2.2.1 Segmentation using silence............................................................................. .9 2.2.2 Segmentation using divergence measures................................. ......................9 2.2.3 Segmentation using frame level audio classification..................................... 10 2.2.4 Segmentation using direction of arrival......................................... ................11

2.3 Speaker clustering.............................. ..................................... ...........................11 2.4 Speaker identification.12

3 Speech Activity detection and implementation................................ .................... 13 3.1 Introduction.13 3.2 Gaussian Mixture Models...................... 14 3.3 Our algorithm for SAD...15 3.3.1 Observation............................................................................... .....................15 3.4 Experiments....16 3.4.1 Fan noise.............................................................................. .................... ......16 3.4.2 Silence..................................................... ................................... .................... 16 3.4.3 Fast paced speech................................................................. .................... .....17 3.4.4 Moderate paced speech......................................................... ..................... ....17 3.4.5 White noise................................................................................ .................... 18 3.5 Advantages and Drawbacks................................................................................19 4 Implementation of Text Independent Speaker Identification.......................... ...20 4.1 Introduction.......................................................... ......................... ......................20 4.2 Speech Parameterization : Feature vectors.................................. .......................23 4.2.1 MFCC.................................................................................. .......................... 23 4.3 Statistical Modelling...........................................................................................27 4.3.1 Vector Quantization............................................................ ...........................28 4.3.2 Gaussian Mixture Models......31 4.4Bayesian Information Criterion ......33
5 Experimental Results 6 Conclusion ..35 ........37

Bibliography...38

vi

List of Figures
Figure 1: Segmentation of a an audio clip ................................ ................................ ............. 8 Figure 2: Gaussian mixture models ............................................................................. ...........16 Figure 3: Fan noise ................................ ................................ ................................ .............. 16 Figure 4: Silence ................................ ................................ ................................ ................. 16 Figure 5: Fast paced speech.......................................................................................... .......... 17 Figure 6: Moderately paced speech ................................ ................................ ..................... 17 Figure 7: White noise ................................ ................................ ................................ .......... 18 Figure 8: Training phase of a speaker identification system ................................ ................. 22 Figure 9: Testing phase of a speaker identification system................................ ................... 22 Figure 10: General schematic to calculate PLP or MF-PLP features ................................ .... 23 Figure 11: Different modules in computation of cesptral features ................................ ........ 24 Figure 12: Vector quantization ................................ ................................ ............................ 30

vii

List of Tables

Table 1: Speech activity detection experiment results ................................ .......................... 19 Table 2:Speaker Diarization experiment results ................................ ................................ ... 36

Chapter 1

Introduction
In this chapter we will discuss where the problem of speaker diarization originated, what all problem domains it can find application in and how much work has already been done in this area and we will also discuss in brief about the several chapters in this thesis.

1.1 Motivation
Recording of speech, for several purposes, has long been in practice. There are many reasons to record ones voice, like for educational purposes, archival purposes and conserving a memory through the vicissitudes of time. While its more automatic than scribbling down the dialogue it is often times cheaper than video as well. Across the globe, in countless archives, there exists a huge amount of audio data. We organize our data, like database tables, through certain keys. In audio clips, as such, we have no key. The idea is to devise an organization for these databases in order to make them easy to be handled. One of the possible indexing of audio data can be in terms of speaker identity; this thought is the key idea for motivating speaker diarization.

1.2 Definition
In a given audio clip, the task of speaker diarization essentially addresses the question of Who spoke when. The problem involves labeling a given audio file entirely with speaker beginning and end times si and ei, for all homogenous single speaker segments. If there are portions that correspond to non speech those have to be explicitly mentioned too, for example, a sample output for some one minute audio file could be like 0:3 seconds Non speech, 3:25

Speaker 1,25:37 music, 37:60 speaker 2.

1.3 History
Historically, the problem was first thus formulated in national institute of standards and technology, rich transcription framework, or better known as RT- Framework, convention of 1999. Since then, till 2007, several conventions were held about this problem. Mainly, two types of diarization problems were undertaken 1) Broadcast news speaker diarization 2) Meeting or conference room speaker diarization In the broadcast news framework, the recording system is single modal. That is to mean, there is only one recording device, in which all the speakers take turn to speak. Apparently, the apparatus is simple, but the accuracy of the diarization system is hampered on this account. In the meeting room domain problem, audio clips are recorded across multiple distant microphones, locations of which were not disclosed. The results of diarization of these different clips were then combined together to enhance the efficiency of the overall diarization engine. There is one aspect about multimodal scenarios that lead to enhanced diarization results with them, the TDOA parameter; time delay of arrival. There are different recording devices; the distance of speaker from each microphone is bound to change with speaker change because the two speakers will most probably not be occupying the same physical position. This information, the time difference in different recording devices, leads to a very strong tool for speaker segmentation or identifying speaker turn points. The problem we have undertaken to solve is similar to that of the broadcast news scenario where different speakers take turn to speak in a single recording device. The current state of art in the speaker diarization regime has moved beyond audio. The idea is to record not only audio, but also take visual cues from the speakers and audiences to determine speaker status and change. Its impact has been conclusively shown in Multimodal Speaker diarization of real world meeting using compressed domain video features, a paper by Friedland and Hung, 2010.

1.3.1 NIST RICH TRANSCRIPTION FRAMEWORK


There is a set of related problems that is undertaken under NIST rich transcription framework, namely, 1) Large vocabulary continuous speech recognition (LVCSR) 2) Speaker diarization 3) Speech activity detection(SAD) Let us now discuss each of these problems briefly and the success that has been achieved in solving them briefly.

Large Vocabulary Continuous Speech Recognition


This is the name for most common and important problem in speech processing. LVCSR is simply a technical name for the most general speech recognition system. The amateur speech recognition engines place some kinds of restrictions on the vocabulary for example the speaker could speak only out of so many words already recognized by the engine or the speaker should speak at a certain pace which generally meant a place slower than the normal pace of speaking. LVCSR was meant to overcome all these restrictions

Speaker Diarization
The problem statement of Speaker Diarization has already been introduced; it will be taken up in fair detail; a both theoretical and implementational detail as it is the object of this thesis.

Speech Activity Detection


Speech activity detection is a sub-problem in most of the speech processing applications and speaker Diarization is no different. We will also develop an algorithm for Speech activity detection. In chapter 3, we will discuss in detail about our algorithm for Speech activity detection, its experimental performance, its advantages and its drawbacks.

1.4 Applications
Speaker diarization provides for multifarious applications in diverse domains. Some of them follow 1) Once our audio archive has been indexed as per the speaker identities, a user can quickly browse through the archive only looking at the speakers of ones own interest rather than manually looking for the speaker he is interested in, through the entire file. 2) Speaker diarization also plays a vital role in automatic speech recognition. If we do not the speaker identity and are trying to convert the speech to text, the phonetic models we apply are rather generic but once we know who exactly this speaker is we can migrate to a speaker specific phonetic model which performs better. Literature says, there has been about 10% improvement in automatic speech recognition if we knew in advance the identity of speaker. 3) If one decides to resort to manual transcription in order to avoid the inherent difficulties and inaccuracies in automatic speech recognition, even then a diary of which speaker started when will come in handy.

1.5 Outline of the Work


We created a system which is capable of producing a diary of a given audio clip, provided it has in its database the training samples of all the speakers present in the audio clip. Our implementation fundamentally consists of three main modules, a Speech activity detection module, A Bayesian Information criterion module and a Text independent speaker identification module. The SAD module, given a speech segment decides whether its speech or non-speech. The non-speech categories could be Gaussian noise, Background noise, Music or most commonly silence. The SAD module differentiates speech from these kinds of signals. After this, we have Bayesian Information Criterion module which narrows down on a given audio segment till it becomes reasonably assured that this particular segment belongs to a single speaker.

Then in the end, we have the core speaker identification engine based on MFCC feature vectors. It is of the text independent type, when we say its text independent we mean it does not depend on the text uttered by the speaker, it can identify a speaker no matter what he speaks, so this establishes that our audio clip doesnt have to be a fixed set of words, Our speakers can talk about anything and we will still be able to produce a diary.

1.6 Chapter outline of the thesis


In the remaining chapters we develop the idea of Speaker diarization first and then discuss the various details associated with the different modules chapter by chapter. In Chapter 2; we discuss one by one the theoretical ideas behind and the advances made in and techniques popular in each of the areas of speaker segmentation, Speaker clustering and speaker identification. Then, in the chapter 3 we discuss, our implementation of the speech activity detection algorithm, in chapter 4 we discuss our implementation and the general stuff about text independent speaker identification and in chapter 5 we discuss in brief about Bayesian Information Criterion. In chapter 6 we furnish results about the performance of our speaker diarization system and in chapter 7 we conclude the thesis.

Chapter2

Speaker Diarization
2.1 Introduction
Speaker Diarization as Anguera and Wooters contended in Robust Speaker Segmentation for Meetings, ICSI SRI Spring, 2005, for the twin purposes of abstraction and modularity can be divided into four stages. 1) 2) 3) 4) Speech activity detection Speaker segmentation Speaker clustering Speaker identification

2.1 Speech activity detection


Speech activity detection is the first step in almost every other speech processing application, the primary reason being the highly computationally intensive nature of these processing methods. One wouldnt want to waste ones resources computing on part of the audio clip that is not speech. Suppose, there is a meeting room in our computer science department, a far field microphone is installed in the room and it is always alert. That is to mean, it always keeps recording. We do not want to set things up every time we enter the meeting room for discussions. Now after a certain period of time, say 24 hours, we take the output of the recorder and want to filter out portions that are non speech. They could be any kind of noise, music or sounds of people passing by in the gallery. This task is accomplished by speech activity detection module. The problem of Speech activity detection also goes by the name of Voice activity detection. There are various possible ways to determine whether an incoming clip is speech or not. One of those ways could be cepstral analysis, as investigated by Haigh and Mason in A Voice Activity Detector based on Cepstral analysis, 1993. Later in this thesis we will discuss in profound detail what we mean by Cepstral features, for now it is sufficient to know these are some values extracted from a stationary portion of the audio clip, may be a 10 seconds window.

The idea proposed by Haigh and Mason is to detect the speech end-points, that is , the points where speech portion begins and ends, by cepstral analysis. This approach will be strictly based on static explicit modeling of speech and non-speech. We will have to train a binary classifier to differentiate speech from non-speech based on some feature vectors extracted from the clip which the authors suggested should be cepstral features. As the reader will see later in the thesis, these are the same feature vectors we use for our text independent speaker identification engine. The model used could be any model ranging from a Gaussian mixture model to a support Vector machine. This is one way to tell speech from non-speech, the earliest algorithms for Voice activity detection used two common speech parameters to decide whether there is voice in a speech frame 1) Short term energy 2) Zero crossing rate Short term energy in an audio frame can be determined with the log energy coefficient in the MFCC feature vector set; it is the 0Th coefficient in the MFCC feature vector set. The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back. This feature has been used heavily in both speech recognition and music information retrieval and is defined formally as

Where s is a signal of length T and the indicator function true and 0 otherwise.

is 1 if its argument A is

But as can be understood the usage of these two parameters can still lead us to error, there can be cases where there is high short term energy in a speech frame and it is still not speech, for example a Gaussian noise with a probability distribution such that the intensity peaks every few seconds. Same can be said for zero crossing rates, though uncommon there can be noises where the change from positive to negative happens very frequently and so it can deceive our system. To overcome these difficulties we have come up with a new algorithm for speech activity detection which will be described in detail in chapter 3.

2.2 Speaker segmentation


Given an audio clip, speaker segmentation is the task of finding the speaker turn points. Our segmenter is supposed to divide the audio clip in non overlapping segments such that each segment contains speech from only one speaker. Non speech segments, we assume, have already been filtered out by our speech activity detection module. The idea is better illustrated through the following figure.

Figure 1 Speaker Segmentation The figure shown above is the amplitude graph of an audio clip, where the number of speakers exceeds one. We begin with one speaker putting forth his point, in between he is interrupted by another speaker and there is a region of overlapping speech from where the second speaker takes over. This overlapping speech region is an instance of turn point. There is a little abuse of the term point here, as we are calling a whole overlap region to be a point. So we have to identify such positions in the audio clip where the shifting over of speakers takes place. These, from now on, will be called homogenous speaker segments. There are various ways to go about solving the problem of speaker segmentation. y Segmentation using silence. y Segmentation using divergence measures. y Segmentation (and clustering) by performing frame level audio classification. y Segmentation (and clustering) using a HMM decoder. y Segmentation (and clustering) using direction of arrival. The last three methods are unified segmentation and clustering methods. They are also called internal segmentation methods at times. The earlier two methods are called external segmentation methods because the ascertaining of identity follows the merging of segments in clusters as opposed to internal

segmentation methods where for every frame you first find out who the speaker is, there you have essentially found out the cluster without bothering about the segment at all. Now we will discuss each of these methods briefly.

2.2.1 Segmentation Using Silence


Segmentation using Silence is a common sense method which is based on the assumption that whenever a speaker change happens there must be a portion of silence inbetween. This, however, cannot be said to hold true in all environments, for example, in a parliament speaker changes almost inevitably happen by one speaker forcing his entry in another speakers speech. So a speaker change did happen, but there was no point of intervening silence. Hence we run the risk of losing some speaker turn points. Besides, this method of segmentation using silence is plagued by another difficulty. If we observe the amplitude graph of any speech file closely we will notice when a speaker speaks, he doesnt just keep speaking all the time, he stops frequently or the tonality of his voice bellows down to silence so even a continuous speaker segment contains many intermediate points of silence. What this means for our purpose is that while we miss some true some speaker turn points, we generate too many unnecessary segments as well. The greater the number of segments greater is the difficulty in clustering them. So we can see why the method of segmenting speech using silence is not such a good idea. Mainly for two reasons 1) Misses some true speaker turn points with overlapping speech. 2) Generates a much higher number of segments, false positives.

2.2.2 Segmentation Using Divergence Measures


Delacourt and Well showed in DISTBIC, A speaker-based segmentation for audio data indexing, that using divergence measures for speaker segmentation can be useful .They used Bayesian Information Criteria as the divergence measure. Segmentation using Divergence measures is the state of the art, lets first discuss what is a divergence measure and what all divergence measures one can use. A divergence measure is fundamentally a tool to determine how similar or dissimilar two things are. In our case it could be two successive audio frames or two successive windows of audio frames. Two famous divergence measures are Kullback- Leibler divergence and Bayesian information criteria. We have used Bayesian Information criteria in our implementation so we will be discussing it in detail in chapter 6. Right now we will discuss Kullback-Leibler Divergence.

10

In probability theory and information theory, the KullbackLeibler divergence (also information divergence, information gain, relative entropy, or KLIC) is a nonsymmetric measure of the difference between two probability distributions P and Q. KL measures the expected number of extra bits required to code samples from P when using a code based on Q, rather than using a code based on P. Typically P represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P. Although it is often intuited as a distance metric, the KL divergence is not a true metric for example, it's not symmetric: the KL from P to Q is generally not the same as the KL from Q to P. For probability distributions P and Q of a discrete random variable their KL divergence is defined to be

In words, it is the average of the logarithmic difference between the probabilities P and Q, where the average is taken using the probabilities P. The K-L divergence is only defined if P and Q both sum to 1 and if Q(i) > 0 for any i such that P(i) > 0. If the quantity 0log0 appears in the formula, it is interpreted as zero. Now that we know what a divergence measure is, we can proceed with our discussion of our segmentation using divergence measures. We consider two windows and calculate their similarity or dissimilarity index with our given divergence measure. If that dissimilarity is above a particular threshold, determined by empirical experimentation, we call that particular point a speaker turn point as it flanks the audio in two windows which are different from each other.

2.2.3 Segmentation using Frame level audio classification


This is an example of an internal segmentation strategy. Frame level audio classification means with every frame of the audio, we determine as to what kind of an audio data it is, is it speech, is it music and then if it is speech which speaker does it come from in our database, this is more or less the strategy we will be following in our implementation except that instead of looking at individual frames we are looking at a window of frames which has been determined to contain a single speaker speech. During our experimentation with various statistical models we observed looking at individual frames rarely leads us to the right answers, we have to look at a certain length of the audio, look at a collection of frames and see which speaker the majority of points belong to. The

11

criteria for belonging may be proximity to a particular codebook code vector or a loglikelihood computation as in Gaussian Mixture Models.

2.2.4 Segmentation using Direction of Arrival (DOA)


In a multimodal corpus where we have multiple distant microphones to record speech, two parameters assume overwhelming significance in speaker diarization or speaker identification, direction of arrival and time delay of arrival. The process has been discussed at length in Real time monitoring of participants interaction in a meeting using audio-visual sensors by Buso and Narayanan, 2008. The technique is called acoustic source localization and has been used widely in RADAR and SONAR. The speech is recorded in a smart room consisting of a microphone array. The microphone array is used for acoustic source localization. The approach is based on TDOA, time delay of arrival due to various microphones. The geometric inference of the source location is calculated from this TDOA. First pair-wise delays are estimated between all the microphones. These delays are subsequently projected as angles into a single axes system.

2.3 Speaker clustering


Speaker clustering is the next step from speaker segmentation. Let us rewind a little bit now. We were first given an audio clip from which using the speech activity detection module we filtered out non speech part. Next, we used our speaker segmentation tool to generate homogenous speaker segments from it. Now, as one can observe, the segments belonging to single speaker can be strewn across the clip. We have to label them as coming from a single source. So we proceed to speaker clustering. We build a similarity matrix corresponding to every single segment in the segments list, and we use a distance metric to calculate the distance between the segments. The segments which match best are merged and incrementally the identities of these clusters get changed. If we already know the number of speakers present, we can choose to keep as many clusters as the number of speakers. Otherwise, we choose a stopping criterion which has to be decided based on experimentation. When that stopping criterion is met, we stop merging the segments and the number of clusters present at that point will be the number of speakers present in our audio clip. Popular approaches to speaker clustering y Clustering using vector quantization.

12

y Clustering using iterative model training and classification. y Clustering in a hierarchical manner using divergence measures. y Clustering and segmentation using a HMM decoder. y Clustering and segmentation using direction of arrival

2.4 Speaker Identification

The last step is speaker identification. First the training part will be discussed. 1) We divided every second of the training clip into approximately 173 frames and for each of these frames we calculated MFCC features. 2) Each MFCC feature vector consisted of 13 dimensions. 3) We use the set of these feature vectors to generate a codebook for every speaker using the concept of vector quantization. 4) Codebook of a speaker consisted of 16 feature vectors which best modeled the set of obtained feature vectors for the clip. Now we move on to the testing part. In testing 1) We again divided the test audio clip into frames, 173 per second and computed the feature vectors per frame. 2) We matched each feature vector thus computed, with the individual codebooks. 3) The codebook with maximum matches is declared to be the one corresponding to the identity of our test speaker.

13

Chapter 3

Speech Activity Detection and Implementation


3.1 Introduction
Speech activity detection involves separating speech from 1) Silence 2) Background noise in different ambience 3) White Gaussian noise 4) Music 5) Crowd noise

The majority of algorithms that are used for speech activity detection fall in two categories. 1. Noise level is estimated after looking at the entire file and anything over and above a particular decibel level is called speech. 2. According to the ambience, speech and non speech training sets are taken, models are trained based on them and then these models are used for further classification. Both of these kinds of algorithms suffer from some serious drawbacks. The algorithms in the first category fail to discriminate speech from non speech when noise is variable; it assumes same noise function throughout the audio. The algorithms falling in second category will fail to perform in an unfamiliar environment. They need a lot of training data. The approach we propose here, overcomes both these hurdles, though it has drawbacks of its own. We want a speech detection algorithm which y Does not require any training at all.

14

y Is, nevertheless, able to grasp the difference between speech and non speech in dynamically changing ambience.

Since we use Gaussian mixture models in our approach, a little briefing of what really Gaussian mixture models are is called for.

3.2 GAUSSIAN MIXTURE MODELS

Many times, it is so that our distribution of data cant be accurately modeled by using a single multivariate distribution. The sample point set might come from two different Gaussian distributions. In that case, rather than modeling the dataset by a single multivariate distribution, its better to model the data as a mixture of two Gaussians where each Gaussian accounts for a certain fraction of point set which is called the mixing proportion of this Gaussian in the Gaussian mixture model. The mean and covariance matrices of each Gaussian can be completely independent and we can restrain them too as the case might be, in our case we let them be completely general.

Figure 2: Gaussian Mixture Models The diagram above is a case where two individual Gaussians are present with different mean and covariance.

15

3.3 Our algorithm for Speech activity detection


The algorithm consists of three simple steps. y Divide the entire audio clip in fixed sized intervals of 15 seconds. y Extract MFCC features from each of these segments. y Cluster individually, the feature vectors obtained for each of these intervals, using Gaussian mixture models where the number of components is two.

3.3.1 Observation
In case of silence, repeated beats, fan noise, white Gaussian noise the preponderance of points belongs to one of the clusters. The mixing proportion remains highly skewed. While for speech frames, the mixing proportion stays even, This attribute of speech frames becomes the basis of our classification strategy. We want to narrow down the number of segments which could be speech. Its a negative classification strategy. Whatever is clustered evenly remains a candidate for speech. We can filter out silence, instrument noises, repeated beats and white noise. This effectively narrows down our search space by a significant factor.

3.4 Experiments 3.4.1 Experiment 1: fan Noise

16

Figure 3 :Fan noise The picture above is that of an audio clip consisting solely of fan noise. For fan noise,A set of 20 different samples was taken. clustering proportion was always over 95% for one of the clusters. It shows that our GMM captured the fact that the preponderance of data was coming from one source.

3.4.2 Experiment 2 :Silence

Figure 4: Silence

The picture above is that of a clip that is supposed to consist solely of silence, but there is no such thing as pure silence as pure speech so we can say this clip consists solely of sounds of ambience and contains no speech at all. Samples were taken with the following specifics.

y Quite ambience y silence periods larger than 10 seconds y over a set of fifty different samples,

y The clustering proportion for the larger of the clusters always >90% Speech samples were taken in two modes, y Fast paced speech, something like a rap song.

17

y Moderately paced, the manner in which people communicate in a meeting.

3.4.3 Experiment 3 : Fast Paced Speech

Figure 5 : Fast paced speech y Over 50 samples y Mixing proportion varied from 50:50 to 75:25.

3.4.4 Experiment 4 : Moderately paced Speech

Figure 6 : Moderately paced speech y 50 samples were taken. y The range of mixing proportion was rather diverse y y lying between 50:50 to 80:20 at times Still good enough to contrast with silence.

18

3.4.5 Experiment 5 : White Noise

Figure 7 : White Noise y Software generated white noise. y Might come in our audios from some malfunctioning in our recording device or network channel noise. y BIC value of clustering always comes out to be negative. y The algorithm often fails to converge for two components in 100 iterations. y When it does converge, the mixing proportion is highly skewed.

Summary of experimental results

19

3.5 Advantages and Drawbacks Advantages


y Does not require any training at all, so we can apply it even in environments where our training based approaches would have failed for lack of training data. y Since it treats every segment independently, it can adjust itself dynamically to changing ambience properties.

Drawbacks

y y y

This algorithm is computationally intensive. We have to calculate the feature vectors, cluster them and find the BIC for every single frame in the file. Works well for relatively longer periods of silence, is not as robust for shorter segments.

20

Chapter 4

Implementation of text Independent speaker identification


4.1 Introduction
Our approach to handling the other three modules, other than speech activity detection, can be called internal segmentation and clustering. We do not first segment and then cluster the audio clip, instead we take a portion of the clip, try to identify as to which speaker it belongs, if there is a confusion, that is the statistics revealed by our testing algorithms are not conclusive then we narrow it down further, thus we determine a minimum time difference between which a speaker change is not happening and in that we establish the identity of the speaker. We can also take this time to be 1 sec and then determine individually for all segments of 1 second to which speaker they belong. For doing so, we have two tools at our disposal. 1) A text independent speaker identification engine and 2) Bayesian information criteria BIC

Speaker identification systems are primarily of two types, text independent and text dependent. The names are fairly self explanatory, text independent speaker identification systems are the ones which work for any test utterance while the text dependent speaker identification systems can work only with a fixed test utterance. The two systems classify their training data based on widely different sets of speech parameters, resulting in different accuracies too, under similar circumstances. Accordingly, they find applications in different problem domains. An abstract layout of a speaker identification/verification engine, consisting of various modules and stages, is shown in the following figures. While figure 1.1 shows the training phase, figure 1.2 is for testing phase.

21

Training phase assuming a most general speaker identification/verification system, consists of two fundamental modules, Speech parameterization module and statistical Modeling module. The raw speech data received is first processed to extract some useful characteristic information through the speech parameterization module. These information parameters are collected, usually, at different points in time domain and frequency domain, to mark the variations in the speech, characteristic of a particular speaker. The speech parameters thus obtained are then fitted in a statistical model of choice using the statistical modeling module, in order to calculate the defining parameters of the model corresponding to that particular speaker, for example, mean and variance in a single Gaussian model. The choice of statistical model could vary. We have experimented with two models, codebooks or vector quantization and Gaussian mixture models. The state of the art systems in speaker identification employ Gaussian mixture models with great success.

Testing phase is preceded by collection of training samples of all speakers in our universe, those training samples are converted into corresponding speaker models and if the statistical modeling demands so, a universal model is formed with the training samples. Now, when we are given a test utterance, we calculate the speech parameters using the speech parameterization module and then use a decision scoring module to decide which of the available speaker models these parameters are best matched with. Again, the choice of decision scoring module is dependent on the statistical model we use, when we use vector quantization, it is simply the number of points corresponding to each codebook, the one which has maximum points closest to the parameter set achieved is our output identity, the distance metric being the simple Euclidean distance metric. While, when use Gaussian mixture models as our statistical modeling tool, log-likelihood becomes the decision scoring module. We calculate log-likelihood of each feature vector obtained from the test utterance and see which speaker model it best corresponds to.

So, our discussion so far has established there are fundamentally three variables in the speaker identification systems. 1) Speech parameterization 2) Statistical modeling 3) Decision scoring

22

We will take them all one by one now, first giving the theoretical details, and then the experimental results.

Input

Speech Parameterization Module

Speech parameters `

Statistical Modeling

Model

Figure 8: Different modules in the training phase of a speaker identification system

Speech data from a given speaker

Speech parameterization Module

Speech parameters

Speaker models from the database Identity

Ide
Scoring Decision

Figure 9 :Different modules in the testing phase of a speaker identification system

23

4.2 Speech parameterization: Feature Vectors


The speech parameterization module calculates or extracts useful speech parameters from a raw audio clip. The popular term for the parameters thus obtained is feature vectors. The most widely used feature vectors are from a particular class called cepstral features. We will discuss briefly what exactly we mean by cepstral features and then we will give the specifics of the feature vector set we are using. Cepstral features based on filter-bank The entire process of calculation of filter-bank based cepstral features is shown schematically, module-wise, in figures 2 and 3.

Figure 10: General Schematic for calculation Of PLP or MF-PLP features.

Input Pre- emphasis Windowing FFT

24

Cepstral transform

20*Log

Filterbank

Figure 11: Different modules employed in calculation of cepstral features, MFCC.

Now we will discuss all of these modules and their relevance in our work one by one. 1) Pre-emphasis: Emphasis is laid on certain special section of the speech signal, the special sections being the higher frequency range of the spectrum. It is believed that the nature of speech production reduces the higher frequencies, thereby, inducing a need for pre-emphasizing the signal making up for the loss in production process. In our case we studied, hardly any benefits were accrued by using pre-Emphasis so we have done away with this module. 2) Windowing: This is a crucial phase in the calculation of feature vectors. We make an assumption called stationary assumption which means that if we consider a window of the speech signal small enough there wouldnt be any variations in the values of feature vectors across that small window. So we select a window beginning at the beginning of the speech signal, in our case of 20 ms, then we shift the starting position of the moving window by 10 ms and consider the next window of length 20 ms which means every two consecutive windows considered have 10 ms part in common. The choice of windows is again dependent on experimental evidence; we went with triangular hamming window. Other options could have been hamming or hanning windows. 3) FFT: Next step is simply calculating the fast Fourier transform of the spectral vector thus obtained, after windowing and possibly pre-emphasis. 4) Filter-bank: The spectrum obtained after applying FFT still contains a lot of unnecessary details and fluctuations, things we are not interested in. So, in order to obtain the features we are really interested in we multiply the spectrum thus obtained with a filter-bank. A filter-bank is nothing but a collection of band pass frequency filters. So in essence we filter out all the un-necessary information and keep only the frequencies that concern us. The knowledge of these particular frequencies comes from our knowledge of the process of speech production. The

25

spectral feature set MFCC receives its name from its filter-bank which is called the Mel scale frequency filter-bank. This scale is an auditory scale which is similar to the frequency scale of the human ear. 5) Cosine discrete transform: An additional transform is applied which in generic terms we have called the cepstral transform, in our case it is cosine discrete transform which when applied on the result of filter-bank operations yields final cepstral feature vectors , which are of interest to us.

Two other important features are the log energy and the of log energy. MFCC is a 13 dimensional feature set. The first coefficient is the log energy, we incorporated the difference of log energy as well in our feature set, which resulted in significantly improved recognition rates. In effect we have incorporated all the deltas corresponding to all the 13 feature vector dimensions.

There are a number of popular feature vector sets that can be extracted from an audio clip. Different feature vectors sets capture different properties of an audio clip. The most widely used ones are 1) MFCC Mel frequency cepstral features 2) Rasta_PLP 3) LPC, Linear predictive coding. The feature vector set which we are using is the MFCC one with its set, the set essentially means the differentials of these feature vectors, that is, it captures how the values of the MFCC feature vectors varies over time. This information, as it turns out, is also vital to characterizing a speaker.

26

4.2.1 MFCC
The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale frequency. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the melfrequency cepstrum is that in the MFC, the frequency bands are equally spaced on the Mel scale which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound. MFCCs are derived as follows 1. Take the Fourier transform of a windowed excerpt of a signal. 2. Map the powers of the spectrum obtained above onto the Mel scale using triangular overlapping windows. 3. Take the logs of the powers at each of the Mel frequencies. 4. Take the discrete cosine transform of the list of Mel log powers, as if it were a signal. 5. The MFCCs are the amplitudes of the resulting spectrum.

27

4.3 Statistical Modeling

Now that we have our 23 dimensional feature vectors, 173 of them for every second of the audio clip, we need to fit this data in a statistical model of choice. We want to get a concise representation for making sense of this data. The two most widely used statistical models are 1)Vector Quantization or Codebook quantization and 2)Gaussian Mixture Models. We will discuss the theoretical details about both one by one and then detail the experimental results obtained with each.

28

4.3.1 VECTOR QUANTIZATION


What is it that a statistical model is supposed to do? Once we have extracted a number of points from the training audio clip in the MFCC feature space of 23 dimensions, what do we do with these points and more importantly, given a feature vector from a testing clip how do we decide what are the odds of this particular feature vector belonging to any of the speakers in our universe or the database. These are the questions that necessitate the statistical modeling module in the speaker identification system. A speaker identification system should generate some kind of a model for every speaker that stores in it the attributes of that particular speaker, the probabilistic distribution of its feature space. It is not a practical idea to store each and every point generated in the process of feature extraction, nor is it particularly profitable, because then our scoring module has to be cumbersome. Vector quantization is a process that takes a huge cluster of points in multi dimensional space and replaces this cluster with a collection of centroid points of this cluster that are representative of the cluster. This method is also called codebook quantization, the set of centroid vectors which is representative of the overall cluster being called the codebook and the centroid points being called code vectors. The basic idea is for each speakers training points, dividing the n dimensional feature space into a number of partitions. Each partition is equivalent to a code vector. In our case we divide the 23 dimensional feature-space in 16 parts. Each part is represented by a vector. K means algorithm has been used for clustering and the distance measure used is simply Euclidean distance.

K means algorithm
The algorithm uses an iterative refinement technique. Given an initial set of k means m1(1),,mk(1) (see below), the algorithm proceeds by alternating between two steps: Assignment step: Assign each observation to the cluster with the closest mean (i.e. partition the observations according to the Voronoi diagrams generated by the means).

Update step: Calculate the new means to be the centroid of the observations in the cluster.

29

The algorithm is deemed to have converged when the assignments no longer change.

Voronoi Diagrams
Without going in the details the Voronoi diagrams consist of Voronoi cells. So, given a set of points, the whole feature space is decomposed in a number of sections or cells, one corresponding to each point. The points lying in a Voronoi cell of a point are the points which are closer to this particular point than any other point in the given set of points.

30

Image source http://www.data-compression.com/vq-2D.gif

Figure 12
4.3.2 Gaussian Mixture Models
At times, it so happens that the representation provided by vector quantization or codebook quantization is not adequate for modeling variations in the tonalities of a particular speaker, so it seems like a nice idea to allow multiple underlying representations with different probabilities to model a particular speaker. This can be achieved handsomely by Gaussian mixture models which are state of the art for speaker recognition. In statistics, a mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data-set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population.

31

However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population-identity information. Some ways of implementing mixture models involve steps that do attribute postulated sub-population-identities to individual observations (or weights towards such subpopulations), in which case these can be regarded as a types of unsupervised learning or clustering procedures. However not all inference procedures involve such steps. The structure of a general mixture model can be understood as follows, a typical finitedimensional mixture model is a hierarchical model consisting of the following components:
y

y y

N random variables corresponding to observations, each assumed to be distributed according to a mixture of K components, with each component belonging to the same parametric family of distributions but with different parameters N corresponding random latent variables specifying the identity of the mixture component of each observation, each distributed according to a K-dimensional categorical distribution A set of K mixture weights, each of which is a probability (a real number between 0 and 1), all of which sum to 1 A set of K parameters, each specifying the parameter of the corresponding mixture component. In many cases, each "parameter" is actually a set of parameters. For example, observations distributed according to a mixture of onedimensional Gaussian distributions will have a mean and variance for each component. Observations distributed according to a mixture of V-dimensional categorical distributions (e.g., when each observation is a word from a vocabulary of size V) will have a vector of V probabilities, collectively summing to 1.

The general mixture model can then easily be converted in a Gaussian mixture model with the adoption of parameters.

Parameter estimation, Expectation Maximization


We have used expectation maximization algorithm for parameter estimation or mixture decomposition in Gaussian Mixture Models. The expectation step With initial guesses for the parameters of our mixture model, "partial membership" of each data point in each constituent distribution is computed by calculating expectation

32

values for the membership variables of each data point. That is, for each data point xj and distribution Yi, the membership value yi, j is:  The maximization step With expectation values in hand for group membership, plug-in estimates are recomputed for the distribution parameters. The mixing coefficients ai are the means of the membership values over the N data points. 

The component model parameters i are also calculated by expectation maximization using data 32points xj that have been weighted using the membership values. For example, if is a mean

With new estimates for ai and the i's, the expectation step is repeated to recompute new membership values. The entire procedure is repeated until model parameters converge.

4.4 Bayesian Information Criterions (BIC)


When going through every window of the given audio clip, it becomes necessary to determine if this window belongs to one speaker or there exists a segmentation point inbetween, this we can achieve by using Bayesian information criteria. The metric is an indicator of the acoustic dissimilarity between the two sub windows.

33

The question we fundamentally ask is, if this window is better modeled using one speaker alone or more speakers give it better representation, model with higher number of independent parameters are penalized using a lambda function. y Given that a model M is denoted by the statistical distribution theta, the BIC for a window can be defined as  ln D

1) is the series of audio feature vectors captured in the window W. 2) D is the number of independent parameters present in theta. 3) Second term is the penalty term and penalizes a model for its complexity. 4) can be adjusted. 5) Model with the higher BIC value is to be chosen.

BIC in Segmentation

BIC = BIC ( =

) BIC ( 

y Two models, M0 and M1 are defined. y Model M0 represents the scenario where t(test) is not a turn point so the left and right sub windows will belong to a common distribution theta(w). y Model M1 represents the scenario where t(test) is a turn point, then the left and right sub windows will belong to different distributions theta(L) and theta(R). y It is assumed that the feature vectors follow Gaussian distributions.

34

Chapter 6

Experimental Results
6.1 Datasets and Objective of the experiments
In the previous three chapters we have discussed how we went about to implement our speaker diarization system. The primary objective of the experiments was to see if the performance of our speaker diarization system eroded with increase in the number of speakers in a conversation, for this it must be borne in mind that all other parameters are kept the same. So we came up with an idea to achieve the same. We recorded speech samples from six different speakers, they were asked to read excerpts from Frederic Wilhelm Nietzsches Also Sprach Zarthustra. Five training samples of 30 to 45 seconds each were taken from each speaker. Then ten testing samples were also taken. From this data then conversations were synthesized or contrived. We take a testing speech from one of the speakers; insert some silence before or after it and then stuff speech samples from other speakers and keep doing it till you are satisfied with the length and other attributes of the conversation clip. The results of our experiments with four, five and six speakers can be summarized as in the table on the following page.

35

Codebook Speaker Accuracy

# 2 3 4 5 6

Accuracy GMM

85.4% 81.3% 74.5% 73.3% 71.2%

91.3% 86.8% 78% 76.5% 74%

36

Chapter 7

Conclusion
The experimental results clearly demonstrate that Gaussian mixture models are much more robust than vector quantization or codebook quantization when it comes to maintaining performance in the wake of increasing number of speakers in a conversation. This can be attributed to the fact that a speakers glottal chord behaves differently under different conditions like uttering different categories of phonetic sounds and so we need a model which accounts for the sub-populations within the feature point population generated by a particular speaker or a mixture model kind of representation which is adequately provided for by the Gaussian mixture models. There are different patterns or sub-populations within the dataset of feature vectors generated by a speaker but these are not many. In our mixture model the number of components has to be equal to the number of such sub-populations within a speakers feature vector set to optimize the performance. Moreover, it can be seen that despite using Gaussian mixture models, state of the art in speaker identification, the performance of our diarization system is not very encouraging, hence it becomes imperative to incorporate two other factors in a diarization system 1) Make the corpus multi-modal and use Direction of arrival and time delay of arrival to enhance the performance. 2) Incorporate visual cues, as mere audio has not given significantly encouraging results.

37

Bibliography
Ajmera, J., & Wooters,C.2003. A Robust speaker clustering algorithm. In : ASRU 20038th IEEE Automatic Speech Recognition and Understanding workshop. Ajmera,J.. McCowan,I. & Bourlard,H.2004. Robust speaker change detection. IEEE Signal processing Letters., 11(8). Akita,Yuva and Kawahara, Tatsuya, 2003. Unsupervised Speaker Indexing Using Anchor models and Automatic Transcription of Discussions In : The interspeech2003 -8th European Conference on Speech Communication and Technology. Anguera, X.,Wooters, C,, Pardo, J. & Hernando, J.2007. Automatic Weighting for the Combination of TDOA and Acoustic features in Speaker diarization for Meetings In : IEEE International Conference on Acoustics, Speech and Signal Processing 2007. Anguera,X.,Wooters,C.,& Hernando,J.2005a. Speaker diarization for multi party meetings using Acoustic fusion. In : ASRU 2005 -9th IEEE Automatic Speech Recognition and Understanding workshop. Anguera, Xavier. 2005. XBIC : Real time Cross probability measure for speaker segmentation. Tech. Rept. ICSI Anguera, Xavier.2006b. Robust Speaker Diarization for Meetings. Ph.D thesis, Universitat Politecnica de Catalunya Anguera, Xavier, Wooters, Chuck, Peskin, Barbara, & Aguilo, Matu. 2005b. Robust Speaker Segmentation for Meetings : The ICSI-SRI Spring 2005 Diarization system. In : Rich transcription 2005 Spring Meeting Recognition Evaluation Workshop. Edinburgh,UK : Springer LNCS 3869. Anguera, Xavier, Wooters, Chuck, Hernando, Javier.2006 a. Friends and Enemies : A Novel Initialization for Speaker Diarization. In : Interspeech2006 ICSLP -9th International conference on Spoken Language Processing. Anguera, Xavier, Wooters, Chuck & Pardo, Jos M.2006b. Robust speaker Diarization for Meetings : ICSI RT06S Meetings Evaluation System In : Rich Transcription 2006 Spring Meeting Recognition Evaluation Workshop. Bethsda, MD, USA : Springer LNCS 4299.

38

Barras, Clause, Zhu, Xuan, Meignier, Sylvain & Gauvain , Jean-Luc.2004 . Improving Speaker Diarization. In : Rich Transcription 2004 Fall Workshop. Barras. Michael, Bimbot, Fedric, Ben, Mathieu & Gravier, Guillaume 2004 Multistage Speaker Diarization of Broadcast News. In: IEEE Transactions on Audio, Speech, And Language Processing, 14(5). Bester, Michael, Bimbot, Frdric, Ben, Mathieu, & Gravier, Guillaume. 2004. Speaker Diarization Using bottom up clustering based on a Parameter-Derived Distance between adapted GMMs. In : ICSLP 2004- 8th International Conference on Spoken Language Processing. Bimbot, Frdric, & Mathan, Luc, 1993. Text Free speaker Recognition using an arithmetic-harmonic sphericity measure. In : Eurospeech93 -3rd European Conference on Speech Communication and Technology. Black, A.,& Schultz,T.2006. Speaker clustering for Multilingual Synthesis. In : ISCA Tutorial and Research Workshop on Multilingual Speech and Language Processing Bonastre, J.-F., Delacourt,P.,Fredouille,C.,Merlin,T.,&Wellelens,C.2000. A speaker tracking system based on speaker turn detection for NIST Evaluation. In: IEEE International Conference on Acoustics, Speech and Signal Processing 2000. Burges, CJ.C1998. A Tutorial on Support Vector Machines for Pattern Recognition. Data mining and Knowledge Discovery,2(2). Cettolo, Mauro.2000. Segmentation, Classification, and Clustering of an Italian Broadcast news Corpus. In : 6th RIAO 2000- Content Based Multimedia Information Access. Chen. Jingdong, Benesty, Jacob& Huang, Yiteng.2006. Time delay estimation in room acoustic environments : An Overview. EURASIP Journal on Applied Signal Processing. Chen,Scott, Shaobing, & Gopalkrishnan,P.S.1998b. Speaker, Environment And Channel Change Detection and Clustering Via the Bayesian Information Criterion. In DARPA Speech Recognition Workshop 1998. Cheng.E.Lukasiak, J., Burnett,I.S.,& Stirling, D.2005. Using Spatial cues for meeting Speech Segmentation. In : IEEE ICME05- International Conference on Multimedia and Erpo 2005. Cohen,A.& Lapidus,V.1995. Unsupervised Text Independent Speaker classification. In : 18th Convention of Electrical and Electronics Engineers in Israel.

39

Cook, G.D., & Robinson, A.J.1998.The 1997 Abbot System for the transcription of broadcast news. In : DARPA Speech Recognition Workshop 1998. Couvreur, L.&Boite,J.1999. Speaker Tracking in Broadcast Audio Material in the Framework of THISL Project. In : ESCA Tutorial and Research Workshop Accessing Information in Spoken Audio. Delacourt,P.,& Wellekens, C.J.2000. DISTBIC: A speaker based segmentation for Audio Data Indexing. Speech Communications,32(1-2). www. Wikipedia. org

You might also like