Speaker Identification-Ocr

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Attempted Speaker Identification Florida vs.

Zimmerman

Report to: Richard W. Mantei Assistant State Attorney Fourth Judicial Circuit of Florida 220 E. Bay St., Jacksonville, FL 32202
March 20,2013

Report Prepared by: Harry Hollien PhD James D. Harnsberger PhD Senior Consultants Forensic Communication Associates

REPORT ON ATTEMPTED SPEAKER IDENTIFICATION Florida vs. Zimmerman


INTRODUCTION

Personnel at Forensic Communication Associates (FCA) were contacted by Mr. Richard Mantei, Assistant State Attorney, Fourth Judicial Circuit of Florida, Jacksonville, regarding recordings associated with the above cited case. It was requested that an attempt be made to discover if the male voice found on a 9-11 recording (i.e., the unknown voice) was the same as the one recorded on an exemplar5 CD (the known voice). The person speaking on the exemplar was Mr. George Zimmerman. Later exemplar recordings were requested for Mr. Trayvon Martin; the speech on them was to be compared to the 9-11 utterances also.

MATERIALS RECEIVED

Two CD recordings were received at FCA. One was of the relevant 9-11 call. It was labeled 911 witness call, an address and^m^^i^. The second CD contained the voice of George Zimmerman. It was dated 2/26/12, dispatch callfllMHHSfc. Both were labeled with an FCA number and digital copies made on laboratory equipment and a computer. FCA personnel then requested additional voice samples both of G. Zimmerman and T. Martin. At various later dates, three Zimmerman CDs were received (they were jail calls, 4/20/12; video interview, 2/27/12; and reenactment audio, 3/22/12 Finally, two DVDs taken from Trayvon Martins phone (only markings) were received at yet a later date. Identifying marks were placed on these recordings also and digital copies made of Page 2 of 17

them (i.e., via computer input). The undersigned and a senior technician listened to them in their entirety several times. Analysis CDs were constructed (of evidence-exemplar sets and pairs) and the samples they contained processed by means of several aural-perceptual speaker identification techniques (see below and Hollien, H., Acoustics of Crime, Plenum, 1990; Hollien and Hollien, Improving Aura-perception Speaker Identification Techniques, Studies in Forensic Phonetics, 1995, Wissenchaftlicher, Trier, and Hollien, H., Forensic Voice Identification. Academic Press, 2001).

THE RECORDINGS
The Evidence Recording. As expected, the samples on the 9-11 evidence recording were not at all suitable for ordinary speaker identification analyses. First, they were mostly short grunts, calls or cries; a few gave the illusion of speech, mostly help or help me. Second, with only two exceptions, they were rather faint. Third, since they were recorded at a 9-11 center, other voices were heard (they were much louder, of course). In many instances, these voices obliterated and/or overlapped those in the background. Fourth, 16 utterances in all could be identified. However, only six were found to be potentially useful and some of their extent was lost when they were extracted. Taken as a whole, only a little over 8 sec. of speech was found to be available for assessment.

Exemplars. On the other hand, the energy levels of the utterances on the exemplar recordings were sufficient and the overall quality of those produced was quite good in all instances. They were Page 3 of 17

of the type suitable for speaker identification purposes; that is, they were intelligible and, although noise was intermittently present, it rarely masked the speech. The problem, of course, was that very few of the utterances they contained actually were suitable for comparison as those involving very short samples produced under high stress were quite rare. Although a number of procedures were tried, ultimately the judgments made contrasting the Zimmerman re created cries as exemplars (i.e., when compared to the six 9-11 samples) were the most useful. For Martin, several of his high frequency laughs, exclamations and mocking utterances were employed.

Selection of the Unknown Samples. As stated, the major problem was that very little speech/voice material was available for processing; a second problem was that they were all calls or cries; the third was that they were very faint and, the fourth, that most were at least partly masked by the speech of other talkers. In all, 16 short calls/cries were identified and very little intelligible speech was available: i.e., only one or two instances of help or help me. Of these 16 samples, only six provided at least 500ms or more of a clear call and, even in these instances, part of the total call had to be removed. As stated, just a little more than 8 sec. of phonation was available. Samples this brief rarely lead to attempts at speaker identification. Ordinarily, 10 words or 10 seconds of speech constitute a bare minimum. However, they were the only unknown samples available and the task involved making a determination between but two speakers (i.e., G. Zimmerman and T. Martin).

Page 4 of 17

Preparation of the Recordings. Samples of the "unknown" (U) and the "known" (K) speakers were prepared for the aural-perceptual comparisons. These procedures included the selection of three sets of samples for three different analyses. The first set was mostly for familiarization purposes. It involved a compilation of all the calls/cries onto a single recording. It was compared (serially) to a group of short speech passages drawn from the several interviews/calls made by Mr. Zimmerman (K). Later on, this procedure was independently applied to samples from Mr. Martins telephone calls (K). The second procedure was to create six separate CDs, each with a different call or cry from the 9-11 recordings and individually compare them to a variety of short speech samples from the K recordings (first to Zimmerman samples, then separately to those by Martin). The final procedure (i.e., the third one) was most important. These same cries/calls were individually compared to the cries/calls from the reenactment recording. In Mr. Martins case, samples of laughter, mocking, and high pitched exclamations were employed. As would be expected, all samples were of the best available quality and where noise was at a minimum. Both the known and the unknown samples were band-pass filtered with both the high pass and low pass filtering cutoffs set outside the speech range. This procedure was carried out in order to 1) minimize any situational differences, 2) reduce distractions and 3) eliminate some of the non speech artifacts present on the recordings. Thus, the highest quality samples possible were made available.

AURAL-PERCEPTUAL SPEAKER IDENTIFICATION


The aural-perceptual speaker identification procedures employed are those where an

Page 5 of 17

unknown voice (U), drawn from an evidence recording, is compared to exemplars of a known voice (K). As stated above, samples of a number of U-K combinations were placed on a CD in pairs for direct and repeated comparison. The undersigned then carried out evaluations which were based on a number of heard parameters. In this instance, they only included comparisons of: I) fundamental frequency, 2) voice quality, 3) vocal intensity (variability) patterns, 4) vowels, and 5) nasality. Subjective impressions also were logged for consideration. This entire procedure was completed, then repeated in its entirety some time later -- usually the next day. In this case, a speaker identification procedure had to be employed in which an attempt had to be made to match - or not match - Mr. Zimmermans vocalizations to the six usable cries found on the 9-11 call with utterances (as similar as possible) from exemplar recordings. As stated, the process was carried out as follows. The greatest extent of the cry possible was isolated; all extraneous noise was removed; the cry was repeated 8-10 times. A comparison recording of a number of Mr. Zimmermans exemplar utterances was compared in turn - and, individually - with each of the eight samples. Voice quality, pitch, vowel quality, nasality and intensity inflections were the primary judgmental features. Finally, the entire process was repeated using the mimicked cries by Mr. Zimmerman. It then was carried out twice for Mr. Martin. First with general speech samples, then with the available stress units. It also should be noted that, for this case, the usual procedure was further modified. Ordinarily, evaluating an identification parameter (pitch say) was carried out by playing the pairs over and over until a decision could be made. Here, the single identification parameter remained the same but the specific cry (No. 8 say) was compared to a variety of exemplar samples (again, over and over) until the judgment is made. It then was repeated for the five other U utterances. Thus, the process here more closely parallels six separate speaker

Page 6 of 17

identifications with the product for each summed both by cry and feature. Please note that the two investigators worked independently and did not compare results until after all evaluations had been completed. As implied, the (individual) assessments ordinarily obtained are summarized on a continua like the one found in Figure 1. In general, the range of scores making up each continuum can be divided roughly as follows: 1) any mean scores in the 0-3 range suggest that a match cannot be made and the samples were produced by two different individuals, 2) a scoring of 4-6 is generally neutral but somewhat on the positive side (i.e., toward a match) and 3) those that fall within the 7-10 range indicate a positive-to-strong match. It should be stressed again that the listed parameters were evaluated one at a time with the complete procedure independently replicated a number of times. This method of presentation was adapted for these evaluations (see Figures 2-5).

RESULTS
The prepared samples were played (repeatedly) on high quality laboratory equipment. The findings and impressions of the undersigned resulted in differing conclusions depending on which of the U samples were compared to which of those uttered by the two known (K) speakers. As stated, a maximum of only five speech/voice parameters (plus an overall judgment) could be used to permit U-K judgments.

The Bases of the Comparisons. 1. Pitch. Perceived pitch is the psychophysical correlate of fundamental frequency

Page 7 of 17

usage. In this case, it refers to the level of those tones produced by the speaker. It proved to be one of the weaker contrasts in this evaluation.

2. Voice Quality. This dimension is a little difficult to define but rather easy to demonstrate. Any hearing individual would have little difficulty differentiating a violin from a saxophone even though both were played at the same fundamental frequency and intensity. The relative differences among the partials (frequencies) within the complex musical sounds are what make this discrimination possible. This characteristic proved to be a major factor for these assessments as the same is true for human speakers.

3. Vocal Intensity Patterns. Absolute vocal intensity levels are very difficult to detect because even slight differences in the environmental situation, microphone position, talker distance, etc., can result in large differences in the absolute level of measured or perceived loudness. As with pitch, the intensity variability patterns proved to be one of the lesser identification features. Yet they aided in some judgments.

4. Nasality. Detection of the amount of nasality in the cries and exemplar samples proved to be helpful.

5. Vowels. In some cases, vowel format comparisons of the calls with exemplar samples provided enough information to permit graded same-different judgments.

Page 8 of 17

6. Finally, each of our evaluators provided a general overall assessment of the U-K samples. In many cases, these efforts aided in the decision making.

Specific Results. The first of the three sets of judgments (i.e., general speech) for Mr. Zimmerman was simply inconclusive and will not be included in the results. The second provided some insight so its results will be established as Figure 2. The third (i.e., the U-K comparisons of the 9-11 call calls/cries vs. those from the reenactment) was the most important. Please see Figure 3. The two sets for Mr. Martin parallel those of Mr. Zimmermans to some extent - i.e., short speech samples and more relevant samples. They will be presented as Figures 4 and 5. Over two thousand specific judgments were required to permit the following decisions to be made. As can be seen from consideration of the four figures, no robust matches were obtained. On the other hand, several rather strong tendencies were found. First, please note the following. Call No. 11 proved almost impossible to judge once the masking (of other) voices was trimmed from its borders. Accordingly, data from this sample will not be included on any figures. Data on Mr. Zimmerman. The scores for cries/calls Numbers 1 and 8 were so low (see Figures 2 and 3) that Iittle-to-no evidence that Mr. Zimmerman made them appeared to exist. His scores for call No. 13 were rather mixed - and they were quite variable. Thus, even though their mean was close to 5.0, the judgment had to be that they were inconclusive. That is, while they graded above the 0-3 range, they still fell far short of a match. On the other hand, the data for cry No. 14 and (more so) for cry No. 16 proved to be more toward - but in most instances not quite reaching - a match. Indeed, as may be seen from the range data, several of the

Page 9 of 17

individual scores exceeded the border of the match category - and the mean for No. 16 (see Figure 3) also came very close to a match. In short, there is a very good possibility that, under normal circumstances, cry No. 14 and, especially, cry No. 16 would be judged to be a match i.e., that Mr. Zimmerman had, indeed, made one or both of those two utterances. In this instance, the confidence level only reaches about 65-70%. Nevertheless, it is even much less likely that he (George Zimmerman) was not the person who made these two cries. Data for Mr. Martin. The data for Mr. Martin are similar in extent but different in pattern. Of course, the judgments here are even more difficult to make as they were drawn from a telephone call and, unlike those for Mr. Zimmerman, no reenactment samples were available. In judging Figures 4 and 5 (based also on two separate analyses), it can be noted 1) that no judgments were possible for call No. 11, and data for calls No. 13, 14, and 16 quite clearly demonstrate that he did not make them. That is, the means of all the many hundreds of judgments usually ranged from 1.0 to 3.1. And, even though one score reached 5.5, very few of the individual judgments exceeded the non-match category. Thus, even with these restricted judgments, there is too little evidence suggesting that he uttered any of these three calls. On the other hand, there is some evidence that he was responsible for the first two calls/cries (i.e., No. 1 and No. 8). Note their mean scores on Figures 4 and 5. They range from 5.8-6.5 for the first identification run (Fig. 4) and 6.4-6.5 for the second (Fig. 5). Note also, that several of the individual judgments are in the 7.0 or above (i.e., match) category. Thus, it may be concluded with a nearly 70% confidence level that Mr. Martin produced the first two calls. Again, while they did not reach the definitely match category, the data do not provide any real evidence that he did not make these utterances.

Page 10 of 17

DISCUSSION
While the evidence suggests that Mr. Martin produced the first two utterances and Mr. Zimmerman made the last two, the confidence level for these relationships is not very robust. Yet, conclusions of these low magnitudes are hardly surprising, given the limits and difficulty of the evaluation process. It is possible, of course, that more robust data could have been obtained if we had been supported in conducting two additional sets of procedures. The first of these procedures would have included comparative acoustic analyses of the listed U-K samples. The second would have been a perceptual experiment to compare the evidence recordings to an appropriately-sized samples of male speakers that were matched in age, gender, and linguistic background to, alternatively, Trayvon Martin and George Zimmerman. These two groups of speakers would produce utterances similar to those found on the 9-11 and exemplar recordings. The results of these procedures would have aided the undersigned in confirming or not confirming the findings reviewed above.

CONCLUSIONS
The opinions to follow are based primarily on the aural-perceptual evaluations described above. As was stated, even though many problems were evident, the evidence recording provided minimum-to-marginal material for identification purposes. Moreover, the exemplar recordings contained enough material to permit a number of different judgments to be made. Based on the many analyses carried out, the undersigned had to conclude that, while there is evidence to suggest that Mr. Martin made the first two calls/cries (Nos. 1 and 8) and that Mr. Zimmerman made those identified as 14 and 16, none of these conclusions reached the criterion for a match. Neither speaker could be identified as being responsible for the others. Page 11 of 17

Finally, it must be conceded that the aural-perceptual method of speaker identification, while reasonably well organized and extensive in this case, is somewhat subjective in nature and, hence, the possibility of error exists. Nonetheless, the reported data can be defended on the basis of the rigorous procedures employed and, hence, the conclusions drawn can be viewed as reasonable.

Respectively submitted,

Forensic Communication Associates

James D. Hamsberger, PhD Senior Consultant

Harry Hollien, Ph.D. Senior Consultant

Page 12 of 17

Figure 1.

A sample of the type of summary figure employed in ordinary aural-perceptual speaker identification. The structuring of figures 2-5 is patterned on this one.
FORENSIC COMMUNICATION ASSOCIATES

Case Name: Aural-perceptual Approach to Speaker Identification Score Sheet


0 = U-K least alike; 10 = U-K most alike

FCA REF:

1.

PITCH

SCORE

RANGE

a.

Level
0.. 0.. ..5.. ..5..

10 . . 10 . . 10

b. Variability c.
2.

Patterns

VOICE QUALITY

a.

General

0.. 0.. 0..

..5.. ..5.. ..5..

. . 10 . . 10 . . 10

b. Vocal Fry c.
3.

Other

INTENSITY

a. Variability
4. DIALECT

0..

..5..

. . 10

a.

Regional

0.. 0.. 0..

..5.. ..5.. ..5..

. . 10 . . 10 . . 10

b. Foreign c.
5.

Idiolect

ARTICULATION

a.

Vowels

0.. 0.. 0.. 0..

..5.. ..5.. ..5.. ..5..

. . 10 . . 10 . . 10 . . 10

b. Consonants c. Misarticulations

d. Nasality
6. PROSODY

a.

Rate

0.. 0.. 0..

..5.. ..5.. ..5..

. . 10 . . 10 . . 10

b. Speech Bursts c.
MEAN

Other

Page 13 of 17

Figure 2. Comparison of Mr. Zimmermans general (but short) samples with the six cries or calls drawn from the 9-11 telephone call. Twenty such samples were compared to each cry. The data (x) on the continuum are the means of at least four of the five features plus a general assessment.

Cry Number No. 1 No. 8 No. 11 No. 13 No. 14 No. 16

Perception wow ow (mainly cherp) wyra owa swaa

Mean Judgment 0. .X.. 5.. 0.X..5... 0 .... 5 .... 0... .X5.. 0 .... 5 X .. 0 .... 5 .X.. . 10 = 2.4 . 10= 1.9 10 = N/A . 10 = 4.7 . 10 = 6.1 . 10 = 6.6

Range 1.0-4.0 0-4.5 null 2.5 - 5.5 4.0 - 6.0 4.0-7.0

Page 14 of 17

Figure 3. Comparison of Mr. Zimmermans reenacted cries with each of the six cries/calls. Twenty such samples were matched with each cry. The data (x) on the continuum are the means of at least four of the Five features plus a general assessment.

Cry Number No. 1 No. 8 No. 11 No. 13 No. 14 N o . 16

Perception wow ow (mainly cherp) wyra owa

Mean Judgment 0.X..5... 0.X..5... . 10 = 2.0 . 10 = 2.1

Range 0-4.0 1.0-3.5 null

0 .... 5 .... 10 = N/A 0... .X5..

< . y>., { . 1 0 = 4 . 8 2.0-5.5

0 . . . . 5 X . .. 1 0 = 6 . 0 5 . 0 - 7 . 0 0 . . . . 5 . X. .1 0 = 6 . 9 4 . 5 - 7 . 5

ti,.' ij Site f * . i swaa

Page 15 of 17

Figure 4. Comparison of Mr. Martins general (but short) samples with the six cries or calls drawn from the 9-11 telephone call. Twenty such samples were matched with each cry. The data (x) on the continuum are the means of at least four of the five features plus a general assessment.

Cry Number No. 1 No. 8 No. 11 No. 13 No. 14 No. 16

Perception wow "ow (mainly cherp) wyra owa "swaa o

Mean Judgment bo X o II
Ui

Range 3.0-7.0 4.0 - 7.5 null 0-3.2 2.5-5.5 1.0-4.0

0....5.X... 10 = 6.5 0 .... 5 .... 10 = N/A Hi OX...5.... 10=1.2 0...X 5.... 10 = 3.9 0..X. .5.... 10 = 2.5

Page 16 of 17

Figure 5. Comparison of Mr. Martins selected samples (shouts/cries) with the six cries or calls drawn from the 9-11 telephone call. Twenty such samples were matched with each cry. The data (x) on the continuum are the means of at least four of the five features plus a general assessment.

Cry Number No. 1 No. 8 No. 11 No. 1 3 No. 14 No. 16

Perception wow ow (mainly cherp) wyra owa swaa

Mean Judgment

Range

0 ... .5.X. .. 10 = 6.5 4.5-7.5 0....5.X... 10 = 6.4 0 ___ 5 ____ 1 0 = N / A OX...5.... 10 = 1.1 4.5-7.0

null
0.5-3.2

0 .. X . 5 .... 10 = 3.1 2.0-5.0 0..X. 5.... 10 = 2.8 1.0 -4.5 .. _ ^ ' '

Page 17 of 17

You might also like