Professional Documents
Culture Documents
Beatbox ProofMS 037302JAS
Beatbox ProofMS 037302JAS
Beatbox ProofMS 037302JAS
7 Erik Bresch
8 Philips Research, High Tech Campus 5, 5656 AE, Eindhoven, Netherlands
9 Dani Byrd
10 Department of Linguistics, University of Southern California, 3601 Watt Way, Los Angeles,
11 California 90089-1693
J. Acoust. Soc. Am. 133 (2), February 2013 0001-4966/2013/133(2)/1/12/$30.00 C 2013 Acoustical Society of America
V 1
113 form of vocal performance, we hope to shed light on broader Kick “punchy” bf ½pf’+8 ç glottalic egressive
Kick “thud” b ½p’8I glottalic egressive
114 issues of human sound production—making use of direct
Kick “808” b ½p’8U glottalic egressive
115 articulatory evidence to seek a more complete description of
Rimshot k [k’] glottalic egressive
116 phonetic and artistic strategies for vocalization.
Rimshot “K” k [khh+] pulmonic egressive
Rimshot “side K” ½8
Nk lingual ingressive
117 III. CORPORA AND DATA ACQUISITION Rimshot “sucking in” ½8
N! lingual ingressive
Snare “clap” Njw
½8 lingual ingressive
118 A. Participant _
Snare “no meshed” pf [pf’+8 ı] glottalic egressive
119 The study participant was a 27 year-old male professional Snare “meshed” ksh ½kç+ pulmonic egressive
120 singer based in Los Angeles, CA. The subject is a practitioner Hi-hat “open K” kss ½ks+ pulmonic egressive
121 of a wide variety of vocal performance styles including hip Hi-hat “open T” tss
_
½0ts + pulmonic egressive
122 hop, soul, pop, and folk. At the time of the study, he had been _
Hi-hat “closed T” ^t 0K
½0ts t pulmonic egressive
123 working professionally for 10 years as an emcee (rapper) in a ½w N
Hi-hat “kiss teeth” th 8 j lingual ingressive
124 hip hop duo, and as a session vocalist with other hip hop and Hi-hat “breathy” h ½x+w pulmonic egressive
125 fusion groups. The subject was born in Orange County, CA, to Cymbal “with a T” tsh [tˆ)+w ] pulmonic egressive
126 Panamanian parents, is a native speaker of American English, Cymbal “with a K” ksh ½kw ç+w pulmonic egressive
127 and a heritage speaker of Panamanian Spanish.
2 J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing
_
FIG. 1. Articulation of a “punchy” kick drum effect as an affricated labial ejective ½pf ’+8
ç. Frame 92: starting posture; f97: lingual lowering, velic closure; f98:
fully lowered larynx, glottalic closure; f100: rapid laryngeal raising accompanied by lingual raising; f101: glottis remains closed during laryngeal raising;
f103: glottal abduction; final lingual posture remains lowered.
J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing 3
FIG. 2. Articulation of a “thud” kick drum effect as an bilabial ejective [p’8I]. Frame 84: starting posture; f89: glottal lowering, lingual retraction; f93: fully
lowered larynx, sealing of glottalic, velic and labial ports; f95: rapid laryngeal raising accompanied by lingual raising; f97: glottis remains closed during laryn-
geal raising and lingual advancement; f98: final lingual posture raised and advanced.
255 unaffricated bilabial ejective stops: a “thud kick,” and an “808 average release durations of some other Athabaskan glottalic 292
256 kick.” Image sequences acquired during production of these consonants (Hogan, 1976; McDonough and Wood, 2008). In 293
257 effects are shown in Figs. 2 and 3, respectively. The data reveal general, it appears that the patterns of coordination between 294
258 that although the same basic articulatory sequencing is used, glottal and oral closures in these effects more closely resem- 295
259 there are minor differences in labial, glottal, and lingual articu- ble those observed in North American languages, as opposed 296
260 lation which distinguish each kick drum effect. to African languages like Hausa (Lindau, 1984), where “the 297
261 In both thud and 808 kick effects, the lips can be seen to oral and glottal closures in an ejective stop are released very 298
262 form a bilabial seal (Fig. 2, frames 93–95; Fig. 3, frames 80–82), close together in time” (Maddieson et al., 2001). 299
263 while in the production of the affricated punchy effect, the
264 closure is better characterized as labio-dental (Fig. 1, frames
B. Articulation of rim shot effects 300
265 98–103). Mean upward vertical displacement of the glottis dur-
266 ing ejective production, measured over six repetitions of the thud Four different percussion effects classified as snare drum 301
267 kick drum effect, was 18.6 mm, and in five of the six tokens “rim shots” were demonstrated by the subject (Table I). Two 302
268 demonstrated, no glottal abduction was observed after comple- effects were realized as dorsal stops, differentiated by their 303
269 tion of the ejective. Vertical glottal displacement averaged over airstream mechanisms. Two other rim shot sounds were pro- 304
270 five tokens of the 808 kick drum effect, was 17.4 mm. Mean du- duced as lingual ingressive consonants, or clicks. 305
271 ration (oral to glottal release) of the 808 effect was 152 ms. The effect described as “rim shot K” was produced as a 306
272 A final important difference between the three types of voiceless pulmonic egressive dorsal stop, similar to English /k/, 307
273 kick drum effects produced by this subject concerns lingual but with an exaggerated, prolonged aspiration burst: [khh+]. 308
274 articulation. Different amounts of lingual retraction can be Mean duration of the aspiration burst (interval over which aspi- 309
275 observed during laryngeal lowering before production of ration noise exceeded 10% of maximum stop intensity), calcu- 310
276 each ejective. Comparison of the end frames of each image lated across three tokens of this effect, was 576 ms, compared 311
277 sequence reveals that each effect is produced with a different to mean VOT durations of 80 ms and 60 ms for voiceless 312
278 final lingual posture. These differences can be captured in (initial) dorsal stops in American (Lisker and Abramson, 1964) 313
279 close phonetic transcription by using unvoiced _
vowels to and Canadian English (Sundara, 2005), respectively. 314
280 represent the final posture of each effect: ½pf’+8 ç(punchy), A second effect produced at the same place of articula- 315
281 ½p’8I(thud), and ½p’8U (808). tion was realized as an ejective stop [k’], illustrated in 316
282 These data suggest that the kick drum effects produced Fig. 4—an image sequence acquired over a 480 ms interval 317
283 by this artist are best characterized as “stiff” (rather than during the production of the second token. Dorsal closure 318
284 “slack”) ejectives, according to the typological classification (frame 80) occurs well before laryngeal lowering commen- 319
285 developed by Lindau (1984), Wright et al. (2002), and ces (frame 83). Upward movement of the closed glottis can 320
286 Kingston (2005): all three effects are produced with a very be observed after the velum closes off the nasopharyngeal 321
287 long voice onset time (VOT), and a highly transient, high port, and glottal closure is maintained until after the dorsal 322
288 amplitude aspiration burst. The durations of these sound constriction is released (frame 90). 323
289 effects (152 to 160 ms) are longer than the durations reported Unlike in the labial kick drum effects, where laryngeal 324
290 for glottalic egressive stops in Tlingit (Maddieson et al., raising was accompanied by rapid movement of the tongue 325
291 2001) and Witsuwit’en (Wright et al., 2002), but resemble (Figs. 1–3), no extensive lingual movement was observed 326
FIG. 3. Articulation of an “808” kick drum effect as an bilabial ejective ½p’8 U. Frame 75: starting posture; f78: lingual lowering, velic closure; f80: fully low-
ered larynx, glottalic and labial closure; f82: rapid laryngeal raising, with tongue remaining retracted; f83: glottis remains closed during laryngeal raising; f87:
glottal abduction; final lingual posture midhigh and back.
4 J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing
FIG. 4. Articulation of a rim shot effect as a dorsal ejective [k’]. Frame 80: dorsal closure; f83: laryngeal lowering, velic raising; f84: velic closure, larynx
fully lowered; f86: glottal closure; f87: rapid laryngeal raising; f90: glottis remains closed through completion of ejective and release of dorsal constriction.
327 during dorsal ejective production in any of the rim shot at the alveolar ridge and the posterior closure spread over a 364
328 tokens (frames 86–87). Mean vertical laryngeal displace- broad region of the soft palate (frames 17–20). Once again, 365
329 ment, averaged over five tokens, was 14.5 mm. Mean ejec- the velum remains lowered throughout. The same pattern of 366
330 tive duration (lingual to glottal release) was 142 ms: slightly articulation was observed in all eight repetitions of this 367
331 shorter than, but broadly consistent with, the labial ejective effect. As with the lateral click, we cannot determine exactly 368
332 effects described above. where the lingual cavity is formed in this sound effect, nor 369
333 Articulation of the effect described as a “side K rim precisely where and when it is released. Nevertheless, the 370
334 shot” is illustrated in the image sequence shown in Fig. 5, patterns of tongue movement in these data are consistent 371
335 acquired over a 480 ms interval during the fifth repetition of with the descriptions of alveolar clicks in !Xo~ o, N|uu, and 372
336 this effect. The data show that a lingual seal is created Nama, as well as in Khoekhoe (Miller et al., 2007), so this 373
337 between the alveolar ridge and the back of the soft palate effect appears to be best described as a voiceless uvular nasal 374
338 (frames 286–290), and that the velum remains lowered alveolar click: ½8
N!. 375
339 throughout. Frames 290–291 reveal that rarefaction and cav-
340 ity formation occur in the midpalatal region while anterior
C. Articulation of snare drum effects 376
341 and posterior lingual seals are maintained, suggesting that
342 the consonantal influx is lateralized, consistent with the sub- Three different snare drum effects were demonstrated 377
343 ject’s description of the click as being produced at “the side by the subject—a “clap,” “meshed,” and “no meshed” 378
344 of the mouth.” The same pattern of articulation was observed snare—each produced with different articulatory and air- 379
345 in all seven tokens produced by the subject. stream mechanisms, described in detail below. 380
346 Without being able to see inside the cavity formed Articulation of the effect described as a “clap snare” is 381
347 between the tongue and the roof of the mouth, it is difficult illustrated in the image sequence shown in Fig. 7, acquired 382
348 to locate the posterior constriction in these sounds precisely. over a 240 ms interval during the sixth repetition of this effect. 383
349 X-ray data from Traill (1985), for example, reported in As in the rim shot clicks, a lingual seal is first created along 384
350 Ladefoged and Maddieson (1996), show that back of the the hard and soft palates, and the velum remains lowered 385
351 tongue maintains a very similar posture across all five types throughout. However, in this case the anterior lingual seal is 386
352 of click in !Xo~o, despite the fact that the lingual cavity varies more anterior (frame 393) than was observed in the lateral 387
353 considerably in size and location. Nevertheless, both lingual and alveolar clicks, the point of influx occurs closer to the 388
354 posture and patterns of release in this sound effect appear to subject’s teeth (frames 394–395), and the tongue dorsum 389
355 be consistent with the descriptions of lateral clicks in !Xo~o, remains raised higher against the uvular during coronal 390
356 N|uu (Miller et al., 2009) and Nama (Ladefoged and Traill, release. Labial approximation precedes click formation and 391
357 1984). In summary, this effect appears to be best described the labial closure is released with the click. The same pattern 392
358 as a voiceless uvular nasal lateral click: ½8Nj. of articulation was observed in all six tokens demonstrated by 393
359 The final rim shot effect in the repertoire was described the subject, consistent with the classification of this sound 394
360 by the subject as “sucking in.” The images in Fig. 6 were Njw .
effect as a labialized voiceless uvular nasal dental click: ½8 395
361 acquired over a 440 ms interval during the production of the The “no mesh” snare drum effect was produced as a la- 396
362 first token of this effect. Like the lateral rim shot, a lingual bial affricate ejective, similar to the punchy _kick drum effect 397
363 seal is created in the palatal region with the anterior closure but with a higher target lingual posture: [pf’+8 ı]. The final 398
J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing 5
399 snare effect, described as “meshed or verby,” was produced hi-hat effect can be characterized as a (partially ejective) 434
400 as a rapid sequence of a dorsal stop followed by a long pala- pulmonic egressive voiceless stop-fricative sequence [k(’)s+]. 435
401 tal fricative ½kç+. A pulmonic egressive airstream mecha- Two hi-hat effects, the “open T” (SBN: tss) and “closed 436
402 nism was used for all six tokens of the meshed snare effect, T” (SBN: t), were realized as alveolar affricates, largely dif- 437
403 but with considerable variability in the accompanying laryn- ferentiated by their temporal properties. The MRI data show 438
404 geal setting. In two tokens, complete glottal closure was that both effects were articulated as laminal alveolar stops 439
405 observed immediately preceding the initial stop burst, and a with affricated releases. The closed T effect was produced as 440
406 lesser degree of glottal constriction was observed in another a short affricate truncated with a homorganic unreleased stop 441
_
407 two tokens. Upward vertical laryngeal displacement ½0tstK, in which the tongue retained a bunched posture 442
408 (7.6 mm) was observed in one token produced with a fully throughout. Mean affricate duration was 94 ms (initial stop 443
409 constricted glottis, one token produced with a partially con- to final stop, calculated over five tokens). Broadband energy 444
410 stricted glottis (5.2 mm) and in another produced with an of the short fricative burst extended from 1600 Hz up to the 445
411 open glottis (11.1 mm). These results suggest that, although Nyquist frequency (9950 Hz), with peaks at 3794 Hz and 446
412 canonically pulmonic, the meshed snare effect was variably 4937 Hz. 447
_
413 produced as partially ejective ([k’ç+]), or pre-glottalized The open T effect ½0ts+ was realized without the conclud- 448
414 ([?kç+]). ing stop gesture and prolongation of the alveolar sibilant, 449
during which the tongue dorsum was raised and the tongue 450
tip assumed a more apical posture at the alveolar ridge. 451
415 D. Articulation of hi-hat and cymbal effects Mean duration was 410 ms (initial stop to 10% threshold of 452
416 Five different effects categorized as “hi-hats” and two maximum fricative energy, calculated over five tokens). 453
417 effects categorized as cymbals were demonstrated by the Broadband energy throughout the fricative phase was con- 454
418 subject. All these sounds were produced either as affricates, centrated above 1600 Hz, and extended up to the Nyquist fre- 455
419 or as rapid sequences of stops and fricatives articulated at quency (9950 Hz), with peaks at 4883 Hz and 8289 Hz. 456
420 different places. Articulation of the hi-hat effect described as “closed: 457
421 Articulation of an “open K” hi-hat (SBN: kss) is illus- kiss teeth” is illustrated in Fig. 9. The image sequence was 458
422 trated in the sequence in Fig. 8, acquired over a 280 ms inter- acquired over a 430 ms interval during the second of six repe- 459
423 val during the fourth repetition. The rapid sequencing of a titions of this effect. An elongated constriction was first 460
424 dorsal stop followed by a long coronal fricative was similar formed against the alveolar ridge, extending from the back of 461
425 to that observed in the “meshed” snare (Sec. V C), except the upper teeth through to the hard palate (frame 98). Lingual 462
426 that the concluding fricative was realized as an apical alveo- articulation in this effect very closely resembles that of the 463
427 lar sibilant, in contrast to the bunched lingual posture of the clap snare (Figs. 5–7), except that a greater degree of labiali- 464
428 palatal sibilant in the snare effect. All seven tokens of this zation can be observed in some tokens. In all six tokens, the 465
429 hi-hat effect were primarily realized as pulmonic egressives, velum remained lowered throughout stop production, and the 466
430 again with variable laryngeal setting. Some degree of glottal effect concluded with a transient high-frequency fricative 467
431 constriction was observed in five of seven tokens, along with burst corresponding to affrication of the initial stop. In all 468
432 a small amount of laryngeal raising (mean vertical displace- tokens, laryngeal lowering was observed during initial 469
433 ment, all tokens ¼ 4.4 mm). The data suggest that the open K stop production, beginning at the onset of the stop burst, and 470
Njw . Frame 390: tongue pressed into palate; f391–392: initiation of downward
FIG. 7. Articulation of a “clap” snare drum effect as a labialized dental click ½8
lingual motion; f393: rarefaction of palatal cavity; f394–395: dental-alveolar influx resulting from coronal lenition while retaining posterior lingual seal; Note
that the velum remains lowered throughout click production.
6 J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing
FIG. 8. Articulation of an “open K” hi-hat [ks+]. Frame 205: initial lingual posture; f206–209: dorsal stop production; f209–211: coronal fricative production.
471 lasting for an average of 137 ms. Mean vertical displacement different grooves were demonstrated, each performed at 510
472 of the larynx during this period was 3.8 mm. Partial three different target tempi nominated by the subject: slow 511
473 constriction of the glottis during this interval could be (88 b.p.m.), medium (95 b.p.m.), and fast (104 b.p.m.). 512
474 observed in four of six tokens. Although this effect was not Each groove was realized as a one-, two-, or four-bar repeat- 513
475 categorized as a glottalic ingressive, the laryngeal activity ing motif constructed in a common time signature (4 beat 514
476 suggests some degree of glottalization in some tokens, and is measures), demonstrated by repeating the sequence at least 515
477 consistent with the observations of Clements (2002), that three times. In the last two grooves, the subject improvised 516
478 “larynx lowering is not unique to implosives.” In summary, on the basic rhythmic structure, adding ornamentation and 517
479 this effect appears to be best described as a pre-labialized, varying the initial sequence to some extent. Between 518
480 voiceless nasal uvular-dental click ½w N8 j. two and five different percussion elements were combined 519
481 The final hi-hat effect was described as “breathy: into each groove (Table II). Broad phonetic descriptions 520
482 in-out.” Five tokens were demonstrated, all produced as have been used to describe the effects used, as the precise 521
483 voiceless fricatives. Mean fricative duration was 552 ms. realization of each sound varied with context, tempo and 522
484 Broadband energy was distributed up to the nyquist fre- complexity. 523
485 quency (9900 Hz), with a concentrated noise band located
486 between 1600 and 3700 Hz. Each repetition was articulated VI. TOWARDS A UNIFIED FORMAL DESCRIPTION OF 524
487 with a closed velum, a wide open glottis, labial protrusion, BEATBOXING PERFORMANCE 525
488 and a narrow constriction formed by an arched tongue dor-
489 sum approximating the junction between the hard and soft Having described the elemental combinatorial sound 526
490 palates. The effect may be characterized as an elongated effects of a beatboxing repertoire, we can consider formal- 527
491 labialized pulmonic egressive voiceless velar fricative isms for describing the ways in which these components are 528
493 As well as the hi-hat effects described above, the subject tion needs to be able to describe both the musical and lin- 530
494 demonstrated two cymbal sound effects that he described as guistic properties of this style—capturing both the metrical 531
495 “cymbal with a T” and “cymbal with a K.” The “T cymbal” structure of the performance and phonetic details of the con- 532
496 was realized as an elongated labialized pulmonic egressive stituent sounds. By incorporating IPA into standard percus- 533
497 voiceless alveolar-palatal affricate [tˆ)+w ]. Mean total dura- sion notation, we are able to describe both these dimensions 534
498 tion of five tokens was 522 ms, and broadband energy of the and the way they are coordinated. 535
499 concluding fricative was concentrated between 1700 and Although practices for representing non-pitched percus- 536
500 4000 Hz. The “K cymbal” was realized as a pulmonic egres- sion vary (Smith, 2005), notation on a conventional staff 537
501 sive sequence of a labialized voiceless velar stop followed typically makes use of a neutral or percussion clef, on 538
502 by a partially labialized palatal fricative ½kw ç+w . Mean total which each “pitch” represents an individual instrument in 539
503 duration of five tokens was 575 ms. Fricative energy was the percussion ensemble. Filled note heads are typically 540
504 concentrated between 1400 and 4000 Hz. used to represent drums, and cross-headed notes to annotate 541
cymbals; instruments are typically labeled at the beginning 542
of the score or the first time that they are introduced, along 543
505 E. Production of beatboxing sequences
with any notes about performance technique (Weinberg, 544
506 In addition to producing the individual percussion sound 1998). 545
507 effects described above, the subject demonstrated a number The notation system commonly used for music to be 546
508 of short beatboxing sequences in which he combined differ- performed on a “5-drum” percussion kit (Stone, 1980) is 547
509 ent effects to produce rhythmic motifs or “grooves.” Four ideal for describing human beatboxing performance because 548
J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing 7
TABLE II. Metrical structure and phonetic composition of four beatboxing with the multimedia, along with close phonetic transcriptions 578
sequences (grooves) demonstrated by the subject. and frame-by-frame annotations of each sequence. 579
549 the sound effects in the beatboxer’s repertoire typically cor- One of the most important findings of this study is that 587
550 respond to similar percussion instruments. The description all of the sounds effects produced by the beatbox artist were 588
551 can be refined and enhanced through the addition of IPA able to be described using IPA—an alphabet designed exclu- 589
552 “lyrics” on each note, to provide a more comprehensive sively for the description of contrastive (i.e., meaning encod- 590
553 description of the mechanisms of production of each sound ing) speech sounds. Although this study was limited to a 591
554 effect. single subject, these data suggest that even when the goals of 592
555 For example, the first groove demonstrated by the sub- human sound production are extra-linguistic, speakers will 593
556 ject in this experiment, entitled “Audio 2,” can be typically marshal patterns of articulatory coordination that 594
557 described using the score illustrated in Fig. 10. As in stand- are exploited in the phonologies of human languages. To a 595
558 ard non-pitched percussion notation, each instrumental certain extent, this is not surprising, since speakers of human 596
559 effect—in this case a kick drum and a hi-hat—is repre- languages and vocal percussionists are making use of the 597
560 sented on a dedicated line of the stave. The specific realiza- same vocal apparatus. 598
561 tion of each percussive element is further described on the The subject of this study is a speaker of American Eng- 599
562 accompanying lyrical scores using IPA. Either broad lish and Panamanian Spanish, neither of which makes use of 600
563 “phonemic” (Fig. 10) or fine phonetic (Fig. 11) transcrip- non-pulmonic consonants, yet he was able to produce a wide 601
564 tion of the mechanisms of sound production can be range of non-native consonantal sound effects, including 602
565 employed in this system. clicks and ejectives. The effects =˛jj==˛!==˛j= used to 603
emulate the sounds of specific types of snare drums and rim 604
shots appear to be very similar to consonants attested in 605
566 VII. COMPANION MULTIMEDIA CORPUS
many African languages, including Xhosa (Bantu language 606
567 Video and audio recordings of each of the effects and family, spoken in Eastern Cape, South Africa), Khoekhoe 607
568 beatboxing sequences described above have been made (Khoe, Botswana) and !Xo~o (Tuu, Namibia). The ejectives 608
569 available online at http://sail.usc.edu/span/beatboxing. For /p’/ and /pf’/ used to emulate kick and snare drums shares 609
570 each effect in the subject’s repertoire, audio-synchronized the same major phonetic properties as the glottalic egressives 610
571 video of the complete MRI acquisition is first presented, used in languages as diverse as Nuxaalk (Salishan, British 611
572 along with a one-third speed video excerpt demonstrating a Columbia), Chechen (Caucasian, Chechnya), and Hausa 612
573 single-token production of each target sound effect, and the (Chadic, Nigeria) (Miller et al., 2007; Ladefoged and Mad- 613
574 acoustic signal extracted from the corresponding segment of dieson, 1996). 614
575 the companion audio recording. A sequence of cropped, Without phonetic data acquired using the same imaging 615
576 numbered video frames showing major articulatory land- modality from native speakers, it is unclear how closely non- 616
577 marks involved in the production of each effect is presented native, paralinguistic sound effects resemble phonetic equiv- 617
alents produced by speakers of languages in which these 618
sounds are phonologically exploited. For example, in the 619
initial stages of articulation of all three kick drum effects 620
produced by the subject of this study, extensive lingual low- 621
ering is evident (Fig. 1, frame 98; Fig. 2, frame 93; Fig. 3, 622
frame 80), before the tongue and closed larynx are propelled 623
upward together. It would appear that in these cases, the 624
tongue is being used in concert with the larynx to generate a 625
more effective “piston” with which to expel air from the 626
vocal tract.3 It is not known if speakers of languages with 627
glottalic egressives also recruit the tongue in this way during 628
ejective production, or if coarticulatory and other constraints 629
prohibit such lingual activity. 630
FIG. 10. Broad transcription of beatboxing performance using standard
percussion notation: repeated one-bar, two-element groove entitled “Audio
More typologically diverse and more detailed data 631
2.” Phonetic realization of each percussion element is indicated beneath will be required to investigate differences in production 632
each voice in the score using broad transcription IPA “lyrics.” between these vocal percussion effects and the non-pulmonic 633
8 J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing
FIG. 11. Fine transcription of beatboxing groove: two-bar, three-element groove entitled “Tried by Twelve” (88 b.p.m.). Detailed mechanisms of production
are indicated for each percussion element—“open hat” [ts], “no mesh snare” [p’f+], and “808 kick” [p’]—using fine transcription IPA lyrics.
634 consonants used in different languages. If, as it appears from C. Goals of production in paralinguistic vocalization 675
635 these data, such differences are minor rather than categorical,
A pervasive issue in the analysis and transcription of vocal 676
636 then it is remarkable that the patterns of articulatory coordi-
percussion is determining which aspects of articulation are 677
637 nation used in pursuit of paralinguistic goals appear to be
pertinent to the description of each sound effect. For example, 678
638 consistent with those used in the production of spoken
differences in tongue body posture were observed throughout 679
639 language.
the production of each of the kick drum sound effects—both 680
before initiation of the glottalic airstream and after release of 681
640 B. Sensitivity to and exploitation of fine phonetic
641 detail the ejective (Sec. V A). It is unclear which of these tongue 682
body movements are primarily related to the mechanics of pro- 683
642 Another important observation to be made from this duction—in particular, airstream initiation—and which dorsal 684
643 data is that the subject appears to be highly sensitive to ways activity is primarily motivated by sound shaping. 685
644 in which fine differences in articulation and duration can be Especially in the case of vocal percussion effects articu- 686
645 exploited for musical effect. Although broad classes of lated primarily as labials and coronals, we would expect to 687
646 sound effects were all produced with the same basic articula- see some degree of independence between tongue body/root 688
647 tory mechanisms, subtle differences in production were activity and other articulators, much as vocalic coarticula- 689
648 observed between tokens, consistent with the artist’s descrip- tory effects are observed to be pervasive throughout the pro- 690
649 tion of these as variant forms. duction of consonants (Wood, 1982; Gafos, 1999). In the 691
650 For example, a range of different kick and snare drum vocal percussion repertoire examined in this study, it appears 692
651 effects demonstrated in this study were all realized as labial that tongue body positioning after consonantal release is the 693
652 ejectives. Yet the subject appears to have been sensitive to most salient factor in sound shaping: the subject manipulates 694
653 ways that manipulation of the tongue mass can affect factors target dorsal posture to differentiate sounds and extend his 695
654 such as back-cavity resonance and airstream transience, and repertoire. Vocalic elements are included in the transcrip- 696
655 so was able to control for these factors to produce the subtle tions in Table I only when the data suggest that tongue pos- 697
656 but
_
salient differences _between the effects realized as ture is actively and contrastively controlled by the subject. 698
657 ½pf’+8
ç; ½p’8I; ½p’8
U, and [pf ’+8
ı]. More phonetic data is needed to determine how speakers 699
658 This musically motivated manipulation of fine phonetic control post-ejective tongue body posture, and the degree to 700
659 detail—while simultaneously preserving the basic articula- which the tongue root and larynx are coupled during the pro- 701
660 tory patterns associated with a particular class of percussion duction of glottalic ejectives. 702
661 effects—may be compared to the phonetic manifestation of
662 affective variability in speech. In order to convey emotional
D. Compositionality in vocal production 703
663 state and other paralinguistic factors, speakers routinely
664 manipulate voice quality (Scherer, 2003), the glottal source Although beatboxing is fundamentally an artistic activ- 704
665 waveform (Gobl and Nı Chasaide, 2003; Bone et al., 2010), ity, motivated by musical, rather than linguistic instincts, 705
666 and supralaryngeal articulatory setting (Erickson et al., sound production in this domain—like phonologically moti- 706
667 1998; Nordstrand et al., 2004), without altering the funda- vated vocalization—exhibits many of the properties of a dis- 707
668 mental phonological information encoded in the speech crete combinatorial system. Although highly complex 708
669 signal. Just as speakers are sensitive to ways that phonetic sequences of articulation are observed in the repertoire of 709
670 parameters may be manipulated within the constraints dic- the beatboxer, all of the activity analyzed here is ultimately 710
671 tated by the underlying sequences of articulatory primitives, reducible to coordinative structures of a small set of primi- 711
672 the beatbox artist is able to manipulate the production of a tives involving pulmonic, glottal, velic and labial states, and 712
673 percussion element for musical effect within the range of the lingual manipulation of stricture in different regions of 713
674 articulatory possibilities for each class of sounds. the vocal tract. 714
J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing 9
715 Further examination of beatboxing and other vocal with which to describe the vast majority of sound effects 770
716 imitation data may shed further light on the nature of compo- used by (and, we believe, potentially used by) beatboxers. 771
717 sitionality in vocal production—the extent to which the gen- Standard Beatboxing Notation has the advantage that it 772
718 erative primitives used in paralinguistic tasks are segmental, uses only Roman orthography, and appears to have gained 773
719 organic or gestural in nature, and whether these units are some currency in the beatboxing community, but it remains 774
720 coordinated using the same principles of temporal and spa- far from being standardized and is hampered by a consider- 775
721 tial organization which have been demonstrated in speech able degree of ambiguity. Many different types of kick and 776
722 production (e.g., Saltzman and Munhall, 1989). bass drum sounds, for example, are all typically transcribed 777
as “b” (see Splinter and Tyte, 2012), and conventions vary 778
as to how to augment the basic SBN vocabulary with more 779
723 E. Relationships between production and perception
detail about the effects being described. The use of IPA 780
724 Stowell and Plumbley (2010, p. 2) observe that “the (Stowell, 2012) eliminates all of these problems, allowing 781
725 musical sounds which beatboxers imitate may not sound the musician, artist, or observer to unambiguously describe 782
726 much like conventional vocal utterances. Therefore the any sequence of beatboxing effects at different levels of 783
727 vowel-consonant alternation which is typical of most use of detail. 784
728 voice may not be entirely suitable for producing a close au- The examples illustrated in Figs. 10 and 11 also demon- 785
729 ditory match.” Based on this observation, they conclude that strate how the musical characteristics of beatboxing per- 786
730 “beatboxers learn to produce sounds to match the sound pat- formance can be well described using standard percussion 787
731 terns they aim to replicate, attempting to overcome linguistic notation. In addition, it would be possible to make use of 788
732 patternings. Since human listeners are known to use linguis- other conventions of musical notation, including breath and 789
733 tic sound patterns as one cue to understanding a spoken pause marks, note ornamentation, accents, staccato, fermata, 790
734 voice… it seems likely that avoiding such patterns may help and dynamic markings to further enrich the utility of this 791
735 maintain the illusion of non-voice sound.” The results of this approach as a method of transcribing beatboxing perform- 792
736 study suggest that, even if the use of non-linguistic articula- ance. Stone (1980, pp. 205–225) outlines the vast system of 793
737 tion is a goal of production in human beatboxing, artists may extended notation that has been developed to describe the 794
738 be unable to avoid converging on some patterns of articula- different ensembles, effects and techniques used in tradi- 795
739 tion which have been exploited in human languages. The tional percussion performance; many of these same notation 796
740 fact that musical constraints dictate that these articulations conventions could easily be used in the description of human 797
741 may be organized suprasegmentally in patterns other than beatboxing performance, where IPA and standard musical 798
742 those which would result from syllabic and prosodic organi- notation is not sufficiently comprehensive. 799
743 zation may contribute to their perception as non-linguistic
744 sounds, especially when further modified by the skillful use
G. Future directions 800
745 of “close-mic” technique.
This work represents a first step towards the formal 801
study of the paralinguistic articulatory phonetics underlying 802
746 F. Approaches to beatboxing notation
an emerging genre of vocal performance. An obvious limita- 803
747 Describing beatboxing performance using the system tion of the current study is the use of a single subject. 804
748 outlined in Sec. VI offers some important advantages over Because beatboxing is a highly individualized artistic form, 805
749 other notational systems that have been proposed, such as examination of the repertoires of other beatbox artists would 806
750 mixed symbol alphabets (Stowell, 2012), Standard Beatbox- be an important step towards a more comprehensive under- 807
751 ing Notation (Splinter and Tyte, 2012) and English-based standing of the range of effects exploited in beatboxing, and 808
752 equivalents (Sinyor et al., 2005), and the use of tablature or the articulatory mechanisms involved in producing these 809
753 plain text (Stowell, 2012) to indicate metrical structure. The sounds. 810
754 system proposed here builds on two formal notation systems More sophisticated insights into the musical and pho- 811
755 with rich traditions, that have been developed, refined, and netic characteristics of vocal percussion will emerge from 812
756 accepted by international communities of musicians and analysis of acoustic recordings along with the companion 813
757 linguists, and which are also widely known amongst non- articulatory data. However, there are obstacles preventing 814
758 specialists. more extensive acoustic analysis of data acquired using cur- 815
759 The integration of IPA and standard percussion notation rent methodologies. The confined space and undamped surfa- 816
760 makes use of established methodologies that are sufficiently ces within an MRI scanner bore creates a highly resonant, 817
761 rich to describe any sound or musical idea that can be pro- echo-prone recording environment, which also varies with 818
762 duced by a beatboxer. There are ways of making sounds in the physical properties of the subject and the acoustic signa- 819
763 the vocal tract that are not represented in the IPA because ture of the scan sequence. The need for additional signal 820
764 they are unattested, have marginal status or serve only a spe- processing to attenuate scanner noise (Bresch et al., 2006) 821
765 cial role in human language (Eklund, 2008). Yet because the further degrades the acoustic fidelity of rtMRI recordings 822
766 performer’s repertoire makes use of the same vocal appara- which, while perfectly adequate for the qualitative analysis of 823
767 tus and is limited by the same physiological constraints that human percussion effects presented here, do not permit 824
768 have shaped human phonologies, the International Phonetic detailed time-series or spectral analysis. There is a need to 825
769 Alphabet and its extensions provides an ample vocabulary develop better in-scanner recording and noise-reduction 826
10 J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing
827 technologies for rtMRI experimentation, especially for stud- repertoire of a human beatboxer, affording novel insights into 884
828 ies involving highly transient sounds, such as clicks, ejec- the mechanisms of production of the imitation percussion 885
829 tives, and imitated percussion sounds. effects that characterize this performance style. The data 886
830 Further insights into the mechanics of human beatbox- reveal that beatboxing performance involves the use of many 887
831 ing will also be gained through technological improvements of the airstream mechanisms found in human languages. The 888
832 in MR imaging. The use of imaging planes other than midsa- study of beatboxing performance has the potential to provide 889
833 gittal will allow for finer examination of many aspects of important insights into articulatory coordination in speech 890
834 articulation that may be exploited for acoustic effect, such as production, and mechanisms of perception of simultaneous 891
835 tongue lateralization and tongue groove formation. Since speech and music. 892
836 many beatbox effects appear to make use of non-pulmonic
837 airstream mechanisms, axial imaging could provide addi- ACKNOWLEDGMENTS 893
838 tional detail about the articulation of the larynx and glottis
This work was supported by National Institutes of 894
839 during ejective and implosive production.
Health grant R01 DC007124-01. Special thanks to our 895
840 Because clicks also carry a high functional load in the
experimental subject for demonstrating his musical and lin- 896
841 repertoire of many beatbox artists, higher-speech imaging of
guistic talents in the service of science. We are especially 897
842 the hard palate region would be particularly useful. One im-
grateful to three anonymous reviewers for their extensive 898
843 portant limitation of the rtMRI sequences used in this study
comments on earlier versions of this manuscript. 899
844 is that, unlike sagittal X-ray (Ladefoged and Traill, 1984),
845 the inside of the cavity is not well resolved during click
1 900
846 production; as a result, the precise location of the lingual- Preliminary analyzes of a subset of this corpus were originally described
in Proctor et al. (2010b) 901
847 velaric seal is not evident. Finer spatial sampling over thin- 2
All descriptions of dorsal posture were made by comparison to vowels 902
848 ner sagittal planes would provide greater insights into this produced by the subject in spoken and sung speech, and in the set of refer- 903
849 important aspect of click production. Strategic placement of ence vowels elicited using the [h_d] corpus. 904
3 905
850 coronal imaging slices would provide additional phonetic Special thanks to an anonymous reviewer for this observation.
906
851 detail about lingual coordination in the mid-oral region. Lat-
Atherton, M. (2007). “Rhythm-speak: Mnemonic, language play or song,” 907
852 eral clicks, which are exploited by many beatbox artists
in Proc. Inaugural Intl. Conf. on Music Communication Science 908
853 (Tyte, 2012), can only be properly examined using coronal (ICoMCS), Sydney, edited by E. Schubert et al., pp. 15–18. 909
854 or parasagittal slices, since the critical articulation occurs Bone, D., Kim, S., Lee, S., and Narayanan, S. (2010). “A study of intra- 910
855 away from the midsagittal plane. New techniques allowing speaker and interspeaker affective variability using electroglottograph and 911
inverse filtered glottal waveforms,” in Proc. Interspeech, Makuhari, 912
856 simultaneous dynamic imaging of multiple planes located at 913
pp. 913–916.
857 critical regions of the tract (Kim et al., 2012) hold promise Bresch, E., Kim, Y.-C., Nayak, K., Byrd, D., and Narayanan, S. (2008). 914
858 as viable methods of investigating these sounds, if temporal “Seeing speech: Capturing vocal tract shaping using real-time magnetic 915
859 resolution can be improved. resonance imaging [Exploratory DSP],” IEEE Signal Process. Mag. 25, 916
123–132. 917
860 Most importantly, there is a need to acquire phonetic 918
Bresch, E., Nielsen, J., Nayak, K., and Narayanan, S. (2006). “Synchronized
861 data from native speakers of languages whose phonologies and noise-robust audio recordings during realtime magnetic resonance 919
862 include some of the sounds exploited in the beatboxing reper- imaging scans,” J. Acoust. Soc. Am. 120, 1791–1794. 920
863 toire. MR images of natively produced ejectives, implosives Clements, N. (2002). “Explosives, implosives, and nonexplosives: The 921
linguistic function of air pressure differences in stops,” in Laboratory 922
864 and clicks—consonants for which there is little non-acoustic Phonology, edited by C. Gussenhoven and N. Warner (Mouton De 923
865 phonetic data available—would provide tremendous insights Gruyter, Berlin), Vol. 7, pp. 299–350. 924
866 into the articulatory and coordinative mechanisms involved in Eklund, R. (2008). “Pulmonic ingressive phonation: Diachronic and syn- 925
chronic characteristics, distribution and function in animal and human 926
867 the generation of these classes of sounds, and the differences
sound production and in human speech,” J. Int. Phonetic. Assoc. 38, 235– 927
868 between native, non-native, and paralinguistic production. 324. 928
869 Highly skilled beatbox artists such as Rahzel are capable Erickson, D., Fujimura, O., and Pardo, B. (1998). “Articulatory correlates of 929
870 of performing in a way which creates the illusion that the prosodic control: Emotion and emphasis,” Lang. Speech 41, 399–417. 930
Gafos, A. (1999). The Articulatory Basis of Locality in Phonology (Garland, 931
871 artist is simultaneously singing and providing their own per-
New York), pp. 272. 932
872 cussion accompaniment, or simultaneous beatboxing while Gobl, C., and Nı Chasaide, A. (2003). “The role of voice quality in commu- 933
873 humming (Stowell and Plumbley, 2008). Such illusions raise nicating emotion, mood and attitude,” Speech Comm. 40, 189–212. 934
874 important questions about the relationship between speech Hess, M. (2007). Icons of Hip Hop: An Encyclopedia of the Movement, 935
Music, and Culture (Greenwood Press, Westport), pp. 640. 936
875 production and perception, and the mechanisms of percep- 937
Hogan, J. (1976). “An analysis of the temporal features of ejective con-
876 tion that are engaged when a listener is presented with simul- sonants,” Phonetica 33, 275–284. 938
877 taneous speech and music signals. It would be of great Kapur, A., Benning, M., and Tzanetakis, G. (2004). “Query-by-beat-boxing: 939
878 interest to study this type of performance using MR Imaging, Music retrieval for the DJ,” in Proc. 5th Intl. Conf. on Music Information 940
Retrieval (ISMIR), Barcelona, pp. 170–178. 941
879 to examine the ways in which linguistic and paralinguistic 942
Kim, Y.-C., Proctor, M. I., Narayanan, S. S., and Nayak, K. S. (2012).
880 gestures can be coordinated. “Improved imaging of lingual articulation using real-time multislice 943
MRI,” J. Magn. Resonance Imaging 35, 943–948. 944
Kingston, J. (2005). “The phonetics of Athabaskan tonogenesis,” in 945
881 IX. CONCLUSIONS Athabaskan Prosody, edited by S. Hargus and K. Rice (John Benjamins, 946
Amsterdam), pp. 137–184. 947
882 Real-Time Magnetic Resonance Imaging has been Ladefoged, P., and Maddieson, I. (1996). The Sounds of the World’s Lan- 948
883 shown to be a viable method with which to examine the guages (Blackwell, Oxford), pp. 426. 949
J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing 11
950 Ladefoged, P., and Traill, A. (1984). “Linguistic phonetic descriptions of Saltzman, E. L., and Munhall, K. G. (1989). “A dynamical approach to ges- 988
951 clicks,” Language 60, 1–20. tural patterning in speech production,” Ecol. Psychol. 1, 333–382. 989
952 Lederer, K. (2005). “The phonetics of beatboxing,” BA dissertation, Leeds Scherer, K. (2003). “Vocal communication of emotion: A review of research 990
953 Univ., UK. paradigms,” Speech Comm. 40, 227–256. 991
954 Lindau, M. (1984). “Phonetic differences in glottalic consonants,” J. Pho- Sinyor, E., Rebecca, C. M., Mcennis, D., and Fujinaga, I. (2005). “Beatbox 992
955 netics 12, 147–155. classification using ACE,” in Proc. Intl. Conf. on Music Information 993
956 Lisker, L., and Abramson, A. (1964). “A cross-language study of voicing in Retrieval, London, pp. 672–675. 994
957 initial stops: Acoustical measurements,” Word 20, 384–422. Smith, A. G. (2005). “An examination of notation in selected repertoire for 995
958 Maddieson, I., Smith, C., and Bessell, N. (2001). “Aspects of the phonetics multiple percussion,” Ph.D. dissertation, Ohio State Univ., Columbus, OH. 996
959 of Tlingit,” Anthropolog. Ling. 43, 135–176. Splinter, M., and Tyte, G. (2006–2012). “Standard beatbox notation,” http: 997
960 McDonough, J., and Wood, V. (2008). “The stop contrasts of the Athabas- //www.humanbeatbox.com/tips/p2_articleid/231 (Last viewed February 998
961 kan languages,” J. Phonetics 36, 427–449. 16, 2012). 999
962 McLean, A., and Wiggins, G. (2009). “Words, movement and timbre,” in Stone, K. (1980). Music Notation in the Twentieth Century: A Practical 1000
963 Proc. Intl. Conf. on New Interfaces for Musical Expression (NIME’09), Guidebook (W. W. Norton, New York), 357 pp. 1001
964 edited by A. Zahler and R. Dannenberg (Carnegie Mellon Univ., Pitts- Stowell, D. (2008–2012). “The beatbox alphabet,” http://www.mcld.co.uk/ 1002
965 burgh, PA), pp. 276–279. beatboxalphabet/ (Last viewed February 22, 2012). 1003
966 Miller, A., Namaseb, L., and Iskarous, K. (2007). “Tongue body constriction Stowell, D. (2010). “Making music through real-time voice timbre analysis: 1004
967 differences in click types,” in Laboratory Phonology, edited by J. Cole machine learning and timbral control,” Ph.D. dissertation, School of Elec- 1005
968 and J. Hualde (Mouton de Gruyter, Berlin), Vol. 9, 643–656. tronic Engineering and Computer Science, Queen Mary Univ., London. 1006
969 Miller, A. L., Brugman, J., Sands, B., Namaseb, L., Exter, M., and Collins, Stowell, D., and Plumbley, M. D. (2008). “Characteristics of the beatboxing 1007
970 C. (2009). “Differences in airstream and posterior place of articulation vocal style,” Technical Report C4DM-TR-08-01 (Centre for Digital Music, 1008
971 among Nuu clicks” JIPA 39, 129–161. Dep. of Electronic Engineering, Univ. of London, London), pp. 1–4. 1009
972 Narayanan, S., Bresch, E., Ghosh, P. K., Goldstein, L., Katsamanis, A., Stowell, D., and Plumbley, M. D. (2010). “Delayed decision-making in real- 1010
973 Kim, Y.-C., Lammert, A., Proctor, M. I., Ramanarayanan, V., and Zhu, Y. time beatbox percussion classification,” J. New Music Res. 39, 203–213. 1011
974 (2011). “A multimodal real-time MRI articulatory corpus for speech Sundara, M. (2005). “Acoustic-phonetics of coronal stops: A cross-language 1012
975 research,” in Proc. Interspeech, Florence, pp. 837–840. study of Canadian English and Canadian French,” J. Acoust. Soc. Am. 1013
976 Narayanan, S., Nayak, K., Lee, S., Sethy, A., and Byrd, D. (2004). “An 118, 1026–1037. 1014
977 approach to realtime magnetic resonance imaging for speech production,” Traill, A. (1985). Phonetic and Phonological Studies of!Xo~ o Bushman (Hel- 1015
978 J. Acoust. Soc. Am. 115, 1771–1776. mut Buske, Hamburg, Germany), 215 pp. 1016
979 Nordstrand, M., Svanfeldt, G., Granstrm, B., and House, D. (2004). Tyte, G. (2012). “Beatboxing techniques,” www.humanbeatbox.com (Last 1017
980 “Measurements of articulatory variation in expressive speech for a set of viewed February 16, 2012). 1018
981 Swedish vowels,” Speech Comm. 44, 187–196. Weinberg, N. (1998). Guide to Standardized Drumset Notation (Percussive 1019
982 Proctor, M. I., Bone, D., and Narayanan, S. S. (2010a). “Rapid semi- Arts Society, Lawton, OK), pp. 43. 1020
983 automatic segmentation of real-time Magnetic Resonance Images for para- Wood, S. (1982). “X-Ray and model studies of vowel articulation,” in Work- 1021
984 metric vocal tract analysis,” in Proc. Interspeech, Makuhari, pp. 23–28. ing Papers in Linguistics (Dep. Linguistics, Lund Univ. Lund, Sweden), 1022
985 Proctor, M. I., Nayak, K. S., and Narayanan, S. S. (2010b). “Linguistic and Vol. 23, pp. 192. 1023
986 para-linguistic mechanisms of production in human “beatboxing”: A Wright, R., Hargus, S., and Davis, K. (2002). “On the categorization of ejec- 1024
987 rtMRI study,” in Proc. Intersinging, Univ. of Tokyo, pp. 1576–1579. tives: Data from Witsuwit’en” J. Int. Phonetics Assoc. 32, 43–77. 1025
12 J. Acoust. Soc. Am., Vol. 133, No. 2, February 2013 Proctor et al.: Mechanisms of production in human beatboxing