Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Computational phonogram archiving

Michael Blaß, Jost Leonhardt Fischer, and Niko Plath

Citation: Physics Today 73, 12, 50 (2020); doi: 10.1063/PT.3.4636


View online: https://doi.org/10.1063/PT.3.4636
View Table of Contents: https://physicstoday.scitation.org/toc/pto/73/12
Published by the American Institute of Physics

ARTICLES YOU MAY BE INTERESTED IN

Ultrasound in air
Physics Today 73, 38 (2020); https://doi.org/10.1063/PT.3.4634

Exploring cultural heritage through acoustic digital reconstructions


Physics Today 73, 32 (2020); https://doi.org/10.1063/PT.3.4633

The math behind epidemics


Physics Today 73, 28 (2020); https://doi.org/10.1063/PT.3.4614

Ocean acoustics in the changing Arctic


Physics Today 73, 44 (2020); https://doi.org/10.1063/PT.3.4635

Making robots microscopic


Physics Today 73, 66 (2020); https://doi.org/10.1063/PT.3.4642

High radiation dose rates may improve cancer therapy


Physics Today 73, 24 (2020); https://doi.org/10.1063/PT.3.4631
Colores Salsa (2018), Valerie Vescovi, www.valvescoviart.com

50 PHYSICS TODAY | DECEMBER 2020


Michael Blaß, Jost Leonhardt Fischer, and
Niko Plath are researchers at the University
of Hamburg’s Institute of Systematic
Musicology in Hamburg, Germany.

Computational
PHONOGRAM
ARCHIVING
Michael Blaß,
Jost Leonhardt Fischer,
and Niko Plath

A general approach to analyzing audio files


makes it possible to re-create the sound of
ancient instruments, identify cross-cultural

M
musical properties, and more.

usic archives store vast amounts of audio data and serve different
interests. Some of the most prominent archives belong to the
commercial streaming services that deliver music on demand
to worldwide audiences. Users of Pandora, SoundCloud, Spotify,
and other streamers access music through curated or user-
generated playlists. Alternatively, they can search for an artist, song, or genre and
scroll through the results on their devices.
The streamers’ collaborative filtering algorithms audio repository. They aim to provide access to a
also generate recommendations based on the pref- wide variety of field recordings from different cul-
erences of other users. Motivated by commercial gain, tures around the world. The Smithsonian Folkways
major music companies promote their content by in- Recordings, the Berlin Phonogram Archive, and the
ducing the streaming services to suggest it. Their Ethnographic Sound Recordings Archive (ESRA,
songs duly appear in heavy rotation. Combined, the esra.fbkultur.uni-hamburg.de), in which the three of
filtering and the manipulation lead to so-called echo us are involved, are just three examples. The reasons
chambers, in which the recommendations in a dis- for maintaining such archives are various. Some are
tinct group of people gain a self-reinforcing momen- dedicated to preserving musical cultural heritage;
tum. Members end up listening to an identical sub- others focus on education.
set of the music catalog. Repositories of a third type store the sounds of
Ethnomusicological archives are another type of musical instruments. The institutions that maintain

DECEMBER 2020 | PHYSICS TODAY 51


PHONOGRAM ARCHIVING

them focus on providing access to large sets of well-recorded information, such as the artist’s name, genre, title, year of record-
sound samples. One use of such data is for the physical mod- ing, and publisher. Archives of ethnographic field recordings
eling of musical instruments. Researchers solve the differential also collect and store metadata about the origin of the audio.
equations of continuum dynamics to create realistic sounds That information could include geographic region, GPS-derived
that mimic the recordings. To make that possible, archives store latitude and longitude, the ethnic identity of the performers,
the responses of instrument parts to impulsive forces and im- and additional notes about the circumstances of the recording.
aging data—such as CT scans of instruments or computer-aided If an archive focuses on preserving old recordings, metadata
design (CAD) models of them. Also stored are textual data might also include information about the condition of the orig-
from bibliographical research, such as instrument descriptions, inal media.
provenance, and technical drawings. The databases are valu- By using a search engine, users can query an archive’s data-
able for investigations of historic instruments whose surviving base through its metadata. But when the metadata are incom-
examples are unplayable. plete or absent, an alternative strategy comes into play.2 Algo-
The increasing demand in digital humanities for big data rithms applied to musical recordings compute numerical
sets has led to another type of archive as cultural institutions representations of the characteristics of the audio data. At the
rush to digitize their musical recordings and other holdings. most basic level, the representations, called audio features,
Unfortunately, when such projects are complete, the result is quantify specific signal properties, such as the number of times
often a vast trove of uncurated data. per second the waveform crosses the zero-level axis and the
The subject of our article, computational phonogram archiv- amount of energy in various frequency bands.
ing, has emerged as a unified solution to tackling some of the To compute audio features that correlate with the auditory
generic problems of music archives, such as classifying record- perception of human listeners, researchers employ methods
ings, while also addressing their particular shortcomings, such drawn from the field of psychoacoustics.3 Since Hermann von
as echo-chamber playlists.1 Its primary approach is to analyze, Helmholtz’s groundbreaking derivation in 1863 of the Western
organize, and eventually understand music by comparing large major scale from perceptions of musical roughness, psycho-
sets of musical pieces in an automated manner. acoustical investigations have explored the diverse relationships
As we describe below, the first step in computational phono- between physical properties of sound and the effects they have
gram archiving involves using extraction algorithms to trans- on auditory perception. The perception of loudness, for exam-
form music into numerical representations of melody, timbre, ple, is not only proportional to amplitude, it is also strongly de-
rhythm, form, and other properties. In the second step, machine pendent on frequency and on whether or not a sound’s com-
learning algorithms derive mood, genre, meaning, and other ponent frequencies are integer multiples of each other.
higher-level representations of music. Search and retrieval engines that interrogate audio files can
directly compare data sets only when they are the same size. One
From audio files to sonic content way to circumvent that limitation is to resolve the natural time
Music archives consist of audio files and accompanying meta- dependence of the audio features. To that end, several applica-
data in text format. Metadata typically include extra-musical tions use central moments, such as the expected value and vari-

FINDING RHYTHM WITH A HIDDEN MARKOV MODEL


Researchers at the Ethnographic Sound top one illustrates a drum groove as a
Recordings Archive use a hidden Markov waveform diagram. Colored rectangles
model (HMM) to derive a representation mark the regions subject to feature ex-
of a piece’s rhythm in terms of the succes- traction. The colors correspond to the
sion of timbres it includes. An onset de- three instruments that played the groove.
tection algorithm estimates the time Two of them, bass drum (green) and
point of the beginning of each note. Fea- snare drum (blue), are treated individu-
ture extraction is performed only around ally. The third instrument, the hi-hat, is
1.0
those time points. split into two: hi-hat 2 (yellow), which is
The training procedure considers the the cymbal’s pure sound, and hi-hat 1 Hi-hat
Bass
probabilities of changing from one tim- (coral pink), which is a polyphonic timbre 2
bre to another and then adapts the that combines the pure hi-hat sound with
model’s transition probability matrix. The the decay of the bass drum.
matrix therefore embodies the internal The lower figure shows a graphical 0.5 1.0
structure of a trained rhythm. As such, it representation of the rhythmical profile. 0.5
functions as a prototype of the overall Each node corresponds to an HMM state.
“global” rhythm of a piece of music. The Edges represent the node transitions
Hi-hat Snare
original time series is one instance of the with probabilities greater than zero. The 1
global rhythmical profile. HMM quantifies a counterintuitive find-
How the process plays out is repre- ing: Timbre influences how rhythm is 1.0
sented by the accompanying figures. The perceived.

52 PHYSICS TODAY | DECEMBER 2020


ance, to aggregate over the time dimension. In subsequent stages FIGURE 1. A SELF-ORGANIZING MAP (SOM) trained with the
of the interrogation, those values replace the original features. timbre features of audio files in the Ethnographic Sound Recordings
Another approach is to model the distribution of an audio Archive (ESRA). Circles mark the position to which the SOM assigns
feature by using a mixture model. In such a case, the estimated the audio feature vectors of individual audio files. The circles’ color
model parameters stand in for the original time series. A similar denotes which of three ESRA collections the file belongs to. The
clustering of the collections reflects the audio files’ noise floor, which,
but more specialized approach is the use of a hidden Markov
in turn, is related to the recording technology. In particular, the old-
model (HMM).4 The parameters of a trained HMM provide in-
est recordings (yellow dots) tend to be the noisiest (blue contours).
sight into the process that might have generated a particular
instance at hand (see the box on page 52). Such a representation
is especially useful in archives that feature music from different An alternative approach, the self-organizing map (SOM),6
cultures, because its generality allows for an unbiased, extra- is popular for reducing dimensionality in applications that en-
cultural viewpoint on the sound. tail retrieving musical information.7 The SOM estimates the
similarity in the feature space by projecting it onto a regular,
Machine learning and music similarity two-dimensional grid composed of artificial “neurons.” Users
Audio features consist of vectors—that is, sets of points—in a of the archive can browse the resulting grid to explore the orig-
high-dimensional space. How can such a point cloud help users inal feature space.
navigate through an archive’s contents? The key idea is that Another popular approach for reducing dimensionality is
similar feature vectors represent similar regions of the space in to use an embedding method, which maps discrete variables to
which they reside. The vectors’ mutual distance is therefore a a vector of continuous numbers. By coloring data points that cor-
natural measure of their similarity. The same is also true for the respond to the available metadata, users can cluster the embed-
musical pieces that the vectors characterize. If their feature vec- ding visually. Popular methods for doing this are t-distributed
tors are close together, their content is similar. Methods from stochastic neighborhood embedding (t-SNE) and uniform man-
artificial intelligence enable the exploration of enormous sets ifold approximation and projection (UMAP).
of feature vectors.5 Figure 1 illustrates an example of the above approach. ESRA
A straightforward approach, at least for retrieving records currently holds three collections of recordings from different
that resemble a particular input, is to use the k-nearest neigh- time spans: 1910–48, 1932, and 2005–18. When a SOM clustered
bors algorithm, which assigns an object to a class based on its the assets of ESRA by timbre, it revealed that timbre features
similarity to nearby objects in a neighborhood of iteratively ad- appear to be sensitive to the noise floor introduced by different
justable radius. Another way to classify the feature space is to recording equipment—that is, the SOM segregates input data
partition it such that the members of a particular partition are by the shape of the noise. Some pieces lack dates. One of the
similar. Such cluster analyses are conducted using methods SOM’s applications is to estimate the year a piece was recorded.
like the k-means and expectation–management family of algo-
rithms, agglomerative clustering, and the DBSCAN algorithm. Physical modeling
However, to truly explore the inherent similarities in the Physical modeling of musical instruments is a standard
feature space, users need to somehow move through it. For that method in systematic musicology.8 The models themselves are
to be feasible, the complexity of the space has to be mitigated typically based on two numerical methods: finite difference
by reducing its dimensionality. Principal component analysis time domain (FDTD) and finite element method (FEM). Com-
achieves that goal by projecting a data set onto a set of orthog- puted on parallel CPUs, GPUs (graphical processing units), or
onal vectors that account for the most variance in the data set. FPGAs (field-programmable gate arrays), the two methods
DECEMBER 2020 | PHYSICS TODAY 53
PHONOGRAM ARCHIVING

can solve multiple coupled differen-


0.045 102 500
tial equations with complex bound-

PRESSURE (Pa)
ary conditions in real time. 0.040 102 000
Solving the differential equations 0.035 101 500
that embody an instrument’s plates, 0.030 101 000
membranes, strings, and moving air 0.025
is an iterative process that aims to 100 500
0.020
reproduce musical sound. The solu- 100 000

Z-AXIS (meters)
tions can also deepen understand- 0.015
ing of how instruments produce 0.010
sound. Instrument builders use mod- 0.005
els to estimate how an instrument 0
would sound with altered geometry
−0.005
or with materials of different prop-
−0.010
erties without having to resort to
physical prototypes. −0.015
Brazilian rosewood, cocobolo, and −0.020
z
other traditional tonewoods have −0.025 y
become scarce because of climate −0.030 x Time: 2.5 ms
change and import restrictions. An-
−0.035
other use of models is to identify sub-
stitute materials. Conversely, some historical instruments, no- FIGURE 2. A VISUALIZATION OF THE PRESSURE FIELD around
tably organs and harpsichords, no longer sound as they used the mouthpiece of a simulated 15th-century bone flute 2.5 ms after
to because their wood has aged and their pipes, boards, frames, the virtual player starts blowing.
and other components have changed shape. Models can repro-
duce the instruments’ pristine sound. according to their homogeneous properties, the respective sur-
A machine learning model can estimate both the range of face data adequately represent the body. What’s more, replac-
possible historical instrument sounds and the performance of ing myriad voxels with a modest number of polygon meshes
new materials by learning the parameter space of the physical leads to a massive reduction of data.
model. After inserting the synthesized sounds that result into A set of 3D meshes of instrument parts can be stored in the
an archive and applying the methods of computational sound database for subsequent research tasks. The meshes are even
archiving, the essential sound characteristics of instruments can small enough to be shared via email. The reconstructed poly-
be estimated. Although the field is still emerging, a scientifically gon surface mesh is transformed into a parametric model and
robust estimation of historical instrument sounds is in sight for further modified using CAD. Once the virtual flute is made,
the first time. it can be numerically fitted with a selection of differently shaped
blocks that guide the air flow to strike the labium at varied
Resurrection of a bone flute angles.
As simulations proliferate, the need emerges to compare vir- And that’s just what we did. Our model considered three
tually created sounds with huge collections of sounds of real angles (0°, 10°, and 20°) at which the blown air left the windway
instruments. An example of such a comparison is the inves- and struck the labium. It also considered two speeds (5 m/s and
tigation of the tonal quality of an unplayable musical instru- 10 m/s) at which the blown air entered the mouthpiece. We
ment: an incomplete 15th-century bone flute from the Swiss used the initial speed because the main dynamics that charac-
Alps. terizes the generation and formation of tone—what players of
Like its modern relative the recorder, a bone flute consists wind instruments call articulation—takes place in the initial
of a hollow cylinder with holes for different notes. The sound few milliseconds of the blowing process.
is produced when air is blown through a narrow windway over The sound optimization was carried out with OpenFOAM,
a sharp edge, the labium. The windway and mouthpiece are an open-source toolbox for computational fluid dynamics and
formed by a block that almost fills the flute’s top opening. The computational aeroacoustics. We set the tools to work solving
bone flute in our example is unplayable because the block, which the compressible Navier–Stokes equations and a suitable de-
was presumably made of beeswax, has not survived. scription of the turbulence. The six different combinations of
The first step in resurrecting the flute’s sounds is to scan angle and blowing speed were calculated in parallel on the
the object using high-resolution x-ray CT scans. The 3D map 256 cores of the Hamburg University computing cluster. Time
characterizes both the object’s shape and its density. The data series of relevant physical properties were sampled from data
are then ingested into the archive, where they are subjected and analyzed to find the optimal articulation of the instrument.
to segmentation. Widely used in medical imaging to identify Figure 2 visualizes the pressure field in the generator region
and delineate tissue, segmentation allows substructures with of the virtually intact bone flute. The simulated sound pressure-
homogeneous properties to be extracted from a heterogeneous level spectra of the six combinations of angle and blowing
object. speed revealed that one, 20° and 5 m/s, produced the best and
Each voxel (a pixel in 3D space) is allocated to exactly one most stable sound, a finding that was corroborated with mea-
of the defined subvolumes. Since substructures are generated surements of a 3D-printed replica of the bone flute. The five
54 PHYSICS TODAY | DECEMBER 2020
other combinations led to attenuation of the transient and un- phonogram archiving. Composers and compilers of movie
stable or suboptimal sound production. soundtracks might profit from that approach. Given a model
Virtually generated instrument geometries can be subjected trained on features of musical meaning, they could choose ap-
to physical modeling, just like a real instrument. In particular, propriate film music by matching a particular set of emotions
we could investigate more complex dynamics such as overblow- and signatures.
ing, a technique that Albert Ayler, John Coltrane, and other The analysis of the underlying psychoacoustic features
avant-garde saxophonists used to extend the instrument’s could illuminate the musical needs of historical eras. Whether
sonic landscape. a piece of music had a particular function is often subject to de-
bate, which could be resolved through automated comparison
Further applications with the contents of a music archive.
Computational phonogram archiving can have political impli- Indeed, the power and promise of computational phono-
cations. Some multiethnic states define ethnic groups accord- gram archiving derives from its ability to address problems on
ing to their languages and how or whether the languages are different scales. It enables researchers to compare large-scale
related to each other. Like language, music is an essential fea- entities, like musical cultures, and small-scale entities, like dif-
ture of ethnic identity. Using the methods presented in this ar- ferences in the use of distortion by lead guitarists.
ticle, ethnologists can investigate ethnic relations and defini-
tions from the perspective of sound in ways less susceptible to
human prejudice than traditional methods are.
REFERENCES
1. R. Bader, ed., Computational Phonogram Archiving, Springer (2019).
It’s also possible with the tools of computational phonogram 2. I. Guyon et al., eds., Feature Extraction: Foundations and Applica-
archiving to reconstruct historical migrations of music’s rhythms, tions, Springer (2006).
timbres, and melodies. Another application is the investigation 3. H. Fastl, E. Zwicker, Psychoacoustics: Facts and Models, 3rd ed.,
into the ways that individual musical expressions relate to uni- Springer (2007).
4. W. Zucchini, I. L. MacDonald, Hidden Markov Models for Time Se-
versal cross-cultural musical properties such as the octave. ries: An Introduction Using R, Chapman & Hall/CRC (2009).
The doctrine of the affections is a Baroque-era theory that 5. J. Kacprzyk, W. Pedrycz, eds., Springer Handbook of Computational
sought to connect aspects of painting, music, and other arts to Intelligence, Springer (2015).
human emotions. The descending minor second interval, for 6. T. Kohonen, Self-Organizing Maps, 3rd ed., Springer (2001).
7. M. Leman, Music and Schema Theory: Cognitive Foundations of Sys-
example, has always been associated with a feeling of sadness. tematic Musicology, Springer (1995).
Other, perhaps previously unrecognized associations—historical 8. R. Bader, ed., Springer Handbook of Systematic Musicology, Springer
and contemporary—could be revealed through computational (2018). PT

Faculty Position
The Physics Department at the Massachusetts Institute of Technology (MIT), located in Cambridge, Massachusetts, invites
applications for a faculty position in Astrophysics, specializing in the study of exoplanets. Faculty members at MIT conduct
research, teach undergraduate and graduate physics courses, and supervise graduate and undergraduate participation in research.
Candidates must show promise in teaching as well as in research. A Ph.D. in physics or physics-related discipline is required by the
start of employment. Preference will be given to applicants at the Assistant Professor level.
Current astrophysics faculty are active in broad areas of observational and theoretical astrophysics, including optical/IR, radio,
and high energy astronomy; studies of exoplanets; gravitational wave science; and observational and theoretical cosmology. MIT
OVZ[Z[OL2H]SP0UZ[P[\[LMVY(Z[YVWO`ZPJZHUK:WHJL9LZLHYJO^OVZLMHJ\S[`HUKYLZLHYJOZ[HɈJVU[YPI\[LPUZ[Y\TLU[H[PVUMVY
and conduct research using several facilities including the Transiting Exoplanet Survey Satellite (TESS), the Laser Interferometer
Gravitational-Wave Observatory (LIGO), the Magellan telescopes, the Hydrogen Epoch of Reionization Array (HERA), the Canadian
Hydrogen Intensity Mapping Experiment (CHIME), the Chandra X-ray Observatory, and the Neutron Star Interior Composition
Explorer (NICER), as well as an in-house high-performance computing cluster. We seek candidates who have both the potential for
and strong commitment to innovation and leadership in teaching and mentoring undergraduate and graduate students, and share
the Principles of our Community.
(JVTWSL[LHWWSPJH[PVUPUJS\KLZHJV]LYSL[[LYJ\YYPJ\S\T]P[HLHɫ[VɫWHNLZ[H[LTLU[VUYLZLHYJOHUKVULVU[LHJOPUNHUK
mentoring, and three letters of recommendation to be uploaded to https://academicjobsonline.org/ajo. Recognizing that educational
experiences of all students are enhanced when the diversity of their backgrounds is acknowledged and valued, we ask candidates
to articulate (in the teaching and mentoring statement, and, as appropriate, in the cover letter or research statement) their views on
inclusivity and equity as they pertain to teaching, mentorship, research, and service. Enquiries should be directed
to Prof. Claude Canizares, Search Committee Chair, crc@mit.edu.
MIT is an equal employment opportunity employer. All qualified applicants will receive
consideration for employment and will not be discriminated against on the basis of race,
color, sex, sexual orientation, gender identity, religion, disability, age, genetic information,
veteran status, ancestry, or national or ethnic origin. http://web.mit.edu

You might also like