Professional Documents
Culture Documents
Modeling and Control of Expressiveness I PDF
Modeling and Control of Expressiveness I PDF
Performance
SERGIO CANAZZA, GIOVANNI DE POLI, MEMBER, IEEE, CARLO DRIOLI, MEMBER, IEEE,
ANTONIO RODÀ, AND ALVISE VIDOLIN
Invited Paper
performance or , BR , AD , and EC for audio This equation relates every P-parameter with a generic ex-
performance. pressive intention represented by the expressive parameters
The fourth level represents the internal parameters of the and that constitute the fourth-level representation and
expressiveness model. We will use, as expressive representa- that can be put in relation to the position of the control
tion a couple of values for every P-parameter. space.
The meaning of these values will be explained in the next
section. B. The Control Space
The last level is the control space (i.e., the user interface), The control space level controls the expressive content
which controls, at an abstract level, the expressive content and the interaction between the user and the final audio
and the interaction between the user and the audio object of performance. In order to realize a morphing among different
the multimedia product. expressive intentions we developed an abstract control
space, called perceptual parametric space (PPS), that is a
A. The Expressiveness Model
two-dimensional (2-D) space derived by multidimensional
The model is based on the hypothesis, introduced in Sec- analysis (principal component analysis) of perceptual tests
tion II, that different expressive intentions can be obtained by on various professionally performed pieces ranging from
suitable modifications of a neutral performance. The trans- Western classical to popular music [29], [39]. This space
formations realized by the model should satisfy some condi- reflects how the musical performances are organized in
tions: 1) they have to maintain the relation between structure the listener’s mind. It was found that the axes of PPS are
and expressive patterns and 2) they should introduce as few correlated to acoustical and musical values perceived by
parameters as possible to keep the model simple. In order the listeners themselves [40]. To tie the fifth level to the
to represent the main characteristics of the performances, underlying ones, we make the hypothesis that a linear
we used only two transformations: shift and range expan- relation exists between the PPS axes and every couple of
sion/compression. Different strategies were tested. Good re- expressive parameters
sults were obtained [30] by a linear instantaneous mapping
that, for every P-parameter and a given expressive intention
, is formally represented by (3)
(1) where and are the coordinates of the PPS.
where is the estimated profile of the performance re- C. Parameter Estimation
lated to expressive intention , is the value of the
P-parameter of the th note of the neutral performance, Event, expressive and the control levels are related by (1)
is the mean of the profile computed over the entire and (3). We will now get into the estimation process of the
vector, and are, respectively, the coefficients of shift model parameters (see Fig. 5); more details about the relation
and expansion/compression related to expressive intention. between , , and audio and musical values will be given in
We verified that these parameters are very robust in the mod- Sections IV and V.
ification of expressive intentions [38]. Thus, (1) can be gen- The estimation is based on a set of musical performances,
eralized to obtain, for every P-parameter, a morphing among each characterized by a different expressive intention. Such
different expressive intentions as recordings are made by asking a professional musician to
perform the same musical piece, each time being inspired by
(2) a different expressive intention (see Section V for details).
note relies on the cumulative information given by the , and , are the new duration of these segments. Each re-
and factors, and on the deviation induced by the gion will be processed with a time stretch coefficient com-
Legato control considered in the next item. puted from the above equations
Legato: This musical feature is recognized to have great
importance in the expressive characterization of wind and
string instruments performances. However, the processing of (4)
Legato is a critical task that would imply the reconstruction
of a note release and a note attack if the notes are origi-
nally tied in a Legato, or the reconstruction of the transient where , , and are the time-stretching factors of the at-
if the notes are originally separated by a micropause. In both tack, sustain–release, and micropause segment, respectively.
cases, a correct reconstruction requires a deep knowledge of If an overlap occurs due to the lengthening of a note, the
the instrument dynamic behavior, and a dedicated synthesis time stretch coefficient in (4) becomes negative. In this
framework would be necessary. Our approach to this task is case, the second action involved is a spectral linear interpola-
to approximate the reconstruction of transients by interpola- tion between the release and attack segments of two adjacent
tion of amplitudes and frequency tracks. notes over the overlapping region (see Fig. 6). The length
The deviations of the Legato parameter are processed of the overlapping region is determined by the Legato de-
by means of two synchronized actions: the first effect of a gree, and the interpolation within partial amplitude will be
Legato change is a change in the duration of the note by , performed over the whole range. The frequency tracks of the
since DR IOI , where is sinusoidal representation are lengthened to reach the pitch
the original Legato degree and is the Legato for the new transition point. Here, a 10- to 15-ms transition is generated
expressive intention. This time-stretching action must be by interpolating the tracks of the actual note with those of
added to the one considered for the Local Tempo variation, the successive note. In this way, a transition without glis-
as we will see in detail. Three different time-stretching sando is generated. Glissando effects can be controlled by
zones are recognized within each note (with reference varying the number of interpolated frames. This procedure,
to Fig. 6): attack, sustain and release, and micropause. used to reproduce the smooth transition when the stretched
The time-stretching deviations must satisfy the following note overlaps with the following note, is a severe simplifi-
relations: cation of instruments transients, but is general and efficient
enough for real-time purposes.
Envelope Shape: The center of mass of the energy enve-
lope is related to the musical accent of the note, which is
usually located on the attack for Light or Heavy intentions, or
where , , and , are the duration of the attack, sus- close to the end of note for Soft or Dark intentions. To change
tain–release, and micropause segment, respectively, and , the position of the center of mass, a triangular-shaped func-
tion is applied to the energy envelope, where the apex of the to ensure that the upper limit of the band corresponds to a
triangle corresponds to the new position of the accent. value of one on the normalized warped frequency axis. The
Intensity and Brightness Control: The intensity and bright- conversion from hertz to mel is given by the analytical for-
ness of the sound frame are controlled by means of a spectral mula [41]. Fig. 7 shows
processing model relying on learning from real data the spec- an example of a mel-cepstrum spectral envelope.
tral transformations which occur when such a musical param- The above definition of mel-cepstrum coefficients usually
eter changes. First, a perceptually weighted representation applies for a short sound buffer in the time-domain. To con-
of spectral envelopes is introduced, so that the perceptually vert from a sinusoidal representation, alternative methods
relevant differences are exploited in the comparison of spec- such as the discrete cepstrum method [42] are preferred: for
tral envelopes. Next, the parametric model used to represent a given sinusoidal parametrization, the magnitudes
spectral changes is outlined. Finally, the proposed method is of the partials are expressed in the log domain and the fre-
applied to the purpose of modeling the intensity and bright- quencies in hertz are converted to mel frequencies
ness deviation for the control of expressiveness. . The real mel-cepstrum parameters are fi-
nally computed by minimizing the following least-squares
A. Representation of Spectral Envelopes (LS) criterion:
To switch from the original sinusoidal description to a
perceptual domain, the original spectrum is turned to the (6)
mel-cepstrum spectral representation. The mel-frequency
cepstral coefficients (mfcc) for a given sound frame are The aim of the mel-cepstrum transformation in our frame-
defined as the discrete-cosine transform (DCT) of the work is to capture the perceptually meaningful differences
frequency domain logarithmic output of a mel-spaced filter between spectra by comparing the smoothed and warped ver-
bank. The first mel-cepstrum coefficients , sions of spectral envelopes.
where is usually in the range 10–30, represent a smooth We call now the th par-
and warped version of the spectrum, as the inversion of the tial magnitude (in dB) of the mel-cep\-strum spectral enve-
DCT leads to lope, and , with , the
difference between two mel-cepstrum spectral envelopes. By
(5) comparison of two different spectral envelopes, it is possible
to express the deviation of each partial in the multiplicative
where is the frequency in mel, is the frame energy, and form , and we call conversion pattern
with being the sam- the set computed by the comparison of two spectral
pling frequency. The normalization factor is introduced envelopes.
Table 4
Factor Loadings Are Assumed as Coordinates of the Expressive
Performances in the PPS
in Fig. 9) by applying the appropriate transformations on the categorical approach [28], we intend to verify if perfor-
sinusoidal representation of the neutral version. The result mances synthesized according to the adjectives used in our
of this transformation is shown in Fig. 15. It can be noticed experiment, are recognized. The main objective of this test
that the energy envelope changes from high to low values, is to see if a “static” (i.e., not time-varying) intention can
according to the original performances (heavy and soft). The be understood by listeners and if the system can convey the
pitch contour shows the different behavior of the IOI param- correct expression.
eter: the soft performance is played faster than the According to Juslin [48], forced-choice judgments and
heavy performance . This behavior is preserved free-labeling judgments give similar results when listeners
in our synthesis example. attempt to decode a performer’s intended emotional ex-
We developed an application, released as an applet, for pression. Therefore, it was considered sufficient to make
the fruition of fairytales in a remote multimedia environment a forced-choice listening test to assess the efficacy of the
[38]. In these kinds of applications, an expressive identity can emotional communication. A detailed description of the
be assigned to each character in the tale and to the different procedure and of the statistical analyses can be found in
multimedia objects of the virtual environment. Starting from [49]. In the following, some results are summarized.
the storyboard of the tale, the different expressive intentions
are located in a control spaces defined for the specific con- A. Material
texts of the tale. By suitable interpolation of the expressive We synthesized different performances using our model.
parameters, the expressive content of audio is gradually mod- Given a score and a neutral performance, we obtain the five
ified in real time with respect to the position and movements different interpretations from the control space, i.e., bright,
of the mouse pointer, using the model describe above. hard, light, soft, and heavy. We did not consider the dark one,
This application allows a strong interaction between the because in our previous experiments we noticed that it was
user and the audiovisual events. Moreover, the possibility of confused with the heavy one, as can be seen in Fig. 9.
having a smoothly varying musical comment augments the It was important to test the system with different scores to
user emotional involvement, in comparison with the partici- understand how high is the correlation between the inherent
pation reachable using rigid concatenation of different sound structure of the piece and the expressive recognition. Three
comments. Sound examples can be found on our Web site classical pieces for piano with different sonological charac-
[47]. teristics were selected in this experiment: “Sonatina in sol”
by L. van Beethoven, “Valzer no. 7 op. 64” by F. Chopin, and
K545 by W. A. Mozart.
VI. ASSESSMENT
The listeners’ panel was composed of 30 subjects: 15
A perceptive test was realized to validate the system. experts (musicians and/or conservatory graduated) and 15
A categorical approach was considered. Following the commons (without any particular musical knowledge). No
Fig. 14. Analysis: energy envelope and pitch contour of soft performance of Corelli’s sonata op. V.
restrictions related to formal training in music listening were evaluate the grade of brightness, hardness, lightness, soft-
used in recruiting subjects. None of the subjects reported ness, and heaviness of all performances on a graduated scale
having hearing impairments. (0 to 100). Statistical analyses were then conducted in order
to determine if the intended expressive intentions were cor-
B. Procedure rectly recognized.
The stimuli were played by a PC. The subjects listened
to the stimuli through headphones at a comfortable loudness C. Data Analysis
level. The listeners were allowed to listen the stimuli as many Table 5 summarizes the assessors’ evaluation. The
time as they needed, in any order. Assessors were asked to ANOVA test on the subject’s response always yielded a
Table 5
Assessors’ Evaluation Average (From 0 to 100)
Rows represent the evaluation labels, and columns show the different stimuli. Legend: B=Bright, Hr=Hard, L=Light, S=Soft, Hv=Heavy, N=Neutral.
p-index less than 0.001: the p values indicate that one or soft. The bright expression is also quite high but no more
more populations’ means differ quite significantly from than the average brightness of all performances.
the others. From data analyses, such as observation of the A high correlation between hard and heavy and between
means and standard deviations, we notice that generally, light and soft can be noticed. Those expressions are well in-
for a given interpretation, the correct expression obtains dividuated in two groups. On the other hand, bright seems to
the highest mark. One exception is the Valzer, where the be more complicated to highlight. An exhaustive statistical
light interpretation is recognized as soft—with a very slight analysis of the data is discussed in [49], as well as the descrip-
advantage. Moreover, with K545, heavy performance was tion of a test carried out by means of a dimensional approach.
judged near to hard expressive intention (82.8 versus 80.3) It is important to notice that the factor analysis returns our
whereas hard performance near to bright (68.4.8 versus PPS. Automatic expressive performances synthesized by the
67.3) expressive intention, suggesting a slight confusion system give a good modeling of expressive performance re-
between these samples. alized by human performers.
It is also interesting to note that listeners, in evaluating the
neutral performance, did not spread uniformly their evalu- VII. CONCLUSION
ation among the adjectives. Even if all the expressions are We presented a system to modify the expressive content
quite well balanced, we have a predominance of light and of a recorded performance in a gradual way both at the