Vowel and Consonant Characterization Using Fractal Dimension in Natural Speech

Departamento de Matemática Aplicada y Estadística
Universidad Politécnica de Cartagena



Speech signals can be considered as being generated by mechanical system with

inherently nonlinear dynamics. The purpose of this paper is to describe its
complexity using the fractal dimension of a variety of spanish voiced sounds
(vowels, nasals) and unvoiced sounds (fricatives) In our research, the fractal
dimension was computed over recorder signals from a speech spanish database
(AHUMADA), using the method suggested by Katz, [1]. In conclusion the
fractals measures expand the distinguishing features in characterizing voiced
and unvoiced sounds, that leads to better speech recognition performances.

1. Introduction

Recent suggestions that speech production may be a nonlinear process have

sparked great interest in the area of nonlinear analysis of speech giving rise
to many studies [2], [3], [4]. These researches are based on the natural
hypothesis that nonlinear processes occur in speech production, due to:
turbulent air flow produced in the vocal tract; nonlinear neuro-muscular
processes that should occur at the level of vocal cords and of the larynx;
nonlinear couplings, during speech generation, between different parts of
the vocal tract.
Speech recognition systems must map the speech acoustical features
obtained from speech waveform into linguistic entities such as words or
phonemes. Thus, a speech recognition system needs a way to obtain
numerically distinct characteristics of various phonemes (i.e. vowels and
consonants). The usual way to obtain such characterization is by speech
production models. In the model, the speech signal is an output of filtering
an excitation signal by a vocal tract filter. Classical approaches of vowel
characterization and recognition, based on formant analysis, are relatively
successful because different vowels lie in distinct regions of Cartesian space


constructed by the first two or three formants. This is true even if some
overlapping between the vowel regions occurs. However the consonant
characterization is more difficult because consonant waveforms may be
indistinguishable in time or frequency domain.
In this paper, nonlinear dynamic methods are used to describe the
complexity of natural speech. The fractal dimension is a geometric invariant
measure used for characterizing nonlinear systems, which is of interest to us
for of its properties.
The remainder of this paper is organized as follows. In Section 2, the
techniques used in this work for estimating the fractal dimension are
commented. In Section 3, we show the experimental results, and finally,
conclusions based in results are presented in Section 4.

2. Methods

The fractal dimension (Df) can be considered as a relative measure of the

number of basic building blocks that form a pattern. This particular feature
has been used, with great success, in a variety of applications in biomedical
science for transient detection, vaweform complexity estimation, pattern
recognition, etc.. For this reason, we think that Df is a mesure of signal
complexity that can characterize different voiced and unvoiced sounds. It
provides an alternative technique for assessing signal complexity in the time
domain, as opposed to the embedding method based on the reconstruction
of the attractor in the multidimensional phase space, [5], [6]. This innovation
permits a direct connection between complexity variations and speech
signals changes over time, providing a quick computational tool to detect
nonstationarities in these signals.
There are many algorithms in the literature for estimating the fractal
dimension of a waveform, [1], [7], [8]. In our research, the fractal dimension
was computed according to the algorithm proposed by Katz. [1], because in
contrast to other methods, Katz´s Df calculation is slightly slower, but it is
derived directly from the vaweform, eliminating the preprocessing step. The
Df of a time sequence analyzed x( j ) , j=1,2,..., N, can be defined as:
log 10 ( L)
Df = (1)
log 10 (d )
where L is the sum of distances between successive points
N −1
L = ∑ x (i + 1) − x (i) (2)
i =1
and d is the diameter estimated as the distance between the first point of the
sequence and the point of the sequence that provides the farthest distance

d= max{ x(1) − x(i ) }

2 ≤i ≤N

i.e., the Df compares the actual number of units that compose a sequence
with the minimum number of units required to reproduce a pattern of the
same spatial extent, but the Df computed in this fashion depends on
measurement units used. Katz´s approach solves this problem by creating a
general unit or yardstick: the average distance between successive points,
L . Normalizing distances in equation (1) by this average results in:
 L 
log 10  

Df =  L  (4)
 d 
log 10  
 L 

This procedure discretizes the sequence into N-1 intervals,
then N −1 = , and (4) can be written as
log 10 (N − 1)
Df = (5)
d 
log 10   + log 10 (N − 1)
Expression (5) summarizes Katz´s approach to calculate the Df of a waveform.

3. Results

This section shows the experimental results obtained during this study. The
fractal dimension is calculated for a variety of voiced sounds (vowels and
nasals) and unvoiced sounds (fricatives). Concretely, the sounds considered
in our research are:
• The vowels: /a/, /e/, /i/, /o/, /u/.
• The nasals: /m/, /n/.
• The fricatives: /f/. /s/, /z/.

The Df parameter is obtained from the experimental signals using a

sliding window of 512 points (32 ms) shifted along speech signal with 1
point of overlap. Our experiments have been computed over recorder signals
of 1024 points length from a speech spanish database (AHUMADA), [9].
Table 1 shows the results obtained for the estimation of the fractal
dimension for each analyzed sound, by means of the procedure developed in
this paper.

Table 1. Results summary of analyzed sounds.

Sound /a/ /e/ /i/ /o/ /u/ /m/ /n/ /f/ /s/ /z/
Df 2.2 2.2 2.2 2.1 2.1 1.8 1.8 3.3 4.9 4.1

4. Conclusions

This study suggests that the Df is a useful and practical tool for characterizing vowel and
nasal sounds. The Df values obtained show that vowel and nasal sounds are low-
dimensional, whereas fricatives are high-dimensional.
In conclusion, the fractal measures expand the distinguishing features in
characterizing voiced and unvoiced sounds and it gets the possibility of designing an
automatic system for speech recognition.


1. M. Katz, Comp Biol. Med., vol. 18, no. 3, pp. 145-156, 1988.
2. M. Banbrook and S. McLaughin, IEEE Collouium on Exploiting Chaos
and Signals Processing, 8/1-8/10, 1994.
3. A. Kumar and S.K. Mullick, Electronics Letters., vol. 26, no. 21, pp. 1790-
1791, 1990.
4. H.N. Teodorescu and cols., Int. J.of Chaos Theo. .and Appl., vol. 5, no.
3, pp. 1453-145, 1996.
5. F. Takens, Lectures Notes in Mathematics 898, Springer, Berlin, pp. 366-
381, 1981.
6. P. Grassberger and L. Procaccia, Phys. Rev. Lett., vol. 5, pp. 345-349,
7. T. Higuchi, Phys. D, vol. 31, pp. 277-287, 1988
8. A. Petrosian, IEEE Symposium on Computer-Based Medical Systems,
pp. 212-217, 1995.
9. R. García and L. Gómez, Base de datos para la Identificación y
Clasificación de locutores, Universidad Politécnica de Madrid, 1997.

