Professional Documents
Culture Documents
1673 1681 1 PB
1673 1681 1 PB
doi:10.2498 /cit.1001390
Lynne Billard
Department of Statistics, University of Georgia, Athens, USA
Contemporary computers bring us very large datasets, Patient Hospital Age Smoker
datasets which can be too large for those same computers
to analyse properly. One approach is to aggregate these Patient 1 Hospital 1 74 heavy
data (by some suitably scientific criteria) to provide more
manageably-sized datasets. These aggregated data will Patient 2 Hospital 1 78 light
perforce be symbolic data consisting of lists, intervals, Patient 3 Hospital 2 69 no
histograms, etc. Now an observation is a p-dimensional
hypercube or Cartesian product of p distributions in Rp , Patient 4 Hospital 2 73 heavy
instead of the p-dimensional point in Rp of classical data. Patient 5 Hospital 2 80 light
Other data can be naturally symbolic. We give a brief
overview of interval-valued data and show briefly that it Patient 6 Hospital 1 70 heavy
is important to use symbolic analysis methodology since, Patient 7 Hospital 1 82 heavy
e.g., analyses based on classical surrogates ignore some
of the information in the dataset. Patient 8 Hospital 3 74 heavy
.. .. .. ..
. . . .
Keywords: aggregate data, symbolic data analysis, prin-
cipal component analysis, p-dimensional hypercube,
interval-valued data, divisive clustering method Table 1.(a) Classical Values.
way ‘Hospital 1’ had ages over the interval [70, 2. Symbolic or Classical Analysis?
82]. Here, Y1 = age becomes an interval-valued
symbolic value. The variable Y2 = smoker be-
comes a list or multi-valued symbolic value, In the absence of techniques to analyse sym-
and can be modal multi-valued as for Hospital bolic data directly, it is tempting to use classical
1 or a list as for Hospital 2. Classical values surrogates. To do so, however, loses informa-
are special cases of symbolic data. Thus, the tion contained in the data. Consider the three
point value a ≡ [a, a], and the categorical value different realizations of the random variable Y
c ≡ {c}; see Hospital 3 in Table 1(b). = weight, viz., Y1 = 135, Y2 = [132, 138],
There are many possible aggregations. Clearly, Y3 = [129, 141]. Let these be three samples
those driven by relevant scientific questions are each of size m = 1. It is easily shown that the
best. Medical insurance companies may not be sample mean is Ȳ1 = Ȳ2 = Ȳ3 = 135; the sam-
particularly interested in your details for a spe- ple variance is S12 = 0, S22 = 3, S32 = 12. [See
cific visit to a physician, hospital, clinic, . . ., (1) below for the definition of S2 for interval-
but rather are interested in the aggregate of your valued observations.] Thus, had we taken the
visits over a time period such as 5-years. Or, midpoints as a classical surrogate for our analy-
they may be interested in 20-year-old males, or ses, all three samples would have produced the
30-year-old females; or, in those with lung can- same answer. Yet, since Si2 = Si2 , i = i , it is
cer collectively or by age × gender, or . . ., and
clear that the samples are differently valued. In
so on. The automobile manufacturer is less in-
this case, the differences revolve around the in-
terested in the type of car you purchased but
rather in the automobile purchases of 40-year- ternal variation of symbolic observations. Clas-
olds (or 40-year-old males), or of purchases of sical observations have no internal variation.
cars by model and make, etc. Therefore, the use of classical surrogates im-
plies that these internal variations are not taken
Some data are naturally symbolic. For exam- into account.
ple, Table 2 contains values for Y1 = pileus
cap width, Y2 = stipe length, Y3 = stipe thick- Bertrand and Goupil (2000) have shown that
ness, and Y4 = edibility of species of mush- under the assumption that values are uniformly
rooms (from Billard and Diday, 2006, Table distributed within each interval, the sample vari-
3). Thus, for the species arorae the cap width ance (of each variable Y) of a set of observations
is Y1 = [3, 8]. Any one mushroom from that Yu = [au , bu ], u = 1,. . . , m, is
species has a specific length, e.g., Y1 = 5.2,
i.e., a classical point value; but it cannot be m
m 2
said that all mushrooms of that species have the 2 1 2 2 1
S = (au +au bu +bu )− 2 (au +bu )
same Y1 (=5.2, say) value. Notice Y4 however 3m 4m
u=1 u=1
is a classical value for each species. (1)
Table 2. Mushrooms.
Some Analyses of Interval Data 227
Y1 Y2 Y3
Pulse Rate Systolic Pressure Diastolic Pressure
(a) Ȳ1 = 72.43 Ȳ2 = 140.27 Ȳ3 = 95.67
No Rules s21 = 272.50 s22 = 922.40 s23= 1204.13
s1 = 16.51 s2 = 30.37 s3 = 34.70
(Y2 , Y3 ) = ([160, 190], [170, 185]), say. Now, exceeds a specified α . In (9), xuk is the vertex
only that portion bounded by the vertices (160, k in Lu , d(·, G) is the Euclidean distance from
170), (160, 185), (185, 185), (170, 170) form- that vertex and the centroid G of all observa-
ing a hexagon in R2 is a valid region. This is tions. Eqn (8) directly corresponds to α = 0.
analogous to the baseball dataset analysed in When α = 0.2, the principal components PCν ,
detail in Billard and Diday (2006), q.v. ν = 1, 2, are as shown in Table 5(b); the num-
ber of vertices nν which satisfies this condi-
tion, ν = 1, 2, is also shown. The resulting
3. Principal Component Analysis PCν, ν = 1, 2, are plotted in Figure 2.
The clusters become more apparent in Figure 2.
Chouakria (1998) and Billard et al. (2008) have Thus, we see that observations {w1, w2 , w5 , w15 }
proposed a principal component methodology = C1 constitute one cluster; {w4 , w6 , w7 , w8 , w9 ,
based on the vertices of the data hypercubes. w10 } = C2 form a second cluster, with maybe
They show that the ν th symbolic principal com- w13 part of this second cluster if not its own clus-
ponent is ter; {w3 , w12 , w14 } are a single cluster C3 (or
perhaps two clusters); and finally {w11 } = C4
Yu∗ν = [yauν , ybuν ], ν = 1, . . . , s ≤ p, (7) is a cluster on its own.
where Figure 3 is the plot of the PC1 and PC2 obtained
yauν = min{yuν k }, ybuν = max{yuν k } (8) when the interval midpoints are used as classi-
k∈Lu k∈Lu cal surrogates. Unlike the distinctness of w3
where yuν k is the ν th principal component for and w14 in the symbolic analysis, these obser-
the vertex k of the hypercube of observation wu vations would seem to be part of the central clus-
and Lu is the set of vertices associated with wu . ter C2 . The degree of these differences depends
The resulting ν = 1 and ν = 2 principal com- on the varied lengths of the intervals. More
ponents PCν for the blood data of Table 3 are importantly, however, the relative sizes of the
shown in Table 5(a) and plotted in Figure 1. principal component hypercubes reflect those
of the data hypercubes. For example, compar-
In order to help clarify the visualization and in- ing those for w3 and w14 in Figure 1 (or in
terpretation of these principal components, it is Figure 4 below), we see how the data hyper-
further proposed to retain for use in (8) only cube for w4 fits “inside” that of w14 as do also
those vertices xuk whose contribution the respective principal component hypercubes.
This phenomenon cannot be observed in any
(yiν k )2 classical analysis.
Ctr(xik , PCν ) =
[d(xik , G)]2
Some Analyses of Interval Data 229
Figure 1.
Figure 2.
230 Some Analyses of Interval Data
Figure 3.
Figure 4.
Some Analyses of Interval Data 231
Contact address:
[1] P. BERTRAND, F. GOUPIL, Descriptive statistics for Lynne Billard
symbolic data. In: Analysis of Symbolic Data: Department of Statistics
Exploratory Methods for Extracting Statistical In- University of Georgia
formation from Complex Data (eds. H.-H. Bock Athens, GA 30602, USA
and E. Diday). Springer-Verlag, Berlin, (2000), e-mail: lynne@stat.uga.edu
103–124.
[2] L. BILLARD, Dependencies and variation compo-
nents of interval-valued data. In: Selected Contri- LYNNE BILLARD is a University Professor at the University of Georgia
butions in Data Analysis and Classification (eds. P. and an Adjunct Professor at the Australian National University. She
Brito, P. Bertrand, G. Cucumel and F. de Carvalho), is a member and former president of American Statistical Association,
Springer, (2007), 3–12. as well as of International Biometric Society, and a member of Eastern
North American Region (ENAR), International Statistics Institute and
[3] L. BILLARD, A. CHOUAKRIA-DOUZAL, E. DIDAY, Institute of Mathematical Statistics. She has received awards for her
Symbolic principal components for interval-valued work in statistics from American Statistical Association, University of
Georgia and other statistical associations. Her research interests in-
observations, (2007). clude stochastic processes with emphasis on model building, epidemic
processes including AIDS research, sequential and time series analy-
[4] L. BILLARD, E. DIDAY, Symbolic Data Analysis: sis, statistical inference, symbolic and complex data analysis. She is a
Conceptual Statistics and Data Mining, (2006). reviewer/referee for several journals, including Mathematical Reviews
John Wiley. and Journal of American Statistical Association and an Associate Edi-
tor for Journal of Symbolic Data Analysis. She has published over 150
[5] L. BILLARD, E. DIDAY, Descriptive statistics for papers in international scientific journals and conference proceedings
interval-valued observations in the presence of rules. as well as several books.
Computational Statistics. 21, (2006), 187–210.
[6] M. CHAVENT, Analyse de Donneés Symboliques
Une Méthode Divisive de Classification, Thése de
Doctorat, Université Paris Dauphine, (1997).
[7] M. CHAVENT, A monothetic clustering algorithm.
Pattern Recognition Letters 19, (1998), 989–996.
[8] M. CHAVENT, Criterion-based divisive clustering
for symbolic data. In: Analysis of Symbolic Data:
Exploratory Methods for Extracting Statistical In-
formation from Complex Data (eds. H.-H. Bock
and E. Diday). Springer-Verlag, Berlin, (2000),
299–311.
[9] A. CHOUAKRIA, Extension des Méthodes d’analyse
Factorielle a des Données de Type Intervale, Thése
de Doctorat, Université Paris, Dauphine, (1998).
[10] E. DIDAY, M. NOIRHOMME-FRAITURE, (EDS.) Sym-
bolic Data Analysis and the SODAS Software,
(2008). Wiley.