Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Revista Colombiana de Estadística

Edición para autor(es)

AN ADAPTATION OF THE STATIS METHOD


FOR NON SYMMETRICAL ANALYSIS OF
QUALITATIVE VARIABLE BLOCKS
UNA ADAPTACION DEL METODO STATIS PARA EL ANALISIS
NO SIMETRICO DE BLOQUES DE VARIABLES CUALITATIVAS

1, a 1, b 1, c
Jennyfer Combariza , Guillermo Ramírez , Maura Vásquez

1 Postgrado en Estadística, Facultad de ciencias económicas y sociales,

Universidad Central de Venezuela, Caracas, Venezuela

Abstract

One of the methods proposed for the simultaneous analysis of multiple tables
of data on several occasions is the STATIS, whose purpose is to explore the
similarities between structures called objects, which summarize the informa-
tion of individuals.
The problem raised in this research focuses mainly on the search for a met-
hodology, based on the STATIS, that allows to compare and simultaneo-
usly explain the inuence of a qualitative explanatory variable x as deter-
minant of a categorical variable criterion and on H occasions. To this end,
a Frobenius-type scalar product is dened among the objects, which will
allow to conceptualize a statistical distance between objects, function of the
Goodman-Kruskal τ statistic.
It also presents an application of the proposed technique on a set of real data
consisting of 8 blocks of paired data, where each block contains the measu-
rement of two qualitative variables on 786 individuals, in order to determine
whether the risk rating Credit of the clients of a nancial institution based
on information emanating from the Colombian credit bureaus, has some re-
lation to the risk classication estimated with information from the entity.
Key words : STATIS, τ Goodman-Kruskal, non symmetrical corresponden-
ce analysis, three-way analysis..

Resumen
a Tesista doctoral. E-mail: jennyfer.combariza@gmail.com
b Profesor titular. E-mail: guillermo.ramirez.ucv@gmail.com
c Profesor titular. E-mail: mauralvasquez@gmail.com

1
2 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

Uno de los métodos propuestos para el análisis simultáneo de tablas múlti-


ples de datos en varias ocasiones es el STATIS, cuya nalidad es explorar
las similaridades entre unas estructuras denominadas objetos, que resumen
la información de los individuos.
El problema planteado en esta investigación se centra principalmente en la
búsqueda de una metodología, basada en el STATIS, que permita comparar
y explicar simultáneamente la inuencia que tiene una variable cualitativa
explicativa x como determinante de una variable categórica criterio y en H
ocasiones. Con este n se dene un producto escalar tipo Frobenius entre los
objetos, que permitirá conceptualizar una distancia estadística entre objetos,
función del estadístico τ de Goodman-Kruskal.
Se presenta además una aplicación de la técnica propuesta sobre un conjunto
de datos reales del ámbito nanciero conformado por 8 bloques de datos apa-
reados, donde cada bloque contiene la medición de dos variables cualitativas
sobre 786 individuos, con el propósito de determinar si la calicación de ries-
go de crédito de los clientes de una entidad nanciera basada en información
emanada de las centrales de riesgo colombianas, tiene alguna relación con la
clasicación de riesgo estimada con información de la entidad.
Palabras clave : STATIS, τ de Goodman-Kruskal, análisis de correspon-
dencias no simétrico, análisis de tres vías..

1. Introduction
Prediction is the term usually used to refer to the estimation of a categorical
variable criterion y , en H occasions, as a function of one or more independent
variables x1 , x2 · · · , xp , in a linear or non-linear model, which implies explaining
the inuence that the latter exert as determinants of the behavior of the rst. In
the particular case of a single explanatory variable x, and that both (x and y)
be categorical, D'Ambra y Lauro (1984, [2]) have proposed the non symmetrical
correspondence analysis (NSCA) of the paired information block (x, y). This tech-
nique, based on the decomposition of the statistic τ of Goodman-Kruskal, proposes
to obtain minimum quadratic estimates of y , proyectando ortogonalmente sobre el
espacio generado por las modalidades de x. This essentially consists of projecting
the vector conditional probabilities y , xed each of the modalities of x, on the
main directions of the variance and covariance matrix of the estimates of y given
x.
In this work we propose an adaptation of the STATIS method to the case of the
application of an NSCA in H oportunidades, to a data structure corresponding to
the characterization of the same individuals by means of two categorical variables
1
, organized on paired non symmetrical blocks (xh , yh ), in H occasions.
The central problem of this research is mainly the search for a methodology that
allows simultaneous to compare and explain the inuence of an independent or ex-
planatory qualitative variable x as a determinant of a categorical variable criterion
y on H occasions.

1 In this document, it is used, at every opportunity, that the structure of the rows is given by
the variable x and that of the columns by the variable y .

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 3

1.1. Objective of STATIS-NSCA2


2
The goal of the STATIS-NSCA2 is to allow the simultaneous comparison of
objects that identify interdistance structures corresponding to H occasions of qua-
litative blocks, under a non-symmetrical perspective.
For this it is necessary to dene the conceptual basis on which the application of
STATIS will be based in the non symmetrical qualitative context by dening a
representative object of the interdistance structure between the individuals of a
given block, subsequently a suitable scalar product must be dened between ob-
jects, that allows to compare the dierent blocks in terms of a statistical distance
obtained from the scalar product dened. To nally establish the fundamental
aspects of the analysis of the interstructure.

1.2. Main idea of STATIS-NSCA2


The main idea of the STATIS-NSCA2 is to analyze the inuence of a categorical
explanatory variable x about a categorical variable to explain y , over H occasions.
The well-known STATIS methodology was initially designed for the treatment of
information generated by characterizing n individuals according to p continuous
variables in H dierent occasions; in this work that methodology is adapted to
the case of paired non symmetrical blocks of two qualitative variables, of the type
(x, y), in H occasions.

2. Data conguration
In this research, two qualitative variables are considered x and y:
1. An explanatory variable x with p modalities A1 , · · · , Ai , · · · , Ap .
2. A variable explained y with q modalities B1 , · · · , Bj , · · · , Bq .
Likewise, it is considered (xh , yh ), h = 1, · · · , H , matched blocks non symmetrical,
which contain the measurement of two qualitative variables on n individuals.
3
The qualitative variables are observed on the same n individuals in H occasions .

2.1. Disjunctive matrices


For each occasion h, the observed values of the variables x and y on n individuals,
can be organized on complete disjunctive matrices of the form:
2 The acronym NSCA2 means that a qualitative scenario is being studied, non symmetrical
with association measured through the τ of Goodman-Kruskal and the number two means that
only two variables are analyzed: a criterion variable y as a function of an independent variable
x.
3 In the case of longitudinal designs, the term occasion is used to identify each of the possible
instants of measurement, while the terms situation and condition are reserved for cases where
h is not linked to the passage of time but to conceptual classications proper to the meanings
that have the variables that are being analyzed. For theoretical development purposes we use the
term occasion.

Revista Colombiana de Estadística, Edición para autor(es)


4 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

  
Xh = xsih 1≤s≤n,1≤i≤p Yh = ysjh
1≤s≤n,1≤j≤q

(a) Variable x, h-th occa- (b) Variable y, h-th occa-


sion. sion.
Figura 1: Disjunctive matrices h-th occasion

where

if the individual s presents the ith modality of the variable x on the hth occasion 4
(
1
xsih =
0 otherwise.

1 if the individual s presents the jth modality of the variable y on the hth occasion
ysjh =
0 otherwise.

2.2. Required notations


Matrices are denoted by boldface uppercase letters (e.g., Xh ), below we present the
basic denitions required for the understanding of the adaptation of the STATIS
method

• Vectors of marginal frequencies of qualitative variables


 t  t
x and y on the hth
rh = j t Xh = k1·h · · · ki·h · · · kp·h j = (1, · · · , 1)t
occasion  t  t
ch = j t Y h = k·1h · · · k·jh · · · k·qh

• Diagonal matrix of frequencies of the qualitative variables x and y on the


Xt
h Xh = diag(· · · ki·h · · · ) = diag(rh ) = Dph
hth occasion
Yh t Yh = diag(· · · k·jh · · · ) = diag(ch ) = Dqh

• Contingency table that crosses the variable x (in the rows) with the variable
y (in the columns) Kh = Xt
h Yh

• Table of relative frequencies of the variable x (in the rows) with the variable
y (in the columns) FY X = 1 K h
h h n

FX = 1 Dph
• Frequency matrix of the values of the qualitative variables x and y h n
FY = 1 Dqh
h n

on the hth occasion).


kijh

 
Matrix of row proles:Rh = D
−1
K =
ph h ki·h
= P (y = j|x = i,

4 Throughout this investigation indicate that the individual s presents the ith modality of the
variable x on the h-th occasion is equivalent to indicate that the individual s presents the Ai
modality of the variable x on the hth occasion. The above also applies to the modalities of the
variable y on the hth occasion, for simplicity we will use jth modality to indicate the modality
Bj on the hth occasion.

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 5

• Centering matrix P:
P = I − Pm (1)
(jj t ) t
where Pm =
n
, being j the vector j = (1, · · · , 1) of dimension n × 1.
• In the space generated by the information of the variable x, for the hth ins-
tant, is dened the matrix that allows you to obtain the minimum quadratic
projections of the variable y in the space generated by the variable x.
t
PX = X(X X)
−1 t
X (2)

3. Algebraic properties
3.1. Properties of matrices PXh and Pm = n1 Jn
For occasion h,

1. P Xh and Pm are idempotent.

2. The product of the matrices PXh and Pm is commutative.

3. The matrix P Xh − P m is idempotent.

3.2. Matrix of minimum quadratic estimates of the variable


y on the hth occasion
The matrix of minimum quadratic estimates of y as a function of x, denoted by
Ỹh is dened by:

Ỹh =
t
PX Yh = Xh (Xh Xh )
h
−1 t
Xh Y = Xh D
−1
K = Xh Rh
ph h
(3)

The matrix of least squared projections of the variable y depending


Pn on the moda-
x, on the hth occasion, has the generic term (i, j): s=1 xsjh P (yh =
lities of the
j|xh = i) = P (yh = j|xh = i), for all the individuals that present the modality Ai
of x.
Reordering the matrix of minimum quadratic estimates Ỹh , and using the deni-
tion of product Kronecker of matrices, we obtain the following representation:

 jk ⊗ P (yh = 1|xh = 1) ··· jk ⊗ P (yh = j|xh = 1) ··· jk P (yh = q|xh = 1) 


1·h 1·h 1·h

.. .. .. .. .. 

. .
 
. . .
 
 

(4)
 
jk ⊗ P (yh = 1|xh = i) ··· jk ⊗ P (yh = j|xh = i) ··· jk ⊗ P (yh = q|xh = i)
 
Ỹh = 
i·h i·h i·h

.. .. ..
 
.. ..
 
 
. . . . .
 
 
 
jk ⊗ P (yh = 1|xh = p) ··· jk ⊗ P (yh = j|xh = p) ··· jk ⊗ P (yh = q|xh = p)
p·h p·h p·h

being ki·h the number of times the conditioned frequency prole yh = j|xh = i it
is repeated in the matrix of minimum quadratic estimates Ỹh . The estimate for
an individual who presents the modality Ai of the variable x, on the hth occasion,
is the vector of conditioned frequencies of yh |xh = i, as shown below:

!
ki1h kijh kiqh
Y˜h (xh = i) = ,··· , ,··· , (5)
ki·h ki·h ki·h

Revista Colombiana de Estadística, Edición para autor(es)


6 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

In the case where the matrix Ỹh ˜h


be centered (P Y = Ỹh ), would have to be
estimated P (yh = j) by the relation:

p
X ki·h
P (yh = j) = P (yh = j|xh = i) (6)
i=1
n

3.3. Matrix of minimum quadratic estimates centered on the


variable y on the hth occasion
The matrix of minimum quadratic estimates centered on y as a function of x,
denoted by Ỹch is dened by:

Ỹch = PX Ych = (PX − Pm )Yh = Y˜h − Pm Yh = Xh Rh − Pm Yh


h h
(7)

The matrix of quadratic minimum projections centered of the variable y depending


on the variable x, (s, j):P (yh = j|xh =
on the hth occasion, has a generic term
i) − P (yh = j) for all the individuals that present the modality Ai of the variable
x . Reordering the matrix of minimum quadratic estimates centered Ỹch , and using
the Kronecker product denition of matrices, we get the generic term (i, j) from
the matrix Ỹch is


jki·h ⊗ (P (yh = j|xh = i) − P (yh = j)) (8)

Namely, the minimum quadratic estimate centered for an individual who presents
the modality Ai of the variable x, on the hth occasion, it is the vector of conditioned
frequencies of yh |xh = i:
! ! !!
ki1 k·1 kij k·j kiq k·q
Ỹc (x = i) = − ,··· , − ,··· , − (9)
ki· n ki· n ki· n

4. Variability in the STATIS-NSCA2


In this section we will examine the notion of variability in the particular case of a
single explanatory variable x, and both it and the y be categorical.

4.1. Total variability


Measure the total variability, of the variable y on the hth occasion, in the STATIS-
NSCA2 is to quantify globally how much the rows of the matrix Ych are similar to
t
its center of gravity. For this, the trace of the matrix Ych Ych is calculated, which
turns out to be:

q

q

t t 2 2
(10)
X X
traza(Ych Ych ) = traza(Dqh ) − traza(Yh Pm Yh ) = n − nf·jh = n 1 − f·jh 
j=1 j=1

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 7

4.2. Variability explained


Measure the variability explained, of the variable y on the hth occasion, in the
STATIS-NSCA2 consists in quantifying globally how much the conditional proba-
bilities are similar (rows of the matrix Ỹch ) to its center of gravity gỸch . To do
t
this, the trace of the matrix Ỹch Ỹch is calculated:

t
traza(Ỹch Ỹch ) =
t
traza(Yh (PX
h
− Pm )Yh ) (11)
t t
= traza(Yh Px Yh ) − traza(Yh Pm Yh )
h
q X p f 2 q
ijh 2
(12)
X X
= n −n f·jh .
j=1 i=1 fi·h j=1

5. Goodman-Kruskal Index
In this section we will present a statistical index Goodman-Kruskal's τ that allows
us to obtain a measure of the intensity of the force with which the variable x
explains to y . The quotient between the variability explained and the total varia-
bility, constitutes an index of association, similar to Goodman - Kruskal's, which
represents the proportion of variability of the variable y explained by the variable
x (h-th occasion) and denote by τyh ·xh . It has the following representation:

2
fijh
Pq Pp Pq
V E(h) − f2
j=1 i=1 fi·h j=1 ·jh
= Pq = τy ·x .
V T (h) 1− f2 h h
j=1 ·jh

For more information about this index please consult [1].

6. Internal analysis within each block with a NSCA


approach
The analytical treatment of information, proposes to make an NSCA, what it
means to analyze the variability of y explained by x, by decomposing the statistic
τy·x of Goodman-Kruskal (1954, [1]), in accordance with the following steps:

• Step 1: obtaining minimum quadratic estimates of y.

• Step 2: analysis of variability using τ of Goodman-Kruskal τyh ·xh .

• Step 3: determination of the main directions in an NSCA, obtained by spec-


tral decomposition of the matrix Ỹht Ỹh .

• Step 4: construction of a biplot representation of the rows (conditional pro-


babilities of y|x = i for the dierent modalities of the variable y) and the
columns (conditional probabilities of y=j for the dierent modalities of the
variable x) from the matrix Y˜h .

Revista Colombiana de Estadística, Edición para autor(es)


8 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

7. ICI strategy in the STATIS-NSCA2


In the case of this particular investigation is intended to compare the variability
of the variable y depending on the variable x, between occasions, in terms of the
coecient τyh ·xh of Goodman - Kruskal. This means that a procedure is carried
out of three phases that meets the objectives:

1. Interstructure: identication of information blocks (xh , yh ), h = 1, · · · , H ,


similar to each other.

2. Intrastructure: description of the dierences or similarities between indi-


viduals, using as a fundamental piece the NSCA to analyze the behavior
of individuals on each occasion and the blocks that somehow explain the
reasons or causes of similarities and/or dierences between individuals.

3. Compromise: building a common framework of representation to all the


information blocks (xh , yh ), h = 1, · · · , H .

8. Methodology in the STATIS-NSCA2


Next, the dierent elements required for the application of the interstructure phase
are described of the STATIS adaptation, based on the Frobenius scalar product.

8.1. Theoretical elements: study


The study for the hth block is dened as the triplet Eh = ((Xh , Yh ), Mh , Dh ),
where

1. Xh y Yh , are the disjunctive matrices corresponding to the qualitative va-


riables x and y , on the hth occasion on n individuals.
2. Mh = (Xth Xh )−1 is a positive denite matrix that denes the metric used
to construct distances between individuals.

1
3. Dh = √ In being V T (h) the total variability of y in the block h.
VT(h)

8.2. Theoretical elements: object


Given the study Eh for the hth block, the non symmetrical object is dened as:

Wh =
t t
Xh Mh Xh Ych = Xh Mh Xh (I − Pm)Yh = (PX − Pm)Yh
h
(13)

The object Wh it turns out to be a matrix of order (n×n)×(n×q) = n×q 5 . These


objects are not square matrices as in classic STATIS and even less symmetrical
matrices.
5 As you can see the order of this arrangement is independent of p, the number of modalities
of the explanatory variable, more if it depends on the number of individuals under study and
the number of q modalities of the criterion variables.

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 9

8.3. Theoretical elements: scalar product


The Frobenius scalar product between two matrices Wh y Wl , of dimension n × q ,
is dened of the form

t t
< Wh |Wl >F N SCA2 = traza((Dh Wh ) Dl Wl ) = traza(Wh Dh Dl Wl ) (14)

with range of values on the real line, both positive and negative.
The Frobenius scalar product at the time h has the following expression:

hWh |Wh iF N SCA2 =


t t 2
traza((Dh Wh ) Dh Wh ) = traza(Wh Dh Wh ) (15)
1
= p
t
traza(Ỹch Ỹch ) (16)
V T (h)V T (h)
1
= V E(h) = τy ·x
h h
(17)
V T (h)

so the scalar product of an object with itself, on the hth occasion, indicates a re-
lationship of variability or association between variables measured through τyh ·xh .
The Frobenius scalar product at times h and l has the following expression

1
t t
hWh |Wl iF N SCA2 = traza((Dh Wh ) Wl Dl ) = traza(Wh Dh Dl Wl ) = p
t
traza(Ỹch Ỹcl ) (18)
V T (h)V T (l)

This product is dened as a function of the covariances between the estimates of


y in the block h and the block l.

8.4. Theoretical elements: distance


A distance between objects based on the norm is dened given naturally by the
Frobenius scalar product:

2 2
dF N SCA2 (Wh , Wl ) = kWh − Wl kF N SCA2
1
= q hWh − Wl |Wh − Wl iF N SCA2
V T (yh )V T (yl )

(19)

= τy ·x + τy ·x − 2hWh Wl iF N SCA2
h h l l

where the last term is a measure of the covariance between the estimates of
y in both blocks. So the greater is this covariability, the smaller the distance
between the representations of the blocks. That is, it is possible to quantify in
a distance measure DF N SCA2 , the dierences between the structures dened by
the minimum-quadratic estimates of the block of the variables q criterion on two
dierent occasions which are obtained according to the corresponding blocks of
explanatory variables.
In summary, the proposed distance measure to compare the objects of
interest will be greater, the larger the respective Goodman-Kruskal
association indices (on both occasions considered) and the smaller the
larger the aggregate measure of the covariances between the minimum
quadratic estimates of the criterion variable in the occasions h y l.

8.5. Interstructure representation space


A matrix arrangement containing the Frobenius scalar products between the ob-
jects of the dierent blocks is dened.

Revista Colombiana de Estadística, Edición para autor(es)


10 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

It is denoted by Sto the matrix with dimensions H ×H which contains the Fro-
benius scalar products between objects two to two:

 hW1 |W1 iF N SCA2 ··· hW1 |Wi iF N SCA2 ··· hW1 |WH iF N SCA2 
 .. .. .. .. .. 
. . . . .
 
 

(20)
 
 
S =  hWi |W1 iF N SCA2 ··· hWi |Wi iF N SCA2 ··· hWi |WH iF N SCA2 
.. .. ..
 
.. ..
 
 
. . . . .
 
 
hWH |W1 iF N SCA2 ··· hWH |Wi iF N SCA2 ··· hWH |WH iF N SCA2

Substituting the expressions that dene Wh y Wl , the general term is:

1
Shl = p
t
traza(Ỹch Ỹcl ) (21)
V T (h)V T (l)
1 t t
= traza(Yh (PX PX − Pm )Yl )
h l
p
V T (h)V T (l)

where PXh is the matrix of projection on the space generated by the modalities
of the variable x on the hth occasion and likewise Ỹh = (Ỹ(1,h) , · · · , Ỹ(q,h) )) and
Ỹl = (Ỹ(1,l) , · · · , Ỹ(q,l) )) are the projections of y on the space generated by the
modalities of x on the occasions h y l.
A very important result is that on the main diagonal of the matrix S are the
association indices of Goodman-Kruskal corresponding to each of the H block
pairs:
1
Shh =
t
traza(Yh (PX − Pm )Yh )
h
(22)
V T (h)

The construction of the matrix S is done to obtain on the one hand a comparative
graph of the level of association of the paired blocks, and for the other hand be
a tool that facilitates the reconstruction of distances between objects in the usual
Euclidean space, with an interpretation of interest for the purposes of the analysis.
The positioning of a block paired in that space, determined by its distance to the
origin of coordinates, should be measured by a variation of the Goodman-Kruskal
association index of the block in question

2 t
(23)
 
kWh kF N SCA2 = traza Wh Wh
   
t P
traza Yh Xh − Pm Yh V E(h)
= = (24)
V T (h) V T (h)
= τy ·x
h h
(25)

The spectral decomposition of this matrix is carried out S with the purpose of n-
ding a representation space for the objects of the dierent blocks. The eigenvalues
and eigenvectors are then obtained:

α α
SG = tα G , α = 1, 2, · · · , H (26)

being found with this factorization the possibility of decomposing the Goodman-
Kruskal association index

H H
(27)
X X
traza(S) = τyα ·xα = tα
α=1 α=1

By diagonalizing the matrix Yh t (PXh − Pm ) Yh is obtained that the trace of


PH
the matrix S is α=1 τyα ·xα , where each eigenvalue is associated with a main
address of the non symmetrical correspondence analysis (NSCA). Therefore, the
α-th direction of this space, Gα , captures a portion equal to:


PH (28)
α=1 τyα ·xα

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 11

of the overall measure in which the x explains to the y along the blocks.
On the other hand, it is possible to determine a set of H points A1 , · · · , Ah , · · · , AH
on the Euclidean space whose directions are determined by the column vectors of
the matrix G (spectral decomposition of S), which represent the positions of the
objects W1 , · · · , Wh , · · · , WH so that in this representation is preserved the inter-
distance structure between the matrices that identify these objects. That is, the
distance between the points Ah y Al is the same as the distance dF N SCA2 between
objects Wh and Wl .
6
It is established in this way, using the foundations of the PCA , that the pro-
jection coordinates of the H objects W1 , · · · , Wh , · · · , WH about the α-th main

axis of the spectral decomposition of S, are described on the vector tα Gα , α =
1, · · · , h, · · · H . Therefore, the representation over the whole space can be written
matrix like:
 p
··· ···

t1 0 0

.. .. 

. .
 
 
 0 0 0 
1/2 
(29)
p 
GT = G1 ··· Gh ··· GH 0 ··· th ··· 0
 
 
.. ..
 
 
. .
 
 
 0 0 0 
p
0 ··· 0 ··· tH

(30)
 p 
t1 G 1 t2 G2 th Gh tH GH
p p
=
p
··· ···

In this representation, it has to, for the object Wh its square distance to the origin
of coordinates is approximately τyh ·xh , ie, for objects farther from the origin, the
explanatory power of the x is greater.
The projection coordinates of the H objects on the axis α remain in the form:


 tα g1α 
 .. 
.
 
 

(31)
 
α  √ 
γ =  tα ghα 
..
 
 
 
.
 
 

tα gHα

It is proposed to construct the compromise object as a linear combination of the


objects of the form

H
X H
X
Wcomp = α h Wh = αh Ỹch
h=1 h=1
H
(32)
X
= αh (PX − Pm )Yh
h
h=1

so that covariance is maximized globally with the objects analyzed in the sense
of the internal product dened on the objects

1
t
hWh |Wl iF N SCA2 = traza(Wh Dh Dl Wl ) = p
t t
traza(Ỹch Ỹcl ) (33)
V T (h)V T (l)

The objective function then, is of the form:

H
2 t t
(34)
X
hWcomp |Wh iF N SCA2 = α SS α
h=1

PH 2
with the restriction: h=1 αh = 1. The problem is determining the vector
t
α = (α1 , · · · , αh , · · · , αH ) of coecients of the compromise object; which turns
6 PCA: Principal Component Analysis

Revista Colombiana de Estadística, Edición para autor(es)


12 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

out to be the normalized eigenvector of the symmetric matrix SSt associated with
2
its greater self-worth (t1 ) , in the form:

G1 G1
α = = . (35)
kG1 kF N SCA2 (G1t G1 )1/2

For a better understanding the gure 2 summarizes the dierent steps of STATIS-
NSCA2.

Figura 2: The dierent steps of STATIS-NSCA2.

9. An example
This section illustrates the proposed technique, applying it on a set of real data.

9.1. Contexto
As published by the Financial Superintendence of Colombia in chapter 2 of the
7
external circular basic accounting and nancial 100 of 1995 the credit risk (CR)
is the possibility that an entity incur losses and decrease the value of your assets,
as a result of a debtor or counterpart breaching its obligations.
The Superintendence also indicates that the monitored entities must permanently
assess the risk incorporated into its credit assets, both at the time of granting cre-
dits how throughout their lives, including cases of restructuring. For this purpose,
8
the entities must design and adopt a Credit Risk Management System (SARC ).

7 https://www.supernanciera.gov.co/publicacion
8 SARC: Sistema de Administración del Riesgo Crediticio in spanish.

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 13

The basic elements that should comprise the SARC are: CR administration poli-
cies, CR administration processes, internal or reference models for the estimation
or quantication of expected losses, system of provisions to cover the CR and in-
ternal control processes.
Therefore, nancial institutions must establish ecient credit risk management
and control schemes to which they are exposed in the development of the busi-
ness, in resonance with their own risk prole, market segmentation, according to
the characteristics of the markets in which it operates and the products it oers;
therefore it is necessary that each entity develop its own work scheme, that ensures
the quality of your assets and also allows to identify, measure, control (mitigate)
and monitor the materialization of the dierent risks to which they are exposed
as banks.

9.2. Data conguration


It will work with two qualitative variables x and y, measurements on 786 indivi-
duals, throughout the 8 quarters, corresponding to the years 2016 and 2017. The
variables considered are:

9
1. A variable x which corresponds to the risk rating (at the client level) of
each client with 5 modalities: A, B, C, D and E.

2. A variable y which corresponds to a state of risk (at the client level) that es-
tablishes the entity with external information. This variable has 4 modalities
or categories: State 1, State 2, State 3 and State 4.

It is assumed that the internal risk rating of each client (variable x)


explains in some way the state obtained with external information (va-
riable y ).
Throughout this document, the data blocks will be explained explaining their logi-
cal structure, exploring the blocks in the following sections (xh , yh ), h = 1, · · · , 8,
from the point of view of the STATIS-NSCA2 methodology.

9.3. Variables analyzed


1. Rating of credit risk. According to the Colombian SuperFinanciera norm
10
contracts must be classied in one of the following credit risk categories:

• Category A o normal risk.

• Category B o acceptable risk, higher than normal.

• Category C o appreciable risk.

• Category D o signicant risk.

9 Variable subject to Chapter 2 of the Superintendencia nanciera de Colombia.


10 https://www.supernanciera.gov.co/publicacion

Revista Colombiana de Estadística, Edición para autor(es)


14 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

• Category E o risk of uncollectibility.

2. Risk rating at the client level. To obtain a risk rating at the customer
level, the credit rating with the highest exposure is placed. That is, the
analysis presented in this paper is carried out at the client level. For practical
purposes, throughout this document the term risk rating at the client level
refers to the credit rating with the highest exposure.

3. Risk state (client level). Segmentation performed by the nancial ins-


11
titution to track all shared clients , from the payment behavior that they
present, both within the entity and in the nancial sector (this variable in-
12 13
corporates data from the credit bureau ) .
This state is built by the entity on a quarterly basis for all customers who
have current credit products with the entity and who have debts with the
sector. The states according to the presented behavior are:

• State 1: consists of those customers who when asked in credit bureaus


in the quarter meet their nancial obligations to the sector and meet
the entity.

• State 2: consists of those customers who when asked in credit bureaus


in the quarter fail to meet their nancial obligations to the sector and
meet the entity.

• State 3: consists of those customers who when asked in credit bureaus


in the quarter meet their nancial obligations to the industry and do
not comply with the entity.

• State 4: consists of those customers who when asked in credit bureaus


in the quarter fail to meet their nancial obligations to the industry
and do not comply with the entity.

9.4. Description of the sample analyzed


The sample analyzed in this article is part of a quarterly study carried out by the
nancial institution. Funciona con n = 786 clientes observados trimestralmente
durante 24 meses (H = 8). For practical purposes we will use the notation 2016Ti ,
i = 1, · · · , 4, indicates the i-th quarter of the year 2016, likewise the notation is
used 2017Ti , i = 1, · · · , 4 The n = 786 clients were evaluated in each quarterin
each of the variables mentioned (risk state and risk rating). The sample analyzed
in this chapter is part of a quarterly analysis carried out by a nancial institution,

11 Shared clients are those that maintain a current operation with the entity and that, additio-
nally, they present at least one obligation with another entity in the nancial sector.
12 The credit bureau is a private, independent of nancial institutions, commercial and govern-
ment enterprise, which aims to concentrate and provide its aliates, the information concerning
the behavior people have had their credit.
13 Financial entities use the services of risk centersbecause this information provides an in-
novative tool to support decision making in the evaluation, credit risk prevention and client
management. These services, although they are not free, it allow to access in a simple way to the
most updated and complete information database of non-compliance.

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 15

where information of the nancial sector is evaluated and allows to obtain a global
measurement of the shared clients.

9.5. ICI strategy


In the case of these dataset, the phases of the ICI strategy have the following
representation:

1. Interstructure: identify the information blocks (xh , yh ), h = 1, · · · , H ,


which are similar to each other.
Recall that the matrix S denotes the matrix that contains the scalar products
of Frobenius between objects two to two. From the matrix S (see the table
(1)) reconstruction of distances between objects is made (see gure 3(b)),
being able to easily appreciate that the object W8 it is further away from
the objects W1 , W2 y W3 .
A partir de la matriz S (véase la tabla (1)) se hace la reconstrucción de las
distancias entre los objetos (véase gura 3(b)), pudiendo apreciar fácilmente
que el objeto W8 W1 , W2 and W3 .
is further away from the objects
It can be seen that the intensity with which the variable x explains to the y ,
is high. Except for the occasion 3 and 5, where τyh ·xh take the value of 77 %.

Cuadro 1: Matrix S, in yellow it is highlighted, the main diagonal where is the τyh ·xh for every occasion.
W1 W2 W3 W4 W5 W6 W7 W8

W1 100.00 % 86.63 % 64.43 % 41.20 % 23.60 % 15.64 % 11.86 % 6.18 %


W2 86.63 % 90.91 % 65.53 % 41.42 % 23.31 % 15.39 % 11.46 % 5.82 %
W3 64.43 % 65.53 % 77.11 % 40.22 % 21.79 % 13.85 % 9.72 % 2.87 %
W4 41.20 % 41.42 % 40.22 % 89.70 % 41.60 % 26.06 % 17.73 % 5.81 %
W5 23.60 % 23.31 % 21.79 % 41.60 % 76.67 % 44.32 % 31.32 % 11.72 %
W6 15.64 % 15.39 % 13.85 % 26.06 % 44.32 % 90.58 % 62.10 % 32.45 %
W7 11.86 % 11.46 % 9.72 % 17.73 % 31.32 % 62.10 % 89.99 % 44.35 %
W8 6.18 % 5.82 % 2.87 % 5.81 % 11.72 % 32.45 % 44.35 % 79.05 %

2. Intrastructure: description of the dierences or similarities among the in-


dividuals, using as a fundamental piece the NSCA to analyze the behavior
of the individuals in each occasion and the blocks that in some way explain
the reasons or causes of the similarities and / or dierences between the in-
dividuals.
In this order of ideas the biplot gures 4:

• Occasion 1: the risk rating A is mainly associated with State 4. The distance of the risk rating A to the
origin is small, motivated by the fact that the row prole of rating A is very similar to the center of
gravity. The rest of the qualications are far from the origin because they present signicant dierences
with respect to the center of gravity.
• Occasion 2: Risk rating A is mainly associated with State 1. While, rating B is assigned to State 2, rating
C to State 3 and ratings E-D to State 4. The distance of risk rating A to origin is small, motivated
by the fact that the row prole by rating A is very similar to the center of gravity. The rest of the
qualications are far from the origin because they present signicant dierences with respect to the
center of gravity.
• Occasion 3: Risk rating A is mainly associated with State 1. While, rating B is assigned to State 2,
rating C to State 3 and rating E to State 4. The distance of risk rating A to origin it is small, motivated
by the fact that the row prole by rating A is very similar to the center of gravity. The rest of the
qualications are far from the origin because they present signicant dierences with respect to the
center of gravity.

Revista Colombiana de Estadística, Edición para autor(es)


16 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

(a) factorial plane S (b) Distance map between objects

(c) Evolution τyh ·xh of Goodman-


Kruskal

Figura 3: Tools for analysis

• Occasion 4: the risk rating A is mainly associated with State 1. Whereas, the rating C with State 3 and
the ratings D-E with State 4. The distance of the risk rating A at the origin is small, motivated by the
prole rank by grade A is very similar to the center of gravity. The rest of the qualications are far
from the origin because they present signicant dierences with respect to the center of gravity. As an
additional comment, even though Rate this close to home, compared with previous occasions begin to
see a shift away from this qualication regarding the origin.
• Occasion 5: the risk rating A is mainly associated with State 1. Whereas, the rating C with State 3 and
the ratings D-E with State 4. The distance of the risk rating A at the origin is increasing, motivated
by the prole rank by grade A is dierentiating itself from the center of gravity. While the distance of
the risk rating E to the origin is decreasing (in comparison with the previous occasions), motivated to
that the row prole by rating E is very similar to the center of gravity.
• Occasion 6: the risk rating A is mainly associated with State 1. Whereas, the rating C to State 3 and
the ratings D-E to State 4. The distance of the risk rating A to the origin has increased, motivated to
that prole rank by grade A is dierentiating signicantly from the center of gravity. While the distance
of the risk rating E to the origin is decreasing (in comparison with the previous occasions), motivated
to that the row prole by rating E is very similar to the center of gravity.
• Ocasión 7: la calicación de riesgo A esta principalmente asociada a State 1. Mientras que, la calicación
C al State 3 y las calicaciones D-E al State 4. La distancia de la calicación de riesgo A al origen a
aumentado, motivado a que el perl la por calicación A se esta diferenciando signicativamente
con respecto al centro de gravedad. Mientras que la distancia de la calicación de riesgo E al origen va
disminuyendo (en comparación con las ocasiones anteriores), motivado a que el perl la por calicación
E es muy parecido al centro de gravedad.
• Ocasión 8: the rating of risk A is mainly associated with State 1. While, the rating C with State 3 and
the ratings D-E with State 4. The distance of the risk rating A at the origin has increased, motivated
to the prole row by qualication A is dierentiating signicantly with respect to the center of gravity.
While the distance of the risk rating E to the origin is decreasing (in comparison with the previous
occasions), motivated to that the row prole by rating E is very similar to the center of gravity. On this
occasion, 5 and the base scenario the population was concentrated mainly in the combination Category
E-State 4.

3. Compromiso: construction a framework of representation common to all


information blocks (xh , yh ), h = 1, · · · , H . Keep in mind that the purpose
of this section is to perform an analysis of the compromise. By building
an additional space (compromise) that admits representing the individuals

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 17

Biplot ocasión 1 , ( 96.57 %) Biplot ocasión 2 , ( 97.67 %) Biplot ocasión 3 , ( 92.52 %) Biplot ocasión 4 , ( 99.47 %)

3
2

2
Estado4 Estado4
* *
Estado4 Estado4 E +
* *

1
Estado1 Estado1 E
* * +

15.66%

20.38%
Estado1*

7.37%

5.28%
D +
D
E E + Estado1*
+
A+ A+ A+ Estado2 A+

0
B +*

Eje 2

Eje 2
D + B + *

Eje 2

Eje 2
Estado2
D +
Estado3 * BEstado2
+ *

−1

−1

−1

−1
Estado2 *
C + Estado3 * Estado3 * Estado3 *
+ C +

−2

−2

−2

−2
C
+ +
B C

−3

−3

−3

−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Eje 1 89.19% Eje 1 92.39% Eje 1 76.85% Eje 1 79.09%

Biplot ocasión 5 , ( 87.72 %) Biplot ocasión 6 , ( 99.69 %) Biplot ocasión 7 , ( 99.46 %) Biplot ocasión 8 , ( 98.8 %)
3

3
2

2
A+
Estado1
*
Estado4
* B+
1

1
Estado4 Estado1
Estado1 * Estado1 Estado4 *
E + * * *
17.94%

17.83%

22.25%

29.52%
E + A+
A+ A+ E +
D + E +Estado2 *
Estado2 * B+
0

0
Estado2 * B+ Estado4 *
D +
Eje 2

Eje 2

Eje 2

Eje 2
D+
B+ D +
Estado2 *
−1

−1

−1

−1
Estado3 * C+
Estado3 *
Estado3 * Estado3 *
C +
−2

−2

−2

−2
+ +
C C
−3

−3

−3

−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

Eje 1 69.78% Eje 1 81.86% Eje 1 77.21% Eje 1 69.28%

Figura 4: NSCA: results

dierentiating them according to their group.


The analysis of the low-dimensional space of the compromise matrix using
the gure 5 is the following:

• In the rst quadrant of compromise plane the 30 % of the observations are concentrated. These obser-
vations are mainly characterized by clients that started in combination Category A-State 1 (90 %) and
that all ended in combination Category E-State 4.
• The second quadrant has 34 % of the observations, of these the 100 % starts on occasion 1 with the
combination Category A-State 1 and only the 49 % of clients ends on occasion 8 with the combination
Category E-State 4.
• The third quadrant has 24 % of the observations; of these 100 % starts on occasion 1 with the com-
bination of Category A-State 1 and only 34 % of customers ends on occasion 8 with the combination
Category E-State 4.
• Finally, in the fourth quadrant you have the 12 % of the observations, of these the 89 % start on the
occasion 1 with the combination Category A-State 1 and the 89 % of the clients ends on the occasion 8
with the combination Category E-State 4.

Proyección de los objetos


en el espacio compromiso (Wc) ( 88.01 %)
0.015
0.010
0.005
16.77%

I124
I129
I128
I114
I99
I55
I54
I57
I139
I142
I64
I51
I79
I200
I276
I244
I252
I215
I237
I315
I360
I390
I394

I116
I646
I304
I524
I482
I456 I112
I123
I125
I133
I68
I72
I20
I143
I38
I43
I45
I50
I27
I771
I784
I786
I175
I247
I250
I249
I192
I162
I196
I201
I203
I187
I719
I672
I240
I242
I677
I295
I592
I302
I577
I576
I554
I556
I631
I309
I503
I502
I628
I620
I527
I538
I540
I398
I346
I361
I379
I488
I461
I467
I472
I440
I439
I417 I106
I130
I122
I121
I126
I103
I117
I98
I67
I88
I141
I750
I84
I40
I148
I147
I153
I46
I7
I14
I774
I741
I262
I257
I271
I158
I176
I660
I168
I195
I651
I654
I186
I253
I721
I207 I281
21I279
I181
I239
I643
I234
I218
I689
I294
I293
I593
I596
I595
I599
I598
I300
I570
I636
I582
I316
I632
I508
I565
I552
I530
I523
I399
I355
I357
I334
I383
I613
I479
I481
I484
I483
I487
I451
I441
I429 I62
0.000

I184
I284
I105
I107I108
I59
I119
I73
I77
I66
I90
I39
I41
I145
I745
I744
I81
I47
I49
I33
I32
I778
I777
I776
I775
I740
I261
I268I269
I2
I172
I174
I727
I179
I178
I668
I197
I199
I735
I731
I722
I206
I278
I277
I644
I683
I682
I681
I699
I641
I219
I223
I608
I610
I590
I572
I603
I635
I494
I318
I583
I518
I520
I310
I501
I366
I549
I551
I341
I330
I359
I377
I403
I402
I405
I351
I380
I382
I384
I478
I460
I454
I465I466
I449
I409
I408
I420
I422
I430
I415
I513
I83
I756
I151
I34
I156
I232
I227
I301
I571
I319
I585
I560
I559
I533
I324
I356
I391
I457
I431 I111
I102
I137
I24
I140
I36
I31
I30
I759
I768
I762
I270
I167
I657
I664
I714
I185
I724
I671
I241
I706
I702
I221
I288
I575
I579
I496
I322
I525
I332
I373
I364
I344
I614
I462
I475
I458
I474
I445
I444
I436
I425
I428
I427
I418 I60
I118
I138
I75I76
I96
I92
I91
II16I17
I146
I753
I152
I8
I9
I29
I779
I258
I260
I763
I155
I3
I5
I275
I165
I164
I163
I730
I650
I653
I160
I202
I282
I663
I666
I665
I673
I182
I238
I231
I696
I698
I217
I688
I692
I691
I690
I220
I228
I611
I606
I574
I580
I587
I586
I493
I519
I498
I567
I618
I622
I625
I532
I534
I537
I331
I358
I371
I544
I376
I350
I336
I338
I385
I455
I464
I476
I443
I435
I434
I437
I414 I782
I747
I767
I769
I764
I720
I723
I222
I634
I562
I348
I135
I131
I781
I749
I755
I754
I772
I760
I738
I656
I652
I713
I718
I717
I716
I715
I725
I670
I678
I645
I684
I705
I695
I589
I296
I604
I495
I317
I630
I511
I627
I619
I621
I616
I433
I65
I93I381
I601
I23
I53
I166
I605
I311
I548
I491
I469
I448
I450
I426
I372 I144I536
I339 I110
I127
I13
I742
I259
I761
I272
I190
I734
I704
I700
I235
I305
I584
I521
I638
I314
I313
I367
I626
I531
I335
I432I393I400 I573
I74I321
I694
I541
I386I679 I210 I685
I104
I100
I56
I71
I70
I52
I19
I15
I28
I770
I743
I266
I739
I169
I194
I245
I251
I255
I254
I173
I280
I726
I649
I648
I180
I285
I669
I204
I209
I233
I710
I292
I291
I290
I289
I297
I299
I306
I633
I499
I566
I553
I312
I505
I504
I617
I333
I363
I365
I545
I396
I395
I375
I326
I340
I388
I406
I477
I442
I410
I424
I419
I697
I529 I387I446
I216 I459
I564
I134
I780
I191 I748
I766
I737
I733
I680
I707
I22
I236
I558
I412
I224 I470
I132
I87
I86
I170
I658
I667
I183
I736
I712
I214
I686
I320
I526
I370
I550
I539
I506
I343
I345
I347
I349
I354
I468 I547 I500
I528
I389
I101
I136
I89
I80
I154
I150
I765
I283
I198
I189
I732
I243
I208
I213
I212
I709
I225
I287
I298
I516
I535
I368
I329
I362
I378
I325
I615
I480 I44
I48
I6
I265
I267
I274
I193
I661
I205
I701
I594
I497
I629
I542
I561
I353
I463
I486
I485
I473
I423 I783
I113
I82I248
I752
I1 I85
I676
I687
I226
I607
I522
I546
I489
I452
I411I416
I352 I640 I773
I263
I256
I161
I655
I624
I623
I4
I729
I647
I35I492 I95
I581
Eje 2

I323 I751
I149
I563
I514
I637 I453
I188 I555
I659
I578
I557
I507
I509
I392 I447 I785
I758
I757
I159
I662
I711
I675
I642
I591
I597
I515
I308
I397
I413
I746
I600
I543
I374 I569 I37
I273
I177
I211
I674
I708
I517
I510
I342
I230 I69 I25
I728
I328
I264
I307
I512I703
I246 I58
I327
−0.005

I115 I171 I12 I693


I602 I109
I303
I42
I438
I61
I10 I97
I401
I471
I120
I421
I18 I588
I568 I11
I157
I369
I404
I26
I78
I94
I337
−0.010

I612
I609
I490

I286 I229
−0.015

I639
I407 I63

−0.05 0.00 0.05

Eje 1 71.24%

Figura 5: Analysis of the low-dimensional space of the compromise matrix.

10. Conclusions and recommendations


In this investigation it has been achieved:
Addressing the research problem by adapting the STATIS methodology
to perform the simultaneous comparison of the objects that identify the

Revista Colombiana de Estadística, Edición para autor(es)


18 Jennyfer Combariza, Guillermo Ramírez & Maura Vásquez

interdistance structures corresponding to H qualitative block occasions,


under a non symmetrical perspective.
This was achieved through the denition of an appropriate scalar product between
objects that allows to compare the structures that describe the interdistances bet-
ween individuals characterized by the dierent sets of qualitative variables. Addi-
tionally, the scalar product allowed, for one hand, to dene a statistical distance
between objects and for the other build a space of low dimension, where it is pos-
sible to make comparisons between the dierent data sets with non symmetrical
relations evaluated in terms of the index τyh ·xh of Goodman-Kruskal.
In the section 9 an example was analyzed, where

1. The representation of objects in low-dimensional space, eectively allows


to see graphically the similarity and / or dierence between the dierent
non symmetrical blocks in relation to the τyh ·xh h = 1, 2, · · · 8 of Goodman-
Kruskal.

2. The conclusion of the example is that indeed the information of the credit
bureaus (measured through the risk status variable) has a relation with the
risk rating of the cliente. Given that the quarterly process of consultation
with credit bureaus is a process that represents a cost for the bank, it is
recommended not to make this expense and use the convention: Category A
forecasts State 1, Category B forecasts State 2, Category C forecasts State
3 and nally Categories D - E - forecasts State 4.

11. Acknowledgements
I thank all the professors and researchers of the Universidad Central de Venezuela,
who despite of all adversities continue to defend "la casa que vence las sombras".

Revista Colombiana de Estadística, Edición para autor(es)


AN ADAPTATION OF THE STATIS METHOD FOR NSCA2 19

Referencias
[1] Goodman, L., & Kruskal, W. (1972). Measures of association for cross classi-
cations, IV: simplication of asymptotic variances. Journal of the american
statistical association, 415-421.

[2] Lauro, L. D. (1984). Non-symmetrical correspondence analysis. Data Analysis


and Informatics, III. Elsevier, North-Holland, Amsterdam, 433-446.

[3] Lavit, C., & Escouer, and & Traissac, P. (1994). The ACT (STATIS method).
Computational Statistics Data Analysis(23), 97-119.

Revista Colombiana de Estadística, Edición para autor(es)

You might also like