A Rank-Invariant Method of Linear and Polynomial Regression Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

CHAPTER 20

A RANK-INVARIANT METHOD OF LINEAR AND


POLYNOMIAL REGRESSION ANALYSIS·

HENRI THEIL
Economic Research Institute
University of Amsterdam
Amsterdam, The Netherlands

PART I

1. Introduction

Regression analysis is usually carried out under the hypothesis that one of the
variables is nonnally distributed with constant variance, its mean being a function of
the other variables. This assumption is not always satisfied, and in most cases difficult
to ascertain.
In recent years attention has been paid to problems of estimating the
parameters of regression equations under more general conditions (see the references
at the end of this paper: A. Wald (1940), K.R. Nair and M.P. Shrivastava (1942), K.R.
Nair and K.S. Banerjee (1942), G.W. Housner and J.F. Brennan (1948) and M.S.
Bartlett (1949)). Confidence regions, however, were obtained under the assumption of
nonnality only; to obtain these without this assumption will be the main object of this
paper.
In section 1. confidence regions will be given for the parameters of linear
regression equations in two variables. In the sequel of this paper we hope to deal with
equations in more variables, polynomial equations, systems of equations and problems
of prediction.

2. Confidence Regions for the Parameters of Linear Regression Equations in Two


Variables

2.1 THE PROBABILITY SET

Throughout this section the probability set r ("Wahrscheinlich- keitsfeld" in the sense
of A. Kolmogoroff) underlying the probability statements will be the 3n-dimensional

• This article first appeared in the Proceedings of the Royal Netherlands Academy of Sciences S3 (1950)
Part I: 386-392, Part II: 521-525, Part III: 1397-1412. Reprinted here with the pennission of the Royal
Netherlands Academy of Arts and Sciences.

B. Raj et al. (eds.), Henri Theil’s Contributions to Economics and Econometrics


© Kluwer Academic Publishers and copyrightholders 1992
346 H. Theil

Cartesian space R 3n with coordinates uI , ••• , un' VI' ••• , V n, WI' ••• , wn• Every random
variable mentioned is supposed to be defined on this probability set.
In the first place we suppose 3n random variables u j, Vj, Wj (i = 1, ... , n)l to be
defined on r, i.e. we suppose uj, Vj, Wj to have a simultaneous probability distribution
on r.
If we now put:

(1)
(2)
1, ... ,n
(3)
(4)

then, for any set of values of the (n+2) parameters ~j, <Xo and <Xl> the variables Xi and Yj
have a simultaneous distribution on r, and are therefore random variables.
We shall call ~j the parameter values of the variable ~. The equation (1) is the
regression equation; this equation contains no stochastic variables. Furthermore we
shall call Wj the "true deviations from linearity"; hence the variable 11 is a linear
function of ~, but for the deviations w. Finally ui and Vj are called the "errors of
observation" of the true values ~ and 11j respectively.
The problem then is, under certain conditions for the probability distribution of
U j, Vj' Wj, to determine confidence intervals for the parameters <Xo and <Xl> given a
sequence of observations Xl' ••. , X n, YI' ••• , Yn of the random variables Xl' ••• , X n, YI' ••• ,
Yn·

2.2 INCOMPLETE METHOD: CONFIDENCE INTERVAL FOR <XI.2

We suppose that the following conditions are satisfied:

Condition I: The n triples (u j , Vj, w) are stochastically independent.


Condition II: 1. Each of the errors U j vanishes outside a finite intervallu j 1< gj.

2. For each i 'i: j we have: II;j - ~) > gi + gj. -


From condition II it follows that either

1 The distinction between a stochastic variable and the value it takes in a given observation (or system
of observations) will be indicated by bold type for the former one.
2 The author is indebted to Mr J. Hemelrijk for his constructive criticism concerning some points of this
section.
Rank-Invariant Method 347

1 and ~j<~

or

P[Xj > x) = 1 and ~j > ~j"

This condition means that the errors Uj are sufficiently small in order that
arrangement of the observed values Xj according to increasing magnitude be identical
with the arrangement according to increasing values of ~ (cf. also A. Wald (1940), p.
294, seq., where a similar (weaker) condition is imposed). The arrangement of the Xj is
therefore uniquely determined. We therefore suppose the Xj as well as the ~j to be
arranged according to increasing order.
Put n/ = n - [~ n]; if n is odd, the observation with rank ~(n+l) is not used.
We therefore omit this observation and write n = 2n/.
We determine the foUowing n/ statistics:

Yn,+j - Yj Z"j+l
- z.
A(i,n l +i)
I
= 0. 1 +
X
+;
- x.
" 1 I
X . -
"1+ 1
Xj

in which Zj = -o./uj + Vj + Wj.


We now impose:
Condition Ill, which states:

1
T

As all denominators x . - x. are positive, it foUows that


".+1 I

i.e. that A(i,n l + i) has a median 0./ and that its distribution function is continuous in
the median.
The following conditions IITa and llIb are each sufficient in order that
P[;j > znl+i] = P[Zj > znl+i] = lh:
Condition Ilia: the random variables Zj (i = 1. ...• n) have the same continuous
distribution function.
Condition /lIb: the random variables Zj have continuous distribution functions which
are symmetrical with equal medians med(z).
348 H. Theil

Proof: In case IlIa the simultaneous distribution of z; and zn +,. is symmetrical 1

about the line Zj = Znl+I., which proves the statement.


In case Illb it is symmetrical about the lines Zj = med(z) and 4"1+1 = med(z); hence the
simultaneous distribution of 4j - med(z) and z . - med(z) is symmetrical with respect
nIH

to the origin, which proves the statement.


We now arrange the nI statistics fl.(i,n , + i) in increasing order:

in which

fl..) == fl.(i.,n
}
, +i).
)

The probability that exactly r among the nI values fl.(i,n , + i) are < aI' i.e. that

l
fl. r <a <fl.,,!, is 2- n , (:1) because of the conditions I and III. Hence:

in which
,
Jor
xr,-I(l -xr,-r,dx
o

J1
xr'-'(l -xt,-r,dx
o

is the incomplete Beta-function for the argument ~.


So we have proved:
Theorem 1: under conditions I, II and III a confidence interval for a I is given by the
Rank-Invariant Method 349

largest but (rl - 1) and the smallest but (rl - 1) among the values ll.(i,n 1 + i), the level
of significance being 2I,"E(rl ' n 1 - r 1 + 1) .
We shall call this method an "incomplete method" because a limited use is

made of the (~) statistics

(i < J)'

2.3 INCOMPLETE METHOD: CONFIDENCE REGION FOR a o AND a l •

If the median of Zj (i = 1, ... , n) is numerically known, a confidence region for Uo and


a l can be found. We suppose that the following condition is satisfied:
Condition N: the median of each Zj (i = 1, ... , n) is zero:

For any value of a l we can arrange the n quantities Zj = Yj - alxj according to


increasing magnitude:

Under the condition that a l has the value used in this arrangement, we can state that

On the other hand, if we write II for the interval (ll. r.' ll.n.-r.+l ), we can state:

If we denote by 10 the interval bounded by the lowest of the values Z (a) if a l varies
'. 1
through II and by the largest of the values Zn-ro+l (a)
1
if a l varies through II we have
350 H. Theil

So we have proved:
Theorem 2: under conditions I, II, III and IV a rectangular confidence region in the <la,
arplane is given by the intervals a o E 10 and a l E II' the level of significance being S
£0 + £1 - £0£1'
If all observed points (xj' y) obey the inequality Xj ~ 0, all quantities Yj - alxj
are decreasing functions of al • It follows that 10 is bounded by Zro(lln,_r,+I) and by
Zn-ro+1 (llr)' The converse holds if every point satisfies the inequality Xj ::;; O.

2.4 COMPLETE METHOD

We suppose that the conditions I, nand nla are satisfied and consider two
arrangements of the points (xj , y): the arrangement according to increasing values of x
and that according to Z = y - ~ - alx.
The arrangement according to Z is possible for any assumed value of a l • The
hypothesis that this value is the true one is rejected if and only if there is a significant
rank correlation between the arrangements.
Consider the statistics

Zj-Zj
--,
Xj-Xj

in which i < j, so that (if the ordering is according to x) Xj < Xj and ~j < ~j' It follows
that ll(ij) > aI' if and only if Zj < Zj'
Now, under the null hypothesis that the arrangements of the points according to
x and according to Z are independent, the distribution of Kendall's "rank correlation
coefficient"

is known, in which S is the number of cases in which the ordering according to Z is


the same as the ordering according to x (Zk < Zf(, and X k < Xf() minus the number of
cases in which the ordering according to Z is the inverse as compared with the one
according to x (Zk > Zk" while x k < Xf().
For any value of a l the number of cases Zj > Z can be found. Suppose this to be
q; it will be clear that
Rank-Invariant Method 351

The distribution function of S for any value of n has been given by M.G.
Kendall (see M.G. Kendall (1947), p. 403-407 and (1948), p. 55-62) by means of a
recurrence formula. So the probability P[q I nJ that q' .:::. q cases Z; > Zj are found can
be determined. If this probability is below the level of significance chosen, we reject
the hypothesis that fl.] has the value used in the arrangement according to z.
Hence, if we arrange the statistics tJ..(i]) in increasing order:

we find by symmetry

so that we have proved:


Theorem 3: under conditions I, II and IlIa a confidence interval for fl.] is given by the
largest but (q-1) and the smallest but (q-1) among the values tJ..(i}), the level of
significance being 2P[q-11 nJ.
The method of 2.3 can be applied here to find a simultaneous confidence
region for (xo and (Xj' I] now being the interval (tJ..q,tJ..(n) ).
2 -q+l

2.5 A COMPARISON

The second method may be called a "complete method," because all statistics tJ..(i;) are
used. It requires only 5 points in order to reach the level of significance 0.05 whereas
the limited method needs 12 points. However, if the number of points is large, the
computational labor of the complete method is considerably greater than that of the
incomplete method. Moreover, the conditions under which the complete method is
valid are more stringent; the fact that the set of conditions I, II and III is sufficient for
the incomplete method is important in view of the general occurrence of
"heteroscedastic" distributions, i.e. distributions in which the variance (if finite) of Tl is
larger for higher values of ~ than for lower ones if fl.j > 0 and conversely if fl.] < O.

2.6 TESTING LINEARITY

Suppose that the set of conditions I, II and IlIa is valid. Then the hypothesis that the
regression curve for two variables is linear can be tested against the alternative
352 H. Theil

composite hypothesis that it is either positive- or negative-convex 3, i.e. in the set of


equations (1), (2), (3), (4) the equation 8j = no + (Xl;j is tested against any equation 8
= 8(;;) with either

or

d 28 < 0 for all ;,


d~2

the equations (2), (3), (4) remaining unchanged.


Consider the n] statistics

in this arrangement. If this ordering has a significant rank correlation with the ordering
of these statistics according to increasing magnitude, we reject the hypothesis that the
regression curve is linear.

References

Bartlett, M.S.: 1949, "Fitting a Straight Line when Both Variables are Subject to
Error," Biometrics, S, 207-212.

Dantzig, D. van: 1947, Capita Selecta der Waarschijnlijkheidsrekening, caput II,


(stenciled).

Housner, G.W., and J.F. Brennan: 1948, "The Estimation of Linear Trends," Annals of
Mathematical Statistics, 19, 380-388.

Kendall, M.G.: 1947, The Advanced Theory of Statistics, London, 1, 3rd edition.

Kendall, M.G.: 1948, Rank Correlation Methods, London.

Nair, K.R., and K.S. Banerjee: 1942, "A Note on Fitting of Straight Lines if Both
Variables are Subject to Error," Sankhya, 6, 331.

3 A function J(x) is positive-convex (cr. e.g. D. van Dantzig, 93-94 (1947» in an interval if for every XI
and X 2 of this interval and for every real positive number a < 1 the following inequality is satisfied
af(x 1) + (l-a)f(x2) > f(ox 1 + 1 - OXz).
Rank-Invariant Method 353
Nair, K.R., and M.P. Shrivastava: 1942, "On a Simple Method of Curve Fitting,"
Sankhya, 6, 121-132.

Wald, A.: 1940, "The Fitting of Straight Lines if Both Variables are Subject to Error,"
Annals of Mathematical Statistics, 11, 284-300.
354 H. Theil

3. Confidence Regions for the Parameters of Linear Regression Equations in


Three and More Variables

3.1 THE PROBABILITY SET

The probability set r underlying the probability statements of this section is the
n(v+2)-dimensional Cartesian space Rn(V+2) with coordinates

Every random variable will be supposed to be defined on this probability set.


In the first place we consider n(v+2) random variables u},j, Vi' Wi (A = 1, ... , v;
i = 1, ... , n). Furthermore we consider (n+ l)v+ 1 parameters Uo, ~, ~},j (i = 1, ... , n;
A = 1, ... , v) and put:

e., = <x.o +LA=!


<X.A ~},j (5)

{ '01 •...•• (6)


11i = e., + w.,
A=I, ... ,v. (7)
X Ai = ~Ai + U},j

(8)
Yi = 11i + Vi

So the variables X},j and Yi have a simultaneous distribution on r, and are therefore
random variables.
We call ~},j the parameter values of the variable ~A' The equation (5) is the
mUltiple regression equation. The random variables Wi are called "the true deviations
from linearity," while the random variables U},j and Vi are called "the errors of
observation" of the values ~Ai and 11i respectively.
Putting

4 This paper is the second of a series of papers, the first of which appeared in the Proceedings of the
Royal Netherlands Academy of Sciences. 53, 386-392 (1950).
Rank-Invariant Method 355

Zj = - E nl.~ + Vj + W
l.=l

we have

v
Y1 = no + l: a.hJ + Zj
l.=l

the random variables Zj being called "the apparent deviations from linearity."

3.2 CONFIDENCE REGIONS FOR no, nb ... , nv.


In order to give confidence regions for the (v+l) parameters no, ~ (A. = 1, ... , v) we
impose the following conditions:
Condition I: The n(v+2)-tuples (ulj , ••• , U vj ' Vj' w;) are stochastically independent.
Condition II: 1. Each of the errors UAi vanishes outside a finite intervalluAi I :;;. gAi'
2. For each i -:f. j we have I~ - ~Ajl > gAi + gAj'
Furthermore we impose for the incomplete method to be mentioned:
Condition III:

P[Z; < z) = P[Zj > z) = ~ for i -:f. j

and for the complete method:


Condition IlIa: Each Zj has the same continuous distribution function.
Secondly we define the following quantities:

E
v

G (l.')(i) = Y j - n..-xAi =
l.=l
",I
1, ... ,v; 1, ... ,n).

Furthermore, after arranging the n observed points (yj, Xli' ••• , xv) according to
increasing values of xl.' (which, by condition II, is identical with the arrangement
according to increasing values of ~l.'):

we define the quantities


356 H. Theil

Yi - Yj
v
x).;
Xj,fi - Xj,fj
-E l.=1
a,.
- X'J.,j

Xl.'i - Xl.'j
"",'
z., - z.J
= a,., + (i 1, ... ,n-1; j = i+1, ... ,n)
Xl.'i - Xl.'j

For any set of values aJ> ... , ~'.l' a-.:+ J, ... , CLy we arrange the quantities j(ll.'J(ij)

according to increasing magnitude; we define KiQ.!) as the quantity with rank i in this
arrangement:

K (l.') < K(l.') < < K(l.')


1 2 ... {;}

Finally we define the intervals lA' (aI' ... , al.'_I' ~'+I' ... , CLy) as the intervals

with 2q ;; (~} Al.' as the union of

and A as the union of all Al.' (A = 1, ... , v).


We have the following theorem concerning the complete method for three and
more variables:
Theorem 4: Under conditions I, II and IlIa the region A is a confidence region for the
parameters a J , ... , a v , the level of significance being < 2v.P[q-llnD.5
Proof: If the set of assumed parameter values a J, .... -;"'-J' al.'+J' ... , CLy is the "true set",
it follows from the analysis in section 1.3., that lA' (aJ> ... , al.'-I> ~'+I> .... CLy) is a
confidence interval for ~, to the level of significance 2P[q-1In]. Hence it follows that
if (aJ , ... , CLy) represents the "true" point in the a J , .... CLy-space, we have

5 For the definition p[q-ll nl the reader is referred to section 2.4. (part I of this paper).
Rank-Invariant Method 357

P[(o.l' ... ,o.)e Al!] = 1-2P[q-lln], (A.' = 1, ... , v),

which proves the theorem.


If condition III (but not necessarily IlIa) is fulfilled, the method mentioned
above can be replaced by the following one. We replace the quantities

K()..')(i]) (A.' = 1, ... ,v; i = 1, ... ,n-l; j = i+l, ... ,n)

by

CA.' = 1, ... ,v; i = 1, ... ,n l).6

The intervals I:'


(0.1' ... , al!_l'~'+l' ... ,a) are now defined as the intervals bounded
by the values of J(I)..'J(i,nj+i) with rank rj and (nrrj+ 1) respectively, if they are
arranged in ascending order; whereas the definitions of A{, as the union of all I{, and of
A' as the union of all A{, remain unchanged. The following theorem of the incomplete
method for three and more variables will now be obvious from the analysis of section
2.2:
Theorem 5. Under conditions I, II and III the region A' is a confidence region for the
parameters a j , ... , <Xv, the level of significance being
2v. I.(rl'nl-r l +1) . ~
...
A confidence region for the parameters Clo, o.j, ... , <Xv can be constructed, if the
median of Zj is known, e.g. if the following condition is fulfilled:
Condition N: The median of each Zj is zero.
The method for the construction of this confidence region is analogous to the
one given in section 2.3.

3.3 AN ILLUSTRATION FOR THE SPECIAL CASE V = 2.

The form of the region A).. or A~ will now be indicated for the case of three variables:

Using the incomplete method we find n} functions of a 2 :

6 n1 = 1. n. Cf. section 2.3.


2
358 H. Theil

which are estimates of aI' given a 2. They are represented by straight lines in the aI'
a 2 -plane. For any value of a 2 we can arrange these quantities in ascending order. As
long as (under continuous variation of a 2) the numbers il and i2 for which the statistics
K!I)(i I , nI+iI ) and K!lJ(i2> nI +i2) have the rrth and (nrrI+l)-th rank according to
increasing order (with rl as defined in section 2.4.) remain constant, the extreme points
of the confidence intervals vary along straight lines. If, when passing some value a;
of a 2 either il or i2 changes, the corresponding straight line passes into another one,
intersecting the first one in a point with a 2 =a;.
So a diagram can be constructed, in which the n I straight lines are drawn in the
aI' arplane. This gives the stochastic region A~ depending on the given observations
and bounded to the left and to the right by broken lines.
Rank-Invariant Method 359
According to Theorem 5 it contains the true point (0.1 , 0.2 ) with the probability

The region A~, bounded above and below, can be constructed in a similar way;
then the observed points must be arranged in ascending order of X2.
360 H. Theil

PART nr
4. Confidence Regions for the Parameters of Polynomial Regression Equations

4.1 THE PROBABILITY SET

The probability set r underlying the probability statements of this section is the
n(v+2)-dimensional Cartesian space Rn(v+2) with coordinates

Every random variable mentioned is supposed to be defined on this probability set.


We suppose n(v+2) random variables Uj"j, Vi' Wi (A = 1, ... , v; i = 1, ... , n) to
have a simultaneous probability distribution on r. Furthermore we consider nV
parameters ~j"j and N parameters (XPI"'P for all sets of non-negative integers PI' ... , Pv
y

satisfying

o~ E A~l
Pi.. ~ h.

Now we put8

(10)

i=l, ... ,n (11)


TJ·=e.+w
I I ,
{
A=l, ... ,v. (12)

(13)
Yi = TJ i + Vi

So, for any set of values of the (N+nv) parameters (Xp, ...py , ~j"j, the variables Xj"j

and Yi have a simultaneous distribution on r, and are therefore random variables.

7 This paper is the third of a series of papers, the first of which appeared in the Proceedings of the
Royal Netherlands Academy of Sciences. 53. 386-392 (1950); the second appeared in these Proceedings.
53. 521-525 (1950).
8 1: in equation (10) denotes summation over all sets PI ..... Pv'
Rank-Invariant Method 361

The parameters ~Ai (i = 1, ... , n) are interpreted as values assumed by the


variable ~... The equation (10) is the polynomial regression equation. The random
variables Wi are called "the true deviations" from the polynomial of degree h; the
random variables UAi and Vi are called "the errors of observation" of the "true" values
~Ai and 11i respectively.9

4.2 CONDmONS; APPROXIMATION

In order to give confidence regions for the parameters ~1 •...•Pv we consider the
following conditions:
Condition I: All n(v+2)-tuples (UAi, Vi' W) are stochastically independent.
Condition IIa: 1. Each of the errors U)J vanishes outside a finite interval I uAi I <
gAi'
*'
2. For each i j we have I~Ai - ~',il > gAi + g',i'
Condition lIb: 1. Each of the errors uAi vanishes outside a finite interval I uAi I <
gAi'
*'
2. For each i j, for each set PI' ... , Pv and for any real hAi such
that I hAi I :;: gAi we have

Condition III: For all fixed values of the constants PAi the n random variables

L
v
PAiuAi + Vi + Wi == Zi have continuous distribution functions, which are symmetrical
;"=1
with the median med(z).
Finally we mention that the solution will be givetJ. subject to the following
Approximation: For any positive s the quantities

CA,').! == 1, ... , v; 1, ... ,n)

9 It is clear that the random variables Vi and Wi cannot be separated in one sample of observations; if,
however, the experiment is repeated for the same "true" values l;).j, Tli (e.g. if - when the relation
between income and consumption is investigated - for the same families and the same period the
amounts of their incomes and outlays are repeatedly calculated), then the errors Vi can be mitigated by
averaging, whereas the deviations Wi cannot.
362 H. Theil

are neglected. 10

4.3 CONFIDENCE REGIONS

We consider the case v = 1, so that equation (10) can be written as

OJ = E
p=O
ap~f.

Let us arrange the n observed points (xj,yj) according to increasing values of x:

We leave 0, 1, ... or h points out of consideration until the remaining number n' is
such that n'/(h+l) is an integer, and write n h = n'/(h+l). (It seems advisable with
respect to the power of the method to omit the points with rank nh + 1, 2nh + 1, ...
and/or hnh + 1.) From now on we write n for the remaining number n', so that (h+ l)n h
= n.
We define the following quantities:

10 The approximation implies that the enors U!.i are sufficiently small. This restriction is not very
serious. because. unless the number of points n is very large, large values of U!.i will cause the
confidence region for the parameters of the polynomial to be so large as to render the method useless.
Rank-Invariant Method 363

We arrange the observed quantities 11(h)(i, "" hnh + i) according to increasing


magnitude:

11(lh ) < ", < l1(h)


"h'

in which

A (h) _ A(h)('
'j"'"
hn + lj') ,
ilj - il h

Then we have the following theorem:


Theorem 6: Under conditions I, ITa, and III the interval (l1(h) l1(h) ) is a confidence
rh ' II h - rh + 1

interval for a h to the approximate level of significance 2[.(n h-rh + 1,rh ) ,ll
'Y
In order to prove this theorem we shall use the following Lemma,
Define for all non-negative integers s and for all positive integers c and i

s. ~ O,LS
J - J
=S

Then we have

P'' •...•(c- 1)fl +'.


lt
_ P'1I.+I.•••• .cn,,+1.

Proof of the lemma: We have

11 In the first and second part of this paper the arguments of the incomplete Beta-function must be
reversed.
364 H. Theil

p. S • _ p S . .
I, ••• ,(c-l)n h +I n h +I, .•• ,cn h +1

S(c_l)n,+i ( s,
X (c-I)n,+i X,'

in which l: Sj = s. It follows that

S S
p.I, ... ,( c- 1)n +I. - p . .•. ,cn +I.
n h +I,
h h

Xj - xcn,+j

E L S . s-I s-2 8.-1

S, s(c_l)n . . +;
X
nJa+1
n~+' ... s(c-l)n,\+i
X(c_l)n,+j (X;' + X" I XcnJ.+ i
+ ••• + XC~II+')
I

p.
s-1 .
I, ... ,cnh +1"

Proof of Theorem 6: The relation between Xj and Yj is given by

Yj L up(xj-u)P + Vj + Wi
p=O

h
L upxt -Uj(U 1 +2U2~i + ... + hUh~~-I) +Vj +Wj
p=O

in which we neglected (in accordance with the Approximation) U;' for S > 1. Putting
Zj = pjUj + Vi + Wj' in which

we get

Yi""L U px IP + z,"
p=O
Rank-Invariant Method 365
Now we have according to the lemma:

Z.-Z .
J n",+1

Z.-Z .
l n,,+1

!1(h)(i,n h + i, ... , hnh + i)

in which Zi is a random variable depending on

Zi can be written as a fraction, the denominator being a product of terms


(Xcn,+i - X ,+) (c, c' = 0, ... , h; c #- c'); according to condition IIa this denominator
CI1l

has a definite sign. The numerator consists of a sum of terms

But to our order of approximation this is equal to

so that the numerator can be written as


h

E
c==o
't cZcn,+i with E 't c O.
366 H. Theil

It follows from condition III that this quantity has zero median. From this and from
the above-mentioned property of the denominator it follows that

1
2'

From this and from condition I the theorem immediately follows.

4.4 ADDITIONAL REMARKS

If a h is known, a confidence interval for a h _] can be found. Consider the equation


h-l
Yj - a~jh =: L a px·p + z,"
I

°
which shows that the problem is reduced to the case of a polynomial of degree (h-l).
So, if a confidence interval for a h is given, a confidence region for a h and a h _] can be
found. This can be generalized to an (h+ I)-dimensional confidence region for the
parameters no, ... , a h in a way analogous to the one described in 2.2 and 2.3.
If v > 1, an N-dimensional confidence region for the parameters a p,o··pv can be
found in the following way:
1. Given the other parameters, a confidence region for the parameters ap,o ... o (p] =
0, ... , h) in the N-dimensional parameter space can be constructed, the level of
significance being £] (cf. 3.5.).
2. In the same way one can proceed with the parameters

the levels of significance being £2> ... , Ey.

3. Finally the parameters a PI··.py which have at least two indices p~


I I O. We '*
suppose that (apart from the conditions I and III) condition lIb is valid, which
is a more stringent condition than condition ITa. Consider the equation

in which ~ denotes the summation over all sets PI' ... , Pv except the set p;, ... ,p:. We
can then state regarding the quantities
Rank-Invariant Method 367

that

P[s .. <
I)
a,PI"'Pv, I a, ,] = P[S'J.. > a,P.···Ooy, I a,P.···Pv,] = !.,
PI"'Pv 2

so that in a well-known way confidence regions for each of the parameters a, ,can
PI"'Pv
be found with levels of significance

The common part of the N-(h-1)v-1 regions is a confidence region for the "true
parameter point" in the N-dimensional parameter space, the level of significance being

N-(h-l)v-l

< L
q_l
e q•

5. Confidence Regions for the Parameters of Systems of Regression Equations

In recent years considerable work has been done on the subject of systems of
regression equations (see e.g. T. Haavelmo (1943, 1944), T. Koopmans (1945, 1950),
R. Bentzel and H. Wold (1946), M.A. Girshick and T. Haavelmo (1947». In this
section we shall give a brief investigation into the application of the methods
considered on this subject.

5.1 THE PROBABILITY SET

Our probability set r will be the n(v+2t)-dimensional Cartesian space Rn(V+2<) with
coordinates

VB'···' V 1n'···' V<l' ••• ' V,n


368 H. Theil

We suppose n(v+2t) random variables uj,j, V Xi ' W li (i = 1, ... , n; A = 1, ... , v; x, t = 1,


... , t) to have a simultaneous probability distribution on r. Furthermore we consider
nV + mt parameters ~j,j, a tj (i = 1, ... , n; j = 1, ... , m; A = 1, ... , v; t = 1, ... , t).
Finally we consider the following equations

i=I, ... ,n (14)


{
A= 1, ... ,v (15)

x,t=I, ... ,t (16)


Yxi = 11..,- + V..,-

The equations (14) are supposed to have a unique solution for llxi (x = 1, ... , t; i = 1,
... , n) on every element of r, except possibly on a set of elements with zero
probability.
The equations (14) are called the "stochastic regression equations." The
parameters ~j,j (i = 1, ... , n) are interpreted as the values which the variable ~).. assumes
(A = 1, ... , v). The random variables W ti are called the "true deviations" in the
stochastic regression equations. Finally, the random variables Uj,j and Vxi are called the
"errors of observation" of the "true" values ~j,j and 11xi respectively.
The problem is again, to determine confidence regions for the parameters a tj•

5.2 CONFIDENCE REGIONS

We reduce the equations (14), (15) and (16) to the forms

all' ... , aim' ... , a~l' ... , a~m)' (x = 1, ... ,t)

Consider e.g. the case

f, == Ht(~li""'~v) + E PIX ll xi' (t = 1, ... ,t)


x=1

in which PIX are real numbers and H t are polynomials of degree h in the ~'s. Suppose
that the errors Uj,j are sufficiently small in order that terms containing Uj,jU)..'i (A, A' = 1,
... , v) can be neglected (cf. the Approximation of section 4.2.); then we have
Rank-Invariant Method 369

H,(X 1i, .",XV ) + L


x=1
P,x Yxi "" Z'i' (t = 1, ... ,'t)

in which Z'i are linear functions of UN' Vxi' W'i (A. = 1, ... , v; x, t = 1, ... , 't). The random
variables (Zli' ... , Zti) (i = 1, ... , n) have a simultaneous probability distribution, while
the nt-tuples (ZIi' ••. , ~;) are supposed to be stochastically independent. So we have

(17)

in which

and Brx is the cofactor of the element P,...


Then the problem is reduced to the case considered in section 3. Call N the
number of parameters of a polynomial of degree h. Then, in a way and under
conditions which are analogous to those stated in section 3, a confidence region for tN
parameters of the equations (17) can be given. But the original equations contain
t(N+'t-l) parameters. This means that, if 't('t-l) parameters of the original equations
are given, a confidence region for the remaining tN parameters can be constructed. If
the level of significance of the confidence regions for the parameters of the equations
(17) are Ex (x = 1, ... , 't), this level of the confidence region for the tN parameters of

LEx.
~

the original equations is ,:;;


1
We shall now elaborate a simple example, which is due to T. Haavelmo
(1944), p. 99 seq. Suppose we have the following equations:
370 H. Theil

11 1; - J311u = Wli
a~I; + 11 1; - a11u = Wu
Xli = ~Ii i=l, ... ,n

Yli = 11Ii + VIi

Yu = 11u

in which a and -J3 are positive.


We obtain:

_ aJ3 aw li -J3wu
Yli - a-J3 Xli + + VIi
a-J3

a Wti-Wu
Yu = a-J3 Xli +
a-J3

Suppose that the complete or the incomplete method gives two confidence
intervals

with levels of significance 1:.1 and 1:.2 respectively. Then we obtain two confidence
regions in the a, J3-plane, bounded by hyperbolas and by straight lines respectively.
The probability that the common part contains the "true" point (a, J3) is ~ 1 - 1:.1 - £2.
(See fig. 2).

5.3 ON MULTICOLLINEARITY

As a final application we consider the following case. The following equations are
given (cf. section 3.1.):
Rank-Invariant Method 371

Hence

with

Suppose that the observed values Xli' X 2i (i = 1, ... , n) are such that the following
condition is satisfied:
For each pair i, j (i, j = 1, ... , n) the quotient

Xli - Xlj
(i "# j)
X 2i - X2j

has the same sign.


This condition implies that, apart from the above-mentioned linear relation
between Xli' X 2i and Yi' we have an additional monotonic relation between the observed
values Xli and X 2i (if this relation also is - approximately - linear, we have a case of
"multicollinearity").
372 H. Theil

Figure 2

We now have the following


Theorem 7. Under the above-mentioned condition the regions Al and A z (cf. section
3.2.) are identical, and their common part A is unbounded.
Proof. If the condition is satisfied the arrangement of the observed points (Xl;'
X 2;, y) according to increasing values of Xl is the same as (or just the reverse of) the
arrangement according to increasing values of x 2 • Moreover (cf. section 3.2.) the
quantities J(fI!(i}) and j(i2!(i;) which are estimates of aI' given a 2 , and of a b given aI'
respectively are represented by the same set of straight lines in the fJ.I , az-plane:
Rank-Invariant Method 373

(xli-xl)a + (x2i -x2)K(2)(i]) = Yi-Yr


As the slopes of these straight lines -(Xli - x I)/(X2i - x2j ) have the same sign. the regions
Al and A2 are identical. from which the theorem follows.
If the incomplete method instead of the complete method is used. the same
theorem holds with respect to the regions A; A;.
and whereas the condition that all
quantities (Xli - X Ij)/(X2i - X2) have the same sign is weakened to the condition that all
quantities (x 11.-xI.nIH.)/(x2i.-x2.ra\+1.) have the same sign (i = 1•...• n j ).

6. Problems of Prediction

6.1 THE PROBABILITY SET

For the probability set and the random variables defined on it we refer to section 4.l.
We assume. however. that all errors UAj. vxi are identically equal to zero (A. = 1•...• v; X
= 1•...• 't; i = 1•...• n).
Conditions
We impose the following conditions:
Condition I: All n 't-tuples (ww ...• w~;) are distributed independently of each other.
Condition IlIa: All n 't-tuples (wJj • ...• w ti) have the same continuous simultaneous
distribution function.
Apart from these conditions we shall use the additional conditions. which are
necessary for the determination of a confidence region for the parameters of the
regression equations.

6.2 THE PROBLEM

Suppose that the following n points are observed:

Suppose further that the following v parameters are given:

These parameters are interpreted as the ~-coordinates of an (n+ l)-th point. which is
not observed. The problem is to determine a confidence region for the l1-coordinates
of this point, i.e. for
374 H. Theil

1l.,n+1' ... ,l1v.n+l-

6.3 CONFIDENCE REGIONS

Consider again the case (cf. section 5.2.):

f, == H'(~li""'~v) + L
,:1
l3,x Ttxi' (t 1, ... , 't)

so that we have

B
Ttxi =L B'x {-H'(~li""'~) + W). (x 1, ... ,'t;i 1, ... ,n)
':1

Putting i = n+ I we can write

in which

-t
,:1
BB'x H'(~I.n+1'···'~v.n+1)
~
B
hx(W1,n+1'""w~.n+1'13) = L ~W.
1=1 B "

and in which 13 is the "true parameter point"; 13 may be considered as a vector, the
components of which are I3rx (t, x = 1, ... , 't) and all parameters determining the
polynomials H, (t = 1, ... , 't).
Suppose 13 is known. Then we can arrange the n quantities hx(W1i, ... , W~, 13)
according to increasing magnitude:

(x = l, ... ,'t)

in which
Rank-Invariant Method 375

hXJ' = hx (w1I,., ••• ,w.... ,I3).


j

We have the following


Theorem 8: Under conditions I and IlIa a confidence interval for llx.n+l is given by

if 13 is the known
"true parameter point"; the level of significance is 2s(n+ 1).1.
In order to prove this theorem, we shall use the following lemma (see W.R.
Thompson (1936»:
Lemma: If a random sample of size n is drawn from a universe with continuous
distribution function; if the sample values are arranged in ascending order; if an (n+ 1)-
th draw from the same universe is to be effected; then the probability that the
stochastic interval bounded by the s-th and the (n-s+ 1)-th of these values will contain
the (n+1)-th is equal to 1-2s/(n+1).
Proof of Theorem 8: As g«~l.n+l' ... , ~v.n+l' 13) is ex hypothesi a known quantity,
the problem is to determine a confidence interval for hx(Wl.n+I' ... , w~.n+l' 13). But n
sample values hx(wl;, ... , W.... , 13) from the same universe (cf. condition IlIa) are
obtained; hence the lemma is sufficient in order to show the validity of the theorem.
Generally, however, 13 is unknown, and we can only calculate a confidence
region R for 13. Let now 13 vary through R, and denote by Ix the interval bounded by
the lowest of all lower limits of the interval considered in Theorem 8 and by the
highest of all upper limits (x = 1, ... , '1:). If the level of significance of R is E, the
following theorem immediately follows:
Theorem 9:

6.4 THE LINEAR CASE IN TWO VARIABLES

For the linear case of two variables

a simple graphical representation can be given. Suppose that <X; is the "true" <Xl; after
arranging the sample values
376 H. Theil

in order we find two straight lines:

and

Figure 3. n=14, s=2


Rank-Invariant Method 377

The probability that the region bounded above by S] and below by S2 will contain an
(n+l)-th sample point D is, under the condition that a; :at , equal to 1-2s/(n+l). (See
fig. 3.)
Suppose that the confidence interval for at is (at, a;). When a; varies through
this interval the lines S] and S2 revolve around the observed points (~i' 11;) with ranks s
and (n-s+l) respectively with respect to increasing values of w. As long as the
observed points having these properties remain the same, S] and S2 revolve around one
point; but as soon as variation of causes a; another point to have this property, the
revolution takes place around this point. The figures 4 and 5 elucidate the fact that
sometimes the region is bounded by the straight lines S] and S2 for a* = ,at and a;
only, whereas it is sometimes necessary to consider values between at and at as well.

Figure 4. n=ll, s=2


378 H. Theil

L__________________________ ~~

Figure 5. n=15, s=2

7. Concluding Remarks

The methods of detennining confidence regions which may be derived from this kind
of analysis have not been exhaustively treated. In order to elucidate this statement we
shall give a confidence interval for a in the stochastic regression equation
Rank-Invariant Method 379

(i = 1 , ... , n)

in which all ~ are positive, and in which

< 0]
P[All·, = =0

holds for i = 1, ... , n. ~!' ••• , ~n are known. A and a are unknown parameters, and WI'
.•• , Wn are random variables, which are supposed (1) to be distributed stochastically
independent, (2) to have continuous symmetrical distribution functions with zero
median.
We arrange the observed points (I;., 11,) according to increasing magnitude of ~
and define

log Illn,+; I -log Ill; I


(i=1, ... ,n)
log):~nlH. -log):":t,.

in which n, = 'hn (if n is odd the point ():, Tl ,


~ "t(n+!)' 'I "t(n+!)
) is neglected). After arranging
the observed quantities d; according to increasing magnitude:

we have the following


Theorem 10. Under conditions (1) and (2) the interval (d(r.)' d(n.-r.+1Y) is a confidence
interval for a to the level of significance 21.(n!-r!+1,r t ).
"t
Proof. We have

or:

It follows from (:ondition (2), that


380 H. Theil

has a continuous distribution function with zero median. Hence:

[
P Tln,+i - (-t-
~II+;
JIl<
Tli < 0 1=
'" P[logTln,+i - 10gTl i < a(log~n,+i - log~)] =

= p [ IOgTl 111+'. -logTl ,. < ex. 1= P [IOgTl.


nl+I
-logTl I.
log~ II t +".-log~. log~ 11 1+1.-log~ I

if TIl' ... , TIn are positive; if they are negative we have to replace Tli and Tln,+i by
-TI i and -Tln,+i respectively. From this and from condition (1) the theorem follows.
The theorem shows that this method of determining a confidence interval for a
is identical with the incomplete method for a in the linear equation

10gTl i = log A + alog~ i + w;


(which can be written as

if w; ,..., w; satisfy the same conditions (1) and (2).


Finally we mention that it is possible to find estimates instead of confidence
intervals. Consider e.g. the statistics 11(i]); each of these (~) statistics has the property
that its sampling median is equal to a I (cf. section 1.3). Hence one can use the sample
median of the observed quantities 11(i]) as an estimate of a l •
It is a pleasure to acknowledge my indebtedness to Professor Dr D.van Dantzig
for his stimulating interest and to Mr J. Hemelrijk for his valuable and constructive
criticism.
Rank-Invariant Method 381
References

Bentzel, R., and H. Wold: 1946, "On Statistical Demand Analysis from the Viewpoint
of Simultaneous Equations", Skand. Aktuarietidskr., 29, 95-114.

Girshick, M.A., and T. Haavelmo: 1947, "Statistical Analysis of the Demand for Food:
Examples of Simultaneous Estimation of Structural Equations," Econometrica,
15,79-110.

Haavelmo, T.: 1943, "The Statistical Implications of a System of Simultaneous


Equations," Econometrica, 11, 1-12.

Haavelmo, T.: 1944, "The Probability Approach in Econometrics," Econometrica, 12,


suppl.

Koopmans, T.: 1945, "Statistical Estimation of Simultaneous Economic Relations,"


Journal of the Amer. Statist. Assoc., 40, 448-466.

Koopmans, T.: 1950, ed, Statistical Inference in Dynamic Economic Models, New
York.

Thompson, W.R.: 1936, "On Confidence Ranges for the Median and Other
Expectation Distributions for Populations of Unknown Distribution Form,"
Annals of Math. Statist., 7, 122-128.

You might also like