Professional Documents
Culture Documents
Pearson NewMethodDetermining 1909
Pearson NewMethodDetermining 1909
Character B, of which Only the Percentage of Cases Wherein B Exceeds (or Falls Short
of) a Given Intensity is Recorded for Each Grade of A
Author(s): Karl Pearson
Source: Biometrika , Jul. - Oct., 1909, Vol. 7, No. 1/2 (Jul. - Oct., 1909), pp. 96-105
Published by: Oxford University Press on behalf of Biometrika Trust
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
and Oxford University Press are collaborating with JSTOR to digitize, preserve and extend
access to Biometrika
(1) As an example of the class of cases to which the method of this paper
applies I instance that we might be given the ages of candidates for a given
examination, and the number of failures at each age, without the individual
marks; from these data we might desire to correlate capacity and age. Or, again,
we might be given the percentage of first conyictions for each age group of the
community, and the problem might be to determine the- relation between age and
the tendency to crime as judged by conviction. Or, age being put on one side, we
might desire to correlate any psychical character with anthropometric characters,
e.g. the cephalic index in children as a more or less marked racial character with
their conscientiousness or shyness, measured by the number of shy or conscien-
tious children at each value of the index.
* Pearson: " On the Theory of Skew Correlation." Drapers I?esearch Memoirs (Dulau & Co).
(2) The basal idea is so extraordinarily simple that one is inclined to believe
that it must have been noticed before. I am not able, however, to refer to any
previous menition of it.
p = r -q
02
Hence
/.qll ll(i).
So far there is no a
value of the A-variate, for all the pairs with specially marked B-variate; thus in the
example, in the first illustration I have given above, the mean age of all candidates
who passed the examination. And o-, is the standard deviationi of the measured
character and can therefore be found, e.g. in the same illustration is the age vari-
ability of all candidates. Thus the numerator in Equation (i) can always be
found. The next point is to consider how the denominator can be discovered.
Now the B-variate is not given quantitatively, but we are given the percentage
of B beyond the arbitrary division, i.e. in our illustration the number out of the
candidates who succeed in passing. We cannot therefore find q/ 0-2 by the usual
processes of determining a mean. If, however, we assume the B-variate to follow
reasonably closely a Gaussian distributioni, the percentage of the B-variate gives,
by means of the probability-integral tables, the ratio of y/a-2 for the distance froiil
the inean at which the B-variate is divided, and then
N 12,T22
yeX ~~~L~yIr)2
d -_ e - '12(Y10-2)
0 2 N je-Ydy = f ejY2dy
Here both numerator and deno.minator are knowln as soon as Y/o-2 has been
found. They are for exanmple the z and (1 - a) of Sheppard's Tables (Biomnetrika,
Vol. Ii. p. 182).
Thus by the simple hypotheses that the regression is linear, and that a fairly
close value of the mean of the marked part of the B-variate can be found on the
assumption that it has a normal distribution, we can readily find numerator anid
denomniniator of the value of r as given by Equation (i).
Biometrika vii 13
Illustration I. Relation of Anaemia to Age. I take the data from the last
Report (1909) of the Education Committee of the London County Council. Report
of Medical Officer. Particulars as to age and " colour of blood " are given for 325
boys at Laxon Street and 364 boys at Sirdar Road schools; and for 367 girls at
the former school and 330 girls at the latter school: see p. 19. Sirdar Road, Notting
Dale, is described as "one of the poorest schools of London in the midst of a
migratory, criminal, and thoroughly degenerate and poverty stricken population."
Laxon Street is also in a very poor district; lack of employment, improvidence
and drink are exceedingly common in both districts. Either different standards
of anaemia were used in the two schools, or the Laxon Street school shows far
more anaemia than the Sirdar Road. The ages covered are 7 to 13. Putting
" pale " and " very pale " together, the latter "blood colour" only occurring in about
5 p.c. of the children dealt with, we have
BOYS.
Ages.
Totals 51 46 50 51 52 56 58 364
GIRLS.
Ages.
I will illustrate the general process oni the Sirdar Road boys:
= (10-5444 - 10'6456)/2'019
- -1012/2019=- '0512.
The negative sign indicates that anaemia in boys decreases with age, i.e. the
age of the anaemic boys is less than that of all boys. The girls at Sirdar Road
give a plus correlation =+ 0627. Treated in the same manner the Laxon Stieet
boys give - '1632. In the case of the girls at the latter school the correlatioll is
+ 0373, the normal girls having a lower age than all girls. Thus we find:
Boys Girls
* It is necessary to take the normal and not the anaemic girls as the fo
minority.
13-2
(i) The relation between anaemia and age (7 to 13) is not very marked.
(ii) Anaemia decreases as the boys grow older and increases as the girls grow
older.
(iii) The association of anaemia with age in boys is about trebled in intensity
when we pass fromn Sirdar Road to Laxon Street, or from a school with upwards
of 300/0 to one with more than 500/0 of atnaemic boys.
The above distributions are perhaps not quite the type we should select as
most suitable for the method-the totals for each age are too nearly alike. It
would probably, however, be difficult to bring out by any more exact method,
leading to a unique result, the correlation of age and anaemia. It is obvious that
a fourfold table method would present considerable diversity according to the age
partition selected and would require far more labour than the 15 minutes requisite
to determine a correlation-coefficient by the present method.
The only difficulty here is the centering of the groups "22-30" and "over 30,"
the official statistics (as usual!) clubbing all these ages, which are really so signi-
ficant for the frequency distribution, together. After some consideration, I made
the mean ages of these groups 25 and 33.
The mean age of the passing candidates was 18-4280 and of the total candidates
18&7865. The S.D. of the latter being 3-2850. Thus p = - 3585, P/aj = - 1091.
Further i (1 - a) = -3917, giving I (1 + a) = 6083 and z = 3843. Thus
q/f2=-9806, and r=-*1113.
We conclude that there is a small but sensible correlation between youth and
the ability to pass this examination. There are some evidences that this holds for
other examinations. It would be interesting to consider this problem at greater
length. While the complexity of the brain may be greater at 28 than 18, the
mean weight appears to be greater at the latter age, and possibly in mere exami-
nation activity quantity may reckon more than quality in brains as well as it
frequently does in marking. It might be of value to determine the partial
correlation when allowance was made for the number of times the candidate had
entered.
CONSCIENTIOUSNESS CONSCIENTIOUSNESS
Cephalic _ _ _ _ _ _ _ _ __ Cephalic _ _ _ _ - _ _-_ _ _ _-
Index Index
Keen Dull Totals Keen Dull Totals
66 1 - 1 79 127 62 189
67 - 80 122 58 180
68 1 1 2 81 99 44 143
69 4 1 5 82 86 24 110
70 5 1 6 83 55 29 84
71 7 2 9 84 28 21 49
72 11 6 17 85 18 11 29
73 26 10 36 86 16 7 23
74 35 18 53 87 8 1 9
75 58 25 83 88 3 3
76 81 40 121 89 1 1
77 110 58 168 90 1 1 2
78 145 64 209 91 2 2
FIter (1 - a) = -3162
Further, 2 (1+a)-6381 giving z=*3558.
-0075
Thus we find p/o=+ 32929 =+ 0023,
= 11252.
And finally r= + *0020.
GLANDS TONSILS
Weight | l l I [ Totals
Good Bad Good Bad
14 2 2 _ 2
16 3 5 5 3 8
18 15 26 33 8 41
20 20 40 36 24 60
22 28 47 50 25 75
24 34 30 47 17 64
26 30 31 49 12 61
28 29 20 41 8 49
30 3047 13 6030
32 2124 11 3514
34 18 11 22 7 29
36 18 5 15 8 23
38 6 7 10 3 13
40 5 2 4 3 7
42 7 3 8 2 10
44 1 1 1
46
48 3 2 1 3
50 1 1 1
52 - 1 - 1
692 1 - 1 1 2
For glands:
i (1 -a)-=49
* K. Pearson, "R
pp. 105-146.
For tonsils:
4.(1 -ca)=-2679, A(I ?c))= 7321, z=-3293.
Correlation of good glands and weight = 385 x4972
6-7502 -3988
= -070.
Thus the mothers are somewhat more likely to be employed if their sons are
younger. This probably only means that the more well-to-do parents allow their
children to stay longer at school.
MOTHfERS MOTHERS
Height of Height of TOtals
Sons Toas Sons Ttl
Employed Not Emlyd Not
Employed Emp d Employed Eimployed
1-4353 *0567
rhe,= 53O43 x = 1347.
534 1139
Miss Elderton finds by the product-moment method that the correlation of age
and height for these boys is
r5h = *8452.
rhe- rh r.e
Phe >\ h2 2/rCe
=-1958.
Conclusions. The illustrations will have indicated that the new process of
determininig correlation can be applied to a great variety of problems. In none of
the cases dealt with was the correlation very high, but this is purely a result of
the material selected, which was chosen, not from any knowledge of the existence
of correlation, but to indicate the type of problem to which the new method can be
applied. Hitherto such problems could only be treated by the fourfold table
method. The examples given show how much more expeditious is the new process,
and further how it frees us from all doubt that exists in the old method as to the
suitable position for the division of the measured variate. That its probable error
will be less than that of the fourfold table method will I have little doubt be
demonstrated, when its value has been worked out. The method is, however, so
convenient and so frequently of service that I have not delayed its publication
until the leisure came to deterinine the probable error.
Biometrika vii 14