Professional Documents
Culture Documents
Identifying Gender of Authors: An Application of Markov Chains To Textual Analysis
Identifying Gender of Authors: An Application of Markov Chains To Textual Analysis
Identifying Gender of Authors: An Application of Markov Chains To Textual Analysis
Authors
An application of Markov chains to textual
analysis
Curtis Miller
Spring 2015
Identifying Gender of
Authors
An application of Markov chains to textual
analysis
Introduction
Authors frequently use pen names in place of their own for
their work. Reasons range from preserving anonymity to
marketing purposes. Female authors often used pen names in
A , which usually
mi
strings in Ci , so f i , j C i .
Every string f i , j is thought to be generated by a Markov
i
. We do not know
i
, but we
i
P . Suppose
k , l A
are
transitions
k l
in
f i , j . In addition,
i
kl
mi
Q = Qik,lj
(this is the
j =1
i
Qikl
frequency for a letter transition for author i ) and Qk =l
A
Pi
P ( k ,l )
or P
i
kl
) will be
Pi ( k ,l )=
Qikl
. This is the empirical transition
Qik
matrix.
Suppose we knew that a string
x A
was generated by
( Pikl )
kl
k ,l A
if =i as
Pikl =0 ; if just
Pkl =0
Pikl =0 , the
and kl =0 , we have
( Pikl )
kl
k ,l : Pi ( k , l ) > 0
kl log ( Pikl )
k ,l : P ( k , l ) > 0
x .
( x , i ) . A smaller lambda
Rij
allows
Results
I list the results below in a table containing the title of the
text, the author (with pen name in parentheses), the gender
MATH 5050 PROJECT
Author
Jane Austen
Louisa May Alcott
Mark Twain (Samuel
Langhorne Clemens)
Voltaire (Franois-Marie
Arouet)
Joseph Conrad (Jzef
Teodor Konrad
Korzeniowski)
Ellis Bell (Emily Bront)
Acton Bell (Anne
Bront)
George Orwell (Eric
Blaire)
George Eliot (Mary Ann
Evans)
Currer Bell (Charlotte
Bront)
Predicted Gender
Female
Female
True Gender
Female
Female
Female
Male
Male
Male
Male
Male
Female
Female
Male
Female
Male
Male
Male
Female
Female
Female
Anonymous
Female
Unknown (believed to
be male)
Anonymous
Female
Unknown
Anonymous
Anonymous
Male
Female
Unknown
Unknown
is useful not
10
more frequently than some other author (and this is true even
when texts are not segregated based on authorship). This leaves
open a large realm of possible reasons for the method to work.
For example, certain letter transitions may be more likely in one
genre of writing than another, and if women tend to write more in
one genre and men tend to write more in another, a man who
writes in a female-dominated genre may be predicted by the
method to be female. Then the method described above is not
actually predicting genre but rather whether an author writes in a
genre dominated by a particular sex. This may partly explain the
patterns we see in gender identification.
This potential problem holds for determining particular
authors as well. The Time Machine might be attributed to H.G.
Wells rather than Leo Tolstoy because the former is an authority
figure in the science fiction genre, unlike the latter. If we
unearthed some unknown work of science fiction by Tolstoy,
would the method attribute the text to Tolstoy or to H.G. Wells?
This is an important issue that one must bear in mind when
seeking to expand the Markov chain method.
11
References
[1] Khmelev, D.V. Disputed Authorship Resolution through Using
Relative Empirical Entropy for Markov Chains of Letters in
Human Language Texts. 2000. Journal of Quantitative
Linguistics, v. 7, no. 3, pp. 201-207.
[2] Khmelev, D.V. & Tweedie, F.J. Using Markov Chains for
Identification of Writers. 2001. Literary and Linguistic
Computing, v. 16, no. 3, pp. 299-307.
[3] Markov, A.A. On some applications of statistical method. 1916.
Izvetstia Akademii Nauk, v. 10, no. 4, p. 239.
[4] Wikipedia. J. K. Rowling. Retrieved April 18th, 2015 from
http://en.wikipedia.org/wiki/J._K._Rowling.
[5] Wikipedia. The Romance of Lust. Retrieved April 18 th, 2015
from http://en.wikipedia.org/wiki/The_Romance_of_Lust.
12