Identifying Gender of Authors: An Application of Markov Chains To Textual Analysis

Identifying Gender of
Authors
An application of Markov chains to textual
analysis
Curtis Miller
Spring 2015
MATH 5050 PROJECT
Identifying Gender of
Authors
An application of Markov chains to textual
analysis
Introduction
Authors frequently use pen names in place of their own for
their work. Reasons range from preserving anonymity to
marketing purposes. Female authors often used pen names in
MATH 5050 PROJECT
order to hide their gender. This was very common in the 18 th

century but still occurs today. Joanne Rowling, for example, used
the name J. K. Rowling and Robert Galbraith to hide her gender
[4].
In this paper, I show that Markov chains can be used to
identify author gender. I explain the method for doing so and
provide an example using 120 texts provided by the Gutenberg
project. I then finish with a discussion and suggestions of further
applications.
Method and Mathematical Background

Markov conceived of Markov chains originally with the
textual analysis application in mind [3]. Even though human
languages are obviously not Markov chains, Dmitri Khmelev et. al.
showed they can be used to predict authorship with surprisingly
good accuracy [1;2]. The method in this paper is based on
Khmelev's method, described in [1] and [2].
Khmelev provided a formalization of his procedure in [1], and
I describe it below. Begin with an alphabet set
A , which usually
contains lower-case letters and a single whitespace character, the

space character. A is the set of all words of length > 0 , and
MATH 5050 PROJECT
A = >0 A ; in other words,
is the set of strings based on an
alphabet A . f A is one such string, and |f| is its length.
There are n sets Ci , and f i , j A , with 1 j mi , is one of the
mi
strings in Ci , so f i , j C i .
Every string f i , j is thought to be generated by a Markov
chain with transition matrix
i
. We do not know
can estimate it with a transition matrix
i
, but we
i
P . Suppose
k , l A
are
letters in alphabet A . Denote by Qikl, j the number of letter
transitions
k l
in
f i , j . In addition,
i
kl
mi
Q = Qik,lj
(this is the
j =1
i
Qikl
frequency for a letter transition for author i ) and Qk =l
A
(this is the frequency a letter k is used). Then the entry of
Pi
corresponding to a transition k l (which I may refer to as
P ( k ,l )
or P
i
kl
) will be
Pi ( k ,l )=
Qikl
. This is the empirical transition
Qik
matrix.
Suppose we knew that a string
x A
was generated by
some , where 1 n but is unknown. Let kl denote a

transition in x for the letters k l . For every i with 1 i n ,
MATH 5050 PROJECT
we would use Pi as the estimate for i . We would then

represent the probability of seeing the string
( Pikl )
kl
k ,l A
if =i as
Notice that problems would arise where

whole probability is zero, and if both
Pikl =0 ; if just
Pkl =0
Pikl =0 , the
and kl =0 , we have
a number that is undefined ( 00 ). Rather than let these zero

transitions spoil our estimators, we will omit them instead and
consider the number:
( Pikl )
kl
k ,l : Pi ( k , l ) > 0
If we let ^ be the i that maximizes this number, ^ would be

a maximum likelihood estimator for .
Rather than use the probability directly, though, we will use
the natural logarithm of this number. Let:
( x , i )=
kl log ( Pikl )
k ,l : P ( k , l ) > 0
(log is the natural logarithm). Then we could write the maximum

likelihood estimator ^ as:
^
=argmin
( ( x ,i ) )
i
MATH 5050 PROJECT
Khmelev applied this notion in [1] and [2] to guessing who

the author of a text is when that author is unknown, and ^
would correspond to the "best guess" of the author of a text
x .
His framework need not be restricted to that application, though.

The sets Ci mentioned above could consist of texts with any
common theme, not just common authorship. In this paper, I use
gender as the distinguishing factor, and while this technique is
not as effective when applied to gender, it is better than flipping a
coin (at least when applied to my sample).
Sample and Application

I downloaded 120 texts provided by the Gutenberg project
and divided the texts into three groups. One group consists of
texts from female authors who did not use a pen name. Another
group consists of texts from male authors who did not use a pen
name. Finally, the test group consists of texts whose authors (who
are either male or female) either used a pen name or are
anonymous and still unknown. One can infer the texts and
authors used in this study from the provided R script.
Prior to analyzing the texts, some preprocessing takes place.
All punctuation is removed and all white space is replaced with
MATH 5050 PROJECT
only a space character. Khmelev found in [1] that the method is

more accurate if words that contain a capital letter are removed
from the texts completely, so I do so. I also add single space
characters at the beginning and end of the string. The resulting
string is then used in the analysis.
When applying the above method in practice to a particular
text x , the "male author" and the "female author" are assigned
a rank according to their respective
( x , i ) . A smaller lambda
corresponds to a lower rank, and the "author" with the smallest

rank is the one deemed to have written the text. I created a
matrix R with rows corresponding to texts and columns
corresponding to "authors" and an entry
"author"
Rij
being the rank of
regarding possible authorship of text i . In this
application, R has only two columns, male and female. In

addition to making the results easy to find, the matrix
allows
for other useful insights that expand the possible applications of

the Markov chain method.
Results
I list the results below in a table containing the title of the
text, the author (with pen name in parentheses), the gender
MATH 5050 PROJECT
predicted by the Markov chain method, and the authors true

gender. For the authors of known gender, the method does a
decent job, correctly guessing gender about 70% of the time. I do
not believe this to be as effective as the method when applied to
authorship identification; Khmelev was able to achieve about the
same accuracy but with a wider array of categories (think
authors) in [3], which should allow for more error. With that said,
the method does a better job than basing a guess off a coin flip.
Text
Pride and Prejudice
Little Women
The Adventures of
Huckleberry Finn
Micromegas
Heart of Darkness
Wuthering Heights
Agnes Grey
1984
Middlemarch
Jane Eyre
The Romance of Lust,
or Early Experiences
Forbidden Fruit;
Luscious and exciting
story and More
forbidden fruit or
Master Percy's
progress in and beyond
the domestic circle
Laura Middleton
Beauty and the Beast
MATH 5050 PROJECT
Author
Jane Austen
Louisa May Alcott
Mark Twain (Samuel
Langhorne Clemens)
Voltaire (Franois-Marie
Arouet)
Joseph Conrad (Jzef
Teodor Konrad
Korzeniowski)
Ellis Bell (Emily Bront)
Acton Bell (Anne
Bront)
George Orwell (Eric
Blaire)
George Eliot (Mary Ann
Evans)
Currer Bell (Charlotte
Bront)
Predicted Gender
Female
Female
True Gender
Female
Female
Female
Male
Male
Male
Male
Male
Female
Female
Male
Female
Male
Male
Male
Female
Female
Female
Anonymous
Female
Unknown (believed to
be male)
Anonymous
Female
Unknown
Anonymous
Anonymous
Male
Female
Unknown
Unknown
As for the anonymous texts, the method guessed that all

authors were female, with the exception of Laura Middleton. No
one knows the gender of these authors for certain. However, The
Romance of Lust is believed to have been written by a male
author (there are two individuals who are believed to be the
possible authors, William Simpson Potter and Edward Sellon [5]).
Discussion and Conclusion

Determination of author gender could be used in practical
applications. Web sites, for example, try to determine as much as
they can about an individual so they can tailor their services to
meet an individual's expected preferences better. Often a service
can predict an individual's gender best with their name, but if an
individual is anonymous, it may be possible to determine the
individual's gender from the text they generate. The Markov chain
method described here would be one way to do so.
The framework for this method is general enough to be used
in numerous applications. Attribution of authorship is only one.
Here I applied it to author gender attribution, but there are
possibilities beyond that
MATH 5050 PROJECT
Another obvious potential application is determining the

genre of a text. This method could be used to determine if a text
fits into a particular literary genre, such as science fiction,
fantasy, mystery, etc. It could also be used to determine if a text
is a work of fiction (like a novel), an essay, a report, and so on.
This would be helpful for services that process numerous
documents and would like to classify them if no classification has
been given otherwise.
Khmelev said in [1] that the matrix of ranks
is useful not
only for determining authorship of a text but also determining

which authors are similar. He noticed that authors correlate in
their rankings. We could deem authors that correlate in rank to be
"similar" in some way. This idea could be used by social media
websites, where users create lots of text, to suggest other
interests or tailor advertising to users based on what "similar"
users prefer.
One should keep in mind, though, that while the Markov
method is very useful for determining authorship of a text or even
possibly for saying that authors are similar, it does not say why
that is beyond an author transitioning from one letter to another
MATH 5050 PROJECT
10
more frequently than some other author (and this is true even
when texts are not segregated based on authorship). This leaves
open a large realm of possible reasons for the method to work.
For example, certain letter transitions may be more likely in one
genre of writing than another, and if women tend to write more in
one genre and men tend to write more in another, a man who
writes in a female-dominated genre may be predicted by the
method to be female. Then the method described above is not
actually predicting genre but rather whether an author writes in a
genre dominated by a particular sex. This may partly explain the
patterns we see in gender identification.
This potential problem holds for determining particular
authors as well. The Time Machine might be attributed to H.G.
Wells rather than Leo Tolstoy because the former is an authority
figure in the science fiction genre, unlike the latter. If we
unearthed some unknown work of science fiction by Tolstoy,
would the method attribute the text to Tolstoy or to H.G. Wells?
This is an important issue that one must bear in mind when
seeking to expand the Markov chain method.
MATH 5050 PROJECT
11
With that said, the Markov chain method is a surprisingly

good method for determining traits about a text. I used
identification of author gender as one expansion of the method,
but this is hardly the only direction the method could take. There
could be numerous uses for the method beyond what Markov or
others have envisioned, with worthwhile real-world applications.
They are worth investigating.
References
[1] Khmelev, D.V. Disputed Authorship Resolution through Using
Relative Empirical Entropy for Markov Chains of Letters in
Human Language Texts. 2000. Journal of Quantitative
Linguistics, v. 7, no. 3, pp. 201-207.
[2] Khmelev, D.V. & Tweedie, F.J. Using Markov Chains for
Identification of Writers. 2001. Literary and Linguistic
Computing, v. 16, no. 3, pp. 299-307.
[3] Markov, A.A. On some applications of statistical method. 1916.
Izvetstia Akademii Nauk, v. 10, no. 4, p. 239.
[4] Wikipedia. J. K. Rowling. Retrieved April 18th, 2015 from
http://en.wikipedia.org/wiki/J._K._Rowling.
[5] Wikipedia. The Romance of Lust. Retrieved April 18 th, 2015
from http://en.wikipedia.org/wiki/The_Romance_of_Lust.
MATH 5050 PROJECT
12

Identifying Gender of Authors: An Application of Markov Chains To Textual Analysis

Uploaded by

Copyright:

Available Formats

You might also like

Identifying Gender of Authors: An Application of Markov Chains To Textual Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Identifying Gender of Authors: An Application of Markov Chains To Textual Analysis

Uploaded by

Copyright:

Available Formats

Identifying Gender of

MATH 5050 PROJECT

MATH 5050 PROJECT

order to hide their gender. This was very common in the 18 th

Method and Mathematical Background

contains lower-case letters and a single whitespace character, the

A = >0 A ; in other words,

is the set of strings based on an

alphabet A . f A is one such string, and |f| is its length.

There are n sets Ci , and f i , j A , with 1 j mi , is one of the

chain with transition matrix

can estimate it with a transition matrix

letters in alphabet A . Denote by Qikl, j the number of letter

(this is the frequency a letter k is used). Then the entry of

corresponding to a transition k l (which I may refer to as

some , where 1 n but is unknown. Let kl denote a

MATH 5050 PROJECT

we would use Pi as the estimate for i . We would then

Notice that problems would arise where

a number that is undefined ( 00 ). Rather than let these zero

If we let ^ be the i that maximizes this number, ^ would be

(log is the natural logarithm). Then we could write the maximum

MATH 5050 PROJECT

Khmelev applied this notion in [1] and [2] to guessing who

His framework need not be restricted to that application, though.

Sample and Application

only a space character. Khmelev found in [1] that the method is

corresponds to a lower rank, and the "author" with the smallest

being the rank of

regarding possible authorship of text i . In this

application, R has only two columns, male and female. In

for other useful insights that expand the possible applications of

predicted by the Markov chain method, and the authors true

MATH 5050 PROJECT

As for the anonymous texts, the method guessed that all

Discussion and Conclusion

MATH 5050 PROJECT

Another obvious potential application is determining the

only for determining authorship of a text but also determining

MATH 5050 PROJECT

MATH 5050 PROJECT

With that said, the Markov chain method is a surprisingly

MATH 5050 PROJECT

You might also like