Identifying Gender of Authors: An Application of Markov Chains To Textual Analysis

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Identifying Gender of

Authors
An application of Markov chains to textual
analysis

Curtis Miller
Spring 2015

MATH 5050 PROJECT

Identifying Gender of
Authors
An application of Markov chains to textual
analysis
Introduction
Authors frequently use pen names in place of their own for
their work. Reasons range from preserving anonymity to
marketing purposes. Female authors often used pen names in

MATH 5050 PROJECT

order to hide their gender. This was very common in the 18 th


century but still occurs today. Joanne Rowling, for example, used
the name J. K. Rowling and Robert Galbraith to hide her gender
[4].
In this paper, I show that Markov chains can be used to
identify author gender. I explain the method for doing so and
provide an example using 120 texts provided by the Gutenberg
project. I then finish with a discussion and suggestions of further
applications.

Method and Mathematical Background


Markov conceived of Markov chains originally with the
textual analysis application in mind [3]. Even though human
languages are obviously not Markov chains, Dmitri Khmelev et. al.
showed they can be used to predict authorship with surprisingly
good accuracy [1;2]. The method in this paper is based on
Khmelev's method, described in [1] and [2].
Khmelev provided a formalization of his procedure in [1], and
I describe it below. Begin with an alphabet set

A , which usually

contains lower-case letters and a single whitespace character, the


space character. A is the set of all words of length > 0 , and
MATH 5050 PROJECT

A = >0 A ; in other words,

is the set of strings based on an

alphabet A . f A is one such string, and |f| is its length.

There are n sets Ci , and f i , j A , with 1 j mi , is one of the

mi

strings in Ci , so f i , j C i .
Every string f i , j is thought to be generated by a Markov

chain with transition matrix

i
. We do not know

can estimate it with a transition matrix

i
, but we

i
P . Suppose

k , l A

are

letters in alphabet A . Denote by Qikl, j the number of letter

transitions

k l

in

f i , j . In addition,

i
kl

mi

Q = Qik,lj

(this is the

j =1

i
Qikl
frequency for a letter transition for author i ) and Qk =l
A

(this is the frequency a letter k is used). Then the entry of

Pi

corresponding to a transition k l (which I may refer to as

P ( k ,l )

or P

i
kl

) will be

Pi ( k ,l )=

Qikl
. This is the empirical transition
Qik

matrix.
Suppose we knew that a string

x A

was generated by

some , where 1 n but is unknown. Let kl denote a


transition in x for the letters k l . For every i with 1 i n ,

MATH 5050 PROJECT

we would use Pi as the estimate for i . We would then


represent the probability of seeing the string

( Pikl )

kl

k ,l A

if =i as

Notice that problems would arise where


whole probability is zero, and if both

Pikl =0 ; if just

Pkl =0

Pikl =0 , the

and kl =0 , we have

a number that is undefined ( 00 ). Rather than let these zero


transitions spoil our estimators, we will omit them instead and
consider the number:

( Pikl )

kl

k ,l : Pi ( k , l ) > 0

If we let ^ be the i that maximizes this number, ^ would be


a maximum likelihood estimator for .
Rather than use the probability directly, though, we will use
the natural logarithm of this number. Let:
( x , i )=

kl log ( Pikl )

k ,l : P ( k , l ) > 0

(log is the natural logarithm). Then we could write the maximum


likelihood estimator ^ as:
^
=argmin
( ( x ,i ) )
i

MATH 5050 PROJECT

Khmelev applied this notion in [1] and [2] to guessing who


the author of a text is when that author is unknown, and ^
would correspond to the "best guess" of the author of a text

x .

His framework need not be restricted to that application, though.


The sets Ci mentioned above could consist of texts with any
common theme, not just common authorship. In this paper, I use
gender as the distinguishing factor, and while this technique is
not as effective when applied to gender, it is better than flipping a
coin (at least when applied to my sample).

Sample and Application


I downloaded 120 texts provided by the Gutenberg project
and divided the texts into three groups. One group consists of
texts from female authors who did not use a pen name. Another
group consists of texts from male authors who did not use a pen
name. Finally, the test group consists of texts whose authors (who
are either male or female) either used a pen name or are
anonymous and still unknown. One can infer the texts and
authors used in this study from the provided R script.
Prior to analyzing the texts, some preprocessing takes place.
All punctuation is removed and all white space is replaced with
MATH 5050 PROJECT

only a space character. Khmelev found in [1] that the method is


more accurate if words that contain a capital letter are removed
from the texts completely, so I do so. I also add single space
characters at the beginning and end of the string. The resulting
string is then used in the analysis.
When applying the above method in practice to a particular
text x , the "male author" and the "female author" are assigned
a rank according to their respective

( x , i ) . A smaller lambda

corresponds to a lower rank, and the "author" with the smallest


rank is the one deemed to have written the text. I created a
matrix R with rows corresponding to texts and columns
corresponding to "authors" and an entry
"author"

Rij

being the rank of

regarding possible authorship of text i . In this

application, R has only two columns, male and female. In


addition to making the results easy to find, the matrix

allows

for other useful insights that expand the possible applications of


the Markov chain method.

Results
I list the results below in a table containing the title of the
text, the author (with pen name in parentheses), the gender
MATH 5050 PROJECT

predicted by the Markov chain method, and the authors true


gender. For the authors of known gender, the method does a
decent job, correctly guessing gender about 70% of the time. I do
not believe this to be as effective as the method when applied to
authorship identification; Khmelev was able to achieve about the
same accuracy but with a wider array of categories (think
authors) in [3], which should allow for more error. With that said,
the method does a better job than basing a guess off a coin flip.
Text
Pride and Prejudice
Little Women
The Adventures of
Huckleberry Finn
Micromegas
Heart of Darkness
Wuthering Heights
Agnes Grey
1984
Middlemarch
Jane Eyre
The Romance of Lust,
or Early Experiences
Forbidden Fruit;
Luscious and exciting
story and More
forbidden fruit or
Master Percy's
progress in and beyond
the domestic circle
Laura Middleton
Beauty and the Beast

MATH 5050 PROJECT

Author
Jane Austen
Louisa May Alcott
Mark Twain (Samuel
Langhorne Clemens)
Voltaire (Franois-Marie
Arouet)
Joseph Conrad (Jzef
Teodor Konrad
Korzeniowski)
Ellis Bell (Emily Bront)
Acton Bell (Anne
Bront)
George Orwell (Eric
Blaire)
George Eliot (Mary Ann
Evans)
Currer Bell (Charlotte
Bront)

Predicted Gender
Female
Female

True Gender
Female
Female

Female

Male

Male

Male

Male

Male

Female

Female

Male

Female

Male

Male

Male

Female

Female

Female

Anonymous

Female

Unknown (believed to
be male)

Anonymous

Female

Unknown

Anonymous
Anonymous

Male
Female

Unknown
Unknown

As for the anonymous texts, the method guessed that all


authors were female, with the exception of Laura Middleton. No
one knows the gender of these authors for certain. However, The
Romance of Lust is believed to have been written by a male
author (there are two individuals who are believed to be the
possible authors, William Simpson Potter and Edward Sellon [5]).

Discussion and Conclusion


Determination of author gender could be used in practical
applications. Web sites, for example, try to determine as much as
they can about an individual so they can tailor their services to
meet an individual's expected preferences better. Often a service
can predict an individual's gender best with their name, but if an
individual is anonymous, it may be possible to determine the
individual's gender from the text they generate. The Markov chain
method described here would be one way to do so.
The framework for this method is general enough to be used
in numerous applications. Attribution of authorship is only one.
Here I applied it to author gender attribution, but there are
possibilities beyond that

MATH 5050 PROJECT

Another obvious potential application is determining the


genre of a text. This method could be used to determine if a text
fits into a particular literary genre, such as science fiction,
fantasy, mystery, etc. It could also be used to determine if a text
is a work of fiction (like a novel), an essay, a report, and so on.
This would be helpful for services that process numerous
documents and would like to classify them if no classification has
been given otherwise.
Khmelev said in [1] that the matrix of ranks

is useful not

only for determining authorship of a text but also determining


which authors are similar. He noticed that authors correlate in
their rankings. We could deem authors that correlate in rank to be
"similar" in some way. This idea could be used by social media
websites, where users create lots of text, to suggest other
interests or tailor advertising to users based on what "similar"
users prefer.
One should keep in mind, though, that while the Markov
method is very useful for determining authorship of a text or even
possibly for saying that authors are similar, it does not say why
that is beyond an author transitioning from one letter to another

MATH 5050 PROJECT

10

more frequently than some other author (and this is true even
when texts are not segregated based on authorship). This leaves
open a large realm of possible reasons for the method to work.
For example, certain letter transitions may be more likely in one
genre of writing than another, and if women tend to write more in
one genre and men tend to write more in another, a man who
writes in a female-dominated genre may be predicted by the
method to be female. Then the method described above is not
actually predicting genre but rather whether an author writes in a
genre dominated by a particular sex. This may partly explain the
patterns we see in gender identification.
This potential problem holds for determining particular
authors as well. The Time Machine might be attributed to H.G.
Wells rather than Leo Tolstoy because the former is an authority
figure in the science fiction genre, unlike the latter. If we
unearthed some unknown work of science fiction by Tolstoy,
would the method attribute the text to Tolstoy or to H.G. Wells?
This is an important issue that one must bear in mind when
seeking to expand the Markov chain method.

MATH 5050 PROJECT

11

With that said, the Markov chain method is a surprisingly


good method for determining traits about a text. I used
identification of author gender as one expansion of the method,
but this is hardly the only direction the method could take. There
could be numerous uses for the method beyond what Markov or
others have envisioned, with worthwhile real-world applications.
They are worth investigating.

References
[1] Khmelev, D.V. Disputed Authorship Resolution through Using
Relative Empirical Entropy for Markov Chains of Letters in
Human Language Texts. 2000. Journal of Quantitative
Linguistics, v. 7, no. 3, pp. 201-207.
[2] Khmelev, D.V. & Tweedie, F.J. Using Markov Chains for
Identification of Writers. 2001. Literary and Linguistic
Computing, v. 16, no. 3, pp. 299-307.
[3] Markov, A.A. On some applications of statistical method. 1916.
Izvetstia Akademii Nauk, v. 10, no. 4, p. 239.
[4] Wikipedia. J. K. Rowling. Retrieved April 18th, 2015 from
http://en.wikipedia.org/wiki/J._K._Rowling.
[5] Wikipedia. The Romance of Lust. Retrieved April 18 th, 2015
from http://en.wikipedia.org/wiki/The_Romance_of_Lust.

MATH 5050 PROJECT

12

You might also like