Data Mining Final Project: Nick Foti Eric Kee

Data Mining Final Project
Nick Foti
Eric Kee
Topic: Author Identification

Author Identification
Given writing samples, can we determine who
wrote them?
This is a well studied field
See also: stylometry
This has been applied to works such as

The Bible
Shakespeare
Modern texts as well
Corpus Design
A corpus is:
A body of text used for linguistic analysis
Used Project Gutenberg to create corpus

The corpus was designed as follows
Four authors of varying similarity
Anne Bront
Charlotte Bront
Charles Dickens
Upton Sinclair
Multiple books per author
Corpus size: 90,000 lines of text
Dataset Design
Extracted features common in literature
Word Length
Frequency of glue words
See Appendix A and [1,2] for list of glue words
Note: corpus was processed using

C#, Matlab, Python
Data set parameters are
Number of dimensions: 309
Word length and 308 glue words
Number of observations: 3,000
Each obervation 30 lines of text from a book
Classifier Testing and Analysis

Tested classifier with test data
Used testing and training data sets
70% for training, 30% for testing
Used cross-validation
Analyzed Classifier Performance

Used ROC plots
Used confusion matrices
Red Dots Indicate

True-Positive Cases
78%
EX
Used common plotting

scheme (right)
L
P
M
E
55%
45%
22%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Binary Classification
Word Length Classification

Calculated average word length for each
observation
Computed gaussian kernel density from word
length samples
Used ROC curve to calculate cutoff
Optimized sensitivity and specificity with equal
importance
Word Length: Anne B. vs Upton S.
Anne Bront
Charlotte Bront
100%
100%
0%
Anne B.
TP
Anne B.
FP
0%
Upton Sinclair
FP
Upton Sinclair
TP
Word Length: Bront vs. Bront
Anne Bront
Charlotte Bront
100%
78.1%
0%
Anne B.
TP
Anne B.
FP
21.9%
Charlotte B.
FP
Charlotte B.
TP
Principal Component Analysis

Used PCA to find a better axis
Notice: distribution similar to word length
Anne Bront vs. Upton Sinclair
distribution
Is word length
the only useful
dimension?
Word Length Density
PCA Density
Principal Component Analysis

Without word length
It appears that word length is the most useful

axis
Well come
back to this
PCA Density
K-Means
Used K-means to find dominant patterns
Unnormalized
Normalized
Trained K-means on training set

To classify observations in test set
Calculate distance of observation to each class
mean
Assign observation to the closest class
Performed cross-validation to estimate

performance
Unnormalized K-means
98.1%
92.1%
7.9%
1.9%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinclair
TP
Unnormalized K-means
Anne Bront vs. Charlotte Bront
95.7%
74.7%
25.3%
4.3%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Normalized K-means
53.3%
50.6%
49.4%
46.7%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinclair
TP
Normalized K-means
86.7%
84.2%
15.8%
Anne B.
TP
13.3 %
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Discriminant Analysis
Peformed discriminant analysis
Computed with equal covariance matrices
Used average Omega of class pairs
Computed with unequal covariance matrices

Quadratic discrimination fails because covariance
matrices have 0 determinant (see equation below)
Computed theoretical misclassification probability
To perform quadratic discriminant analysis

Compute Equation 1 for each class
Choose class with minimum value
(1)
Empirical P(err) = 0.116
Theoretical P(err) = 0.149
96.2%
92.2%
3.8%
7.8%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinsclair
TP
Empirical P(err) = 0.152
Theoretical P(err) = 0.181
92.7%
89.2%
10.8%
7.3%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Logistic Regression
Fit linear model to training data on all dimensions
Threw out singular dimensions
Left with 298 coefficients + intercept
Projected training data onto synthetic variable

Found threshold by minimizing error of misclassification
Projected testing data onto synthetic variable

Used threshold to classify points
Logistic Regression
Anne Bront vs Charlotte Bront
Anne Bront
Charlotte Bront
92%
89.5%
10.5%
Anne B
TP
Anne B
TP
8%
Charlotte B
TP
Charlotte B
TP
Logistic Regression
Anne Bront vs Upton Sinclair
Anne Bront
Upton Sinclair
98%
Anne B
TP
99%
2%
2%
Anne B
FP
Upton S
TP
Upton S
FP
4-Class Classification
4-Class K-means
Used K-means to find patterns among all
classes
Unnormalized
Normalized
Trained using a training set

Tested performance as in 2-class K-means
Performed cross-validation to estimate
performance
Unnormalized K-Means
4-Class Confusion Matrix
Anne Bront
Charles Dickens
Charlotte Bront
Upton Sinclair
88%
87%
59%
54%
34%
22%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Normalized K-Means
Anne Bront
Charles Dickens
67%
Charlotte Bront
Upton Sinclair
70%
67%
67%
27%
26%
20%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Additional K-means testing

Also tested K-means without word length
Recall that we had perfect classification with 1D
word length (see plot below)
Is K-means using only 1 dimension to classify?
Note: perfect classification only occurs between Anne B. and Sinclair
Unnormalized K-Means (No Word Length)

K-means can classify without word length
Anne Bront
Charles Dickens
Charlotte Bront
Upton Sinclair
72%
44%
43%
35%
35%
33%
29%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Multinomial Regression
Multinomial distribution
Extension of binomial distribution
Random variable is allowed to take on n values
Used multinom() to fit log-linear model for

training
Used 248 dimensions (max limit on computer)
Returns 3 coefficients per dimension and 3
intercepts
Found probability that observations belongs to

each class
Multinomial Logit Function is
where j are the coefficients and cj are the intercepts
To classify
Compute probabilities
Pr(yi = Dickens), Pr(yi = Anne B.),
Choose class with maximum probability
Anne Bront
Charles Dickens
Charlotte Bront
86%
78%
Upton Sinclair
83%
93%
CD
AB
CB
US
CD
AB
CB
CB
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
(Without Word Length)
Multinomial regression does not require word length

Anne Bront
Charles Dickens
Charlotte Bront
79%
76%
Upton Sinclair
79%
91%
CD
AB
CB
US
CD
AB
CB
CB
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Appendix A: Glue Words

I a aboard about above across after again against ago ahead all almost along alongside already
also although always am amid amidst among amongst an and another any anybody anyone
anything anywhere apart are aren't around as aside at away back backward backwards be because
been before beforehand behind being below between beyond both but by can can't cannot could
couldn't dare daren't despite did didn't directly do does doesn't doing don't done down during
each either else elsewhere enough even ever evermore every everybody everyone everything
everywhere except fairly farther few fewer for forever forward from further furthermore had
hadn't half hardly has hasn't have haven't having he hence her here hers herself him himself his
how however if in indeed inner inside instead into is isn't it its itself just keep kept later least less
lest like likewise little low lower many may mayn't me might mightn't mine minus more
moreover most much must mustn't my myself near need needn't neither never nevertheless next
no no-one nobody none nor not nothing notwithstanding now nowhere of off often on once one
ones only onto opposite or other others otherwise ought oughtn't our ours ourselves out outside
over own past per perhaps please plus provided quite rather really round same self selves several
shall shan't she should shouldn't since so some somebody someday someone something
sometimes somewhat still such than that the their theirs them themselves then there therefore
these they thing things this those though through throughout thus till to together too towards
under underneath undoing unless unlike until up upon upwards us versus very via was wasn't
way we well were weren't what whatever when whence whenever where whereas whereby
wherein wherever whether which whichever while whilst whither who whoever whom with
whose within why without will won't would wouldn't yet you your yours yourself yourselves
Conclusions
Authors can be identified by their word usage frequencies
Word length may be used to distingush between the Bront
sisters
Word length does not, however, extend to all authors (See Appendix C)
The glue words describe genuine differences between all four

authors
K-means distinguishes the same patterns that multinomial regression
classifies
This indicates that supervised training finds legitimate patterns, rather than
artifacts
The Bront sisters are the most similar authors

Upton Sinclair is the most different author
Appendix B: Code
See Attached .R files
Appendix C: Single Dimension 4-Author Classification

Classification using Multinomial Regression
Anne Bront
Charles Dickens
Charlotte Bront
Upton Sinclair
96%
94%
54%
46%
22%
11%
6%
3%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
References
[1] Argamon, Saric, Stein, Style Mining of Electronic Messages for Multiple
Authorship Discrimination: First Results, SIGKDD 2003.
[2] Mitton, Spelling checkers, spelling correctors and the misspellings of poor
spellers, Information Processing and Management, 1987.

Data Mining Final Project: Nick Foti Eric Kee

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Final Project: Nick Foti Eric Kee

Uploaded by

Copyright:

Available Formats

Data Mining Final Project

Topic: Author Identification

This has been applied to works such as

Used Project Gutenberg to create corpus

Multiple books per author

Corpus size: 90,000 lines of text

Note: corpus was processed using

Data set parameters are

Number of dimensions: 309

Word length and 308 glue words

Number of observations: 3,000

Each obervation 30 lines of text from a book

Classifier Testing and Analysis

Analyzed Classifier Performance

Red Dots Indicate

Used common plotting

Word Length Classification

Word Length: Anne B. vs Upton S.

Word Length: Bront vs. Bront

Principal Component Analysis

Word Length Density

Principal Component Analysis

It appears that word length is the most useful

Trained K-means on training set

Performed cross-validation to estimate

Computed with unequal covariance matrices

Computed theoretical misclassification probability

To perform quadratic discriminant analysis

Theoretical P(err) = 0.149

Theoretical P(err) = 0.181

Projected training data onto synthetic variable

Projected testing data onto synthetic variable

Trained using a training set

Additional K-means testing

Note: perfect classification only occurs between Anne B. and Sinclair

Unnormalized K-Means (No Word Length)

Used multinom() to fit log-linear model for

Found probability that observations belongs to

where j are the coefficients and cj are the intercepts

Choose class with maximum probability

Multinomial regression does not require word length

Appendix A: Glue Words

The glue words describe genuine differences between all four

The Bront sisters are the most similar authors

Appendix C: Single Dimension 4-Author Classification

You might also like