Professional Documents
Culture Documents
Data Mining Final Project: Nick Foti Eric Kee
Data Mining Final Project: Nick Foti Eric Kee
Nick Foti
Eric Kee
Corpus Design
A corpus is:
A body of text used for linguistic analysis
Anne Bront
Charlotte Bront
Charles Dickens
Upton Sinclair
Dataset Design
Extracted features common in literature
Word Length
Frequency of glue words
See Appendix A and [1,2] for list of glue words
Used cross-validation
78%
EX
L
P
M
E
55%
45%
22%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Binary Classification
Anne Bront
Charlotte Bront
100%
100%
0%
Anne B.
TP
Anne B.
FP
0%
Upton Sinclair
FP
Upton Sinclair
TP
Anne Bront
Charlotte Bront
100%
78.1%
0%
Anne B.
TP
Anne B.
FP
21.9%
Charlotte B.
FP
Charlotte B.
TP
PCA Density
PCA Density
K-Means
Used K-means to find dominant patterns
Unnormalized
Normalized
Unnormalized K-means
Anne Bront vs. Upton Sinclair
98.1%
92.1%
7.9%
1.9%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinclair
TP
Unnormalized K-means
Anne Bront vs. Charlotte Bront
95.7%
74.7%
25.3%
4.3%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Normalized K-means
Anne Bront vs. Upton Sinclair
53.3%
50.6%
49.4%
46.7%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinclair
TP
Normalized K-means
Anne Bront vs. Charlotte Bront
86.7%
84.2%
15.8%
Anne B.
TP
13.3 %
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Discriminant Analysis
Peformed discriminant analysis
Computed with equal covariance matrices
Used average Omega of class pairs
Discriminant Analysis
Anne Bront vs. Upton Sinclair
Empirical P(err) = 0.116
96.2%
92.2%
3.8%
7.8%
Anne B.
TP
Anne B.
FP
Upton Sinclair
FP
Upton Sinsclair
TP
Discriminant Analysis
Anne Bront vs. Charlotte Bront
Empirical P(err) = 0.152
92.7%
89.2%
10.8%
7.3%
Anne B.
TP
Anne B.
FP
Charlotte B.
FP
Charlotte B.
TP
Logistic Regression
Fit linear model to training data on all dimensions
Threw out singular dimensions
Left with 298 coefficients + intercept
Logistic Regression
Anne Bront vs Charlotte Bront
Anne Bront
Charlotte Bront
92%
89.5%
10.5%
Anne B
TP
Anne B
TP
8%
Charlotte B
TP
Charlotte B
TP
Logistic Regression
Anne Bront vs Upton Sinclair
Anne Bront
Upton Sinclair
98%
Anne B
TP
99%
2%
2%
Anne B
FP
Upton S
TP
Upton S
FP
4-Class Classification
4-Class K-means
Used K-means to find patterns among all
classes
Unnormalized
Normalized
Unnormalized K-Means
4-Class Confusion Matrix
Anne Bront
Charles Dickens
Charlotte Bront
Upton Sinclair
88%
87%
59%
54%
34%
22%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Normalized K-Means
4-Class Confusion Matrix
Anne Bront
Charles Dickens
67%
Charlotte Bront
Upton Sinclair
70%
67%
67%
27%
26%
20%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Charles Dickens
Charlotte Bront
Upton Sinclair
72%
44%
43%
35%
35%
33%
29%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Multinomial Regression
Multinomial distribution
Extension of binomial distribution
Random variable is allowed to take on n values
Multinomial Regression
Multinomial Logit Function is
To classify
Compute probabilities
Pr(yi = Dickens), Pr(yi = Anne B.),
Multinomial Regression
4-Class Confusion Matrix
Anne Bront
Charles Dickens
Charlotte Bront
86%
78%
Upton Sinclair
83%
93%
CD
AB
CB
US
CD
AB
CB
CB
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Multinomial Regression
(Without Word Length)
Anne Bront
Charles Dickens
Charlotte Bront
79%
76%
Upton Sinclair
79%
91%
CD
AB
CB
US
CD
AB
CB
CB
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
Conclusions
Authors can be identified by their word usage frequencies
Word length may be used to distingush between the Bront
sisters
Word length does not, however, extend to all authors (See Appendix C)
Appendix B: Code
See Attached .R files
Charles Dickens
Charlotte Bront
Upton Sinclair
96%
94%
54%
46%
22%
11%
6%
3%
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
CD
AB
CB
US
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
TP
FP
FP
FP
FP
FP
References
[1] Argamon, Saric, Stein, Style Mining of Electronic Messages for Multiple
Authorship Discrimination: First Results, SIGKDD 2003.
[2] Mitton, Spelling checkers, spelling correctors and the misspellings of poor
spellers, Information Processing and Management, 1987.