Lecture #5

GET1030
Computers and the humanities
Lecture 5
Textual analysis
Dr Miguel Escobar Varela

GET1030
Objectives
At the end of this session you should be able to:

- Describe basic measurements and
visualizations used for the quantitative study
of text in the humanities
(Next week we will learn how to implement and

visualize some these measurements using
Voyant Tools + Seaborn).
2
Lecture 5
Textual analysis
Part 1: Basic concepts
3
GET1030
NLP tasks in other domains
Natural Language Processing (NLP)
● Search and information retrieval

● Summarization
● Plagiarism and spam detection
● Machine translation
● Natural language interfaces
● Natural language generation: creation of texts
● Understand what people are saying about a given topic
(example: social media)
4
GET1030
What can computers help us do?
Textual data in the humanities
Find patterns
Count words or Identify trends (in the (which words tend
metadata features usage of words, to be more
of books, lyrics, number of works common in certain
subtitles (place, published, etc.) genres)
year, genre)
5
GET1030
Types of questions computers can help with
Authorship attribution How things change

(who wrote this?) Comparison between within a single text
texts by from different (beginning to end), or
authors, genres, places. across time (years,
decades or centuries).
6
GET1030
What can computers help us do?
Simple
Complex
measurements
classification through
machine learning
7
Lecture 5
Textual analysis
Part 2: Research examples
8
GET1030
Distant reading
The quest for broad patterns in literary history through empirical methods. An analysis of 7000 titles.
A concept proposed by Franco Moretti (Stanford Literary Lab).
Moretti (2009)
9
GET1030
Distant reading
The quest for broad patterns in literary history through empirical methods.
A concept proposed by Franco Moretti (Stanford Literary Lab).
Moretti (2009)
10
GET1030
Distant reading
Moretti
(2009)
11
GET1030
Distant reading
Moretti (2009) 12
GET1030
Distant reading
A title with twenty words and one with two are not the same creature, one
larger and one smaller; they are different animals altogether. Different
styles.
Franco Moretti (2009: 145)
13
GET1030
Macroanalysis
“We might think about interpretive close

readings as corresponding to
microeconomics, whereas quantitative
distant reading corresponds to
macroeconomics”
Matthew Jockers (2013, 25).
14
GET1030
Macroanalysis
15
Jockers (2013)
GET1030
Macroanalysis
16
Jockers (2013)
GET1030
Google Books Ngram
Ngram = string with n number of words (bigrams, trigrams, etc)

17
https://books.google.com/ngrams
GET1030
Google Books Ngram
18
MIchel et al (2011)
GET1030
Google Books Ngram
MIchel et al (2011)
19
Lecture 5
Textual analysis
Part 3: Simple measurements and visualizations
20
GET1030
Lexical diversity
Corpus = a collection of texts, plural is corpora
Lexicon = the vocabulary in a text

Type = number of unique words in a text
Lexical diversity = Number of types / number of words
*Also called type/word ratio
What is the lexical diversity of these texts?
Text A: “I am a cat” (4/4)=1

Text B: “Cat cat cat cat” (1/4)=0.25
21
GET1030
What is a word?
● Lemmatized words:
○ A lemma is the root form of a word
■ Different forms of a word: book, books
● Sense disambiguation:
○ Book: verb
○ Book: noun
● Kinds of words:
○ Functional/grammatical words, “a”, “the”, “in”
■ Often used for stylometry (analysis of style)
○ Lexical/content words, “cat”, “house”, “book
■ Often kept for the analysis of themes
22
GET1030
Analyzing word counts in Hip Hop
https://pudding.cool/2017/02/vocabulary/index.html
23
GET1030
Letters of Vincent Van Gogh
903 letters written during his lifetime
Digital Collection of Van Gogh’s letters

http://vangoghletters.org/vg/
1853-1890
24
GET1030
Letters of Vincent Van Gogh
https://voyant-tools.org/?corpus=2a9aa299a95d7eca47cf68d25f0382e7
Next week we will see how to import data into Voyant, perform different types of
analysis and then export it for reuse in Python 25
GET1030
Type/word ratios
26
GET1030
Word trends
https://voyant-tools.org/?corpus=2a9aa299a95d7eca47cf68d25f0382e7 27
GET1030
Word trends (sparklines)
This is an example of Voyant Tools (voyant-tools.org), which we will learn to use 28

next week.
GET1030
Concordances
• A list of instances where a word is found.

• Concordances began in the 12th century, mostly to read the bible (Rouse
and Rouse, 1982).
http://www.opensourceshakespeare.org/concordance/findform.php
29
GET1030
KWIC
• Keyword in context (KWIC)

• Context: the words immediately before and immediately after another word
https://voyant-tools.org/?corpus=2a9aa299a95d7eca47cf68d25f0382e7
30
GET1030
Collocates
Words that tend to occur within the same “window” as another. In this case, the
window is a sentence.
31
GET1030
Contiguous Collocates in Wordtree
Collocates can also be contiguous, where we are interested in words that occur next to
each other.
32
GET1030
Hermeneuti.ca
http://hermeneuti.ca/name-games
The evolution of Game Studies.
http://gamestudies.org/1903
33
Lecture 5
Textual analysis
Part 4: A machine learning example
34
GET1030
Machine Learning
Supervised ML
● Classification
○ Labelled examples used to train a model
■ Training set
■ Test set
Unsupervised ML
● Clustering
35
GET1030
Supervised ML example
Representation of character
across 104,000 books
They used a pipeline called

BookNLP which identifies
character names in a work of
fiction and clusters those
names, so that “Elizabeth” and
“Elizabeth Bennet” are linked
as a single person.
Underwood et al (2018)
36
GET1030
Supervised ML example
For each decade, they

randomly selection of
1600 characters each time
(800 men / 800 women)
Characterization omitting
names and obvious
gender markers (he/she)
Classified using the most

common words in that
group of characters
The model was run 15

times for each decade
Underwood et al (2018) 37
GET1030
Corpus Linguistics
38
Underwood et al (2018)
GET1030
References
Michel, Jean-Baptiste, Yuan Kui Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg, et al.
2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82.
https://doi.org/10.1126/science.1199644.
Underwood, William E., David Bamman, and Sabrina Lee. 2018. “The Transformation of Gender in English-Language Fiction.”
Journal of Cultural Analytics 1 (1): 11035. https://doi.org/10.22148/16.019.
Jockers, Matthew Lee. Macroanalysis. Topics in the Digital Humanities. Baltimore: University of Illinois Press, 2013.
Moretti, Franco. “Style, Inc. Reﬂections on Seven Thousand Titles (British Novels, 1740–1850).” Critical Inquiry 36, no. 1
(2009): 134–58. https://doi.org/10.1086/606125.
Rockwell, Geoffrey, and Stéfan Sinclair. Hermeneutica: Computer-Assisted Interpretation in the Humanities. Cambridge:
MIT Press, 2016. http://www.jstor.org/stable/j.ctt1c0gm6h.
39

Lecture #5

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture #5

Uploaded by

Copyright:

Available Formats

GET1030

Computers and the humanities

Dr Miguel Escobar Varela

At the end of this session you should be able to:

(Next week we will learn how to implement and

Part 1: Basic concepts

NLP tasks in other domains

Natural Language Processing (NLP)

● Search and information retrieval

What can computers help us do?

Textual data in the humanities

Types of questions computers can help with

Authorship attribution How things change

What can computers help us do?

Part 2: Research examples

Franco Moretti (2009: 145)

“We might think about interpretive close

Matthew Jockers (2013, 25).

Google Books Ngram

Ngram = string with n number of words (bigrams, trigrams, etc)

Google Books Ngram

Google Books Ngram

Part 3: Simple measurements and visualizations

Corpus = a collection of texts, plural is corpora

Lexicon = the vocabulary in a text

What is the lexical diversity of these texts?

Text A: “I am a cat” (4/4)=1

Analyzing word counts in Hip Hop

Letters of Vincent Van Gogh

903 letters written during his lifetime

Digital Collection of Van Gogh’s letters

Letters of Vincent Van Gogh

Word trends (sparklines)

This is an example of Voyant Tools (voyant-tools.org), which we will learn to use 28

• A list of instances where a word is found.

• Keyword in context (KWIC)

Contiguous Collocates in Wordtree

The evolution of Game Studies.

Part 4: A machine learning example

They used a pipeline called

For each decade, they

Classified using the most

The model was run 15

You might also like