Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

GET1030

Computers and the humanities

Lecture 5
Textual analysis

Dr Miguel Escobar Varela


GET1030

Objectives

At the end of this session you should be able to:


- Describe basic measurements and
visualizations used for the quantitative study
of text in the humanities

(Next week we will learn how to implement and


visualize some these measurements using
Voyant Tools + Seaborn).

2
Lecture 5
Textual analysis

Part 1: Basic concepts

3
GET1030

NLP tasks in other domains

Natural Language Processing (NLP)

● Search and information retrieval


● Summarization
● Plagiarism and spam detection
● Machine translation
● Natural language interfaces
● Natural language generation: creation of texts
● Understand what people are saying about a given topic
(example: social media)

4
GET1030

What can computers help us do?

Textual data in the humanities

Find patterns
Count words or Identify trends (in the (which words tend
metadata features usage of words, to be more
of books, lyrics, number of works common in certain
subtitles (place, published, etc.) genres)
year, genre)

5
GET1030

Types of questions computers can help with

Authorship attribution How things change


(who wrote this?) Comparison between within a single text
texts by from different (beginning to end), or
authors, genres, places. across time (years,
decades or centuries).

6
GET1030

What can computers help us do?

Simple
Complex
measurements
classification through
machine learning

7
Lecture 5
Textual analysis

Part 2: Research examples

8
GET1030

Distant reading

The quest for broad patterns in literary history through empirical methods. An analysis of 7000 titles.
A concept proposed by Franco Moretti (Stanford Literary Lab).

Moretti (2009)

9
GET1030

Distant reading

The quest for broad patterns in literary history through empirical methods.
A concept proposed by Franco Moretti (Stanford Literary Lab).

Moretti (2009)

10
GET1030

Distant reading

Moretti
(2009)

11
GET1030

Distant reading

Moretti (2009) 12
GET1030

Distant reading

A title with twenty words and one with two are not the same creature, one
larger and one smaller; they are different animals altogether. Different
styles.

Franco Moretti (2009: 145)

13
GET1030

Macroanalysis

“We might think about interpretive close


readings as corresponding to
microeconomics, whereas quantitative
distant reading corresponds to
macroeconomics”

Matthew Jockers (2013, 25).

14
GET1030

Macroanalysis

15
Jockers (2013)
GET1030

Macroanalysis

16
Jockers (2013)
GET1030

Google Books Ngram

Ngram = string with n number of words (bigrams, trigrams, etc)


17
https://books.google.com/ngrams
GET1030

Google Books Ngram

18
MIchel et al (2011)
GET1030

Google Books Ngram

MIchel et al (2011)

19
Lecture 5
Textual analysis

Part 3: Simple measurements and visualizations

20
GET1030

Lexical diversity

Corpus = a collection of texts, plural is corpora

Lexicon = the vocabulary in a text


Type = number of unique words in a text
Lexical diversity = Number of types / number of words
*Also called type/word ratio

What is the lexical diversity of these texts?

Text A: “I am a cat” (4/4)=1


Text B: “Cat cat cat cat” (1/4)=0.25

21
GET1030

What is a word?

● Lemmatized words:
○ A lemma is the root form of a word
■ Different forms of a word: book, books
● Sense disambiguation:
○ Book: verb
○ Book: noun
● Kinds of words:
○ Functional/grammatical words, “a”, “the”, “in”
■ Often used for stylometry (analysis of style)
○ Lexical/content words, “cat”, “house”, “book
■ Often kept for the analysis of themes

22
GET1030

Analyzing word counts in Hip Hop

https://pudding.cool/2017/02/vocabulary/index.html
23
GET1030

Letters of Vincent Van Gogh

903 letters written during his lifetime

Digital Collection of Van Gogh’s letters


http://vangoghletters.org/vg/

1853-1890
24
GET1030

Letters of Vincent Van Gogh

https://voyant-tools.org/?corpus=2a9aa299a95d7eca47cf68d25f0382e7

Next week we will see how to import data into Voyant, perform different types of
analysis and then export it for reuse in Python 25
GET1030

Type/word ratios

26
GET1030

Word trends

https://voyant-tools.org/?corpus=2a9aa299a95d7eca47cf68d25f0382e7 27
GET1030

Word trends (sparklines)

This is an example of Voyant Tools (voyant-tools.org), which we will learn to use 28


next week.
GET1030

Concordances

• A list of instances where a word is found.


• Concordances began in the 12th century, mostly to read the bible (Rouse
and Rouse, 1982).

http://www.opensourceshakespeare.org/concordance/findform.php

29
GET1030

KWIC

• Keyword in context (KWIC)


• Context: the words immediately before and immediately after another word

https://voyant-tools.org/?corpus=2a9aa299a95d7eca47cf68d25f0382e7
30
GET1030

Collocates

Words that tend to occur within the same “window” as another. In this case, the
window is a sentence.

31
GET1030

Contiguous Collocates in Wordtree

Collocates can also be contiguous, where we are interested in words that occur next to
each other.

32
GET1030

Hermeneuti.ca

http://hermeneuti.ca/name-games

The evolution of Game Studies.

http://gamestudies.org/1903

33
Lecture 5
Textual analysis

Part 4: A machine learning example

34
GET1030

Machine Learning

Supervised ML
● Classification
○ Labelled examples used to train a model
■ Training set
■ Test set

Unsupervised ML
● Clustering

35
GET1030

Supervised ML example

Representation of character
across 104,000 books

They used a pipeline called


BookNLP which identifies
character names in a work of
fiction and clusters those
names, so that “Elizabeth” and
“Elizabeth Bennet” are linked
as a single person.

Underwood et al (2018)
36
GET1030

Supervised ML example

For each decade, they


randomly selection of
1600 characters each time
(800 men / 800 women)

Characterization omitting
names and obvious
gender markers (he/she)

Classified using the most


common words in that
group of characters

The model was run 15


times for each decade

Underwood et al (2018) 37
GET1030

Corpus Linguistics

38
Underwood et al (2018)
GET1030

References

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva P. Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg, et al.
2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014): 176–82.
https://doi.org/10.1126/science.1199644.
Underwood, William E., David Bamman, and Sabrina Lee. 2018. “The Transformation of Gender in English-Language Fiction.”
Journal of Cultural Analytics 1 (1): 11035. https://doi.org/10.22148/16.019.

Jockers, Matthew Lee. Macroanalysis. Topics in the Digital Humanities. Baltimore: University of Illinois Press, 2013.

Moretti, Franco. “Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–1850).” Critical Inquiry 36, no. 1
(2009): 134–58. https://doi.org/10.1086/606125.

Rockwell, Geoffrey, and Stéfan Sinclair. Hermeneutica: Computer-Assisted Interpretation in the Humanities. Cambridge:
MIT Press, 2016. http://www.jstor.org/stable/j.ctt1c0gm6h.

39

You might also like