Applying Language Technology in Humanities Research Design Application and The Underlying Logic 1St Ed Edition Barbara Mcgillivray Full Chapter PDF

Applying Language Technology in
Humanities Research: Design,

Application, and the Underlying Logic
1st ed. Edition Barbara Mcgillivray
Visit to download the full and correct content document:
https://ebookmass.com/product/applying-language-technology-in-humanities-researc
h-design-application-and-the-underlying-logic-1st-ed-edition-barbara-mcgillivray/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
Health Humanities in Application Christian Riegel
https://ebookmass.com/product/health-humanities-in-application-
christian-riegel/
Second Language Research: Methodology and Design
https://ebookmass.com/product/second-language-research-
methodology-and-design/
Deepwater drilling : well planning, design,

engineering, operations, and technology application
Peter Aird
https://ebookmass.com/product/deepwater-drilling-well-planning-
design-engineering-operations-and-technology-application-peter-
aird/
Feedstock Technology for Reactive Metal Injection

Molding: Process, Design, and Application 1st Edition
Peng Cao
https://ebookmass.com/product/feedstock-technology-for-reactive-
metal-injection-molding-process-design-and-application-1st-
edition-peng-cao/
Doing Qualitative Research in Language Education 1st
ed. Edition Seyyed-Abdolhamid Mirhosseini
https://ebookmass.com/product/doing-qualitative-research-in-
language-education-1st-ed-edition-seyyed-abdolhamid-mirhosseini/
Teacher Development in Technology-Enhanced Language

Teaching 1st ed. Edition Jeong-Bae Son
https://ebookmass.com/product/teacher-development-in-technology-
enhanced-language-teaching-1st-ed-edition-jeong-bae-son/
Understanding Large Language Models: Learning Their

Underlying Concepts and Technologies 1st Edition
Thimira Amaratunga
https://ebookmass.com/product/understanding-large-language-
models-learning-their-underlying-concepts-and-technologies-1st-
edition-thimira-amaratunga/
Cybersecurity In Humanities And Social Sciences: A

Research Methods Approach 1st Edition Edition Hugo
Loiseau
https://ebookmass.com/product/cybersecurity-in-humanities-and-
social-sciences-a-research-methods-approach-1st-edition-edition-
hugo-loiseau/
Pluralisms in Truth and Logic 1st ed. Edition Jeremy

Wyatt
https://ebookmass.com/product/pluralisms-in-truth-and-logic-1st-
ed-edition-jeremy-wyatt/
Applying Language
Technology in
Humanities Research
Design, Application, and
the Underlying Logic
Barbara McGillivray
Gábor Mihály Tóth
Applying Language Technology in Humanities
Research
Barbara McGillivray · Gábor Mihály Tóth
Applying
Language
Technology
in Humanities
Research
Design, Application, and the Underlying Logic
Barbara McGillivray Gábor Mihály Tóth
Faculty of Modern and Medieval Viterbi School of Engineering, Signal
Languages Analysis Lab (SAIL)
University of Cambridge University of Southern California
Cambridge, UK Los Angeles, CA, USA
The Alan Turing Institute
London, UK
ISBN 978-3-030-46492-9 ISBN 978-3-030-46493-6 (eBook)

https://doi.org/10.1007/978-3-030-46493-6
© The Editor(s) (if applicable) and The Author(s) 2020

This work is subject to copyright. All rights are solely and exclusively licensed by the
Publisher, whether the whole or part of the material is concerned, specifically the rights
of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and
retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and
information in this book are believed to be true and accurate at the date of publication.
Neither the publisher nor the authors or the editors give a warranty, express or implied,
with respect to the material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
Cover illustration: © Melisa Hasan
This Palgrave Macmillan imprint is published by the registered company Springer Nature
Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The idea of this book goes back to the HiCor Research Network, founded
and led by us together with Gard Jenset and Kerry Russell. HiCor was
a research group of historians and corpus linguists at the University of
Oxford active between 2012 and 2014. It was generously supported by
TORCH (The Oxford Research Center for the Humanities). In addition
to organizing lectures and a workshop, HiCor also aimed to disseminate
language technology among historians and, more generally, humanists.
For instance, we organized several courses on Language Technology and
Humanities at the Oxford DH Summer School, which inspired this book.
We are grateful to Gard Jenset who helped to shape the initial ideas
underlying this book. We also thank our employers and funders for pro-
viding us with time and funding to accomplish the project.1
We have contributed equally to the design of the book. We have joint
responsibility for Chapter 1. Barbara McGillivray has primary responsi-
bility for Chapters 2 and 5. Gábor Tóth has primary responsibility for
Chapters 3, 4, 6, and 7.
Cambridge, UK Barbara McGillivray

London, UK
Los Angeles, USA Gábor Mihály Tóth
1Gabor Toth thanks the USC Shoah Foundation and the USC Viterbi School of
Engineering. Barbara McGillivray was supported by The Alan Turing Institute under
EPSRC grant EP/N510129/1.
v
Contents
1 Introducing Language Technology and Humanities 1

1.1 Why Language Technology for the Humanities? 1
1.2 Structure of the Book 3
References 6
2 Design of Text Resources and Tools 7

2.1 Text Resources in the Humanities 7
2.1.1 Text Resources and Corpora 9
2.1.2 Data and Metadata 10
2.2 Corpus Design and Creation 12
2.2.1 Designing a Text Resource 12
2.2.2 Humanities Corpora 14
2.3 Use Case: The Diorisis Ancient Greek Corpus 16
2.4 Corpus and Natural Language Processing Tools 20
2.4.1 Text-Processing Pipeline 20
2.4.2 Pre-processing and Tokenization 21
2.4.3 Stemming, Lemmatization, and Morphological
Annotation 23
2.4.4 Part-of-Speech Tagging 25
2.4.5 Chunking and Syntactic Parsing 27
2.4.6 Named Entities 28
2.4.7 Other Annotation 29
2.5 Conclusion 31
References 32
vii
viii CONTENTS
3 Frequency 35
3.1 Concept of Frequency 36
3.2 Application: The “Characteristic Vocabulary”
of the Moonstone by Wilkie Collins 39
3.3 Application: Terms with ‘Turbulent History’ in the Early
English Books Online 43
3.4 Conclusion 46
References 46
4 Collocation 47
4.1 The Concept of Collocation 48
4.2 Probability of a Bigram 49
4.3 Observed and Expected Probability of a Bigram 50
4.4 Strength of Association: Pointwise Mutual Information
(PMI) 52
4.5 Strength of Association: Log Likelihood Ratio 54
4.6 Application: What Residents of Modern London
Complained About 54
4.7 Conclusion 58
References 59
5 Word Meaning in Texts 61

5.1 The Study of Word Meaning 61
5.2 Distributional Approaches to Word Meaning 62
5.3 Word Space Models 64
5.3.1 Words in Space 64
5.3.2 Word Embeddings 68
5.4 Use Case: Exploring Smell in Historical Health Reports 69
5.4.1 Visualizing Words in the Semantic Space 71
5.4.2 Measuring Distances in the Semantic Space 72
5.5 Use Case: Finding Semantic Change in a Web Archive 75
5.6 Conclusion 78
References 78
6 Mining Textual Collections 81

6.1 Textual Similarity, an Old Problem 82
6.2 How to Construct a Feature Space 83
6.2.1 Feature Selection 84
6.2.2 Feature Scoring 88
CONTENTS ix
6.2.3 Representation as a Geometric Space 89

6.2.4 The Document–Term Matrix 90
6.2.5 Representation as a Vector Space 90
6.2.6 Summary 93
6.3 Application: Discovery of Similarity in the Anglo-Saxon
Chronicle 93
6.3.1 Transformation of the Anglo-Saxon Chronicle
into a Document Collection 94
6.3.2 Feature Extraction and Feature Selection 95
6.3.3 Construction of the Document–Term Matrix 96
6.3.4 Feature Scoring 97
6.3.5 Rendering a Feature Space Through Projection
to a Lower-Dimensional Space 99
6.3.6 Measuring the Cosine Similarity Between
Annals 102
6.3.7 Clustering 104
6.3.8 Topic Modelling 107
6.3.9 Topic as a Hidden Layer 110
6.3.10 Hierarchical Topic Modelling 111
6.3.11 Summary of Topic Modelling 113
6.4 Conclusion 113
References 114
7 The Innovative Potential of Language Technology

for the Humanities 117
7.1 Bridging Concepts Between Humanities and Language
Technology 117
7.2 A Critical View of Language Technology 121
Index 123
List of Figures
Fig. 3.1 Relative document frequency of lemma forsake in the EEBO

subcorpus 45
Fig. 4.1 Changes of log likelihood ratio (window: 5 words; direction:
left) between complain/complaint and dust, mouse, noise, rat,
smell, smoke in the London Health Reports dataset 56
Fig. 4.2 Changes of log likelihood ratio (window: 5 words; direction:
right) between complain/complaint and dust, mouse, noise, rat,
smell, smoke in the London Health Reports dataset 57
Fig. 5.1 Bi-dimensional representation of the words film, movie,
and quote using the coordinates from Table 5.2 67
Fig. 5.2 Bi-dimensional representation of the semantic space
from the London MOH reports. We have displayed the
points corresponding to the top 40,000 most frequent words,
and the labels of the words smell, stink, odour, perfume, table,
and house 73
Fig. 5.3 Simplified visualization of the semantic change of the noun
tweet in three semantic spaces 76
Fig. 6.1 Simplified representation of some car models in terms
of common features 84
Fig. 6.2 Some novels by Wilkie Collins and their representation using
library catalogue subject headings as common features
(*Source The on-line catalogue of the Bodleian Library,
Oxford, http://solo.bodleian.ox.ac.uk, accessed 1 January
2020) 85
Fig. 6.3 A simplified document collection of three English proverbs
and their representation through bag of words 87
xi
xii LIST OF FIGURES
Fig. 6.4 Representation of three English proverbs in a feature space

rendered as geometric space 91
Fig. 6.5 Representation of three English proverbs as document vectors 92
Fig. 6.6 Representation of the annals of the Anglo-Saxon Chronicle
in a projected space 101
Fig. 6.7 Similarity matrix of annals highlighted (Group 2) in Fig. 6.6 103
Fig. 6.8 Representation of the annals of the Anglo-Saxon Chronicle
in a projected space with some clusters highlighted 106
Fig. 6.9 Hierarchical topic modelling in the Anglo-Saxon Chronicle 112
List of Tables
Table 2.1 Top frequency word types in Shakespeare’s Hamlet 23

Table 2.2 Top frequency word types in Shakespeare’s Hamlet after
removing stop words 24
Table 5.1 Example of co-occurrence frequencies in a toy example
consisting of four sentences containing the nouns dog
and cat 65
Table 5.2 Example of co-occurrence frequencies for the lemmas film,
movie, and quote from the British National Corpus 2014
Spoken 66
Table 5.3 Example of co-occurrence frequencies for the lemmas film,
movie, and quote from the British National Corpus 2014
Spoken 67
Table 5.4 Cosine similarity measures between the word embeddings
for blackberry and phone, and blackberry and raspberry, 2000
and 2013. The embeddings are from https://zenodo.org/
record/3383660#.XfylShf7Sbc 77
Table 6.1 Summary of annals highlighted in Fig. 6.6 102
Table 6.2 Key topics extracted from the Anglo-Saxon Chronicle 108
xiii
CHAPTER 1
Introducing Language Technology

and Humanities
Abstract This chapter outlines the relevance of language technology

for the exploration and study of big textual data sets in the humanities.
We also discuss the importance of understanding the logic underly-
ing the use of language technology to resolve research problems in the
humanities. Finally, we outline the three pillars of the approach we follow
throughout the book: focus on application through both simplified and
more complex use-case examples; discussion of both the potential and
the limitations of language technology; and explanation of how to trans-
late humanities research questions into research problems using language
technology.
Keywords Big data · Distant reading · Textual resource ·

Language technology · Humanities research
1.1 Why Language Technology for the Humanities?

In the last two decades, the humanities have seen an unprecedented
change opening up new directions for the inquiry of human cultures and
their histories: the yet not fully explored availability of digitized human-
istic texts. Thanks to the mass digitization of analogue resources pre-
served in libraries and archives, large textual collections, such as Google
Books, Early English Books Online, and Project Gutenberg, have
become available on the World Wide Web. The rise of digital humanities
© The Author(s) 2020 1

B. McGillivray and G. M. Tóth, Applying Language Technology
in Humanities Research, https://doi.org/10.1007/978-3-030-46493-6_1
2 B. McGILLIVRAY AND G. M. TÓTH
as a new academic field has contributed to the proliferation of research

infrastructures and centres dedicated to the study and distribution of tex-
tual resources in the humanities. The mission of digital humanities pro-
jects such as CLARIN European Research Infrastructure, DARIAH and
the ESRC Centre for Corpus Approaches to Social Science is to make
textual resources not only available but also investigable for scholars.
Digital humanists have proposed the method of distant reading or macro
analysis for learning from large textual resources (Jockers 2013; Moretti
2015). Alongside a growing interest in large textual resources, there is
an increasing demand from (digital) humanities researchers for quanti-
tative and computational skills. The current offering in this space is rich,
with a range of training options (including dedicated summer schools
like the digital humanities training events at Oxford,1 DHSI at Victoria,2
or the European Summer School in Digital Humanities in Leipzig3) and
publications (examples include Bird et al. 2009; Gries 2009; Hockey
2000; Jockers 2014; Piotrowski 2012). Nonetheless, textual resources in
the humanities and beyond raise a key challenge: they are too big to be
read by humans interested in analysing them. The potential lying in the
exploration of large textual collections has not been fully realized; yet, it
remains a key task for the current and the next generations of humanities
scholars.
To explore tens of thousands of books or millions of historical docu-
ments, humanities scholars inevitably need the power of computing tech-
nologies. Among these technologies, there is one that has had and will
definitely continue to have a pivotal role in the exploration of big textual
resources. Language technology, which can help unlock and investigate
large amounts of textual data, is a truly interdisciplinary enterprise. It is
not an academic field per se; it is rather a collection of methods that deal
with textual data. Language technology sits at the crossroads between
corpus and computational linguistics, natural language processing and
text mining, data science and data visualization. As we will demonstrate
throughout this book, language technology can be used to address a
great variety of research problems involved in the investigation of textual
data in the humanities and beyond.
1 https://www.dhoxss.net/.
2 https://dhsi.org.
3 http://esu.culintec.de/.
1 INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 3
1.2 Structure of the Book

This book examines research problems that are relevant for humani-
ties and can be addressed with the help of language technology. The
first chapter demonstrates how language technology can help structure
raw textual data and represent them as a resource meaningful for both
humans and computers. For instance, the lyrics of thousands of popular
songs are now available in plain text on the World Wide Web. But lyrics
in plain text format do not distinguish the title and the refrain of a song.
This is an example of unstructured data because various components of
a song are not marked in a way that computers can a utomatically extract
them. Language technology can help detect structural components
within a text such as the refrain of a song; it can also help represent a
song in digital form so that different structural components are distin-
guished and readily available for further computational investigations.
Language technology also supports word-level investigations of textual-
ity. The lyrics of a song consist of not only structural units, but also dif-
ferent types of words such as nouns, verbs, and names of people. In plain
text format, word-level information about lyrics is not readily usable
by computing tools; for instance, it is not possible to extract all proper
names from a collection of lyrics in plain text. As Chapter 2 explains,
language technology helps attach different types of information to each
word of a text; it also offers ways to record this information in well-es-
tablished data formats.
Language technology also facilitates the bottom-up exploration
of textual resources and textuality. For instance, finding terms that are
significant elements of a text is an important component of bottom-up
explorations. We will discuss how the investigation of word frequency
can support this in Chapter 3. Language technology methods can map
terms closely related to a given concept in thousands of texts. This form
of bottom-up exploration is discussed in Chapter 4. Language tech-
nology methods can also help in bottom-up studies of word meaning.
For instance, the meaning of a concept can be investigated by draw-
ing on a dictionary definition, but it can also be inferred from the way
authors used that concept in their works. Chapter 5 examines how lan-
guage technology enables this type of exploration of meaning. Finally,
language technology has tools to detect patterns recurring over thou-
sands of texts. As the proverb says, there is nothing new under the sun.
Similar themes and ideas recur over texts from different historical times.
However, detecting them in large textual resources is a tedious (or some-

times impossible) task for human readers. As Chapter 5 illustrates, lan-
guage technology supports humans in their efforts to detect recurrence
and similarity in texts.
To realize the rich potential that language technology offers, human-
ists need to bridge two interrelated gaps. The first is the conceptual gap
between humanities research problems and language technology meth-
ods. As a simple example, language technology can detect how many
times a given term is used in a given set of historical sources. In more
technical terms, with language technology we can study word frequency.
But rarely do historians ask how many times a term occurs in their source
texts. Rather, they inquire about the prevailing social concepts in a given
historical time. There is a conceptual gap between word frequency and
the prevailing social concepts. This simple example also sheds light
upon the second gap, which lies between qualitative and quantitative
approaches. The insights that language technology can deliver are very
often quantitative and difficult to interpret with a qualitative framework.
Bridging these gaps is a daunting task for scholars, and this publication
seeks to assist them in this task. We believe that the potential of lan-
guage technology can be realized if there is a clear understanding of the
logic underlying it. The overall goal of this book is therefore to apply
the logic of language technology to the resolution of humanistic research
problems. We will attempt to convey this logic by following a didactic
approach with three pillars.
First, we guide you through various research procedures involved in
the application of language technology. The first chapter looks at the
design of language resources, the first step in the application of language
technology. The following chapters study specific humanities-related
research problems and show how to design quantitative research pro-
cedures to address them. We believe that an understanding of how
to design a research process in language technology is one of the key
steps to understanding its overall logic. We do not, however, explain
the technical implementation of the research procedures discussed
throughout the book.4 Thanks to the development of computing
tools in popular programming languages, such as Python and R, many
4 The Python implementation can be found in the following github repository: https://
github.com/toth12/language-technology-humanities.
1 INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 5
of the technological procedures presented here have been (at least

partially) automated, and their implementation can be learnt by follow-
ing excellent on-line tutorials and manuals. But what is difficult to learn
from on-line resources is how language technology is related to existing
research goals and practices in the humanities. We draw on both complex
and simple examples to illustrate this. Sometimes these illustrative exam-
ples will be simplistic; we call these ‘toy examples’. Despite their simplic-
ity, we believe that readers can grasp otherwise highly complex research
problems and procedures through them.
Second, we highlight both the potential and the limitations of lan-
guage technology. We believe that the logic of a technology can be
understood if one is aware of what that technology can and cannot
resolve. A thorough critical understanding is crucial to use technology in
an innovative way.
Third, we will return again and again to the two gaps described
above. Resolving the conceptual gap involves a process that is similar
to translation. In order to make use of language technology, humani-
ties researchers have to express, or more technically speaking operation-
alize, their research questions and problems in way that are meaningful
from the language technology perspective. This translation process will
be demonstrated through various applied examples. Similarly, address-
ing the gap between quantitative and qualitative views needs a transla-
tion process. Highly complex mathematical procedures need humanistic
analogies so that their results can inform qualitative research prob-
lems. Throughout the book we will attempt to establish such analogies.
Although these might sound simplistic to readers trained in mathematics,
we believe that our simple and accessible explanations will enable readers
to build a more solid understanding of language technology.
With language technology playing a pivotal role in the discovery
and analysis of textual data, this book offers an accessible overview of
the main topics that can be considered under the umbrella term of lan-
guage technology: corpus linguistics, computational linguistics, natu-
ral language processing, and text mining. Our aim is to focus on those
aspects that are relevant to a readership of humanists. To keep this vol-
ume agile and easy to handle, some topics have been removed from its
scope. For example, sentiment analysis is only briefly touched on, and we
have not been able to cover many other important areas, including stylo-
metrics, geospatial analysis, and authorship attribution. Space constraints
also mean that many details concerning the topics covered were omit-
ted. However, we aimed to provide basic information to further explore
themes that are of particular interest to readers.
References
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python. Sebastopol, CA: O’Reilly.
Gries, S. T. (2009). Quantitative Corpus Linguistics with R. New York, NY and
Abingdon: Routledge.
Hockey, S. (2000). Electronic Texts in the Humanities: Principles and Practice.
Oxford: Oxford University Press.
Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History.
Champaign, IL: University of Illinois Press.
Jockers, M. L. (2014). Text Analysis with R for Students of Literature. New York,
NY: Springer.
Moretti, F. (2015). Distant Reading. London: Verso.
Piotrowski, M. (2012). Natural Language Processing for Historical Texts.
San Rafael, CA: Morgan and Claypool.
CHAPTER 2
Design of Text Resources and Tools
Abstract This chapter guides the reader through the key stages of
creating language resources. After explaining the difference between lin-
guistic corpora and other text collections, the authors briefly introduce
the typology of corpora created by corpus linguists and the concept of
corpus annotation. Basic terminology from natural language processing
(NLP) and corpus linguistics is introduced, alongside an explanation of the
main components of an NLP pipeline and tools, including p re-processing,
part-of-speech tagging, lemmatization, and entity extraction.
Keywords Corpus · Text collection · Metadata · Annotation ·

Natural Language Processing (NLP) · Pipeline · Tool · Part-of-speech
tagging · Lemmatization
2.1 Text Resources in the Humanities

This chapter will guide you through the key stages of creating and using
text resources and tools. We use the term text resource to refer to col-
lections of texts of various kinds, as well as other types of text-based
resources encountered in the humanities. The key difference here is that
collections typically contain running text, often organized into sections,
books, and so on, while other text-based resources display text content
which is not necessarily a connected piece of work. For example, the first
© The Author(s) 2020 7

B. McGillivray and G. M. Tóth, Applying Language Technology
in Humanities Research, https://doi.org/10.1007/978-3-030-46493-6_2
edition of the Encyclopaedia Britannica1 was digitized by the National

Library of Scotland and is organized into volumes and pages, hence
we call it a text collection. On the other hand, resources like historical
maps can contain textual parts which are not in the form of running text.
Some of the techniques that we cover here, such as lemmatization or
semantic analysis, will also be suitable for exploring such resources.
Many text collections encountered in the humanities are not in digital
form, and a large proportion of digital humanities research focuses on the
process of digitizing such texts and providing them to the scholarly com-
munity. The focus of this book is on texts in digital form, as this is a pre-
condition for the type of computer processing that we are concerned with.
A special category of text collections is that of linguistic corpora,
which by definition are designed with the specific purpose of studying
human language. But text analysis can reveal important patterns about
society and culture, well beyond questions of linguistic interest. For
example, finding occurrences of place names, personal names, or men-
tion of events in texts, and counting those instances both independently
and in relation to other entities can help us access aspects of the content
at a scale that is not reachable by close reading. Purely linguistic units
like patterns of use of modal verbs (like must, should, or ought to in
English) can be used to study abstract concepts such as obligation in a
historical period. Therefore, the experience of corpus research, including
approaches to corpus design and creation, is helpful as a way to draw rich
information from texts.
Over the past decades, linguists have dedicated major efforts to cre-
ating corpora and developing ways of enriching them with annotation.
These efforts have the potential to benefit other humanities disciplines.
For example, if positive and negative adjectives are annotated in a novel,
then we can analyse the sentiment of the text, and relate it to different
characters to answer questions such as “Is character X associated with
negative sentiment and how does this change throughout the novel?” In
this chapter we describe a typology of linguistic corpora and some of their
most useful features. We focus on how corpus design questions such as
balance and representativeness may impact research outcomes. We also
explain the workings of annotation, stressing the benefits it can bring to
1 https://digital.nls.uk/encyclopaedia-britannica/archive/144133900.
2 DESIGN OF TEXT RESOURCES AND TOOLS 9
humanities scholarship, and discuss the challenges that text resources of

interest to humanists pose for the design and creation phases.
Regarding tools to manipulate and enrich texts, we will borrow some
basic terminology from the fields of natural language processing (NLP)
and corpus linguistics, and explain the main concepts behind processes
such as tokenization, part-of-speech tagging, lemmatization, and so on.
The main focus is on how these tools are needed in humanities research
and can guide the exploration and close reading of texts.
2.1.1 Text Resources and Corpora

Here we use the term text resource broadly to refer to any resource con-
taining some text. Corpora (plural of corpus) are a special type of text
resources because they are collections typically created by linguists to
answer specifically linguistic questions. In fact, a branch of linguistics
called corpus linguistics emerged in the second half of the twentieth cen-
tury to define the characteristics of corpora, and to create and use them
in linguistic studies. For an overview of the history of corpus linguistics,
see McEnery and Hardie (2013).
There are several definitions of what a corpus actually is. One of the
most famous definitions is by Sinclair (2005): a corpus is “a collection of
pieces of language text in electronic form, selected according to external
criteria to represent, as far as possible, a language or language variety as
a source of data for linguistic research”. Along similar lines, Xiao et al.
(2006) say that “a corpus is a collection of machine readable authentic
texts (including transcripts of spoken data) which is sampled to be rep-
resentative of a particular language or language variety”. Both these defi-
nitions name the main features of corpora: electronic format, the fact
that they contain naturally occurring language, and the fact that they
are meant to represent a language or a part of it (technically called “vari-
ety”). For example, the corpus of the frequency lexicon of spoken Italian
(Bellini and Schneider 2003–18) contains 469 texts that are transcriptions
of lectures, TV and radio programmes, and spoken interactions between
people. This corpus was used to create the first frequency dictionary of
spoken Italian, which gives information about how frequently words are
used in this language. The aim was to support scholars studying the lan-
guage variety of spoken Italian. To realize this goal the corpus compil-
ers decided to focus on four cities (Milan, Florence, Rome, and Naples)
deemed representative of the broad range of features of spoken Italian.
One important difference between humanities text collections and lin-

guistic corpora is that the former are typically not created for the purpose
of linguistic studies and therefore do not aim at being representative of a
language variety. For example, the Darwin Correspondence Project2 has
published the full texts of more than 9000 of Charles Darwin’s letters.
In this project the letters have been collected into a resource available to
the community of historians of science and other scholars interested in
analysing Charles Darwin’s views and entourage, and more broadly the
creation of his scientific network, his impact on the scientific discourse of
his time, and his legacy. However, the Darwin Letters are not meant to
be representative of the English language or the scientific language of the
time. Many of the text technology tools and terminology developed by
corpus linguists are very useful when analysing and processing the texts
of a non-linguistic project like the Darwin Correspondence Project, and
we will describe such tools and terminology in the next sections.
2.1.2 Data and Metadata

We have seen that text resources can contain text in various forms. Let
us take the example of the Hartlib Papers,3 which contain the full-text
transcription (as well as facsimile images) of the manuscripts of the corre-
spondence of Samuel Hartlib (c.1600–62), a seventeenth-century ‘intel-
ligencer’ and man of science. The texts of these letters reveal interesting
insights into the topics talked about in Hartlib’s circle. However, for
many research questions it is of critical importance to scholars to know
a range of other attributes in addition to the texts of the letters, such as
the library subject header, the year in which the letter was written, who
wrote it, and its addressee, gender, location, and so on. This allows us to
investigate, for instance, how many women corresponded with Samuel
Hartlib, and whether this number changed over time. Together, all
these features about the context of the text are referred to as metadata.4
Combining the text data with the metadata makes it possible to answer
even more questions, such as: how did the topics of the letters change
2 https://www.darwinproject.ac.uk.
3 https://www.dhi.ac.uk/hartlib/context.
4 We will follow the Oxford Dictionaries in using metadata as a mass noun and data as a
plural noun.
over time? Did Hartlib use a different style when addressing certain
personalities? Does the length of the letters tend to change over time?
Metadata can be of different types, depending on the kind of infor-
mation it provides. We follow the categorization in Burnard (2005)
and distinguish between descriptive, administrative, editorial, and ana-
lytic metadata. The scope of the first two categories is the collection as
a whole, while the latter ones apply to smaller text units. Descriptive
metadata accesses external information about the context of the text,
such as its source, date of publication, and the sociodemographics of the
authors. Administrative metadata contains information about the collec-
tion itself, for example its title, its version, encoding, and so on. Editorial
metadata, on the other hand, provides information about the editorial
choices that the creators of the digital collection made with respect to
the original text, for example regarding additions, omissions, or correc-
tions. Finally, analytic metadata focuses on the structure of the text, for
example by marking the beginning and end of sections or paragraphs.
Metadata can be encoded into text resources in various ways, either
in external documentation or as part of the collections themselves. The
Text Encoding Initiative (TEI) has developed detailed guidelines for the
encoding of texts in digital format and it has become a widely accepted
standard in the digital humanities. The TEI guidelines specify, among
other things, how the metadata of a text should be displayed in what is
known as the TEI header (for details see TEI Consortium 2019).
As we have said earlier, metadata combined with text data offers the
widest scope for insightful ways to explore texts. Moreover, the texts
themselves can be enriched via annotation to optimize the implicit lin-
guistic information they contain and make it usable for large-scale anal-
yses. Let us imagine that we have access to a large collection of digitized
newspapers and we are interested in analysing the level of international
relations exemplified in this collection. Knowing the geographical ori-
gin of each newspaper is of primary importance, but it is not sufficient
because a newspaper article may talk about a location which is differ-
ent from its place of publication. Hence, we would want to conduct
an in-depth search of the texts to find, for example, instances of place
names. This can be a very time-consuming (or sometimes impossible)
process if we need to read all the articles. Without good disambigua-
tion, we may have to ignore many instances of potentially irrelevant hits
while at the same time missing a high number of relevant hits. For exam-
ple, Paris is the name of the French capital but is also the name of a
city in Texas, and being able to distinguish the two means that we can
know whether a particular mention refers to international relationships
with France or the United States. Moreover, Paris can also be a per-
son’s name, and at the same time the city can be referred to in different
ways (e.g., ‘the City of Lights’), so again being able to disambiguate the
usages of this name in context is very useful.
As noted by McEnery and Wilson (2001, p. 32), annotation makes
the linguistic information in a text computationally retrievable, thus ena-
bling a wide range of searches that can be performed in a manual, auto-
matic, semi-automatic, or crowd-sourced way, depending on whether
humans, computers, a combination of humans and computers, or groups
of humans are responsible for it. For a detailed overview of linguistic
annotation, see Jenset and McGillivray (2017, pp. 99 ff.). In Sect. 2.4
we will see different types of linguistic annotations and how they can be
relevant to humanities research.
2.2 Corpus Design and Creation

This section will guide you through the decisions involved in designing
and creating a corpus for humanities research. We will borrow most of
the terminology from corpus linguistics, but will also discuss what should
be adapted to the specific needs of humanities research. We will cover
issues of availability of the data, representativeness, and their impact on
the research outcomes. Finally, we will walk you through a use case, an
Ancient Greek corpus built for the purpose of studying how Ancient
Greek words change their meaning over time.
2.2.1 Designing a Text Resource

In humanities research it is very common to start from a question and
then search for the best evidence to answer it. In other cases, we may
have already identified an existing resource that we are interested in,
for example an archive, and we want to use it for research. To take an
example, the National Library of Scotland has recently made available
the first ten editions of the Encyclopaedia Britannica in digital form.
This is an impressive resource which allows us to explore a range of ques-
tions regarding the composition of the text, the differences between
editions, the various topics covered in it, the relationship between text
and images, and many more. But if we wanted to investigate themes that
relate to the wider historical context of this work or comparisons with

other encyclopaedias from different eras or geographical regions, for
example, we would need to expand our evidence base to other resources.
So we would find ourselves in the position of assembling a suitable cor-
pus. In this section we will explore the steps involved in designing a text
resource in a way that lets us answer the questions we care about.
Let us imagine, for example, that we are interested in analysing the
spread of diseases in France in the nineteenth century. What are the best
sources to address this topic? Within the limits of our time and resources,
we should choose the available texts that allow us to best tackle the task
in question. How to realize this in practice? A useful starting point is to
consider the features of the texts we would need to collect. Some helpful
criteria for designing the resource can be borrowed from corpus linguis-
tics, where corpora can be categorized in the following ways:
• By medium: does the corpus contain only text, speech, video mate-
rial, or is it mixed?
• By size: does the corpus contain a static snapshot of a language vari-
ety (static corpus) or is it continually updated to monitor the evolu-
tion of language (monitor corpus)?
• By language: is the corpus monolingual or multilingual? If it is mul-
tilingual, have its parts been aligned (parallel corpus)?
• By time: does the corpus cover a language variety in a specific period
without considering its time evolution (synchronic corpus) or does
it focus on the change of a language variety over time (diachronic
corpus)?
• By purpose: was the corpus built to describe the general language
(like contemporary spoken English) or a special aspect of it (like the
language of medical emergency reports)?
Of course, some of these criteria may co-exist, so that, for instance, we

may create a collection of health reports (text) in French (monolingual)
covering the nineteenth century (diachronic). But how can we make sure
that our resource is good enough to investigate our question, i.e., the
spread of diseases in nineteenth-century France?
In the course of its history, the field of corpus linguistics has witnessed
a hot debate around the topic of representativeness, and corpus linguists
have developed methodologies for building balanced corpora that aim at
being representative of the language under study. These methodologies
typically involve drawing a prioritized inventory of the relevant features,

for example register, region, time period, and so on; then estimating a
target size for each feature, and then assembling the corpus according
to these proportions. In the example about the spread of diseases in
nineteenth-century France, we should make sure that the texts are sam-
pled from different regions in such a way as to reflect the diversity of
France. For instance, given the prominent role of Paris, we would expect
a high proportion of reports to be from this city, but at the same time we
would want to ensure that other regions are suitably represented as well.
In addition to geographical provenance, we should account for other fea-
tures such as text type, and make sure that the texts represent a range
of different health-related texts by different roles (medical professionals,
general public, scientists, etc.).
As McGillivray (2014, pp. 11–13) discusses, in spite of the efforts
to build balanced corpora, representativeness remains an ideal limit and
more recently a different approach aimed at gathering the most inclusive
set of texts has gained popularity. This has led to the creation of very
large corpora containing as many texts as it is feasible to collect. This
topic has been discussed at length outside linguistics (see, for example,
Underwood 2019; Bode 2020). While not taking part in this debate
here, we believe that engaging with the issue of representativity in a crit-
ical way is useful when designing a text resource for humanities research.
In the next section we will dive deeper into the features of the text
resources typically considered in humanities research, and stress some key
differences from linguistic corpora.
2.2.2 Humanities Corpora

In this chapter we have made a distinction between humanities text
resources in general on the one hand and linguistic corpora on the
other. In Sect. 2.1.1 we saw that today’s corpora tend to be assem-
bled with the aim of including as many relevant texts as possible, even
if this means compromising on the balance between different features.
This is true especially in the case of contemporary languages, for which
the Internet can provide a huge source of born-digital texts, leading
to very large corpora whose size can be measured in billions of words.
One example is the JSI Timestamped English corpus,5 an English corpus
5 https://www.sketchengine.eu/jozef-stefan-institute-newsfeed-corpus/.
built from news articles gained from their RSS feeds; it is updated daily
and contains 37 billion words. Such an unrestricted approach to cor-
pus building, however, is not always applicable to the text resources
employed in humanities scholarship, where a potentially complex inter-
action of research questions and availability of texts affects the size and
shape of the resources we can create. For example, sometimes only a few
texts or fragments have survived historical accidents and have found
their way into the collection, meaning that creating a balanced corpus is
simply not a viable option.
Three important considerations to keep in mind when building a
corpus in humanities research are access, digitization, and encoding.
Gaining access to a group of texts can often be anything but straight-
forward, requiring potentially complex issues to be negotiated such as
legal questions with third parties (who might have been responsible for
the digitization, for example), and privacy or human data protection
concerns. Even when we gain access to the texts, these may need to be
digitized, as any subsequent computational processing of the type we talk
about in this volume requires them to be in digital form. Once the texts
have been digitized, or even better during the digitization step itself, the
texts should be presented in such a way to enable their effective use in
research. In Sect. 2.1.2 we touched on the TEI guidelines, which pro-
vide a great basis for ensuring that digital texts are equipped with all
the metadata needed to place them in their historical context. Although
these topics are not the focus of this volume and therefore will not be
covered in depth, we acknowledge that access, digitization, and encoding
can have a significant impact on the decisions that follow in the research
process. In particular, the quality of the digitization can radically affect
the outcomes of quantitative analyses carried out on the texts, as shown,
for example, by Hill and Hengchen (2019).
Another challenge concerns historical texts, which are often the object
of study in the humanities and which require especially careful consid-
eration. One primary reason for this is that tools and methods devel-
oped in language technology research are still mainly concerned with
modern and well-established languages like English, but require special
adaptation when applied to historical languages (cf. Piotrowski 2012;
McGillivray 2014). Philological and interpretative issues are often of
major importance and need to be accurately incorporated in the corpus
design phase (cf. Meyer 2015). Furthermore, the lack of native speak-
ers of extinct languages or old varieties of living languages means that
we cannot rely on native speaker intuition for the annotation, and extra
layers of checks and explicit guidelines are needed to achieve good qual-
ity results. The next section will describe a concrete use case involving a
historical language, Ancient Greek.
2.3 Use Case: The Diorisis Ancient Greek Corpus

In Sect. 2.2.2 we stressed some of the features of humanities corpora,
including the specific challenges posed by historical texts. This section is
dedicated to a case study on the Diorisis Ancient Greek corpus, which is
described in detail in Vatri and McGillivray (2018) and which will give us
the opportunity to explore the process of corpus design starting from the
original research questions through a concrete example.
The Diorisis corpus was built in the context of the “Computational
models of meaning change in natural language texts” project funded by
the Alan Turing Institute (McGillivray et al. 2019; Perrone et al. 2019).
The interdisciplinary project team included scholars in classics, NLP, sta-
tistics, and digital humanities, who worked together for six months to
begin to explore the following question: how can we identify the change
in meaning of Ancient Greek words over the history of this language?
The question of meaning change (or semantic change, more precisely)
is relevant to a range of humanistic disciplines. In fact, a large part of
humanities research involves interpreting meaning in textual sources, for
example to find instances of entities or concepts based on which we can
analyse historical, cultural, and social trends, or explore the connection
between language and stylistic and geographical factors.
Words can have many meanings and this changes over time and
across registers, geography, style, etc. Let us take the example of the
Ancient Greek word mus, which can mean ‘mussel’, ‘muscle’, ‘whale’,
or ‘mouse’. Imagine that we are interested in medical terminology, how
can we find only those texts that display the medical meaning of mus?
Knowing the genre of a text will obviously help, as the medical mean-
ing is more likely to be found in medical texts, but occurrences of the
medical meaning can also be found in other texts. Historical dictionaries
usually give some examples of usage of each word meaning, but do not
attempt to give a full account of the literature, and using close reading
methods, we would need to read and record the meaning of all words
in every single text ever written, which clearly does not scale up to very
large text collections. So, having access to an annotated corpus can make
all the difference (McGillivray et al. 2019).
The project aimed to map the change in the meaning of words in the
history of Ancient Greek from the seventh century BCE to the fifth cen-
tury CE, an extremely ambitious goal. For this purpose, we had to build
the largest corpus possible. In Sect. 2.1.1 we stressed the aspiration to
representativeness. One of the important factors to keep in mind is the
role of genre in Ancient Greek semantics, so in the corpus design phase
we aimed at finding the best possible representation of Ancient Greek
genres. While scoping the genre distribution of the texts, we devised a
categorization into genre classes (such as Poetry, Narrative, or Technical)
and subclasses (such as Bucolic, Biography, or Geography).
The categorization aimed at the best possible representation of
Ancient Greek genres. The emphasis on “possible” is critical in this con-
text, as we were constrained by three main factors. First, the texts that
have survived historical accidents and have reached us are all we can hope
to obtain for Ancient Greek. Second, as new digitization was not within
the scope of the project, the number of available digital resources consti-
tuted the upper limit of what we were able to include. Third, even when
digitized editions exist, they may not be free to use and distribute, so we
sourced the texts from three openly available digital libraries (for details
see Vatri and McGillivray 2018). The corpus consists of 820 texts and it
counts 10,206,421 word tokens, making it the largest corpus of its kind
available today.
As is often the case in digital humanities projects, the texts came in
different formats, ranging from TEI XML, to non-TEI XML, HTML,
and Microsoft Word files.6 Therefore we had to allow for an initial phase
of cleaning and standardization of these formats into TEI-compliant
XML to allow further processing and analysis. Another important con-
sideration was character encoding. Greek characters can pose additional
challenges when it comes to encoding, and we found a range of options
in the sources, from Beta Code7 to UTF-8 Unicode, to HTML hexadec-
imal references. Taking the example from Vatri and McGillivray (2018),
for the Greek character ᾆ, the Beta Code is A) = |, the Unicode UTF-8
encoding is ᾆ, and the hexadecimal reference is &#1F86;. We converted
all Greek characters to Beta Code for standardization purposes, choosing
this encoding because it makes automatic processing and retrieval easier.
6 See http://teibyexample.org/modules/TBED00v00.htm?target=markuplanguages for
an explanation of these terms.

7 https://www.tlg.uci.edu/encoding.
For example, if we want to easily find occurrences of the same word

starting with or without a capital letter and match them to a digital dic-
tionary, we can easily do that with Beta Code. This is because Beta Code
encodes capitalization by adding an asterisk (*) to the letter character,
so we can easily look up the capitalized and non-capitalized forms of the
same word by adding or removing the asterisk.
The format of the corpus was determined by further processing,
aimed at identifying semantic change in a computational way. This means
that, instead of one single file of running text, the corpus is organized in
several text files to enable faster programmatic access to it. Moreover, the
text is split into sentences, as a sentence is the unit of input for the com-
putational model. We marked sentence boundaries as analytic metadata
in the text.
Below is an excerpt from the text file for the work Leucippe and
Clitophon by Achilles Tatius, where we have removed the linguistic anno-
tation on lemma and morphological analysis (see Sect. 2.4 for more
details) and only included the first four words of the first sentence (hence
the ellipsis).8
<sentence id="1" location="1.1.1">

<word form="*sidw\n" id=1"></word>
<word form="e)pi\" id="2"></word>
<word form="qala/tth|" id="3"></word>
<word form="po/lis" id="4"></word>
…
</sentence>
We retained analytic metadata information regarding the line, book,

chapter, or section of each sentence and whether a text chunk was a quo-
tation or not. We also encoded modern additions to fragmentary texts as
editorial metadata, so as to allow for their easy retrieval in case they were
relevant for subsequent analysis, but we excluded elements that were not
needed for the analysis, such as footnotes and critical apparatuses. Finally,
we encoded the text-level metadata in the TEI header. In our case, the
8 In the example we can see that the XML tag <sentence> shows the beginning of the sen-
tence, and has the attributes id (which assigns a unique identifier to the sentence) and loca-
tion (which gives information about the passage to which the sentence belongs). Nested
inside the <sentence> tag we find a series of <word> tags, each corresponding to a word in the
sentence.
historical nature of the research question meant that dating information

was essential, so we added the date of composition of each text. In addi-
tion, genre information was required to build the computational models
of semantic change. In order to make it possible to retrieve the informa-
tion about each source text, we also included a reference to the url of
the source files, and the authors and work title (descriptive metadata).
Next, we recorded administrative metadata information in the form of
names and roles of the corpus compilers. Below is an excerpt from the
TEI header of the text file for Leucippe and Clitophon. In the example we
have highlighted in bold some key tags, such as title, author, language,
genre, and subgenre which we described above, and for a full descrip-
tion of each tag we refer to the TEI’s P5: Guidelines for Electronic Text
Encoding and Interchange (Version 3.6.0. Last updated on 16 July
2019, revision daa3cc0b9).9
<TEI.2>
  <teiHeader>
    <fileDesc>
     <titleStmt>
       <title>Leucippe and Clitophon</title>
       <author>Achilles Tatius</author>
       …
      </titleStmt>
    …
   </fileDesc>
   <profileDesc>
   < langUsage>
       <language ident="grc">Greek</language>
    </langUsage>
    <creation>
      <date>120</date>
    </creation>
   </profileDesc>
   <xenoData>
     <genre>Narrative</genre>
     <subgenre>Novel</subgenre>
   </xenoData>
  </teiHeader>
9 https://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html.
We have seen how we designed the Diorisis corpus based on our original
research question on semantic change. According to the corpus termi-
nology introduced in Sect. 2.1, it is a diachronic monolingual static (i.e.,
not monitor) general text corpus of Ancient Greek. Being the largest
of its kind, the Diorisis corpus is aimed at classicists and historical lin-
guists in general, and can be used to investigate a variety of aspects of the
Ancient Greek language, history, and culture. In fact, we would like to
stress that, in this case and in many others, even though a corpus is often
designed with a specific research question in mind, and its format and
content tend to be affected by this aim, it is also important to consider
its general usability beyond the original research study, and adopt stand-
ard formats as much as possible. For the Diorisis corpus, we chose the
TEI XML encoding and recorded a range of informative metadata which
make it of wider relevance.
2.4 Corpus and Natural Language Processing Tools
2.4.1 Text-Processing Pipeline

In Sect. 2.1 we introduced the concept of annotation as a way to add lin-
guistic information to the text resources we build. This operation helps
to retrieve such linguistic information at scale with the help of com-
puter programs. For example, we could be interested in identifying and
counting the names of people in our text. Because a computer can only
recognize sequences (or strings) of characters, we need to make explicit
the information that is normally implicit in the text. So, we should mark
people’s names to distinguish them from other names. This could be
done, for example, by adding a label like PN next to all instances of per-
son names (separated by a character like _): Florence_PN Nightingale
was an English statistician. This way, when we search for Florence, we
only get hits relative to people named Florence, and not references to the
Italian city.
Because there are various levels of linguistic analysis, and they tend to
build on each other, it is useful to think in terms of a pipeline of steps.
Text-processing pipelines usually start with the most basic level (tokeni-
zation) and then build up to the more sophisticated levels of analysis
(like the semantic or the pragmatic). This section is dedicated to the
main steps in a text-processing pipeline for humanities text resources and
aims to give a general-purpose overview that can be adapted in individual
projects depending on the research questions and resource availability.

We refer the interested reader to Manning et al. (2008) for more details
on the single pipeline steps. Here we are not covering programming
aspects, but several computer packages are available for text-processing
pipelines. Among the most popular are the Natural Language ToolKit
(NLTK)10 for Python and StanfordNLP11 for modern languages, and
Classical Languages ToolKit (CLTK) for the classical languages.
2.4.2 Pre-processing and Tokenization

To make a text ready for further analysis we can apply a series of
pre-processing steps, either using existing tools, or via custom-made
scripts. These steps include:
1. Language detection: to identify the language(s) of the text.

2. Sentence segmentation: to split the text into sentences.
3. Word segmentation: to identify words in languages that do not use
white space to separate words, such as Chinese.
4. Compound splitting: to split compounds written with no separa-
tion between the components; this is necessary in languages like
German.
5. Tokenization: to identify the tokens.
6. Punctuation removal: to remove punctuation marks.
7. Lower-casing: to lower-case the tokens.
8. Stop-word removal: to remove stop words.
9. Spelling standardization: to standardize the spelling of words.
Tokenization is the process that splits the text into units called tokens.
What counts as a token can vary depending on the specific criteria we
choose to suit our research, but it typically corresponds to what in the
common language is referred to as a word. For example, the following
passage from Shakespeare’s Hamlet contains 14 tokens including punc-
tuation marks:
10 https://www.nltk.org.
11 https://stanfordnlp.github.io/stanfordnlp/pipeline.html.
I think I hear them. Stand, ho! Who’s there?
Corpus linguistics terminology distinguishes between tokens and types:

both are sequences of characters, but tokens are the individual word
occurrences in a particular text, while a word type groups together all
tokens that have the same sequence of characters. In the previous exam-
ples, because I appears twice, both occurrences count as tokens of the
same type (I), and so the passage contains 13 types. It may seem obvious
to identify tokens simply based on white spaces, at least for those lan-
guages that use them, like English. However, this does not always work.
In the example, we have Who’s, which includes two tokens, Who and ’s,
which are not separated by a white space. On the other hand, we can
find sequences which contain a white space but which would be better
considered as one token, for example Dr Smith or New York. Different
languages adopt different systems; therefore tokenization rules will vary
depending on the language of the text.
During pre-processing we may also consider excluding punctuation
marks, taking care to distinguish the cases in which they are used as part
of a token (e.g., in Dott. Rossi in Italian, where the title Dott. and the
name Rossi are part of the same token).
Another useful processing step is lower-casing, which consists of
removing capital letters at the beginning of tokens. For example, all
occurrences of the article the at the beginning of a sentence in English
would be capitalized as The and in some headings they would appear as
THE. Without lower-casing, we would obtain three types (the, THE,
and The), while with lower-casing we would only have one type (the).
This step is very helpful, but can also lead to some unwanted results, for
example acronyms such as CAT being confused with a common noun
like cat.
For some purposes, we may also want to remove certain n on-content
words, usually called stop words. These include determiners (in English
the, a, etc.), prepositions (to, in, etc.), pronouns (I, you, etc.), and so
on. For example, if we are interested in analysing the language in
Shakespeare’s Hamlet, we may want to count the number of times differ-
ent types are found. Without stop word removal, the ten most frequent
types are those shown in Table 2.1.12
12 We have generated these lists from the lower-cased text in http://shakespeare.mit.
edu/hamlet/full.html.
Table 2.1 Top

Type Frequency
frequency word types in
Shakespeare’s Hamlet the 1143
and 964
to 738
of 669
i 568
you 549
a 531
my 514
hamlet 463
in 436
The first eight words in this list are all stop words, and the first con-
tent word is hamlet, perhaps not surprisingly. If we remove stop words,
we obtain the list in Table 2.2, which is more readily usable in analyses of
the content of the text.
Another step that is sometimes useful, particularly in the case of his-
torical texts, is spelling standardization, which involves standardizing the
many spellings that the same word can have (for example adviser and
advisor in English). Extensive research has been done on this (as well
as on OCR correction) and we refer the interested reader to Piotrowski
(2012) for an overview.
2.4.3 Stemming, Lemmatization, and Morphological Annotation

Words can appear in different morphological forms. For example, the
English verb run can take the forms run, runs, running, and ran, and
these are called inflectional forms. Sometimes we are interested in know-
ing how many times and in which contexts each of these forms appears,
but at other times we just want to reduce this variability and simply find
out all the times that any form of run appears. To do this, we need to
reduce every different form (inflection) of a word into its base form (run
in the example).
The first idea could be to simply remove the ending (technically speak-
ing, the inflectional suffix), which is predictable for regular forms, for
example -ed for the past simple in English. This approach can also work
to remove derivational suffixes like -ness from representativeness or deri-
vational prefixes such as mis- in misunderstand. We can think of various
Table 2.2 Top

Type Frequency
frequency word types in
Shakespeare’s Hamlet hamlet 463
after removing stop lord 310
words king 194
horatio 158
claudius 120
polonius 119
queen 118
good 109
come 106
laertes 105
patterns that cover a large number of inflectional and derivational affixes,

and then write a computer program that matches these patterns via rules
to find the base form of any given word form. Programs like this are called
stemmers. Examples are the Porter stemmer13 or the Lancaster stemmer
(Paice 1990) for English and the Schinke Latin stemming algorithm14 for
Latin, to mention just a few, but more exist for many other languages.15
Stemming is fast, but not always as accurate as we would like.
Stemming a form like supported could result in support, but having in
hav and stopped in stopp. This makes stemming desirable when speed is
important. Stemmers are widely used, for example, in search engines, so
a search for cook will return pages containing the words cooks, cooking,
cooked, and so on. On the other hand, if we want a higher level of accu-
racy because we want to study how the verb to be is used in Shakespeare’s
Hamlet, for instance, we might want to consider a more sophisticated
approach like lemmatization.
Lemmatization aims to reduce inflected forms to their dictionary form
(lemma). For example, lemmatization would map the irregular form ran
to the lemma run. In order for a lemmatizer to find the correct lemma
of a form in a given context, it relies on knowing its part of speech. For
example, Latin amor in amor vincit omnia (‘love conquers everything’)
13 https://tartarus.org/martin/PorterStemmer/.
14 http://snowball.tartarus.org/otherapps/schinke/intro.html.
15 See for example, the list in http://members.unine.ch/jacques.savoy/clef/.

should be lemmatized as the noun amor, but in other cases amor can be
lemmatized as the passive of the verb amo, ‘to love’. Lemmatization can
be introduced as part of a manual annotation of a corpus, but when pos-
sible using off-the-shelf lemmatizers is much faster, at least for those lan-
guages for which these exist. For example, the NLTK package contains
the WordNet lemmatizer16 for English and CLTK contains lemmatizers
for Latin and Ancient Greek.17
One step further from lemmatization consists in providing the full
morphological analysis of a form in its context. This is useful when we
want to know characteristics like the number of a noun (is it plural or
singular?) or the tense of a verb (is it past, present, or future?). The
example below is taken from the Diorisis Ancient Greek corpus intro-
duced in Sect. 2.3:
<word form="*sidw\n" id="1">

          <lemma id="94083" entry="Σιδών" POS="proper"
TreeTagger="false" disambiguated="n/a">
             <analysis morph="fem nom/voc sg"/>
          </lemma></word>
The XML tag <word> contains the word form, and nested in it
the <lemma> tag shows the lemma (attribute “entry”) in addition to
other attributes18; inside <lemma> we find the <analysis> tag, whose
attribute “morph” contains the morphological analyses of the form, in
this case feminine nominative or vocative singular.
2.4.4 Part-of-Speech Tagging

It is often useful to know the part of speech (PoS) of a word, for exam-
ple whether it is a verb or a noun or an adjective. There are eight PoS in
English:
16 http://www.nltk.org/_modules/nltk/stem/wordnet.html.
17 See https://wiki.digitalclassicist.org/Morphological_parsing_or_lemmatising_Greek_
and_Latin for an overview of tools for Ancient Greek and Latin.

18 The attribute PoS contains the part of speech, TreeTagger indicates whether the lemma
was disambiguated using the part-of-speech tagger, and disambiguated gives the confidence
in the disambiguation. See Vatri and McGillivray (2018) for details.
• Noun (cat, Jane)

• Verb (sit, be)
• Adjective (incredible, your)
• Adverb (very, rapidly, tomorrow)
• Preposition (in, between, on)
• Conjunction (and, because, but)
• Pronoun (you, us)
• Interjection (hi!)
The first four classes are called open because new elements can be added
to them, while the last four are called closed. For example, English wel-
comes new nouns all the time (vacay, fabulosity), but very rarely new
conjunctions or prepositions.
Knowing the PoS of a word is important in many contexts. Imagine
that we want to do sentiment analysis of a series of tweets to find out
whether they express positive or negative opinions. One approach would
be to look for adjectives in those tweets, and see if they belong to a pos-
itive category (e.g., good, fantastic, awesome) or a negative one (terrible,
bleak, bad, and so on). How do we find all adjectives in a text? We can
do this by adding PoS annotation to it, and then searching for words
annotated with the PoS of interest.
The example below is taken from the first sentence of the anno-
tated version of the Brown Corpus,19 a corpus containing English texts
amounting to one million words.
<s n = "1">
<w type = "at" > The </w>
<w type = "np-tl" > Fulton </w>
<w type = "nn-tl" > County </w>
<w type = "jj-tl" > Grand </w>
<w type = "nn-tl" > Jury </w>
<w type = "vbd" > said </w>
<w type = "nr"> Friday </w>
</s>
19 http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM#bc5.
The example contains an XML tag for the sentence <s> with attribute n
for its number. Nested inside the sentence tag we find a series of word
tags <w>, each with an attribute type for the morphological analysis of
the form words: at stands for article, np for proper noun, nn for common
noun, jj for adjective, vbd for verb in the past tense (so PoS and morpho-
logical information on verb tense), and nr for singular adverbial noun
(so, again, PoS and morphological details combined, in this case number
information).
For some languages it is possible to do PoS annotation automatically,
and a very popular PoS tagger for which implementations are available
for several languages is TreeTagger.20 PoS taggers are programs that
assign the PoS to every token in a text, and are usually able to use the
word’s context to perform disambiguation. For example, book is a verb in
We are going to book a flight tonight, but a noun in We gave him a heavy
book. The way such taggers work is usually by being trained on a set of
annotated texts where they are able to learn patterns of co-occurrences
of different PoS, for example that in English an adjective is usually fol-
lowed by a noun, and can then use this to analyse new sentences.
2.4.5 Chunking and Syntactic Parsing

One step further in the direction of deeper levels of linguistic analysis is
the syntactic one. Imagine that we are interested in finding passages in
which machines are seen as agents. We may start from the word machine
(we will see how to find synonyms of a word in Chapter 5). First, it
makes sense to focus on the lemma machine to capture both its singu-
lar and its plural forms. How to find the cases in which machine appears
to refer to an agent? In a preliminary analysis we could find when this
lemma is used as a subject of an active verb form (if we have morpholog-
ical information on the verbs), and then we could examine the types of
verbs that are found in these instances via PoS tagging. To find subjects
of verbs, we need some way to access the syntactic level of analysis.
There are two main ways to perform syntactic analysis (or syntactic
parsing). One approach is via dependency parsing, which identifies the
relations of dependency between the different elements of a sentence.
20 https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.
For example, in the sentence Annotation adds linguistic information to a

corpus, dependency parsing will say that adds is the predicate, and anno-
tation is its subject (so it “depends” on it as subject), while information
depends on it as object, and linguistic depends on information as an
attribute. Another approach to syntactic parsing, called constituent pars-
ing, consists in grouping the tokens into syntactic phrases, such as noun
phrase, verb phrase, prepositional phrase, and so on. So we can think of
the sentence in the example in terms of its constituents: a noun phrase
(annotation), then a verb phrase (adds linguistic information to a corpus),
which in turn contains a noun phrase (linguistic information) and a prep-
ositional phrase (to a corpus), which in turn contains a noun phrase (a
corpus).
Syntactically annotated corpora, also known as treebanks, are very val-
uable resources, and treebanks for a number of both living and dead lan-
guages exist. For an overview on treebanks see Nivre (2008). Building
treebanks can be an extremely time-consuming process, hence programs
that perform syntactic parsing of a text (parsers) have been created for
many languages. These programs are often ‘trained’ on treebanks, in the
sense that they learn relevant patterns of association present in the tree-
banks and then are able to syntactically analyse new texts based on these
learnt patterns.
2.4.6 Named Entities

In humanities research, recognizing named entities is a very common
step in the analysis of a text. Imagine that we have a very large collec-
tion of historical newspapers and that we are interested in investigat-
ing which places these newspapers talk about. To this end, we will need
to find the place names mentioned in the texts, for example York or
Newbury. Place names, or toponyms, are one example of named entities,
alongside names of people and names of organizations. There are sev-
eral related tasks involving named entities: recognition, classification,
and disambiguation or linking. Named-entity recognition aims at find-
ing which elements in the text refer to a named entity; for example, in
the sentence Florence is the cradle of the Renaissance, Florence is a named
entity. Named-entity classification consists in assigning the correct cate-
gory to the named entity. For example, in the sentence above, Florence
is a typonym. Disambiguation aims at connecting the named entity with

a unique object (typically via its Wikipedia page); for example, there are
several places in the world named Florence, but the one mentioned in the
sentence above is likely to be the city in central Italy.
Several tools for performing named-entity tasks are available, includ-
ing Stanford NER (Finkel et al. 2005) and the Edinburgh Geoparser
(Grover et al. 2010), and existing tools can be improved to bet-
ter suit the needs of humanities research (cf. Erdmann et al. 2016).
Named-entity-related tasks can be especially challenging in the case of
historical texts, where the quality of the texts can vary (for example due
to OCR errors) and access to resources can be limited. For example,
an important step in toponym resolution is the creation of a gazetteer,
which is a collection of records corresponding to geographical places. A
gazetteer may include the name of a region, its location in terms of its
geo-coordinates, and other relevant information. While modern gazet-
teer options such as the NGA GEOnet Names Server21 are abundant,
historical gazetteers such as the Gazetteer of British Place Names22 are
harder to find, but critically important for historical analyses.
2.4.7 Other Annotation
A large part of humanities research relies on text interpretation. Access

to the annotation at the level of the meaning of texts (i.e., the seman-
tic level of analysis) is therefore particularly important. Semantic anno-
tation is often done at the level of individual word tokens, and can take
different shapes. Word-sense tagging assigns every token to a code from
a thesaurus-style dictionary, so that we can identify, for example, all the
instances of a given class like objects or emotions. The UCREL Semantic
Analysis System (Rayson et al. 2004) is one of the tools for such tagging.
Let us take the opening sentence of this section. Below is the output of
the UCREL system:
21 http://geonames.nga.mil/gns/html/.
22 http://gazetteer.org.uk/.

$7$=
--ODUJH11$
11SDUW120,6.
,2RI=
11+XPDQLWLHV3
11UHVHDUFK;
99=UHOLHV6
,,RQ=
11WH[W44
11LQWHUSUHWDWLRQ;.4

From the full list of tags23 and their subcategories24 we can see that
the determiner A is assigned the category Z5, which is reserved to gram-
matical items, while the adjective large is tagged with the code N3.2+,
which refers to size, N5+ (quantities), and A11.1+ (important).
When semantically annotating a text, it is a good idea to rely on
existing resources that organize the lexicon into semantic categories. A
very widely used such resource is WordNet (Miller 1995),25 which has
become a standard in computational linguistics for contemporary lan-
guages. The availability of similar lexicons for historical languages is
much more restricted (WordNet has a limited Latin version, for exam-
ple), and it can be helpful to consider automatic approaches to semantic
annotation. This will be the focus of Chapter 5.
Another important level of annotation is that of sentiment, which is
relevant to many research questions in the humanities. Imagine that we
want to find out whether a text has a positive, negative, or neutral sen-
timent, and how this changes throughout the text in relation to differ-
ent characters, regions, and so on, or we want to measure the sentiment
expressed on the Internet with respect to specific topics or views and in
relation to certain historical or political events. This task is commonly
referred to as sentiment analysis. As we know, sentiment is not always
23 http://ucrel.lancs.ac.uk/usas/semtags.txt.
24 http://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt.
25 https://wordnet.princeton.edu/.
binary, and texts can display multiple layers of meaning, and use rhetori-
cal devices like sarcasm and irony. All this heavily depends on the context,
which makes developing automatic methods for sentiment analysis a very
active area of research in NLP (cf. Castellucci et al. 2015, among others).
An overview of such methods is outside the scope of this chapter,
and here we will briefly review two main approaches to sentiment anal-
ysis: semantic and machine-learning approaches. Semantic approaches
make use of sentiment lexicons which organize words (typically adjec-
tives) into positive, negative and neutral classes. Once the text has been
pre-processed (for example with tokenization and stop-word removal,
stemming or lemmatization), we check for the presence or absence of
each term of the lexicon. Then we add the polarity values of the terms to
reach a global polarity value for the text, taking into account that mod-
ifier terms (such as very, too, little) can increase or decrease the polarity
of accompanying terms, and inversion terms or negations (such as no,
never) can reverse the polarity of the terms they relate to. On the other
hand, machine-learning methods rely on collections of texts that have
been annotated by sentiment. These are then used to “train” so-called
classifiers, computer programs which classify a text into classes, for exam-
ple into the positive, negative, or neutral category based on their fea-
tures (see Chapter 5 for an introduction to this concept), for example its
words and their semantics.
2.5 Conclusion
In this chapter we have discussed the differences between linguis-
tic corpora, text collections, and text resources in the humanities, and
illustrated the basic steps involved in processing a text for the purpose
of further analysis. The topics covered in this chapter are the object of
very active research in the field, and the brief account we have given here
is far from complete. Moreover, not all the topics will be relevant to all
research scenarios, but knowing the terminology and being aware of
the possibilities offered by existing tools helps inform the phases of the
research process, from design to analysis. For example, it is very impor-
tant to collect the texts in such a way as to address our research questions
in the most accurate way, but also taking into account existing stand-
ards, access and availability concerns, size and format of the resources.
This way we make it possible for other researchers to reuse and possibly
further enrich our resources in the future.
Language can be analysed at different levels, from spelling variation

to its morphology, syntax, meaning, or pragmatics. Language process-
ing tools make these levels of analysis explicit by connecting the forms
in a text to more general labels (for example lemmas or PoS or abstract
concepts such as those in a semantic ontology). These tools do not just
help towards a linguistic analysis of the texts, but they are a way into
their content as well. For example, detecting named entities can help
us explore the characters or the places in a text, and detecting adjec-
tives can help us understand the sentiment of a text, which opens up a
whole range of possibilities for further investigation. It is worth noting,
however, that no automatic analysis system gives 100% correct results
all the time, and accuracy is one of the factors to be considered in the
research decsign and interpretation. In fact, evaluating text processing
tools requires careful thinking because the standard approaches used in
NLP are not always suitable for humanities research (cf. discussion in
Erdmann et al. 2019 and McGillivray et al. 2020). In the next chapter
we will use the concepts introduced in this chapter to delve deeper into
the analysis of texts.
References
Bellini, D., & Schneider, S. (Eds.). (2003–18). Banca dati dell’italiano parlato
(BADIP). Graz: Karl-Franzens-Universität Graz. http://badip.uni-graz.at.
Bode, K. (2020). Why You Can’t Model Away Bias. Modern Language
Quarterly, 81, 1.
Burnard, L. (2005). Metadata for Corpus Work. In M. Wynne (Ed.), Developing
Linguistic Corpora: A Guide to Good Practice (pp. 30–46). Oxford: Oxbow
Books. Available online from http://ota.ox.ac.uk/documents/creating/dlc/.
Accessed 16 Sept 2019.
Castellucci, G., Croce, D., & Basili, R. (2015). Acquiring a Large Scale Polarity
Lexicon Through Unsupervised Distributional Methods. In C. Biemann et al.
(Eds.), Natural Language Processing and Information Systems 2015. Lecture
Notes in Computer Science (Vol. 9103, pp. 73–86). Switzerland: Springer
International Publishing. https://doi.org/10.1007/978-3-319-19581-0_6.
Erdmann, A. et al. (2016). Challenges and Solutions for Latin Named Entity
Recognition. COLING, Association for Computational Linguistics, 85–93.
Erdmann, A., Wrisley, D. J., Allen, B., Brown, C., Cohen-Bodénès, S., Elsner,
M., et al. (2019). Practical, Efficient, and Customizable Active Learning for
Named Entity Recognition in the Digital Humanities (pp. 2223–2234).
https://doi.org/10.18653/v1/n19-1231.
Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating Non-local

Information into Information Extraction Systems by Gibbs Sampling. In
Proceedings of the 43nd Annual Meeting of the Association for Computational
Linguistics (ACL 2005) (pp. 363–370).
Grover, C., Tobin, R., Byrne, K., Woollard, M., Reid, J., Dunn, S., & Ball, J.
(2010). Use of the Edinburgh Geoparser for Georeferencing Digitised
Historical Collections. Philosophical Transactions of the Royal Society A,
368(1925): 3875–3889.
Hill, M. J., & Hengchen, S. (2019). Quantifying the Impact of Dirty OCR
on Historical Text Analysis: Eighteenth Century Collections Online as
a Case Study. Digital Scholarship in the Humanities, 34, 4. https://doi.
org/10.1093/llc/fqz024.
Jenset, G. B., & McGillivray, B. (2017). Quantitative Historical Linguistics. A
Corpus Framework. Oxford: Oxford University Press.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
Information Retrieval. Cambridge: Cambridge University Press.
McEnery, T., & Hardie, A. (2013). The History of Corpus Linguistics. In
K. Allan (Ed.), The Oxford Handbook of the History of Linguistics. Oxford:
Oxford University Press.
McEnery, T., & Wilson, A. (2001). Corpus Linguistics: An Introduction.
Edinburgh: Edinburgh University Press.
McGillivray, B. (2014). Methods in Latin Computational Linguistics. Leiden:
Brill.
McGillivray, B., Hengchen, S., Lähteenoja, V., Palma, M., & Vatri, A. (2019).
A Computational Approach to Lexical Polysemy in Ancient Greek. Digital
Scholarship in the Humanities. https://doi.org/10.1093/llc/fqz036.
McGillivray, B., Poibeau, T., & Ruiz Fabo, P. (2020). Digital Humanities and
Natural Language Processing: Je t’aime. Moi non plus. Digital Humanities
Quarterly.
Meyer, C. F. (2015). Textual Analysis: From Philology to Corpus Linguistics.
In English Corpus Linguistics: Crossing Paths. Brill. https://doi.org/
10.1163/9789401207935_004.
Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications
of the ACM, 38(11), 39–41.
Nivre, J. (2008). Treebanks. In M. Kytö & A. Lüdeling (Eds.), Corpus
Linguistics: An International Handbook (pp. 225–241). Berlin: Mouton de
Gruyter.
Paice, Chris D. (1990). Another Stemmer. ACM SIGIR Forum, 24(3), 56–61.
Perrone, V., Palma, M., Hengchen, S., Vatri, A. Smith, J., & McGillivray, B.
(2019, August 2). GASC: Genre-Aware Semantic Change for Ancient Greek.
In Proceedings of the 1st International Workshop on Computational Approaches
to Historical Language Change 2019, ACL 2019. Florence, Italy.
Piotrowski, M. (2012). Natural Language Processing for Historical Texts.

Synthesis Lectures on Human Language Technologies. San Rafael, CA:
Morgan and Claypool Publishers.
Rayson, P., Archer, D., Piao, S. L., & McEnery, T. (2004, May 24). The
UCREL Semantic Analysis System. In Proceedings of the Workshop on Beyond
Named Entity Recognition Semantic Labelling for NLP Tasks in Association
with 4th International Conference on Language Resources and Evaluation
(LREC 2004) (pp. 7–12). Lisbon: Portugal.
Sinclair, J. (2005). Corpus and Text—Basic Principles. In M. Wynne (Ed.),
Developing Linguistic Corpora: A Guide to Good Practice (pp. 1–16). Oxford:
Oxbow Books.
TEI Consortium (Eds.) TEI. (2019). P5: Guidelines for Electronic Text Encoding
and Interchange. Version 3.6.0. Last updated on 16/07/2019. http://www.
tei-c.org/Guidelines/P5/. Accessed 17 Sept 2019.
Underwood, T. (2019). Distant Horizons: Digital Evidence and Literary Change.
Chicago: Chicago University Press.
Vatri, A., & McGillivray, B. (2018). The Diorisis Ancient Greek Corpus.
Research Data Journal for the Humanities and Social Sciences, 3(1), 55–65.
https://doi.org/10.1163/24523666-01000013.
Xiao, R., McEnery, T., & Tono, Y. (2006). Corpus-Based Language Studies.
London: Routledge.
Another random document with
no related content on Scribd:
The discomposure of the solicitor and the nervous tension of the
advocate were intruded upon at last by the constable, who had taken
rather more than three-quarters of an hour to perform his mission.
“Will you come this way, gentlemen?” he said.
They were conducted along more dark and apparently interminable
passages, up one flight of stone steps and down two others, until at
last they found themselves in a room similar to the one they had left,
except that it was larger and gloomier, smelt rather more poisonous,
and looked somewhat more funereal.
Northcote’s heart was again beating violently as he stepped over its
threshold, and his excitement was not in the least allayed when he
discovered that there was no one in it.
“If you will kindly take a seat, gentlemen,” said their guide, “Harrison
will be here in a few minutes.”
“In other words, twenty,” said Mr. Whitcomb, beginning a tour of
inspection of this dismal apartment. “These small mementoes may
have some slight interest for you, my friend,” he said to Northcote.
He drew the young man’s attention to a row of shelves placed at
right angles to the window. They were raised tier upon tier to the
height of the ceiling, and were crammed with crude staring objects. A
close inspection revealed them to be busts made of plaster of Paris.
“Why, what are these horrible things supposed to represent?” said
Northcote, with a thrill in his voice.
“These,” said Mr. Whitcomb cheerfully, “are the casts taken after
death of a number of ladies and gentlemen who have had the
distinction of being hanged within the precincts of this jail during the
past hundred years. If you will examine them closely, you will be able
to observe the indentation of the hangman’s rope, which has been
duly imprinted on the throat of each individual. Also, you may discern
the mark of the knot under the left ear. Interesting, are they not? The
official mind is generally able to exhibit itself in quite an amiable light
when it stoops to the æsthetic.”
“I call it perfectly devilish,” said Northcote, shuddering with horror.
“They must have quite a peculiar scientific interest,” said Mr.
Whitcomb, “for each lady or gentleman who may chance to enter this
apartment to consult his or her legal adviser. Are you able to
recognize any of these persons of distinction? If I am not mistaken,
the elderly gentleman on the third row on the right towards the door
is no less an individual than Cuttell, who poisoned a whole family at
Wandsworth. High-minded and courteous person as he undoubtedly
was, I must say Cuttell certainly looks less outré now he is dead, and
more in harmony with his surroundings, than when he entered this
room, and asked me in a mincing tone, with all the aitches
misplaced, whether in my opinion any obstacle would be raised
against his getting his evening clothes out of pawn, as he desired to
wear them in the dock during his trial.”
“For the love of pity, spare me!” cried Northcote, pressing his fingers
into his ears, “or I shall run away.”
“The gentleman with the protruding lip on the second shelf towards
the window is, unless my eyes deceive me, one Bateman, who
slaughtered his maiden aunt with a chopper and buried her in a drain
—”
Northcote spared himself further details in the history of Mr. Bateman
by laying violent hands upon his counterfeit presentment, and hurling
it with terrific force against the iron window bar, whence it fell to the
floor in a thousand pieces.
“Upon my soul, I have a great mind to go through the lot,” he said,
livid with fury.
“Pray do so, by all means, dear boy,” said Mr. Whitcomb, with that
unction which never forsook him, “and you will find your art-loving
countrymen will avenge this outrage upon the private and peculiar
form of their culture by one day insisting that your own effigy is
placed on these historic shelves.”
XIX
THE ACCUSED
Renewed assaults upon these interesting objets d’art were averted

by sounds outside in the corridor. Northcote imposed a superhuman
control on all his faculties that his agitation might be restrained, when
the door opened and two shadowy figures, barely visible at first,
crept silently into the darkness of the room.
The two figures were those of women. By the time Northcote had
evoked a sufficient force of will to meet their outline, the one that first
encountered his glance was so brutalized and repulsive that his eyes
were detained with a fascinated sense of horror. It belonged to a
creature that was degraded, squat, coarse, insensitive. He felt
almost the same reluctance in approaching it as he would a cobra.
She, however, was not the one whom Mr. Whitcomb, with all the
polished readiness of the thoroughgoing man of the world, had
advanced to meet, and to whom he had held out his hand. The
young man heard with stupefaction, while his own gaze remained
riveted to the features of the sibyl, the bland and courtier-like tones
of the solicitor caressing and paying homage to a figure in the
background, a figure which was still and silent, which he could not
see.
This person, however, had no interest for Northcote; she was so
obviously the female warder who had accompanied the murderess.
One so characterless, so formless, could not be said to exist in the
presence of this detaining horror, whose personality thickened, as
with pestilence, the noisome air of the room. And it was this obscene
life that he had pledged himself to save!
Strangely, this blunt fact did not dominate his consciousness in the
manner it must have done one of a less alert perception. For with a
perversity that transcended the will, at this moment his thoughts
were overspread by the comedy that was being enacted by the
suave lips of the solicitor. The harmonious stream of mellow
commonplaces that Mr. Whitcomb was pouring into the ear of the
shrinking official nonentity who kept in the background accosted his
sense of the comic with a kind of lugubrious irony. With a critical
detachment which even startled himself, he seemed to awake to the
fact that he was standing outside his milieu, that he was witnessing a
drama within a drama; and he found himself in possession of the
singular reflection that here was a robust yet delicate adumbration of
the farcical which would make the fortune of a writer for the stage.
For there was something indescribably ludicrous in the rich voice of
the solicitor enunciating his own private opinions upon the weather,
the state of trade, the inconvenience of winter and its bearing upon
the perennial problem of the unemployed, when the grotesque horror
which dominated the room was at his elbow, emitting the glances of
a venomous snake.
Suddenly Northcote heard Mr. Whitcomb call his name.
“Come here, Mr. Northcote; I want to introduce you.”
In a hazy, stupefied manner the young man obeyed.
“Mrs. Harrison,” said the solicitor, “allow me to present my friend Mr.
Northcote. I feel sure you will find a friend in him too.”
The advocate grew aware that a weak, nerveless hand was resting
in his, but his eyes were still riveted on the face of the ghoul.
“Say something, you fool, and play up a bit,” said the solicitor’s calm
voice in his ear.
“Er—a nice day, Mrs. Harrison,” said the young man, without
knowing a word he was uttering.
“Yes,” said a hesitating voice, which by no possibility could have
proceeded from the tightly closed lips of the creature whom his gaze
was devouring.
The words broke the illusion at a blow. The brutalized countenance
under whose dominion he had fallen was that of the female warder.
The person with whom the solicitor had been conversing with such
cheerful volubility, to whom he was now himself speaking, was the
poisoner, the cold-blooded denizen of the curb and the gutter. He
drew his hand away quickly, with an involuntary emotion, from those
hot, flabby, and damp fingers that he still detained.
“I know, I know,” the woman seemed to breathe, as though she were
interpreting an unspoken thought.
“I may tell you, Mrs. Harrison,” said the solicitor, with his well-fed
chuckle, “that if your knowledge can compare with that of this
gentleman, you are one of the wisest persons in the world. He will
tell you so himself.”
So crude a gibe had the happy effect of restoring to Northcote his
self-possession.
“My name is not known, Mrs. Harrison,” he said, with his fibres
stiffening, and his voice growing deeper and falling under control,
“but you can trust me to eke out my inexperience with a
determination to serve you to the utmost of my power.”
Northcote saw that two luminous orbs were being defined slowly in
the centre of the gloom. For an instant no reply was made to his
words, and then he was conscious that a faint voice was whispering,
“If your friend would go right away with the warder—right away to the
end of the room, then perhaps we could speak with one another here
where it is so dark.”
“Whitcomb,” said Northcote, in a low tone, “please take the warder
right up to the window at the other end, where you can see to read,
and read the Law Journal to her.”
“How d’ye do, ma’am,” said the solicitor, turning to the ghoul in his
promptest, blandest, and most musical manner. “I think it has been
my privilege to meet you before, although you may not remember
me. Is that boy of yours prospering in the police force?”
“I haven’t got a boy in the police force,” said the sibyl, in a loud,
strident tone.
“Then which of your blood relations is it, may I ask, who is connected
with the police force? I am sure you have some one.”
“I have an uncle.”
“Ah, to be sure, an uncle! But it is so easy to make a mistake on a
point of official nepotism. Come along this way, ma’am, and tell me
all about your uncle.”
XX
THE INTERVIEW
Prisoner and advocate were left together amid recesses of

impenetrable gloom in the darkest corner of the large apartment. It
seemed to enfold them, and to render the pallor of their faces almost
invisible. The eyes alone encountered those of each other, and even
these could embody no phase of meaning. A strange continence, as
sharp and clean as that of a hero of fable, had begun to cleanse the
veins of the advocate. In the presence of this stealthy thing his
nature had never seemed so fine, so valiant, so full of subtle
penetration; nor had it ever felt so girt with mastery, so completely
enamored of its own security.
“I shall know what words to speak to-morrow,” he said, in a hoarse
undertone.
“Will they not be spoken for yourself?” whispered the dismal low
voice.
“How? In what manner?”
“You will speak to make a name.”
“Also for the salvation of yours.”
“Mine does not matter; it is not my own.”
“You trust me, do you not?”
“I trust you; yet you drew your hand away so quickly when you knew
it was not the warder who was the murderess. Give it to me again.”
There was something so curious in the prisoner’s fragility, something
so strange in her cowed air, that it seemed to pervade the advocate
with the stealth of a drug. But the emotion of disgust with which he
had withdrawn his hand when first he grew conscious that he
touched her was no longer present when he offered it again. The
second time she clasped her fingers round it so that their pressure
seemed to sear his skin. It had the heat of a live coal.
In releasing his hand she let her fingers yield it so imperceptibly that
he did not know the precise point at which it had ceased to be held;
and he was afraid to make a motion of withdrawal, lest it should be
interpreted as a repetition of that which had dealt her a wound. He
tried to see her face, but in the darkness there was no lineament to
decipher.
“This is my deliverer,” he heard her breathe.
“How have you come to know it?” The advocate was devoured by an
intolerable curiosity.
“Your hands—your hands, they are so powerful; are you not so
strong?”
There was nothing in these words that the advocate had expected;
the voice, the manner of their utterance, their apparent irrelevance,
made a strange effect in his ears.
“They will not do me to death,” she said, in a tone he could hardly
hear. “I never tasted life until I was brought into prison. And you
cannot think how sweet it is to me. Everything has become so
beautiful: the birds, the trees, and the sky, and the crowds of people
and the mud of the great city.”
She clutched the hand of the young advocate with a convulsive
shudder.
“Your quietness tells me that you understand.” Her voice was
touched with ecstasy. “You do not answer or seek to console me.
You are the one I have dreamed of in prison. Where is your hand?”
Again Northcote yielded to her entreaty, this time without a sense of
repulsion.
“Yes, this is the hand that has been around me in the darkness,
when I have shuddered in my dreams.”
“It is wonderful,” said Northcote, “that you should know that you will
be able to lean upon me.”
“I know what your voice is like also, although it is so vague and
distant to me now. I know the words it will speak to-morrow, when it
asks them to be merciful. I know that all I have seen in my dreams
will take place.”
“It must be a grievous thing to go to sleep in a prison,” said
Northcote, uttering a half-formed thought without consideration of his
words. “Or perhaps it is more dreadful to awaken in one.”
“The going to sleep and the awakening are not so terrible as the
dreams that come. That in which I saw you first, in which I first heard
your voice, in which I first touched the hand that will deliver me, was
most dreadful in its nature. My weak mind fell down under it. I think I
could not live through such a vision again.”
“How strange are these visitations!” said Northcote. “How awful, how
mysterious! When did this dream come to you?”
“Last night about the hour of ten; the first time I had closed my eyes
for three days.”
Northcote recoiled with a shudder. The precision of the voice and the
power of the coincidence were overmastering.
“There is no accounting for these things,” he said, in a voice
throbbing with excitement. “At the same hour I also had a strange,
an almost terrible sort of vision.”
“Yes, my deliverer, you have been called into my life to save it—to
save that life which never had a perfect thought until it was brought
into prison. It did not know what the trees and the sky were, nor the
air and the birds; never had it heard a deep voice nor touched a
strong hand. You are he that leaped out of the vast multitude that
mocked me in my dream, he who stood up before it, and, with a
great voice that sounded like the waves of the sea, caused them all
to break and run. They grew afraid of your words and your looks,
and they fled in terror. Yes, my life has become so full of beauty and
meaning, so full of a spacious mystery, that I cannot believe it is to
be taken away.”
These words, breathed rather than spoken, sounded in the ear of
Northcote as those of a transcendent sanity. Remote as they were,
they yet appeared divinely appropriate to the time and place. But
they left only one course for him to follow. He must detach himself
from the unhappy speaker of them; he must flee her presence. Their
edge was too keen. There would be no advocacy on the morrow if
he yielded to the subtle enervation of this atmosphere. The voice
pierced him like a passion, yet his veins had grown sluggish and
heavy, as if under the influence of a drug.
XXI
THE TALISMAN WHICH TRANSCENDS
EXPERIENCE
Calling the name of the solicitor, Northcote broke away abruptly

from the prisoner and left the room. It had seemed to be charged
with a pestilence. Mr. Whitcomb was soon at his side, and hastily
they wended their way up and down various flights of stone steps,
along the noisome corridors of the huge building, until daylight came
in sight once more through the doorway at the end of the passage at
which their cab was standing. Their relief was very real at being able
to breathe again the living air, fog-laden as it was.
“I don’t know how many times,” said Mr. Whitcomb, as they drove
from the portals of the jail, “on one errand and another, I have
descended into this inferno, but it never loses its power to give me
the blues.”
“I am regretting,” said Northcote, “that I did not take your advice. I
wish I had not come near it. I cannot shake off the impression it has
made. Ugh! it gets into one’s blood. I don’t know anything quite so
overpowering as the nausea of locality.”
“You are too impressionable, my son,” said the solicitor, with a furtive
smile. “You will never be able to get through life at this rate. It wants
one of some hardihood, one who is robust in each one of his five
senses, to practise law.”
“I would say,” Northcote rejoined, with a shudder, “that to be armed
for this calling each particular nerve he has got in his body must be
shod with iron.”
The solicitor laughed at so palpable a discomposure.
“What did you make of the prisoner?” he asked, suddenly. “You
appeared to find a great deal to say to one another.”
“Personally I hardly spoke a word to her,” said the young man,
seeking to gather his recollection of that strange interview.
“She appeared to find a good deal to say to you,” said the solicitor.
“In that respect you have been more fortunate than myself. I have
spoken with her three times, and I don’t think I have been able to
extract three words from her. Do you mind telling me what she said?”
“To the best of my remembrance she said nothing that could have
the least interest for anybody.”
“Tell me, what impression of her have you brought away?”
“I hardly know whether she allowed me to form one. Our
communication seemed so indirect. She kept her face in the shadow
all the time; I could not discern a feature.”
“Surely you were able to gather some sort of general idea?”
“That is the strange thing—I seem to have formed no opinion about
her. One would not have thought it conceivable that one should have
conversed with a person, dealt at least in an actual exchange of
words at close quarters, and that they should remain so null. I think I
should have been better acquainted with her had I not seen her at
all.”
“Come, my dear fellow, you can surely recall a word or two of what
she said? She is an enigma; and she is said not to have spoken six
words since she was first remanded in custody.”
“That certainly makes the volubility in which she indulged this
afternoon the more astonishing.”
“Indeed it does. Would you say that she expects an acquittal?”
“Well, now you come to mention that, I would say she does.”
“It is an extraordinary thing that they are all so sanguine. It hardly
ever seems to occur to any of them that by any possibility they can
meet with their deserts. Indeed, one might say the bigger the
criminal, the greater their confidence that they will escape.”
“I am going to ask you what opinion you have formed of her,” said
Northcote.
“It follows the lines of your own. When I have come into personal
contact with her, I have been able to make rather less than nothing
of her. At first I thought she seemed sullen, and quite reconciled to
her position, indeed, that she was too callous to care about anything;
but upon seeing her to-day, I was rather struck by the fact that her
attitude had undergone a change.”
“How long has she been in prison?”
“Nearly three months. She is an odd sort of creature—her former
associates are agreed upon that—and doubtless some sort of
change has taken place in her. I am more than ever convinced that
insanity is your line; and by this time it should not be too much to
hope that you are.”
“She will expect her liberty.”
“She will expect! My dear boy, it is when you permit yourself to talk in
this fashion that you fill one with so much distrust. Her position
entitles her to expect nothing.”
“No sort of doubt overtakes you then in regard to her guilt?”
“None. I have suggested that to you over and over again. My dear
fellow, it is as I feared; you have not permitted yourself a due
appreciation of the overwhelming nature of the evidence. I do not
see how she can hope to escape; and this is pretty plain speaking on
the part of her attorney. Just look at the array of facts—her course of
life, her purchase of the poison, the result of the post-mortem, the
presence of motive. Again and again I have felt it to be my duty to
suggest to you that Tobin would not have attempted to shake the
evidence.”
“Well, you must permit me to say that, reflect upon the question as I
will, it does not seem easy to reconcile the woman in that room with
the cold-blooded monster who will be presented to the jury.”
“That phenomenon is by no means rare. It has been my fortune to
undertake the defence of more than one finished example of moral
obliquity who has presented not the least indication of such a
condition. Besides, do you not admit that the impression that this
woman made upon you was one of absolute nullity? Were you not
unable to divine anything in regard to her?”
“Yes, that was my first feeling; but I am now confessing that after all,
in some mysterious way, she has contrived to shake these
preconceived ideas about her, now that from this distance I can view
the room and what transpired in it. I dare not say by what means she
has contrived to produce this effect; indeed, it is so subtle that I can
hardly say what it amounts to, because if I begin to recall her words
she seems almost to have admitted her guilt. Yet of one thing I am
convinced—she presented no evidence of her depravity.”
“One can easily concede the probability of that.”
“Yes, but had it been as complete as you insist, I must have seen it.”
“Pardon me, but I am afraid it does not follow. What is easier than to
hide its traces from the eyes of inexperience?”
“Have I not the talisman in my pocket which transcends experience?”
“Talisman be damned,” said Mr. Whitcomb, with a jovial brutality.
Before his companion could frame an answer to a scorn so
unconciliatory, the hansom stopped before the offices of Messrs.
Whitcomb and Whitcomb. They alighted together.
XXII
LIFE OR DEATH
The final consultation of Northcote and his client took place in the
open street in the heavily raining December afternoon, with their
backs against Mr. Whitcomb’s brass plate. The spot selected for their
last utterances on this momentous affair was incongruous indeed,
but each had grown so impatient of the other, that if their last words
were spoken here, the clash of their mental states was the less likely
to invite disaster than in a more formal council-chamber of four walls.
The robust common sense of the solicitor had never shown itself to
be more incisive than now as he stood with his back to his own door,
under a dripping umbrella, his hat pushed to the back of his head,
and his trousers turned up beyond his ankles. His twenty years of
immensely successful practice, his exact knowledge of human
nature, his ruthless worldliness, his reverence for the hard fact, stood
forth here in the oddest contrast with the somewhat “special” and
rarefied quality of this youthful advocate whom he had seen fit to
entrust with so important a case.
“It’s a pity, it’s a pity,” he brought himself to say at last, his veneer
falling off a little under the stress of his chagrin, and revealing a
glimpse of the baffled human animal beneath. “It is a serious mistake
to have made; but we have got to stand to it. You are not the man for
this class of work, to speak bluntly. You are either too deep or you
are not deep enough. But as I say, we have got to stand to it now.
My last words will be to urge you to put as good a face upon it as
you can.”
“In other words,” said Northcote, stiffening, “you will look to me to do
my best.”
“I don’t put it in that form exactly,” said the solicitor, midway between
exasperation and a desire to be courteous. “I want you fully to
appreciate that you are handling an extremely tough job, and I
merely want you to make the best of it, that’s all.”
“I will tell you, Mr. Whitcomb,” said Northcote, striving in vain to avert
the explosion that had been gathering for so long, “that if it were not
now the eleventh hour, if I had not pledged myself to this thing more
deeply than you know, if it were not a matter of life and death to me
as well as to your client, I would throw your brief back at you rather
than submit to this. It will be time enough for you to get upon your
platform when I have made a hash of everything.”
“Yes, I think you are entitled to say that,” said the solicitor impartially,
having made a successful effort to recapture his own serenity. “I
have no right to talk as I am doing; I have never done so to any one
else. I suspect you have got on my nerves a bit.”
“Yes, the whole matter throws back to the clash of our
temperaments,” said Northcote, unable to cloak his own irritation
now that it had walked abroad. “It is a pity that we ever attempted to
work together. Yet for one who envelops himself in the serene air of
reason, you are somewhat illogical, are you not? You enter the
highways and hedges in search of a particular talent; you have the
fortune to light upon it; and then you turn and rend its unhappy
possessor for possessing it.”
“As I say, my dear boy, this particular talent of yours—or is it your
temperament?—you see I am not up in these technical names—has
got on my nerves a little.”
“And your temperament, my friend, to indulge a tu quoque, is
covered with a hard gritty outer coating, for which I believe the
technical name is ‘practicality,’ which positively sets one’s teeth on
edge.”
“So be it; we part with mutual recriminations. But this is my last word.
Firmly as I believe I have committed an error of judgment, if to-
morrow you can prove that I have deceived myself, you will not find
me ungrateful. I can speak no fairer; and this you must take for my
apology. It is not too much to say that since I have come to know you
I have ceased to recognize myself.”
“I accept your amende” said Northcote, without hesitation. “I see I
have worried you, but if I might presume to address advice to the
fount of all experience, never, my dear Mr. Whitcomb, attempt to
formulate a judgment upon that which you cannot possibly
understand.”
“After to-morrow there is a remote chance that I may come to heed
your advice. In the meantime we will shake hands just to show that
malice is not borne. Don’t forget that you will be the first called to-
morrow, at half-past ten. It is quite likely to last all day.”
The solicitor turned into his offices and Northcote sauntered along
Chancery Lane. The twilight which had enveloped the city all day
was now yielding to the authentic hues of evening. The dismal
street-lamps were already lit, the gusts of rain, sleet, and snow of the
previous night had been turned into a heavy downpour which had
continued without intermission since the morning. The pavements
were bleached by the action of water, but a miasma arose from the
overburdened sewers, whose contents flowed among the traffic and
were churned by its wheels into a paste of black mud. Northcote was
splashed freely with this thick slushy mixture, even as high as his
face, by the countless omnibuses; and in crossing from one
pavement to another he had a narrow escape from being knocked
down by a covered van.
It was in no mood of courage that the young man pushed his way to
his lodgings through the traffic and the elbowing crowds who
thronged the narrow streets. Even the mental picture that was
thrown before his eyes of this garret which had already devoured his
youth had the power to make him feel colder than actually he was.
Never had he felt such a depression in all the long term of his
privation as now in wending his way towards it laboriously, heavily,
with slow-beating pulses.
He was sore, disappointed, angry; his pride was wounded by the
attitude of his client. His self-centred habit caused him to take
himself so much for granted, that at first he could discern no reason
for this volte-face. In his view it was inconsiderate to withhold the
moral support of which at this moment he stood so much in need.
Truly the lot of obscurity was hard; its penalties were of a kind to
bring many a shudder to a proud and sensitive nature. The
patronizing insolence of one whom he despised was beginning to fill
him with a bootless rage, yet in his present state how impotent he
was before it. He must suffer such things, and suffer them gladly,
until that hour dawned in which his powers announced themselves.
That time was to-morrow—terrible, all-piercing, yet entrancing
thought! The measure of his talent would then be proclaimed. Yet all
in an instant, like a lightning-flash shooting through darkness, for the
first time the true nature of his task was revealed to him. Doubt took
shape, sprang into being. Its outline seemed to loom through the
dismal shadows cast by the lamps in the street. Who and what was
he, after all, in comparison with a task of such immensity? With
startling and overwhelming force the solicitor’s meaning was
suddenly unfolded to him.
He took himself for granted no more. He must be mad to have gone
so far without having paused to subject himself to the self-criticism
that is so salutary. How could he blame the solicitor whose eminently
practical mind had resented this inaccessibility to the ordinary rules
of prudence? Was he not the veriest novice in his profession, without
credentials of any kind? And yet he arrogated to himself the right to
embark upon a line of conduct that was in direct opposition to the
promptings of a mature judgment.
How could he have been so sure of this supreme talent? It had never
been brought to test. The only measure of it was his scorn of others,
the scorn of the unsuccessful for those who have succeeded. The
passion with which it had endowed him was nothing more, most
probably, than a monomania of egotism. How consummate was the
folly which could mistake the will for the deed, the vaulting ambition
for the thing itself!
On the few occasions, some seven or eight in all, in which he had
turned an honest guinea, mostly at the police-court, he had betrayed
no surprising aptitude for his profession. There had been times, even
in affairs so trivial, when his highly strung nervous organization had
overpowered the will. He had not been exempt from the commission
of errors; he recalled with horror that once or twice it had fallen to his
lot to be put out of countenance by his adversary; while once at least
he had drawn down upon himself the animadversions of the
presiding deity. Surely there was nothing in this rather pitiful career
to provide a motive for this overweening arrogance.
He grew the more amazed at his own hardihood as he walked along.
To what fatal blindness did he owe it that from the beginning his true
position had not been revealed to him? Where were the credentials
that fitted him to undertake a task so stupendous? What
achievement had he to his name that he should venture to launch his
criticisms against those who had been through the fray and had
emerged victorious? How could he have failed to appreciate that
abstract theory was never able to withstand the impact of
experience! It was well enough in the privacy of his garret to
conceive ideas and to sustain his faculties with dreams of a future
that could never be, but once in the arena, when the open-mouthed
lion of the actual lay in his path, he would require arms more
puissant than these.
To overcome those twin dragons Tradition and Precedent, behind
which common and vulgar minds entrenched themselves so
fearlessly, the sword of the sophist would not avail. It would snap in
his fingers at the first contact with these impenetrable hides. His
blade must be forged of thrice-welded steel if he were to have a
chance on the morrow. He had decided to promulgate like a second
Napoleon the doctrine of force, and for his only weapon he had
chosen a dagger of lath. Well might Mr. Whitcomb smile with
contempt. Where would he find himself if he dared to preach the
most perilous of gospels, if he could not support it with an enormous
moral and physical power?
For years he had dwelt in a castle which he had built out of air,
secure in the belief that he was endowed in ample measure with
attributes whose operations were so diverse yet so comprehensive,
that in those rare instances in which they were united they became
superhuman in their reach. An Isaiah or a Cromwell did not visit the
world once in an era. How dare such a one as he fold his nakedness
in the sacred mantle of the gods! It was the act of one whose folly
was too rank even to allow him to pose as a charlatan. If he ventured
to deliver one-half of these astonishing words he had prepared for
the delectation of an honest British jury, these flatulent pretensions
would be unveiled, he would be mocked openly, his ruin would be
complete and irretrievable.
Never had irresolution assailed him so powerfully. This review at the
eleventh hour of the unwarrantable estimate he had formed of
himself rendered it imperative that he should change his plans. The
opinion of others, acknowledged masters of the profession in which
he was so humble a tyro, was incontrovertible. Evidence in support
of a perfectly rational plea was provided for him, would be ready in
court. His client had demanded that it should be used. To disregard
that demand would be to rebuff his only friend, one of great influence
who had been sent to his aid in his direst hour. And it was for nothing
better than a whim that he was prepared to yield his all. No principle
was at stake, no sacrifice of dignity was involved. That which his
patron had asked of him was so natural, so admirably humane, that
the mere act of refusal would be rendered unpardonable unless it
were vindicated by complete success. No other justification was
possible, not only in the eyes of himself and in those of his client, but
no less was exacted of him by the hapless creature whose life was in
his keeping.
Stating it baldly, let him fail in the superhuman feat which had been
imposed upon him by a disease which he called ambition, and this
wretched woman would expiate his failure upon the gallows. Had
any human being a right to incur such a penalty, a right to pay such a
price in the pursuit of his own personal and private aims? The middle
course was provided for him. It would deliver the accused and
himself from this intolerable peril; it opened up a path of safety for
them both.
Already he could observe with a scarifying clearness, that here and
now, at the eleventh hour, he must defer to the irresistible impact of
the circumstances. The risk was too grave; he was thrusting too
cruel a responsibility upon his flesh and blood. He must hasten to
make terms with that grossly material world of the hard fact which he
scorned so much. He must submit to one of those pitiful

Applying Language Technology in Humanities Research Design Application and The Underlying Logic 1St Ed Edition Barbara Mcgillivray Full Chapter PDF

Uploaded by

Copyright:

Available Formats

You might also like

Applying Language Technology in Humanities Research Design Application and The Underlying Logic 1St Ed Edition Barbara Mcgillivray Full Chapter PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applying Language Technology in Humanities Research Design Application and The Underlying Logic 1St Ed Edition Barbara Mcgillivray Full Chapter PDF

Uploaded by

Copyright:

Available Formats

Applying Language Technology in

Humanities Research: Design,

Health Humanities in Application Christian Riegel

Second Language Research: Methodology and Design

Deepwater drilling : well planning, design,

Feedstock Technology for Reactive Metal Injection

Teacher Development in Technology-Enhanced Language

Understanding Large Language Models: Learning Their

Cybersecurity In Humanities And Social Sciences: A

Pluralisms in Truth and Logic 1st ed. Edition Jeremy

ISBN 978-3-030-46492-9 ISBN 978-3-030-46493-6 (eBook)

© The Editor(s) (if applicable) and The Author(s) 2020

Cover illustration: © Melisa Hasan

Cambridge, UK Barbara McGillivray

1 Introducing Language Technology and Humanities 1

2 Design of Text Resources and Tools 7

5 Word Meaning in Texts 61

6 Mining Textual Collections 81

6.2.3 Representation as a Geometric Space 89

7 The Innovative Potential of Language Technology

Fig. 3.1 Relative document frequency of lemma forsake in the EEBO

Fig. 6.4 Representation of three English proverbs in a feature space

Table 2.1 Top frequency word types in Shakespeare’s Hamlet 23

Introducing Language Technology

Abstract This chapter outlines the relevance of language technology

Keywords Big data · Distant reading · Textual resource ·

1.1 Why Language Technology for the Humanities?

© The Author(s) 2020 1

as a new academic field has contributed to the proliferation of research

1.2 Structure of the Book

However, detecting them in large textual resources is a tedious (or some-

of the technological procedures presented here have been (at least

Design of Text Resources and Tools

Keywords Corpus · Text collection · Metadata · Annotation ·

2.1 Text Resources in the Humanities

© The Author(s) 2020 7

edition of the Encyclopaedia Britannica1 was digitized by the National

humanities scholarship, and discuss the challenges that text resources of

2.1.1 Text Resources and Corpora

One important difference between humanities text collections and lin-

2.1.2 Data and Metadata

2.2 Corpus Design and Creation

2.2.1 Designing a Text Resource

relate to the wider historical context of this work or comparisons with

Of course, some of these criteria may co-exist, so that, for instance, we

typically involve drawing a prioritized inventory of the relevant features,

2.2.2 Humanities Corpora

2.3 Use Case: The Diorisis Ancient Greek Corpus

6 See http://teibyexample.org/modules/TBED00v00.htm?target=markuplanguages for

an explanation of these terms.

For example, if we want to easily find occurrences of the same word

<sentence id="1" location="1.1.1">

We retained analytic metadata information regarding the line, book,

historical nature of the research question meant that dating information

2.4 Corpus and Natural Language Processing Tools

2.4.1 Text-Processing Pipeline

projects depending on the research questions and resource availability.

2.4.2 Pre-processing and Tokenization

2.2 Corpus Design and Creation

2.4 Corpus and Natural Language Processing Tools

1. Language detection: to identify the language(s) of the text.