Professional Documents
Culture Documents
Applying Language Technology in Humanities Research Design Application and The Underlying Logic 1St Ed Edition Barbara Mcgillivray Full Chapter PDF
Applying Language Technology in Humanities Research Design Application and The Underlying Logic 1St Ed Edition Barbara Mcgillivray Full Chapter PDF
Applying Language Technology in Humanities Research Design Application and The Underlying Logic 1St Ed Edition Barbara Mcgillivray Full Chapter PDF
https://ebookmass.com/product/health-humanities-in-application-
christian-riegel/
https://ebookmass.com/product/second-language-research-
methodology-and-design/
https://ebookmass.com/product/deepwater-drilling-well-planning-
design-engineering-operations-and-technology-application-peter-
aird/
https://ebookmass.com/product/feedstock-technology-for-reactive-
metal-injection-molding-process-design-and-application-1st-
edition-peng-cao/
Doing Qualitative Research in Language Education 1st
ed. Edition Seyyed-Abdolhamid Mirhosseini
https://ebookmass.com/product/doing-qualitative-research-in-
language-education-1st-ed-edition-seyyed-abdolhamid-mirhosseini/
https://ebookmass.com/product/teacher-development-in-technology-
enhanced-language-teaching-1st-ed-edition-jeong-bae-son/
https://ebookmass.com/product/understanding-large-language-
models-learning-their-underlying-concepts-and-technologies-1st-
edition-thimira-amaratunga/
https://ebookmass.com/product/cybersecurity-in-humanities-and-
social-sciences-a-research-methods-approach-1st-edition-edition-
hugo-loiseau/
https://ebookmass.com/product/pluralisms-in-truth-and-logic-1st-
ed-edition-jeremy-wyatt/
Applying Language
Technology in
Humanities Research
Design, Application, and
the Underlying Logic
Barbara McGillivray
Gábor Mihály Tóth
Applying Language Technology in Humanities
Research
Barbara McGillivray · Gábor Mihály Tóth
Applying
Language
Technology
in Humanities
Research
Design, Application, and the Underlying Logic
Barbara McGillivray Gábor Mihály Tóth
Faculty of Modern and Medieval Viterbi School of Engineering, Signal
Languages Analysis Lab (SAIL)
University of Cambridge University of Southern California
Cambridge, UK Los Angeles, CA, USA
The Alan Turing Institute
London, UK
This Palgrave Macmillan imprint is published by the registered company Springer Nature
Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The idea of this book goes back to the HiCor Research Network, founded
and led by us together with Gard Jenset and Kerry Russell. HiCor was
a research group of historians and corpus linguists at the University of
Oxford active between 2012 and 2014. It was generously supported by
TORCH (The Oxford Research Center for the Humanities). In addition
to organizing lectures and a workshop, HiCor also aimed to disseminate
language technology among historians and, more generally, humanists.
For instance, we organized several courses on Language Technology and
Humanities at the Oxford DH Summer School, which inspired this book.
We are grateful to Gard Jenset who helped to shape the initial ideas
underlying this book. We also thank our employers and funders for pro-
viding us with time and funding to accomplish the project.1
We have contributed equally to the design of the book. We have joint
responsibility for Chapter 1. Barbara McGillivray has primary responsi-
bility for Chapters 2 and 5. Gábor Tóth has primary responsibility for
Chapters 3, 4, 6, and 7.
Engineering. Barbara McGillivray was supported by The Alan Turing Institute under
EPSRC grant EP/N510129/1.
v
Contents
vii
viii CONTENTS
3 Frequency 35
3.1 Concept of Frequency 36
3.2 Application: The “Characteristic Vocabulary”
of the Moonstone by Wilkie Collins 39
3.3 Application: Terms with ‘Turbulent History’ in the Early
English Books Online 43
3.4 Conclusion 46
References 46
4 Collocation 47
4.1 The Concept of Collocation 48
4.2 Probability of a Bigram 49
4.3 Observed and Expected Probability of a Bigram 50
4.4 Strength of Association: Pointwise Mutual Information
(PMI) 52
4.5 Strength of Association: Log Likelihood Ratio 54
4.6 Application: What Residents of Modern London
Complained About 54
4.7 Conclusion 58
References 59
Index 123
List of Figures
xi
xii LIST OF FIGURES
xiii
CHAPTER 1
1 https://www.dhoxss.net/.
2 https://dhsi.org.
3 http://esu.culintec.de/.
1 INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 3
4 The Python implementation can be found in the following github repository: https://
github.com/toth12/language-technology-humanities.
1 INTRODUCING LANGUAGE TECHNOLOGY AND HUMANITIES 5
also mean that many details concerning the topics covered were omit-
ted. However, we aimed to provide basic information to further explore
themes that are of particular interest to readers.
References
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with
Python. Sebastopol, CA: O’Reilly.
Gries, S. T. (2009). Quantitative Corpus Linguistics with R. New York, NY and
Abingdon: Routledge.
Hockey, S. (2000). Electronic Texts in the Humanities: Principles and Practice.
Oxford: Oxford University Press.
Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History.
Champaign, IL: University of Illinois Press.
Jockers, M. L. (2014). Text Analysis with R for Students of Literature. New York,
NY: Springer.
Moretti, F. (2015). Distant Reading. London: Verso.
Piotrowski, M. (2012). Natural Language Processing for Historical Texts.
San Rafael, CA: Morgan and Claypool.
CHAPTER 2
Abstract This chapter guides the reader through the key stages of
creating language resources. After explaining the difference between lin-
guistic corpora and other text collections, the authors briefly introduce
the typology of corpora created by corpus linguists and the concept of
corpus annotation. Basic terminology from natural language processing
(NLP) and corpus linguistics is introduced, alongside an explanation of the
main components of an NLP pipeline and tools, including p re-processing,
part-of-speech tagging, lemmatization, and entity extraction.
1 https://digital.nls.uk/encyclopaedia-britannica/archive/144133900.
2 DESIGN OF TEXT RESOURCES AND TOOLS 9
2 https://www.darwinproject.ac.uk.
3 https://www.dhi.ac.uk/hartlib/context.
4 We will follow the Oxford Dictionaries in using metadata as a mass noun and data as a
plural noun.
2 DESIGN OF TEXT RESOURCES AND TOOLS 11
over time? Did Hartlib use a different style when addressing certain
personalities? Does the length of the letters tend to change over time?
Metadata can be of different types, depending on the kind of infor-
mation it provides. We follow the categorization in Burnard (2005)
and distinguish between descriptive, administrative, editorial, and ana-
lytic metadata. The scope of the first two categories is the collection as
a whole, while the latter ones apply to smaller text units. Descriptive
metadata accesses external information about the context of the text,
such as its source, date of publication, and the sociodemographics of the
authors. Administrative metadata contains information about the collec-
tion itself, for example its title, its version, encoding, and so on. Editorial
metadata, on the other hand, provides information about the editorial
choices that the creators of the digital collection made with respect to
the original text, for example regarding additions, omissions, or correc-
tions. Finally, analytic metadata focuses on the structure of the text, for
example by marking the beginning and end of sections or paragraphs.
Metadata can be encoded into text resources in various ways, either
in external documentation or as part of the collections themselves. The
Text Encoding Initiative (TEI) has developed detailed guidelines for the
encoding of texts in digital format and it has become a widely accepted
standard in the digital humanities. The TEI guidelines specify, among
other things, how the metadata of a text should be displayed in what is
known as the TEI header (for details see TEI Consortium 2019).
As we have said earlier, metadata combined with text data offers the
widest scope for insightful ways to explore texts. Moreover, the texts
themselves can be enriched via annotation to optimize the implicit lin-
guistic information they contain and make it usable for large-scale anal-
yses. Let us imagine that we have access to a large collection of digitized
newspapers and we are interested in analysing the level of international
relations exemplified in this collection. Knowing the geographical ori-
gin of each newspaper is of primary importance, but it is not sufficient
because a newspaper article may talk about a location which is differ-
ent from its place of publication. Hence, we would want to conduct
an in-depth search of the texts to find, for example, instances of place
names. This can be a very time-consuming (or sometimes impossible)
process if we need to read all the articles. Without good disambigua-
tion, we may have to ignore many instances of potentially irrelevant hits
while at the same time missing a high number of relevant hits. For exam-
ple, Paris is the name of the French capital but is also the name of a
12 B. McGILLIVRAY AND G. M. TÓTH
city in Texas, and being able to distinguish the two means that we can
know whether a particular mention refers to international relationships
with France or the United States. Moreover, Paris can also be a per-
son’s name, and at the same time the city can be referred to in different
ways (e.g., ‘the City of Lights’), so again being able to disambiguate the
usages of this name in context is very useful.
As noted by McEnery and Wilson (2001, p. 32), annotation makes
the linguistic information in a text computationally retrievable, thus ena-
bling a wide range of searches that can be performed in a manual, auto-
matic, semi-automatic, or crowd-sourced way, depending on whether
humans, computers, a combination of humans and computers, or groups
of humans are responsible for it. For a detailed overview of linguistic
annotation, see Jenset and McGillivray (2017, pp. 99 ff.). In Sect. 2.4
we will see different types of linguistic annotations and how they can be
relevant to humanities research.
• By medium: does the corpus contain only text, speech, video mate-
rial, or is it mixed?
• By size: does the corpus contain a static snapshot of a language vari-
ety (static corpus) or is it continually updated to monitor the evolu-
tion of language (monitor corpus)?
• By language: is the corpus monolingual or multilingual? If it is mul-
tilingual, have its parts been aligned (parallel corpus)?
• By time: does the corpus cover a language variety in a specific period
without considering its time evolution (synchronic corpus) or does
it focus on the change of a language variety over time (diachronic
corpus)?
• By purpose: was the corpus built to describe the general language
(like contemporary spoken English) or a special aspect of it (like the
language of medical emergency reports)?
5 https://www.sketchengine.eu/jozef-stefan-institute-newsfeed-corpus/.
2 DESIGN OF TEXT RESOURCES AND TOOLS 15
built from news articles gained from their RSS feeds; it is updated daily
and contains 37 billion words. Such an unrestricted approach to cor-
pus building, however, is not always applicable to the text resources
employed in humanities scholarship, where a potentially complex inter-
action of research questions and availability of texts affects the size and
shape of the resources we can create. For example, sometimes only a few
texts or fragments have survived historical accidents and have found
their way into the collection, meaning that creating a balanced corpus is
simply not a viable option.
Three important considerations to keep in mind when building a
corpus in humanities research are access, digitization, and encoding.
Gaining access to a group of texts can often be anything but straight-
forward, requiring potentially complex issues to be negotiated such as
legal questions with third parties (who might have been responsible for
the digitization, for example), and privacy or human data protection
concerns. Even when we gain access to the texts, these may need to be
digitized, as any subsequent computational processing of the type we talk
about in this volume requires them to be in digital form. Once the texts
have been digitized, or even better during the digitization step itself, the
texts should be presented in such a way to enable their effective use in
research. In Sect. 2.1.2 we touched on the TEI guidelines, which pro-
vide a great basis for ensuring that digital texts are equipped with all
the metadata needed to place them in their historical context. Although
these topics are not the focus of this volume and therefore will not be
covered in depth, we acknowledge that access, digitization, and encoding
can have a significant impact on the decisions that follow in the research
process. In particular, the quality of the digitization can radically affect
the outcomes of quantitative analyses carried out on the texts, as shown,
for example, by Hill and Hengchen (2019).
Another challenge concerns historical texts, which are often the object
of study in the humanities and which require especially careful consid-
eration. One primary reason for this is that tools and methods devel-
oped in language technology research are still mainly concerned with
modern and well-established languages like English, but require special
adaptation when applied to historical languages (cf. Piotrowski 2012;
McGillivray 2014). Philological and interpretative issues are often of
major importance and need to be accurately incorporated in the corpus
design phase (cf. Meyer 2015). Furthermore, the lack of native speak-
ers of extinct languages or old varieties of living languages means that
16 B. McGILLIVRAY AND G. M. TÓTH
we cannot rely on native speaker intuition for the annotation, and extra
layers of checks and explicit guidelines are needed to achieve good qual-
ity results. The next section will describe a concrete use case involving a
historical language, Ancient Greek.
The project aimed to map the change in the meaning of words in the
history of Ancient Greek from the seventh century BCE to the fifth cen-
tury CE, an extremely ambitious goal. For this purpose, we had to build
the largest corpus possible. In Sect. 2.1.1 we stressed the aspiration to
representativeness. One of the important factors to keep in mind is the
role of genre in Ancient Greek semantics, so in the corpus design phase
we aimed at finding the best possible representation of Ancient Greek
genres. While scoping the genre distribution of the texts, we devised a
categorization into genre classes (such as Poetry, Narrative, or Technical)
and subclasses (such as Bucolic, Biography, or Geography).
The categorization aimed at the best possible representation of
Ancient Greek genres. The emphasis on “possible” is critical in this con-
text, as we were constrained by three main factors. First, the texts that
have survived historical accidents and have reached us are all we can hope
to obtain for Ancient Greek. Second, as new digitization was not within
the scope of the project, the number of available digital resources consti-
tuted the upper limit of what we were able to include. Third, even when
digitized editions exist, they may not be free to use and distribute, so we
sourced the texts from three openly available digital libraries (for details
see Vatri and McGillivray 2018). The corpus consists of 820 texts and it
counts 10,206,421 word tokens, making it the largest corpus of its kind
available today.
As is often the case in digital humanities projects, the texts came in
different formats, ranging from TEI XML, to non-TEI XML, HTML,
and Microsoft Word files.6 Therefore we had to allow for an initial phase
of cleaning and standardization of these formats into TEI-compliant
XML to allow further processing and analysis. Another important con-
sideration was character encoding. Greek characters can pose additional
challenges when it comes to encoding, and we found a range of options
in the sources, from Beta Code7 to UTF-8 Unicode, to HTML hexadec-
imal references. Taking the example from Vatri and McGillivray (2018),
for the Greek character ᾆ, the Beta Code is A) = |, the Unicode UTF-8
encoding is ᾆ, and the hexadecimal reference is F86;. We converted
all Greek characters to Beta Code for standardization purposes, choosing
this encoding because it makes automatic processing and retrieval easier.
8 In the example we can see that the XML tag <sentence> shows the beginning of the sen-
tence, and has the attributes id (which assigns a unique identifier to the sentence) and loca-
tion (which gives information about the passage to which the sentence belongs). Nested
inside the <sentence> tag we find a series of <word> tags, each corresponding to a word in the
sentence.
2 DESIGN OF TEXT RESOURCES AND TOOLS 19
<TEI.2>
<teiHeader>
<fileDesc>
<titleStmt>
<title>Leucippe and Clitophon</title>
<author>Achilles Tatius</author>
…
</titleStmt>
…
</fileDesc>
<profileDesc>
< langUsage>
<language ident="grc">Greek</language>
</langUsage>
<creation>
<date>120</date>
</creation>
</profileDesc>
<xenoData>
<genre>Narrative</genre>
<subgenre>Novel</subgenre>
</xenoData>
</teiHeader>
9 https://www.tei-c.org/release/doc/tei-p5-doc/en/html/HD.html.
20 B. McGILLIVRAY AND G. M. TÓTH
We have seen how we designed the Diorisis corpus based on our original
research question on semantic change. According to the corpus termi-
nology introduced in Sect. 2.1, it is a diachronic monolingual static (i.e.,
not monitor) general text corpus of Ancient Greek. Being the largest
of its kind, the Diorisis corpus is aimed at classicists and historical lin-
guists in general, and can be used to investigate a variety of aspects of the
Ancient Greek language, history, and culture. In fact, we would like to
stress that, in this case and in many others, even though a corpus is often
designed with a specific research question in mind, and its format and
content tend to be affected by this aim, it is also important to consider
its general usability beyond the original research study, and adopt stand-
ard formats as much as possible. For the Diorisis corpus, we chose the
TEI XML encoding and recorded a range of informative metadata which
make it of wider relevance.
Tokenization is the process that splits the text into units called tokens.
What counts as a token can vary depending on the specific criteria we
choose to suit our research, but it typically corresponds to what in the
common language is referred to as a word. For example, the following
passage from Shakespeare’s Hamlet contains 14 tokens including punc-
tuation marks:
10 https://www.nltk.org.
11 https://stanfordnlp.github.io/stanfordnlp/pipeline.html.
22 B. McGILLIVRAY AND G. M. TÓTH
edu/hamlet/full.html.
2 DESIGN OF TEXT RESOURCES AND TOOLS 23
The first eight words in this list are all stop words, and the first con-
tent word is hamlet, perhaps not surprisingly. If we remove stop words,
we obtain the list in Table 2.2, which is more readily usable in analyses of
the content of the text.
Another step that is sometimes useful, particularly in the case of his-
torical texts, is spelling standardization, which involves standardizing the
many spellings that the same word can have (for example adviser and
advisor in English). Extensive research has been done on this (as well
as on OCR correction) and we refer the interested reader to Piotrowski
(2012) for an overview.
13 https://tartarus.org/martin/PorterStemmer/.
14 http://snowball.tartarus.org/otherapps/schinke/intro.html.
should be lemmatized as the noun amor, but in other cases amor can be
lemmatized as the passive of the verb amo, ‘to love’. Lemmatization can
be introduced as part of a manual annotation of a corpus, but when pos-
sible using off-the-shelf lemmatizers is much faster, at least for those lan-
guages for which these exist. For example, the NLTK package contains
the WordNet lemmatizer16 for English and CLTK contains lemmatizers
for Latin and Ancient Greek.17
One step further from lemmatization consists in providing the full
morphological analysis of a form in its context. This is useful when we
want to know characteristics like the number of a noun (is it plural or
singular?) or the tense of a verb (is it past, present, or future?). The
example below is taken from the Diorisis Ancient Greek corpus intro-
duced in Sect. 2.3:
The XML tag <word> contains the word form, and nested in it
the <lemma> tag shows the lemma (attribute “entry”) in addition to
other attributes18; inside <lemma> we find the <analysis> tag, whose
attribute “morph” contains the morphological analyses of the form, in
this case feminine nominative or vocative singular.
16 http://www.nltk.org/_modules/nltk/stem/wordnet.html.
17 See https://wiki.digitalclassicist.org/Morphological_parsing_or_lemmatising_Greek_
was disambiguated using the part-of-speech tagger, and disambiguated gives the confidence
in the disambiguation. See Vatri and McGillivray (2018) for details.
26 B. McGILLIVRAY AND G. M. TÓTH
The first four classes are called open because new elements can be added
to them, while the last four are called closed. For example, English wel-
comes new nouns all the time (vacay, fabulosity), but very rarely new
conjunctions or prepositions.
Knowing the PoS of a word is important in many contexts. Imagine
that we want to do sentiment analysis of a series of tweets to find out
whether they express positive or negative opinions. One approach would
be to look for adjectives in those tweets, and see if they belong to a pos-
itive category (e.g., good, fantastic, awesome) or a negative one (terrible,
bleak, bad, and so on). How do we find all adjectives in a text? We can
do this by adding PoS annotation to it, and then searching for words
annotated with the PoS of interest.
The example below is taken from the first sentence of the anno-
tated version of the Brown Corpus,19 a corpus containing English texts
amounting to one million words.
<s n = "1">
<w type = "at" > The </w>
<w type = "np-tl" > Fulton </w>
<w type = "nn-tl" > County </w>
<w type = "jj-tl" > Grand </w>
<w type = "nn-tl" > Jury </w>
<w type = "vbd" > said </w>
<w type = "nr"> Friday </w>
</s>
19 http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM#bc5.
2 DESIGN OF TEXT RESOURCES AND TOOLS 27
The example contains an XML tag for the sentence <s> with attribute n
for its number. Nested inside the sentence tag we find a series of word
tags <w>, each with an attribute type for the morphological analysis of
the form words: at stands for article, np for proper noun, nn for common
noun, jj for adjective, vbd for verb in the past tense (so PoS and morpho-
logical information on verb tense), and nr for singular adverbial noun
(so, again, PoS and morphological details combined, in this case number
information).
For some languages it is possible to do PoS annotation automatically,
and a very popular PoS tagger for which implementations are available
for several languages is TreeTagger.20 PoS taggers are programs that
assign the PoS to every token in a text, and are usually able to use the
word’s context to perform disambiguation. For example, book is a verb in
We are going to book a flight tonight, but a noun in We gave him a heavy
book. The way such taggers work is usually by being trained on a set of
annotated texts where they are able to learn patterns of co-occurrences
of different PoS, for example that in English an adjective is usually fol-
lowed by a noun, and can then use this to analyse new sentences.
20 https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.
28 B. McGILLIVRAY AND G. M. TÓTH
21 http://geonames.nga.mil/gns/html/.
22 http://gazetteer.org.uk/.
30 B. McGILLIVRAY AND G. M. TÓTH
$7$=
--ODUJH11$
11SDUW120,6.
,2RI=
11+XPDQLWLHV3
11UHVHDUFK;
99=UHOLHV6
,,RQ=
11WH[W44
11LQWHUSUHWDWLRQ;.4
From the full list of tags23 and their subcategories24 we can see that
the determiner A is assigned the category Z5, which is reserved to gram-
matical items, while the adjective large is tagged with the code N3.2+,
which refers to size, N5+ (quantities), and A11.1+ (important).
When semantically annotating a text, it is a good idea to rely on
existing resources that organize the lexicon into semantic categories. A
very widely used such resource is WordNet (Miller 1995),25 which has
become a standard in computational linguistics for contemporary lan-
guages. The availability of similar lexicons for historical languages is
much more restricted (WordNet has a limited Latin version, for exam-
ple), and it can be helpful to consider automatic approaches to semantic
annotation. This will be the focus of Chapter 5.
Another important level of annotation is that of sentiment, which is
relevant to many research questions in the humanities. Imagine that we
want to find out whether a text has a positive, negative, or neutral sen-
timent, and how this changes throughout the text in relation to differ-
ent characters, regions, and so on, or we want to measure the sentiment
expressed on the Internet with respect to specific topics or views and in
relation to certain historical or political events. This task is commonly
referred to as sentiment analysis. As we know, sentiment is not always
23 http://ucrel.lancs.ac.uk/usas/semtags.txt.
24 http://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt.
25 https://wordnet.princeton.edu/.
2 DESIGN OF TEXT RESOURCES AND TOOLS 31
binary, and texts can display multiple layers of meaning, and use rhetori-
cal devices like sarcasm and irony. All this heavily depends on the context,
which makes developing automatic methods for sentiment analysis a very
active area of research in NLP (cf. Castellucci et al. 2015, among others).
An overview of such methods is outside the scope of this chapter,
and here we will briefly review two main approaches to sentiment anal-
ysis: semantic and machine-learning approaches. Semantic approaches
make use of sentiment lexicons which organize words (typically adjec-
tives) into positive, negative and neutral classes. Once the text has been
pre-processed (for example with tokenization and stop-word removal,
stemming or lemmatization), we check for the presence or absence of
each term of the lexicon. Then we add the polarity values of the terms to
reach a global polarity value for the text, taking into account that mod-
ifier terms (such as very, too, little) can increase or decrease the polarity
of accompanying terms, and inversion terms or negations (such as no,
never) can reverse the polarity of the terms they relate to. On the other
hand, machine-learning methods rely on collections of texts that have
been annotated by sentiment. These are then used to “train” so-called
classifiers, computer programs which classify a text into classes, for exam-
ple into the positive, negative, or neutral category based on their fea-
tures (see Chapter 5 for an introduction to this concept), for example its
words and their semantics.
2.5 Conclusion
In this chapter we have discussed the differences between linguis-
tic corpora, text collections, and text resources in the humanities, and
illustrated the basic steps involved in processing a text for the purpose
of further analysis. The topics covered in this chapter are the object of
very active research in the field, and the brief account we have given here
is far from complete. Moreover, not all the topics will be relevant to all
research scenarios, but knowing the terminology and being aware of
the possibilities offered by existing tools helps inform the phases of the
research process, from design to analysis. For example, it is very impor-
tant to collect the texts in such a way as to address our research questions
in the most accurate way, but also taking into account existing stand-
ards, access and availability concerns, size and format of the resources.
This way we make it possible for other researchers to reuse and possibly
further enrich our resources in the future.
32 B. McGILLIVRAY AND G. M. TÓTH
References
Bellini, D., & Schneider, S. (Eds.). (2003–18). Banca dati dell’italiano parlato
(BADIP). Graz: Karl-Franzens-Universität Graz. http://badip.uni-graz.at.
Bode, K. (2020). Why You Can’t Model Away Bias. Modern Language
Quarterly, 81, 1.
Burnard, L. (2005). Metadata for Corpus Work. In M. Wynne (Ed.), Developing
Linguistic Corpora: A Guide to Good Practice (pp. 30–46). Oxford: Oxbow
Books. Available online from http://ota.ox.ac.uk/documents/creating/dlc/.
Accessed 16 Sept 2019.
Castellucci, G., Croce, D., & Basili, R. (2015). Acquiring a Large Scale Polarity
Lexicon Through Unsupervised Distributional Methods. In C. Biemann et al.
(Eds.), Natural Language Processing and Information Systems 2015. Lecture
Notes in Computer Science (Vol. 9103, pp. 73–86). Switzerland: Springer
International Publishing. https://doi.org/10.1007/978-3-319-19581-0_6.
Erdmann, A. et al. (2016). Challenges and Solutions for Latin Named Entity
Recognition. COLING, Association for Computational Linguistics, 85–93.
Erdmann, A., Wrisley, D. J., Allen, B., Brown, C., Cohen-Bodénès, S., Elsner,
M., et al. (2019). Practical, Efficient, and Customizable Active Learning for
Named Entity Recognition in the Digital Humanities (pp. 2223–2234).
https://doi.org/10.18653/v1/n19-1231.
2 DESIGN OF TEXT RESOURCES AND TOOLS 33
The final consultation of Northcote and his client took place in the
open street in the heavily raining December afternoon, with their
backs against Mr. Whitcomb’s brass plate. The spot selected for their
last utterances on this momentous affair was incongruous indeed,
but each had grown so impatient of the other, that if their last words
were spoken here, the clash of their mental states was the less likely
to invite disaster than in a more formal council-chamber of four walls.
The robust common sense of the solicitor had never shown itself to
be more incisive than now as he stood with his back to his own door,
under a dripping umbrella, his hat pushed to the back of his head,
and his trousers turned up beyond his ankles. His twenty years of
immensely successful practice, his exact knowledge of human
nature, his ruthless worldliness, his reverence for the hard fact, stood
forth here in the oddest contrast with the somewhat “special” and
rarefied quality of this youthful advocate whom he had seen fit to
entrust with so important a case.
“It’s a pity, it’s a pity,” he brought himself to say at last, his veneer
falling off a little under the stress of his chagrin, and revealing a
glimpse of the baffled human animal beneath. “It is a serious mistake
to have made; but we have got to stand to it. You are not the man for
this class of work, to speak bluntly. You are either too deep or you
are not deep enough. But as I say, we have got to stand to it now.
My last words will be to urge you to put as good a face upon it as
you can.”
“In other words,” said Northcote, stiffening, “you will look to me to do
my best.”
“I don’t put it in that form exactly,” said the solicitor, midway between
exasperation and a desire to be courteous. “I want you fully to
appreciate that you are handling an extremely tough job, and I
merely want you to make the best of it, that’s all.”
“I will tell you, Mr. Whitcomb,” said Northcote, striving in vain to avert
the explosion that had been gathering for so long, “that if it were not
now the eleventh hour, if I had not pledged myself to this thing more
deeply than you know, if it were not a matter of life and death to me
as well as to your client, I would throw your brief back at you rather
than submit to this. It will be time enough for you to get upon your
platform when I have made a hash of everything.”
“Yes, I think you are entitled to say that,” said the solicitor impartially,
having made a successful effort to recapture his own serenity. “I
have no right to talk as I am doing; I have never done so to any one
else. I suspect you have got on my nerves a bit.”
“Yes, the whole matter throws back to the clash of our
temperaments,” said Northcote, unable to cloak his own irritation
now that it had walked abroad. “It is a pity that we ever attempted to
work together. Yet for one who envelops himself in the serene air of
reason, you are somewhat illogical, are you not? You enter the
highways and hedges in search of a particular talent; you have the
fortune to light upon it; and then you turn and rend its unhappy
possessor for possessing it.”
“As I say, my dear boy, this particular talent of yours—or is it your
temperament?—you see I am not up in these technical names—has
got on my nerves a little.”
“And your temperament, my friend, to indulge a tu quoque, is
covered with a hard gritty outer coating, for which I believe the
technical name is ‘practicality,’ which positively sets one’s teeth on
edge.”
“So be it; we part with mutual recriminations. But this is my last word.
Firmly as I believe I have committed an error of judgment, if to-
morrow you can prove that I have deceived myself, you will not find
me ungrateful. I can speak no fairer; and this you must take for my
apology. It is not too much to say that since I have come to know you
I have ceased to recognize myself.”
“I accept your amende” said Northcote, without hesitation. “I see I
have worried you, but if I might presume to address advice to the
fount of all experience, never, my dear Mr. Whitcomb, attempt to
formulate a judgment upon that which you cannot possibly
understand.”
“After to-morrow there is a remote chance that I may come to heed
your advice. In the meantime we will shake hands just to show that
malice is not borne. Don’t forget that you will be the first called to-
morrow, at half-past ten. It is quite likely to last all day.”
The solicitor turned into his offices and Northcote sauntered along
Chancery Lane. The twilight which had enveloped the city all day
was now yielding to the authentic hues of evening. The dismal
street-lamps were already lit, the gusts of rain, sleet, and snow of the
previous night had been turned into a heavy downpour which had
continued without intermission since the morning. The pavements
were bleached by the action of water, but a miasma arose from the
overburdened sewers, whose contents flowed among the traffic and
were churned by its wheels into a paste of black mud. Northcote was
splashed freely with this thick slushy mixture, even as high as his
face, by the countless omnibuses; and in crossing from one
pavement to another he had a narrow escape from being knocked
down by a covered van.
It was in no mood of courage that the young man pushed his way to
his lodgings through the traffic and the elbowing crowds who
thronged the narrow streets. Even the mental picture that was
thrown before his eyes of this garret which had already devoured his
youth had the power to make him feel colder than actually he was.
Never had he felt such a depression in all the long term of his
privation as now in wending his way towards it laboriously, heavily,
with slow-beating pulses.
He was sore, disappointed, angry; his pride was wounded by the
attitude of his client. His self-centred habit caused him to take
himself so much for granted, that at first he could discern no reason
for this volte-face. In his view it was inconsiderate to withhold the
moral support of which at this moment he stood so much in need.
Truly the lot of obscurity was hard; its penalties were of a kind to
bring many a shudder to a proud and sensitive nature. The
patronizing insolence of one whom he despised was beginning to fill
him with a bootless rage, yet in his present state how impotent he
was before it. He must suffer such things, and suffer them gladly,
until that hour dawned in which his powers announced themselves.
That time was to-morrow—terrible, all-piercing, yet entrancing
thought! The measure of his talent would then be proclaimed. Yet all
in an instant, like a lightning-flash shooting through darkness, for the
first time the true nature of his task was revealed to him. Doubt took
shape, sprang into being. Its outline seemed to loom through the
dismal shadows cast by the lamps in the street. Who and what was
he, after all, in comparison with a task of such immensity? With
startling and overwhelming force the solicitor’s meaning was
suddenly unfolded to him.
He took himself for granted no more. He must be mad to have gone
so far without having paused to subject himself to the self-criticism
that is so salutary. How could he blame the solicitor whose eminently
practical mind had resented this inaccessibility to the ordinary rules
of prudence? Was he not the veriest novice in his profession, without
credentials of any kind? And yet he arrogated to himself the right to
embark upon a line of conduct that was in direct opposition to the
promptings of a mature judgment.
How could he have been so sure of this supreme talent? It had never
been brought to test. The only measure of it was his scorn of others,
the scorn of the unsuccessful for those who have succeeded. The
passion with which it had endowed him was nothing more, most
probably, than a monomania of egotism. How consummate was the
folly which could mistake the will for the deed, the vaulting ambition
for the thing itself!
On the few occasions, some seven or eight in all, in which he had
turned an honest guinea, mostly at the police-court, he had betrayed
no surprising aptitude for his profession. There had been times, even
in affairs so trivial, when his highly strung nervous organization had
overpowered the will. He had not been exempt from the commission
of errors; he recalled with horror that once or twice it had fallen to his
lot to be put out of countenance by his adversary; while once at least
he had drawn down upon himself the animadversions of the
presiding deity. Surely there was nothing in this rather pitiful career
to provide a motive for this overweening arrogance.
He grew the more amazed at his own hardihood as he walked along.
To what fatal blindness did he owe it that from the beginning his true
position had not been revealed to him? Where were the credentials
that fitted him to undertake a task so stupendous? What
achievement had he to his name that he should venture to launch his
criticisms against those who had been through the fray and had
emerged victorious? How could he have failed to appreciate that
abstract theory was never able to withstand the impact of
experience! It was well enough in the privacy of his garret to
conceive ideas and to sustain his faculties with dreams of a future
that could never be, but once in the arena, when the open-mouthed
lion of the actual lay in his path, he would require arms more
puissant than these.
To overcome those twin dragons Tradition and Precedent, behind
which common and vulgar minds entrenched themselves so
fearlessly, the sword of the sophist would not avail. It would snap in
his fingers at the first contact with these impenetrable hides. His
blade must be forged of thrice-welded steel if he were to have a
chance on the morrow. He had decided to promulgate like a second
Napoleon the doctrine of force, and for his only weapon he had
chosen a dagger of lath. Well might Mr. Whitcomb smile with
contempt. Where would he find himself if he dared to preach the
most perilous of gospels, if he could not support it with an enormous
moral and physical power?
For years he had dwelt in a castle which he had built out of air,
secure in the belief that he was endowed in ample measure with
attributes whose operations were so diverse yet so comprehensive,
that in those rare instances in which they were united they became
superhuman in their reach. An Isaiah or a Cromwell did not visit the
world once in an era. How dare such a one as he fold his nakedness
in the sacred mantle of the gods! It was the act of one whose folly
was too rank even to allow him to pose as a charlatan. If he ventured
to deliver one-half of these astonishing words he had prepared for
the delectation of an honest British jury, these flatulent pretensions
would be unveiled, he would be mocked openly, his ruin would be
complete and irretrievable.
Never had irresolution assailed him so powerfully. This review at the
eleventh hour of the unwarrantable estimate he had formed of
himself rendered it imperative that he should change his plans. The
opinion of others, acknowledged masters of the profession in which
he was so humble a tyro, was incontrovertible. Evidence in support
of a perfectly rational plea was provided for him, would be ready in
court. His client had demanded that it should be used. To disregard
that demand would be to rebuff his only friend, one of great influence
who had been sent to his aid in his direst hour. And it was for nothing
better than a whim that he was prepared to yield his all. No principle
was at stake, no sacrifice of dignity was involved. That which his
patron had asked of him was so natural, so admirably humane, that
the mere act of refusal would be rendered unpardonable unless it
were vindicated by complete success. No other justification was
possible, not only in the eyes of himself and in those of his client, but
no less was exacted of him by the hapless creature whose life was in
his keeping.
Stating it baldly, let him fail in the superhuman feat which had been
imposed upon him by a disease which he called ambition, and this
wretched woman would expiate his failure upon the gallows. Had
any human being a right to incur such a penalty, a right to pay such a
price in the pursuit of his own personal and private aims? The middle
course was provided for him. It would deliver the accused and
himself from this intolerable peril; it opened up a path of safety for
them both.
Already he could observe with a scarifying clearness, that here and
now, at the eleventh hour, he must defer to the irresistible impact of
the circumstances. The risk was too grave; he was thrusting too
cruel a responsibility upon his flesh and blood. He must hasten to
make terms with that grossly material world of the hard fact which he
scorned so much. He must submit to one of those pitiful