Week 7 Seminar Notes Terminology and Corpus

A sample of Corpus Based Terminology management and tools
Introduction
Corpus Based Terminology in this example is based on a software especially designed for this purpose
according to the general guidelines of the Communicative Theory of Terminology (Cabré, 1999) aka
TCT. The TCT adopts a linguistic approach to terminology taking also cognitive and discursive aspects
into account; therefore its object of study are terms conceived in the context of specialized discourse.
For the TCT, a linguistic approach to terminology is a corpus-based approach. TCT is not only interested
in prescriptive terminology, i.e. the terminology established by standards or found in official databases
but also (and particularly) in those terms that are actually used in Language for Special Purposes corpora
by field experts. In other words, the TCT not only adopts an in vitro approach but is also interested in the
living terminology, that is, the units that are effectively used by experts in specialized communication.
The terminological unit is seen as a three-fold polyhedron having a cognitive component (the concept),
a linguistic component (the term) and a communicative one (the situation).
Linguists are interested in the living terminology, and by carrying out empirical analyses they can
discover that terminology is much more complex than it has been traditionally assumed. In specialized
texts there is also synonymy, ambiguity, vagueness, periphrases, redundancy and systems of term
variation taking place at different levels. The notion of univocity becomes (partly) untenable.
Consequently, a theory of terminology must assume that variation is an essential property of the
communication between experts and both terms and concepts must be studied in their dynamic
interplay.
Terminus 2.0 is a web-based application for corpus analysis and term extraction and management. The
worksheet will present all the steps to be followed in a typical terminological project, of which the most
important ones are explained in this section, resulting in a glossary as the final product.
1. The first step of the process is the definition of the terminological project. The selection of the
domain field of interest for the glossary as well as the main language of the entries and the
language or languages of the equivalences and, most importantly, which will be audience
targeted by our terminological product (the communicative situation).
2. The next elements are the selection of the domain field of interest for the glossary as well as the
main language of the entries and the language or languages of the equivalences and, most
importantly, which will be audience targeted by our terminological product (the communicative
situation).
1
3. Once these decisions have been made, the following step is to compile an appropriate corpus
to work with, that is, a sample of specialized documents of sufficient size and quality to be
considered representative of the domain in question.
The purpose of gathering a corpus of documents of the domain in question is threefold:
- as terminologists we need a first hand experience with the data to become familiar
with the type of language of the domain, an indispensable experience that will
complement the information obtained by interviewing experts
- the corpus is needed to conduct different statistical analyses of the vocabulary as
well as terminology extraction
- texts will also be used to obtain complementary information about the terms such
as semantic, syntactic or collocational clues, among others.
Ideally, the corpus to be analyzed should be large enough to be considered representative of the
domain. Unfortunately, there is no precise mathematical formula to determine what should be the size
of a corpus to guarantee the sample's representativeness. Thus, the corpus should contain as many
documents as possible, because the bigger it is, the more terminological units it will contain and the
more reliable our conclusions will be.
2
Zipf’s Law
Another important aspect is the qualitative aspect. The first step when we approach a new area of
knowledge is to identify the publications of reference. This is probably the best way to compile an L.S.P.
corpus, but probably not the most practical: sometimes, for instance, the documents are not available
on electronic format and thus the cost of the OCR scanning and subsequent manual correction becomes
prohibitive. In such a case, one should consider using the web to search for documents using some
terms as a query expression. Terminus 2.0 tool permits uploading files previously compiled by the user,
or selecting the option to download documents using web search engines (useful for downloading large
amounts of data from the web).
4. At this point it might be useful to develop the concept structure of the particular domain under
study. Terminus contains a concept structure module for the design of the domain's concept
tree. This information, entered by the user in a graphic form, will be later encoded in a logical
form by the program (in XML syntax) in such a way that it could be, eventually, treated by other
systems. This functionality is not accessible in the demo version.
3
5. Once the user has defined and compiled the corpus, Terminus offers different possibilities for
the analysis of the vocabulary: the extraction of concordances (or Key Word in Context); the
sorting of the vocabulary of the corpus (words or n-grams) by frequency or by statistical
measures of association, and, finally, the automatic extraction of terminology.
a. The Key Word in Context search (KWIC), also called concordance extraction in the field of
corpus linguists, consists of the extraction of contexts of occurrence of a given query expression
(i.e., the term), the context being a sentence or an arbitrary number of words on the left and
right. Concordances can be extremely helpful to quickly grasp an idea of the meaning of a term
by observing how the experts use such expression in real texts. They can also be used to
analyze the collocations of the term or to see with which other terms it is conceptually
associated.
4
b. Sorting the vocabulary of the corpus. The program will offer the possibility of sorting words or
sequences of words (n-grams) alphabetically, by decreasing frequency order or by using
statistical measures of association which will highlight those sequences of words that have a
significant tendency to appear together in the corpus. Sorting the vocabulary in this way is a
fairly simple procedure to discover multi-word terminology, collocations and phraseological
units of various types.
c. Term extraction consists in the design of computational algorithms to extract terminological
units from texts. This is a highly technical area of expertise on its own, and in spite of decades of
efforts there is still no consensus on which should be the best strategy for the extraction of
terms, and the solution to this problem remains an open question. As it occurs with most term
extractors, the results need human validation, because not all the yielded candidates will be
real terms. This program learns from examples provided by the users. In a phase prior to the
analysis, the user is expected to train the program by uploading lists of terms of the domain of
interest in a given language. With this list, the program will develop a mathematical model of
5
the terms of the domain. Once the program is trained, it will be ready to extract any number of
terms from that domain.
6
d. Glossary Creation Terminus has a built-in glossary model, including the most typical fields of a
glossary, such as the grammatical category of the term, its source, contexts of occurrence,
equivalences, collocations, among many others. In addition, the user can configure his or her
own glossary, customizing, eliminating, or creating new fields.
e. Term management Once the fields of the glossary have been defined and the program
configured accordingly, the term management phase consists in the creation of the
terminological records and the completion of the records with term-related information. Once
the compilation of the glossary has finished, the final step is to export the glossary in one of the
different file formats available. For human readers, PDF as well as HTML formats are the most
convenient. By contrast, users interested in exporting the data with the goal of importing it later
in another database software may prefer other formats such as XML or CSV.
7
Retrieving database records: an example of a terminological record in HTML
Terminus is an integral system used for the whole work-flow, from the compilation of a corpus
to the edition of a glossary, including the analysis and exploitation of textual and terminological
data as well as the elaboration of the conceptual structure that helps to select and organize the
terms included in the glossary.
(based on Corpus-based Terminology Processing, M. Teresa Cabré, M. Amor Montané and
Rogelio Nazar)

Week 7 Seminar Notes Terminology and Corpus

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 7 Seminar Notes Terminology and Corpus

Uploaded by

Copyright:

Available Formats

A sample of Corpus Based Terminology management and tools

You might also like