Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

HUE UNIVERSITY

UNIVERSITY OF FOREIGN LANGUAGES


---------------

NGUYEN THI MY NGOC

BUILDING AN ENGLISH - VIETNAMESE PARALLEL


CORPUS OF CONTEMPORARY ART TERMS

RESEARCH PROPOSAL
ENGLISH LANGUAGE

THUA THIEN HUE, 2022

0
HUE UNIVERSITY
UNIVERSITY OF FOREIGN LANGUAGES
---------------

NGUYEN THI MY NGOC

BUILDING AN ENGLISH - VIETNAMESE PARALLEL


CORPUS OF CONTEMPORARY ART TERMS

RESEARCH PROPOSAL
ENGLISH LANGUAGE
CODE: 8220201

SUPERVISOR: PhD. PHAN THI THANH THAO

THUA THIEN HUE, 2022

1
I. INTRODUCTION

1.1. Rationale of the study

Globalization has led to the worldwide popularity of English. In fact, the most
prominent language in the globe with more than 1.3 billion internationally speakers in 2021
(Statista, 2021) has long showed its presence in almost every aspect of life, especially in the
fields of science, medicine, technology, and education, etc., playing a fundamental role in
connecting researchers, scientists, teachers, students, and trainees from all over the world.
Therefore, the work of translating technical jargon of a scientific field from English to
another language and vice versa is inevitable and has been never an easy-to-face task.
Technical translation, in general, and academic translation, in particular, are significant since
they serve as a bridge between scholars from various cultures, allowing them to share their
knowledge and assisting in the detection of academic plagiarism (Muhiesen & Al-Ajrami,
2019). Muhiesen and Al-Ajrami (2019) conducted a study which investigated challenges that
researchers faced when translating technical texts such as academic texts in archaeology,
particularly in translating a research titled“The Acheulian culture in Jordan 1500000-90000”
by Dr. Mujahed al-Muhiesen, published in Abhath Al-Yarmouk, from Arabic to English.
According to their findings, the researcher ran into several lexical and syntactic problems
during the translation and tried to overcome them using certain strategies. Muhiesen and Al-
Ajrami (2019) suggest that the translator should have a broad knowledge of archaeology,
good comprehension ability in both languages, as well as great linguistic competency in both
languages to solve the aforementioned issues. However, in reality, not all experts in a certain
field such as health, education, or technology, etc., are good at English, and not all
professional translators have complete knowledge of all other sciences. Consequently, when
facing with technical texts, they need the help of reference sources such as dictionaries or
glossaries of specialized terms, or more conveniently, a parallel corpus which contains all of
the terminologies and their translated versions, to fasten the translation process.

In the field of art, the concept of “contemporary art” comes in bursts of special usage in the
1920s and 1930s, and again in the 1960s, though it remains a subordinate term to

2
terminology like “modern art”, “modernism”, and “postmodernism”, which emphasize art‟s
close but controversial ties to social and cultural modernity (Smith, 2014). Contemporary art
is nowadays the established name for today‟s art in professional settings, having extensive
resonance in public media and popular speech (Smith, 2014). Over the last few years in
Vietnam, fine arts-majored students have had opportunities to study and gain access to
contemporary art through theory and composition classes. Many studies by Vietnamese
professors and students in the field of contemporary art have been exhibited, collected, and
even received national and international awards, resulting in a high demand for the
translation of research papers and articles which contain contemporary art terms from
Vietnamese to English and vice versa. However, in terms of reliable and accurate references,
there is a serious shortage of Vietnamese documents consisting of contemporary fine art
terms, which negatively affects the teaching and learning process of Vietnamese lecturer and
students in Fine Arts schools.

In Vietnam, there have been certain works which involve detailed information about
Vietnamese Fine Arts terms such as Từ điển Mỹ thuật (Dictionary of Fine Arts) by Le Thanh
Loc (1998), Từ điển Mỹ thuật hội họa thế giới (Dictionary of World Fine Arts) by Ve Hai,
Tiep Nhan (2004), Từ điển thuật ngữ Mỹ thuật phổ thông (Dictionary of common Fine Art
terms) by Dang Bich Ngan (2002), Le Thi My Hanh (2019)‟s doctoral thesis Đặc điểm cấu
tạo và ngữ nghĩa của thuật ngữ mĩ thuật tiếng Việt (Structural and semantic features of
Vietnamese Art terms), the Từ điển bách khoa Việt Nam (Literally Encyclopedic Dictionary
of Vietnam) (Vietnam National Council, 2005). Besides, there are a small number of
scientific research and articles on fine art terms in different journals and websites, but the
content of these studies only revolves around basic terms of fine arts, none of them give a
sharper focus on contemporary fine arts terminology. Moreover, there has been no study on
the electronic version of contemporary fine arts materials like dictionaries, books, articles,
and so on. Therefore, it is very necessary to build an English-Vietnamese corpus of
contemporary art terms to meet the current learning and research needs in Vietnam. The
product of this study is hoped to serve as a useful and practical reference for the teaching and

3
learning process at Fine Arts schools so that Vietnamese lecturers, students and artists can
easily make use of it when they want to translate Vietnamese contemporary fine arts
terminology into English and vice versa. At the same time, the corpus built is also expected
to be a reference source for translators when encountering art-related texts, helping them to
expand their knowledge of this field. Last but not least, the glossary provides examples of
arts-specialized vocabularies for teachers when teaching about culture-specific words.

1.2. Research aims and research questions


The study illustrates a method of using cloud-based CAT tools to build an English-
Vietnamese corpus of contemporary art terms and uses a corpus-based approach to build a
bilingual glossary of English- Vietnamese contemporary art terms. Therefore, the study aims
at investigating how cloud-based CAT tools can be used to build an English- Vietnamese
parallel corpus of contemporary art terms. It also explores how Sketch Engine, a corpus
analysis tool, can help in effectively analyzing the parallel corpus of contemporary art terms.
The study seeks answers to the following research questions:
1. What are the procedures for building an English- Vietnamese parallel corpus of
contemporary art terms?
2. What steps are involved in creating an English- Vietnamese glossary of contemporary art
terms?
3. How is the English- Vietnamese parallel corpus of contemporary art terms applied in some
specific domains?

1.3. Outline of thesis


In addition to references and appendices, the list of figures and tables, abbreviation, the
thesis consists of five chapters, including:
Chapter 1: Introduction
Chapter 2: Literature review and Theoretical framework
Chapter 3: Methodology
Chapter 4: Findings and Discussion
Chapter 5: Conclusion and Implications

4
II. LITERATURE REVIEW
2.1. The term ‘corpus’
The term „corpus‟ (singular form of corpora) refers to a machine-readable collection of
authentic texts or speeches created by language speakers. Jones and Waller (2015)
maintain that a corpus is “simply an electronically stored, searchable collection of texts”
(p.5). These texts may be written or spoken, and their length may vary, but they will
typically be longer than a single speaking turn or written sentence (Jones & Waller,
2015).

Other scholars have made efforts to clarify the term „corpus‟. According to McEnery,
Xiao, and Tono (2006), a corpus should be a principled collection of texts, as opposed to
a random collection of texts. As a result, a principled corpus is a collection of (1)
machine- readable (2) authentic texts (including transcripts of spoken data) that is (3)
sampled to be (4) representative of a given language or language variety.
2.2. Kinds of corpora
Corpora come in a variety of shapes and sizes. In corpus linguistics, certain corpora,
whichrepresent a language as a whole, are referred to as ‘general corpora’ or ‘reference
corpora’, whereas others endeavoring to reflect a specific type of language use are
referred to as „specialised corpora‟ (Cheng, 2012). In terms of size, general corpora are
often perceivedto be much bigger than specialised corpora. Typical general corpora include
the 100 million- word British National Corpus (BNC) which consists of a wide range of
texts and is compiled to be a representative of British English generally, the Bank of
English including more than 600 million words, and Corpus of Contemporary American
English (COCA) which contains around 400 million words, etc. However, size is not a
key factor that distinguishes these two sorts of corpora. According to Cheng (2012), the
editor‟s purposes when compiling a corpus determine whether it is a general corpus or a
specialized corpus. On the one hand, general corpora attempt to provide a snapshot of a
language as a whole at a specific point of time (Zufferey, 2020). Although it is obviously
impossible to collect a representative sample of the entire language, in the same way that

5
a general language dictionary strives to explain a language‟s common vocabulary, the
general corpus aims to offer a global image, including the main textual genres found in a
language (Zufferey, 2020). These corpora are in fact beneficial with respect to
considering a dialect as a entirety. Specialised corpora, on the other hand, are compiled
to describe language use in a specific variety, register or genre (Cheng, 2012), as well as
offer precise answers on linguistic phenomena present in certain specific communication
means, such as mobile texting, social media, medical reports, etc. (Zufferey, 2020). To
maintain the representativeness and balance of a specialized corpus, the corpus linguist
frequently consults with specialists in the field (Cheng, 2012).

Written or spoken language samples can be found in both general and specialized
languagecorpora. Although written language corpora were the standard for a long time,
spoken language analysis has advanced significantly during the 2000s (Zufferey, 2020).
However, given that spoken language corpora need manual transcription, they are often
smaller than written language corpora. This is an issue facing corpora that try to capture
general language use, due to the practicalities and costs of collecting and transcribing
naturally occurring spoken data, while the sheer simplicity and convenience of collecting
electronic written texts has resulted in the compilation of countless written corpora
(Cheng, 2012).

Another criterion that can be used to differentiate sorts of numerous existing corpora is
the type of processing performed on the corpus‟ linguistic data. Raw corpora consist of
simplylinguistic samples and typical representatives of these are the majority of the French
corpora (Zufferey, 2020). Annotated corpora, however, include the adding of
interpretative linguistic information to the corpus. The most recognized type of
annotation is the additionof tags, or labels, identifying the word class to which words in a
text belong (Leech, 2004). This is known as part-of-speech tagging (or POS tagging) and
it allows users to distinguish between words that have the same spelling but distinct
meanings or pronunciations.

The next differentiation that can be made between all of the current corpora is the one

6
that divides them into sample corpora and monitor corpora. The monitor corpora
continually expand as each of them contains a variety of materials which grows in size
over time. On the flip side, sample corpora or balanced corpora, also known as closed
corpora in the specialized literature (Zufferey, 2020), attempt to reflect a specific
language type during a specific period of time. They hold datasets that have been acquired
once and for all and will not be updated thereafter. In doing so, these corpora attempt to
be balanced and representative within a sample frame that determines the type of
language, or population, that we want to characterize (McEnery & Hardie, 2012).

Currently existing corpora can also be classified into two groups: monolingual corpora
and multilingual corpora. There are two types of multilingual corpora, namely
comparable corpora which comprise of equivalent samples produced by native
speakers in two or more languages, and parallel corpora which entail texts produced in
one language and their translation into one or more other languages. Comparable corpora
enable users to compare how people communicate in similar situations in different
languages and cultures (Zufferey, 2020).

Last but not least, many corpora are based on current written or spoken information.
However, there are archives that allow researchers to research the history of a language,
dating back to ancient French, for instance (Zufferey, 2020). Contemporary corpora
are used to study language in a synchronic fashion, which means at a specific point in its
evolution, while historical corpora allow researchers to study language from a
diachronicperspective, that is, on the evolution of language (Zufferey, 2020).

2.3. A brief history of corpus linguistics

2.4. The uses of corpora

2.4.1. Corpora in Lexicography

2.4.2. Corpora in Grammar/Syntactic studies

2.4.3. Corpora in Semantics

2.4.4. Corpora in Pragmatics

7
2.4.5. Corpora in other linguistic areas

2.5. Criteria for determining the structure of a corpus


Any choice must be based on a few criteria, and the first significant stage in corpus
construction is forming the criteria deciding which texts should be included in the corpus.

According to Sinclair (2004), common criteria are:


a. the mode of the text; whether the corpus should contain written texts or transcripts
ofspeeches;

b. the type of text; for example, if written, whether a book, a journal, a notice or a
lettershould be chosen;

c. the domain of the text: academic or popular;


d. the language or language varieties of the corpus;
e. the place where the texts are produced; for example (the English of) UK or Australia;
f. the date of the texts.
These criteria, as Sinclair (2004) noted, should be small in number, clearly distinct from
one another, and effective as a group in defining a corpus that is typical of the language or
varietyunder investigation.

When creating a corpus, some aspects of language science must be taken into account,
such as the quantity of the text to be sampled, the range of language diversity
(synchronous), andthe time period of the text (diachronic) (Sam, 2016).

Regarding stages in building a corpus, Sam (2016) also suggests there should be 5 stages:
 Planning and design the corpus
 Selection of the data sources
 Permission of the owner(s) of the data
 Data collection and encoding
 Handling corpus
2.6. Procedures for building corpora
2.7. Corpus analysis tools

8
2.8. Cloud-based CAT tools
2.9. Translation strategies, translations techniques, and translation procedures
2.10.Translating culture-specific terms, challenges and related studies
III. METHODOLOGY
Regarding the research methods, this study incorporates a combination of the following
research approaches:
 Quantitative approach: Linguistic database statistics of the English-Vietnamese
parallel corpus consists of more than 1 million words.
 Qualitative approach: With the help of cloud-based CAT tools, a qualitative approach
is employed to carry out the proper translation of documents from English to
Vietnamese.
3.1 Data collection procedures
3.1.1. Target Materials
The researcher will collect English-Vietnamese contemporary art terms from documents of
reliable resources, including:
 Documents published by official international organizations
 Articles published by reputable electronic newspaper sites
 Information from books published by prestigious publishers in the world
Once the decision has been made, the materials will then be converted into a word
processing document (Toriida, 2016). Next step involves the word elimination process which
is the process of removing useless words from the corpus. These words are not considered
content words, based on the corpus‟s goals and the needs of users (Toriida, 2016). Certain
parts which can be deleted are reference sections and citations, repetitive textbook headings,
figure and table headings, etc. (Toriida, 2016).

9
3.1.2. Corpus size
The parallel corpus is intended to include more than 100,000 words with the following
features illustrated in the table below:
Corpus type English-Vietnamese parallel corpus
Domain Contemporary Art
Directionality Unidirectional
Languages English-Vietnamese
Time Contemporary

1.1.3. Pre-processing of corpus data


Following the data collection from the above-mentioned reliable sources, data will then be
stored in plain text (.txt or UTF-8) using Notepad or Notepad++ software. Plain texts need
less memory for storage than rich texts, which is why the researcher chose to store data in
this file type. Furthermore, because most corpus analysis software only accepts .txt files,
these simple texts are appropriate and ready for use with any corpus analysis software.
1.1.4. Translating texts from English into Vietnamese and aligning English-
Vietnamese texts
Concerning texts which are not available in Vietnamese, Wordfast- an effective CAT tool
will be employed to foster the translating process.
Next, both translated texts and texts that are originally available in both English and
Vietnamese will be aligned. In this step, Alignment Tools (e.g., YouAlign, LF Aligner, etc.)
will be adopted to create TM files containing aligned English-Vietnamese sentences:
 The researcher will first upload the SL and TL files into the software and click "Align
Now" after selecting the language of the source-language (SL) text (English version)
and the target-language (TL) text (Vietnamese version).
 The researcher will then select and download the output file in .tmx format. Many
CAT applications, notably Smartcat, use this file type as the default and original
format for translation memory database files.

10
3.1.5. Creating the Vietnamese-English parallel corpus of contemporary art glossary
- In order to use Smartcat, the first users have to create an account in Smartcat. After signing
in, users will see the initial interface of Smartcat.
- In the next step, the researcher will upload all the aligned texts files (mentioned above in
3.1.4.) into the Translation Memory.
- After that, the researcher will click „Next‟ to set the name of the project, the source and
target language, deadline (optional) and comment (optional).
- The glossary of contemporary art terms can be created by clicking „GLOSSARIES‟.
- The last step is clicking „Finish‟ to complete the process of building the corpus.
- In terms of employing a tool for the corpus analysis, online tools such as Sketch Engine
will be taken into consideration as it is easy to exploit, with a simple, clear interface which
does not require users to do complicated algorithms.
IV. TENTATIVE TIMETABLE

Content Time allocation

Chapter 1: Introduction 1 month

Chapter 2: Literature review and Theoretical framework 3 months

Chapter 3: Methodology 3 months

Chapter 4: Findings & discussion 3 months

Chapter 5: Conclusion & Implications 1 month

Review all chapters 1 month

11
V. REFERENCES
1. Cheng, W. (2012). Exploring Corpus Linguistics. New York, NY: Routledge.
2. Đặng, B. N. (2002). Từ điển thuật ngữ Mỹ thuật phổ thông: NXB Giáo Dục.
3. Jones, C., & Waller, D. (2015). Corpus Linguistics for Grammar. New York, NY:
Routledge.
4. Lê, T. L. (1998). Từ điển Mỹ thuật: Công ty sách Thời Đại & NXB VHTT.
5. Leech, G. (2004). Adding Linguistic Annotation. Developing Linguistic Corpora: a
Guide to Good Practice. Retrieved from
https://users.ox.ac.uk/~martinw/dlc/chapter2.htm
6. McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice.
New York, US: Cambridge University Press.
7. McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An
advanced resource book. Oxford, UK: Routledge.
8. Muhiesen, E. a. M., & Al-Ajrami, M. S. (2019). Challenges in Translating Technical
Texts. Human and Social Sciences, 46(1).
9. Sam, A. (2016). What is Corpus Linguistics and Stages in Building Corpus. Retrieved
from https://notesread.com/what-is-corpus-linguistics-and-stages-in-building-corpus/
10. Sinclair, J. (2004). Corpus and Text- Basic Principles. Developing Linguistic
Corpora: a Guide to Good Practice. Retrieved from https://bond-
lab.github.io/Corpus-Linguistics/dlc/chapter1.htm
11. Smith, T. (2014). Contemporary Art. OBO in Art History. Retrieved from
https://www.oxfordbibliographies.com/view/document/obo-9780199920105/obo-
9780199920105-0007.xml
12. Statista. (2021). The most spoken languages worldwide in 2021. Retrieved from
https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
13. Tiệp Nhân, & Vệ Hải. (2004). Từ điển Mỹ thuật hội họa thế giới: NXB Mỹ Thuật.
14. Vietnam Nationcal Council. (2005). Từ điển bách khoa Việt Nam (Literally
Encyclopaedic Dictionary of Vietnam) Hà Nội: Vietnam's Encyclopedia Publishing
House.
15. Zufferey, S. (2020). Introduction to Corpus Linguistics. Great Britain: ISTE Ltd.

12

You might also like