Professional Documents
Culture Documents
Structuring of Translation Memory Structuring of Translation Memory Work-Integrated Learning Programmes Division
Structuring of Translation Memory Structuring of Translation Memory Work-Integrated Learning Programmes Division
Structuring of Translation Memory Structuring of Translation Memory Work-Integrated Learning Programmes Division
net/publication/327700572
CITATIONS READS
0 151
1 author:
Ashutosh Kumar
Expert Software
4 PUBLICATIONS 2 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ashutosh Kumar on 17 September 2018.
by
Ashutosh Kumar
2015HT13439
April, 2018
Structuring of Translation Memory
by
Ashutosh Kumar
2015HT13439
April, 2018
Birla Institute of Technology & Science, Pilani
ABSTRACT
II
ABSTRACT
III
ACKNOWLEDGEMENTS
I would also like to thank Dr. Mukul Kumar Sinha for his insightful comments
and encouragement, but also for asking hard questions which helped me
widen my research from various perspectives.
I would also like to thank my colleague, Ms. Himanshi Thapliyal for editing and
proof-reading the dissertation.
Last but not the least, I would like to express my love and gratitude to my
beloved family, for their understanding & motivation, through the duration of
this project.
V
Table of Contents
CERTIFICATE......................................................................................................................................I
ABSTRACT.......................................................................................................................................III
ACKNOWLEDGEMENTS.................................................................................................................V
Chapter 1 : Introduction........................................................................................................................1
1.1 Translation Memory and Uses........................................................................................................1
1.2 Basic Approach of Retrieval from TM...........................................................................................2
1.3 Problems faced...............................................................................................................................2
1.4 Our approach to solve this problem................................................................................................3
1.5 Objective.........................................................................................................................................4
Chapter 2 : Translation Memory - Literature review............................................................................5
2.1 Translation Memory.......................................................................................................................5
2.2 Matching Techniques for TM.........................................................................................................5
2.2.1 Simple Matching.....................................................................................................................5
2.2.2 Using “linguistic knowledge”.................................................................................................6
2.3 Types of Matches............................................................................................................................6
2.3.1 Exact Match (100%)...............................................................................................................6
2.3.2 Fuzzy Match (1% - 99%)........................................................................................................6
2.3.3 No Match................................................................................................................................7
2.4 Ways of creating TM......................................................................................................................7
2.4.1 During Translation..................................................................................................................7
2.4.2 Importing another TM database..............................................................................................7
2.4.3 Aligning existing translations and their original texts............................................................7
2.5 Data exchange standards for TM systems......................................................................................7
2.6 TM vs MT.......................................................................................................................................9
2.7 TM vs term-base?.........................................................................................................................10
Chapter 3 : Experiments and Results..................................................................................................11
3.1 Clause Extraction..........................................................................................................................11
3.2 Experimental Data........................................................................................................................12
3.3 Relevant Match.............................................................................................................................12
3.4 Experiments..................................................................................................................................13
VI
3.4.1 Configuration........................................................................................................................13
3.4.1.1 TM configuration 1.......................................................................................................13
3.4.1.2 TM configuration 2.......................................................................................................14
3.4.1.3 TM configuration 3.......................................................................................................14
3.4.2 Process to Run Experiment...................................................................................................14
3.5 Experimental Results....................................................................................................................15
3.5.1 Result: Data Set-A................................................................................................................15
3.5.2 Result: Data Set-B................................................................................................................15
3.5.3 Result: Data Set-C................................................................................................................16
3.5.4 Conclusion of Results...........................................................................................................16
3.5.4.1 Conclusion 1.................................................................................................................17
3.5.4.2 Conclusion 2.................................................................................................................17
3.5.4.3 Conclusion 3.................................................................................................................17
Annex.................................................................................................................................................18
SUMMARY........................................................................................................................................21
Direction for future work....................................................................................................................22
REFERENCES :.................................................................................................................................23
Checklist of items for the Final Dissertation Report..........................................................................24
VII
List of Figures
Fig 1: constituency parse tree............................................................................................................12
Fig 2: analysis graph..........................................................................................................................16
VIII
List of Tables
Table 1: TM Without clause structuring...............................................................................................2
Table 2: TM with clause splitting.........................................................................................................4
Table 3: Result of Set-A.....................................................................................................................14
Table 4: Result of Set-B......................................................................................................................15
Table 5: Result of Set-C......................................................................................................................15
Table 6: TM configuration 1...............................................................................................................17
Table 7: TM configuration 2...............................................................................................................17
Table 8: TM configuration 1...............................................................................................................17
Table 9: TM configuration 3...............................................................................................................18
IX
Chapter 1 : Introduction
1
1.2 Basic Approach of Retrieval from TM
It has been observed that during the process of building TM, sometimes there
exists a relevant clause in TM data but it is not retrieved. This is because the
TM tool in which I am working is structured in sentences only.
Sentence level data made the matching result below the acceptable threshold
for a clause match. So relevant result from TM is dropped. Thus a valid and
relevant TU is not available to the translator thereby affecting his or her
efficiency. For example -
1. Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep
them as it doesn't disturb others' sleep
2. It doesn't disturb others' sleep
2
from TM. The relevant match for sentence 2 in Query data was not found
because this match (36%) is below the defined threshold (75%) for TM.
Therefore recommending a low threshold value for partial match is not a very
good solution.
In our approach we will use clause splitting to define clause level structure of
TM. In clause level structure of TM, We will split the sentence into clauses ,and
put the clauses along with the its sentence. So in clause level structure of TM,
TM contains the sentences along with the clauses of these sentences.
1 Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep them as it doesn't disturb others' sleep
2 It doesn't disturb others' sleep 1.1 Earphone is the best option available for
them
3
Table 1 shows TM source data without clause structuring. Table 2 shows TM
source data with clause structuring. Sentence 1 of TM source data in Table 1 is
divided into two clauses i.e. clause 1.1 and clause 1.2 in Table 2. Now if we try
to find a match with TM data for sentence 2 of Query data, we will get the
desired match because it crosses the defined threshold(75%) for TM.
1.5 Objective
4
Chapter 2 : Translation Memory - Literature review
Translation memory systems (TMs) have been around for a number of years,
and their use is now widespread not only among translators, but also among
language service providers (LSPs) and other translation services (public service
or other corporate settings), that have seen the numerous advantages that
TMs offer. Today, TMs are the most widely used translation aid in market.
In past years, lot of research work has been done to syntactically compare two
strings. String similarity measure states the extent to which the two stated
strings are similar or dissimilar . These algorithms can be effectively used in
the fuzzy string searches. This constitutes the main approach for the best
match in TM if an exact match is not found[1].
The matching algorithms have been altered and modified over time, but they
still rely on simple character- or token-based matching procedures without
taking into account linguistic aspects like morphosyntactic, syntactic or
semantic features that may determine the “similarity” of translation units [1].
5
2.2.2 Using “linguistic knowledge”
[1]
Current approaches in commercial and research systems Linguistics-driven
efforts on enhancing retrieval in TM systems are basically motivated by two
different goals:
Exact matches are found if the word chain of a new Input Segment is identical
to that of a stored source-text segment in TM.
Example:
• What are earphones and why do people use them. (Input Segment)
• What are earphones and why do people use them. (TM segment)
Example:
• What are earphones and why do people use them. (Input Segment)
• What are wireless earphones and why do people use them. (TM
segment)
6
2.3.3 No Match
7
[7]
A TMX document is an XML document whose root element is <tmx>. The
<tmx> element contains two children: <header> and <body>. General
information about the TMX document is described in the attributes of the
<header> element. Additional information is provided in the <note>, <ude>,
and <prop> elements. The main content of the TMX document is stored inside
the <body> element. It holds a collection of translations contained in
[7]
translation unit elements (<tu>). Each translation unit contains text in one
or more languages in translation unit variant elements (<tuv>).
<tmx version="1.4">
<header
creationtool="XYZTool" creationtoolversion="1.01-023"
datatype="PlainText" segtype="sentence"
adminlang="en-us" srclang="en"
o-tmf="ABCTransMem"/>
<body>
<tu>
<tuv xml:lang="en">
<seg>Hello world!</seg>
</tuv>
<tuv xml:lang="">
<seg>Bonjour tout le monde!</seg>
</tuv>
</tu>
</body>
</tmx>
8
2.5 Benefit of TM for translator/post-editor
2.6 TM vs MT
9
2.7 TM vs term-base?
10
Chapter 3 : Experiments and Results
OpenNLP[6] the open-source tool supports the most common Natural Language
Processing (NLP) tasks, such as tokenization, sentence segmentation, part-of-
speech tagging, named entity extraction, chunking, parsing, language
detection and coreference resolution.
Penn tree bank defines the clause level tag to identify the clauses from a parse
tree. These tags are SBAR, SQ, SINV etc. We use these tags to extract the
clauses from a given English sentence.
“Earphone is the best option available for them as it doesn't disturb others' sleep”
[TOP [S [NP [NN Earphone]] [VP [VBZ is] [NP [NP [DT the] [JJS best] [NN option]] [ADJP
[ADJP [JJ available] [PP [IN for] [NP [PRP them]]]] [SBAR [IN as] [S [NP [PRP it]] [VP [VBD
doesn't] [NP [ADJP [NN disturb] [JJ others']] [NN sleep]]]]]]]] [. .] ]]
The pictorial representation of this parse tree is given in “Fig 1”. We can see
that SBAR tag node represents a clause and remaining S tag node represents
the other clause.
11
Fig 1: constituency parse tree
For our experiments we have created three different test data set for English
language. Set-A contains 100 input sentences, Set-B contains 200 input
sentences and Set-C contains 300 input sentences. For TM we have a data set
which contains 3,500 sentences. This data contains simple sentences and
complex sentences. Complex sentence contains more than one clause.
A match for a sentence of Query data in TM data will be a relevant match if the
percentage of matching for the query sentence with the available data in TM is
equal to or greater than the fuzzy threshold set for the TM. For our experiment
this fuzzy threshold is set to 75%.
In addition to this, there are several query inputs which partially matches with
the TM data. But their percentage matching is less than the defined threshold,
hence they are not reflected in the result – left-out from the result. In fact, if
we examine minutely, we see that some of these matches are also relevant for
the translator.
12
For example, the given sentence is
“Earphone is the best option available for them as it doesn't disturb others' sleep”
as the percentage match (36%) is below the set threshold (75%) of the TM,
this is not reflected in Query results.
The translator did not get any match but the given sentence in TM is a relevant
match because it contains Query sentence as a clause which will significantly
help the translator.
In our experiments, we have ignored the number of results that are retrieved
for a single query. We are only interested in knowing, whether even a single
match is found or not, if the query exists as a clause in TM.
3.4 Experiments
3.4.1 Configuration
In this dissertation we have performed experiments with three kinds of TM
configurations.
3.4.1.1 TM configuration 1
In TM configuration 1, Query data contains full sentences. TM data also
contains sentences. In this configuration, there is no clause level splitting
either in Query data or in TM data.
13
3.4.1.2 TM configuration 2
3.4.1.3 TM configuration 3
For a given sentence or clause, we prepare Query data set (which may contain
one query or more than one query). We lookup into the TM database to see
whether there is a match for each Query data in the TM data. Whenever, even
a single relevant match is found for a given Query data, we record the value of
"Match Found" as ‘True’ other wise "Match Found" will be set to ‘False’.
Based on this we calculate how many Query data were matched with TM data
and how many of them did not match with TM data.
14
3.5 Experimental Results
‘S’ denotes the sentences, ‘C’ denotes the clauses of these sentences, ‘S, C’
denotes the sentences and its clauses.
15
3.5.3 Result: Data Set-C
Table 3, Table 4 and Table 5 show the results of experiment for different data
set (Set-A, Set-B, Set-C) and different TM configurations. We have created a
line graph (as shown in “Fig 2”) from these tables. Y-axis of the line graph
shows the percentage of matching find for Query data with TM data. X-axis
shows the different configuration as we have defined above in configuration
section.
16
3.5.4.1 Conclusion 1
Table 3, Table 4 and Table 5, shows that whenever there is clause splitting
either in the Query data or in TM data, there is increase in % Match from TM.
3.5.4.2 Conclusion 2
In Table 3, Table 4 and Table 5, the Query data set is different in each case. In
spite of different data set (Set-A, Set-B, Set-C) we observe that for each data
set there is an increasing trend in % Match from TM. Hence we can conclude
that clause splitting will always increase the % Match over sentence level data.
3.5.4.3 Conclusion 3
In Table 3, Table 4 and Table 5, the increase in % Match in each case is not
uniform. This is because the % Match for a given data set from the TM is
dependent upon the two factors, I.e. TM data and Query data.
17
Annex
Example of TM configurations
1. Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep
them as it doesn't disturb others' sleep
2. It doesn't disturb others' sleep
Table 6: TM configuration 1
To explain how the TM works, Table 6 shows two instances of Query data and
one instance of TM data. When we search for the first query in TM database
we get 100% match, as shown in TM data.
For second query we do not get any result from the TM database, as the
matching percentage is (36%) and the threshold of acceptable match for the
TM is 75%. But we can see that the second query has a relevant match
(underlined portion of the sentence 1 of TM data)
When we used “TM Configuration 2” for the data set shown in the Table 6, we
get data as shown in “Table 7”. We can see in Table 7, if we search for
sentence 2 of Query data, we get an acceptable % match. It matches with TM
data 1.2, (which is actually a clause of sentence 1 in TM data).
18
Query data TM data
1 Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep them as it doesn't disturb others' sleep
2 It doesn't disturb others' sleep 1.1 Earphone is the best option available for
them
Table 7: TM configuration 2
1. You should wear some protective materials 1. Drugs help us when we are not feeling well
while using the things that contain chemicals but you should keep the drugs from the reach
and you should keep the chemical from the of kids because it may be dangerous for the
reach of kids kids
Table 8: TM configuration 1
1. You should wear some protective materials 1. Drugs help us when we are not feeling well
while using the things that contain chemicals
but you should keep the drugs from the reach
and you should also keep the chemical from
the reach of kids of child because it may be dangerous for the
child
1.1 You should wear some protective
materials while using the things that contain 1.1 Drugs help us when we are not feeling
chemicals
well
1.2. and you should keep the chemical from 1.2 but you should keep the drugs from the
the reach of kids
reach of kids
1.3 because it may be dangerous for the kids
Table 9: TM configuration 3
19
Now we use “TM configuration 3” for data set in Table 8, and the results are
shown in Table 9. Sentence 1 of Query data is divided into two clauses i.e.
clause 1.1 and clause 1.2. Sentence 1 of TM data is divided into three clauses
i.e. clause 1.1, clause 1.2 and clause 1.3. As shown in table, a relevant match
exists between clause 1.2 of Query data with clause 1.2 of TM data. As they
are in the form of clauses, the percentage match (82%) in this case is higher
than the set TM threshold(75%). Hence, they qualify as acceptable match and
are reflected in the result set. These were earlier dropped from the result set
as shown in Table 8.
20
SUMMARY
We have seen that when we use clause level structuring in TM, relevant
matches for Query data that were earlier dropped due to low percentage match
in sentence level, are also retrieved in the resulting set.
So we get more relevant matches for Query data from the TM database.
21
Direction for future work
We have seen that when we use clause level structuring in TM, relevant
matches for Query data that were earlier dropped due to low percentage match
in sentence level, are also retrieved in the resulting set.
The clause level structuring increases the size of TM database because in this
its contains sentences as well as its clauses. The increase in size of TM
database may affect the performance of search in TM database. Most search
algorithm for fuzzy matching is based on word level edit distance or bag of
word model.
The performance can be improved if we keep POS tag annotated data for each
word in TM and in Query data. Our hypothesis is that if we keep the POS tag
information of words this may improve performance as well as accuracy.
Because in that case we can search for TM match based on tag information
instead of word. So we can know which text is structurally same in the TM with
Query data. For this experiment a large POS tagged corpus will be required for
building the TM database. This work is in progress.
22
REFERENCES :
[1] Reinke, U. (2013), State of the Art in Translation Memory Technology. Proceedings of
the Workshop on Natural Language Processing for Translation Memories (NLP4TM),
pages 17–23,
[2] Grönroos, Mickel., Becks,Ari., Bringing Intelligence to Translation Memory Technology.
Translating and the Computer 27, November 2005 [London: Aslib, 2005]
[3] Timonera, Katerina., and Mitkov, Ruslan., (Sept 2015), Improving Translation Memory
Matching through Clause Splitting. Proceedings of the Workshop on Natural Language
Processing for Translation Memories (NLP4TM), pages 17–23, Hissar, Bulgaria,
[4] Sharma, Sanjeev Kumar.,(2016), Clause Boundary Identification for Different
Languages: A Survey, International Journal of Computer Applications & Information
Technology Vol. 8, Issue II 2016 (ISSN: 2278-7720)
[5] https://stanfordnlp.github.io/CoreNLP/
[6] https://opennlp.apache.org/
[7] https://www.ibm.com/developerworks/library/x-localis3/
[8] Translators on translation memory (TM). Results of an ethnographic study in three
translation services and agencies Matthieu LeBlanc Université de Moncton,
[9] Christensen,Tina Paulsen. and Schjoldager, Anne., (2011) The Impact of Translation-
Memory (TM) Technology on Cognitive Processes. NLPSC 2011
[10] A.Zerfass., (2002). Evaluating Translation Memory Systems. Proceedings of the LREC
2002 Workshop, Las Palmas, Canary Islands, SPAIN.
[11] Timothy Baldwin & Hozumi Tanaka. (2001). Balancing up Efficiency and Accuracy in
Translation Retrieval. Journal of Natural Language Processing vol. 8.
[12] Soroosh, Mortezapoor., (2013). Translation Memory and Bitext Alignment. Seminar in
Artificial Intelligence.]
23
View publication stats