Structuring of Translation Memory Structuring of Translation Memory Work-Integrated Learning Programmes Division

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/327700572
Structuring of Translation Memory Structuring of Translation Memory Work-

Integrated Learning Programmes Division
Thesis · April 2018

DOI: 10.13140/RG.2.2.22931.94240
CITATIONS READS
0 151
1 author:
Ashutosh Kumar
Expert Software
4 PUBLICATIONS 2 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Transzaar View project
All content following this page was uploaded by Ashutosh Kumar on 17 September 2018.
The user has requested enhancement of the downloaded file.

Structuring of Translation Memory
BITS ZG628T: Dissertation
by
Ashutosh Kumar
2015HT13439
Dissertation work carried out at
Expert Software Consultants Pvt. Ltd. , Noida
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE

PILANI (RAJASTHAN)
April, 2018
Structuring of Translation Memory
by
Ashutosh Kumar
2015HT13439
Dissertation work carried out at

Expert Software Consultants Pvt. Ltd., Noida
Submitted in partial fulfillment of M.Tech. Software Systems
Under the Supervision of

Dr. Pawan Kumar, Sr. Software Engineer
Expert Software Consultants Pvt. Ltd., Noida
BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE

PILANI (RAJASTHAN)
April, 2018
Birla Institute of Technology & Science, Pilani
Work-Integrated Learning Programmes Division
Second Semester 2017-2018
ABSTRACT
BITS ID No. : 2015HT13439
NAME OF THE STUDENT : Ashutosh Kumar
EMAIL ADDRESS : 2015HT13439@wilp.bits-pilani.ac.in
STUDENT’S EMPLOYING : Expert Software Consultants Pvt. Ltd.
ORGANIZATION & LOCATION Noida
SUPERVISOR’S NAME : Dr. Pawan Kumar
SUPERVISOR’S EMPLOYING : Expert Software Consultants Pvt. Ltd.
ORGANIZATION & LOCATION : Noida
SUPERVISOR’S EMAIL ADDRESS: hawahawai@gmail.com
DISSERTATION TITLE : Structuring of Translation Memory
II
ABSTRACT
Translation Memory (TM) is an archive of previously translated segments that

stores source language segment and its corresponding translation to target
language. Here, segment refers to a sentence or paragraph.
When a translator uses a TM tool to translate a new segment, the tool

identifies similarities between segment of Query data and the stored segment
in TM database. A translator may then choose to insert or make slight changes
to the previous translation of that segment.
Most of the TM tool stores sentence-level segments in TM database. Hence the

benefits of TM are only realized for identical or similar sentences, which may
occur rarely because usually sentences are complex; while sentence fragments
(clauses) may recur much more often.
In TM comprising of sentence-level segments, It may sometimes occur that

input sentence contains a sub-segment (clause) and its translation is
available in TM. But search-and-retrieval function does not show any result.
This is due to the fact that the matching percentage between input and TM
data is below the defined threshold for TM. Since the sentence in Query data is
a small part (clause) of the longer sentence in TM data, therefore it does not
qualify for the filtering criteria of partial match. Thus, a relevant segment
which was available in TM - not identical but similar enough for the translation
to be worth recalling for the translator, is dropped.
This dissertation aims to introduce the clause level structuring for TM to

retrieve the dropped relevant matches of Query data that could not be
retrieved when we used the sentence level TM.
III
ACKNOWLEDGEMENTS
Firstly, I would like to express my sincere gratitude to my Supervisor Dr.

Pawan Kumar for the continuous support of my M.Tech. study and related
research, for his patience, motivation, and immense knowledge. His guidance
helped me in all the time of research and writing of this dissertation.
I would also like to thank Dr. Mukul Kumar Sinha for his insightful comments
and encouragement, but also for asking hard questions which helped me
widen my research from various perspectives.
I would also like to thank my colleague, Ms. Himanshi Thapliyal for editing and
proof-reading the dissertation.
Last but not the least, I would like to express my love and gratitude to my
beloved family, for their understanding & motivation, through the duration of
this project.
V
Table of Contents
CERTIFICATE......................................................................................................................................I
ABSTRACT.......................................................................................................................................III
ACKNOWLEDGEMENTS.................................................................................................................V
Chapter 1 : Introduction........................................................................................................................1
1.1 Translation Memory and Uses........................................................................................................1
1.2 Basic Approach of Retrieval from TM...........................................................................................2
1.3 Problems faced...............................................................................................................................2
1.4 Our approach to solve this problem................................................................................................3
1.5 Objective.........................................................................................................................................4
Chapter 2 : Translation Memory - Literature review............................................................................5
2.1 Translation Memory.......................................................................................................................5
2.2 Matching Techniques for TM.........................................................................................................5
2.2.1 Simple Matching.....................................................................................................................5
2.2.2 Using “linguistic knowledge”.................................................................................................6
2.3 Types of Matches............................................................................................................................6
2.3.1 Exact Match (100%)...............................................................................................................6
2.3.2 Fuzzy Match (1% - 99%)........................................................................................................6
2.3.3 No Match................................................................................................................................7
2.4 Ways of creating TM......................................................................................................................7
2.4.1 During Translation..................................................................................................................7
2.4.2 Importing another TM database..............................................................................................7
2.4.3 Aligning existing translations and their original texts............................................................7
2.5 Data exchange standards for TM systems......................................................................................7
2.6 TM vs MT.......................................................................................................................................9
2.7 TM vs term-base?.........................................................................................................................10
Chapter 3 : Experiments and Results..................................................................................................11
3.1 Clause Extraction..........................................................................................................................11
3.2 Experimental Data........................................................................................................................12
3.3 Relevant Match.............................................................................................................................12
3.4 Experiments..................................................................................................................................13
VI
3.4.1 Configuration........................................................................................................................13
3.4.1.1 TM configuration 1.......................................................................................................13
3.4.2 Process to Run Experiment...................................................................................................14
3.5 Experimental Results....................................................................................................................15
3.5.1 Result: Data Set-A................................................................................................................15
3.5.2 Result: Data Set-B................................................................................................................15
3.5.3 Result: Data Set-C................................................................................................................16
3.5.4 Conclusion of Results...........................................................................................................16
3.5.4.1 Conclusion 1.................................................................................................................17
3.5.4.2 Conclusion 2.................................................................................................................17
3.5.4.3 Conclusion 3.................................................................................................................17
Annex.................................................................................................................................................18
SUMMARY........................................................................................................................................21
Direction for future work....................................................................................................................22
REFERENCES :.................................................................................................................................23
Checklist of items for the Final Dissertation Report..........................................................................24
VII
List of Figures
Fig 1: constituency parse tree............................................................................................................12
Fig 2: analysis graph..........................................................................................................................16
VIII
List of Tables
Table 1: TM Without clause structuring...............................................................................................2
Table 2: TM with clause splitting.........................................................................................................4
Table 3: Result of Set-A.....................................................................................................................14
Table 4: Result of Set-B......................................................................................................................15
Table 5: Result of Set-C......................................................................................................................15
Table 6: TM configuration 1...............................................................................................................17
IX
Chapter 1 : Introduction
In recent years, with the intense application of Artificial Intelligence techniques

accuracy of machine translation systems have improved significantly. Yet lot of
human editing is required to make the translated content correct and fluent
(publishable quality). To improve the quality of translated content human
translators use several computer aided tools, which aids them in performing
their translation/editing tasks. These tools include dictionary, glossary, term-
base, translation memory, spell checker, concordance, etc. Among them
Translation Memory (TM) is one of the most important tool that enhances
the productivity of a translator.
1.1 Translation Memory and Uses
TM is an archive of previously translated texts, that stores source language

text and its corresponding translation to target language. It is generally
created during the process of translation done by Human Translators.
Each single record in TM is called a Translation Unit (TU). A TU contains a

pair of source language and target language segment. Segment may be either
a paragraph or a sentence or a phrase[1].
Human Translator can use TM to translate a new text segment. Translator

searches the new segment in TM, and if a matching segment is found, it is
provided to the translator. The translator can use the this segment as it is,
revise it, or reject it.
TM will help Human Translators to increase their productivity. It helps to ensure

that the same terminology is consistently used across translation. It helps to
ensure uniform style of translation across a large document.
1
1.2 Basic Approach of Retrieval from TM
Searching technique used for retrieval of TU from TM is generally based on

word level string matching between TM source-language data and Query data.
A source-language data in TM can be an exact match (100%) of Query data;

partial match of Query data with an acceptable threshold of 75%-99%
(approx.) match; or there will be no match if source-language data is below
than the above mentioned threshold of 75%. This threshold depends upon the
accuracy of the algorithm used for matching.
1.3 Problems faced
It has been observed that during the process of building TM, sometimes there
exists a relevant clause in TM data but it is not retrieved. This is because the
TM tool in which I am working is structured in sentences only.
Sentence level data made the matching result below the acceptable threshold
for a clause match. So relevant result from TM is dropped. Thus a valid and
relevant TU is not available to the translator thereby affecting his or her
efficiency. For example -
Query data TM data
1. Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep
2. It doesn't disturb others' sleep
Table 1: TM Without clause structuring
Table 1, shows two sentences in Query data. Exact match of sentence 1 is

available in TM data. Sentence 2, which is one of the clauses of sentence 1 in
TM source data, is a relevant match. But we did not get any relevant match
2
from TM. The relevant match for sentence 2 in Query data was not found
because this match (36%) is below the defined threshold (75%) for TM.
To circumvent the above mentioned problem, we can lower the value of

threshold to get the relevant sentences from TM data which are left out due to
higher threshold. But it may also retrieve too many irrelevant sentences as
well, if the translation memory is large.
Therefore recommending a low threshold value for partial match is not a very
good solution.
1.4 Our approach to solve this problem
In our approach we will use clause splitting to define clause level structure of
TM. In clause level structure of TM, We will split the sentence into clauses ,and
put the clauses along with the its sentence. So in clause level structure of TM,
TM contains the sentences along with the clauses of these sentences.
Retrieving clauses is desirable because there is a higher chance for a match to

be found for a clause than for a complex sentence (contains more than one
clause). A Clause contains complete thought because it comprises of a subject
and a verb. Hence even if a translator does not find a match for the entire
sentence, he or she still might get a match for clauses and therefore the
translator will be benefited.
Query data TM data
1 Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep them as it doesn't disturb others' sleep
2 It doesn't disturb others' sleep 1.1 Earphone is the best option available for
them
1.2 as it doesn't disturb others' sleep
Table 2: TM with clause splitting
3
Table 1 shows TM source data without clause structuring. Table 2 shows TM
source data with clause structuring. Sentence 1 of TM source data in Table 1 is
divided into two clauses i.e. clause 1.1 and clause 1.2 in Table 2. Now if we try
to find a match with TM data for sentence 2 of Query data, we will get the
desired match because it crosses the defined threshold(75%) for TM.
1.5 Objective
The goal of this dissertation is to retrieve the dropped relevant matches of

Query data that could not retrieve when we used the sentence level TM. To
achieve this we will define a clause-level structuring of TM to retrieve the
dropped sentences of Query data with TM data.
4
Chapter 2 : Translation Memory - Literature review
Translation memory systems (TMs) have been around for a number of years,
and their use is now widespread not only among translators, but also among
language service providers (LSPs) and other translation services (public service
or other corporate settings), that have seen the numerous advantages that
TMs offer. Today, TMs are the most widely used translation aid in market.
2.1 Translation Memory
A TM[10][12] technology is used as a translator’s aid for providing a good

precision translation. Basically, a translation memory is a database that
consists of Translation unit (TU) which constitutes the language pair of source
and its translation. TM system matches the source-language input (that needs
to be translated) with the source-language text in TM and retrieves the
matched result in source language and target language. This may help human
translators to either accept the translation, replace it with a fresh translation or
modify the translation to fit into the new translation and update it in TM.
2.2 Matching Techniques for TM
In past years, lot of research work has been done to syntactically compare two
strings. String similarity measure states the extent to which the two stated
strings are similar or dissimilar . These algorithms can be effectively used in
the fuzzy string searches. This constitutes the main approach for the best
match in TM if an exact match is not found[1].
2.2.1 Simple Matching
The matching algorithms have been altered and modified over time, but they
still rely on simple character- or token-based matching procedures without
taking into account linguistic aspects like morphosyntactic, syntactic or
semantic features that may determine the “similarity” of translation units [1].
5
2.2.2 Using “linguistic knowledge”
[1]
Current approaches in commercial and research systems Linguistics-driven
efforts on enhancing retrieval in TM systems are basically motivated by two
different goals:
(1) improving recall and precision of (monolingual) retrieval, i.e. enhancing

quantity, quality and ranking of matches, at segment level and at subsegment
level[1].
(2) Automized adjustment of fuzzy matches to enhance reusability and reduce

post-editing efforts by integrating SMT technology into TM systems [1].
2.3 Types of Matches
2.3.1 Exact Match (100%)
Exact matches are found if the word chain of a new Input Segment is identical
to that of a stored source-text segment in TM.
Example:
• What are earphones and why do people use them. (Input Segment)
• What are earphones and why do people use them. (TM segment)
2.3.2 Fuzzy Match (1% - 99%)
A fuzzy match is defined as a 1%-99% correspondence, but in TM, Fuzzy

match is used by defining a threshold.
Example:
• What are earphones and why do people use them. (Input Segment)
• What are wireless earphones and why do people use them. (TM
segment)
6
2.3.3 No Match
No matches are found if the correspondence is below the defined fuzzy

threshold.
2.4 Ways of creating TM
There are 3 ways of creating TM -
2.4.1 During Translation
The translated and reviewed segments goes to TM.
2.4.2 Importing another TM database
This can either be a TM created with the same TM system or a TM available in

the Translation Memory exchange format (TMX), which is supported by all
commercial systems.
2.4.3 Aligning existing translations and their original texts
With the help of an alignment tool it is possible to create TM databases from

the source and target text files of previous translation projects.
2.5 Data exchange standards for TM systems
The formal definition of TMX shown at the Localization Industry Standards

Association (LISA) Web site states: TMX is the vendor-neutral open XML
standard for the exchange of TM data created by CAT and localization tools.
The purpose of TMX is to allow easier exchange of translation memory data

between tools and/or translation vendors with little or no loss of critical data
during the process[7].
7
[7]
A TMX document is an XML document whose root element is <tmx>. The
<tmx> element contains two children: <header> and <body>. General
information about the TMX document is described in the attributes of the
<header> element. Additional information is provided in the <note>, <ude>,
and <prop> elements. The main content of the TMX document is stored inside
the <body> element. It holds a collection of translations contained in
[7]
translation unit elements (<tu>). Each translation unit contains text in one
or more languages in translation unit variant elements (<tuv>).
Sample of TMX document
<tmx version="1.4">
<header
creationtool="XYZTool" creationtoolversion="1.01-023"
datatype="PlainText" segtype="sentence"
adminlang="en-us" srclang="en"
o-tmf="ABCTransMem"/>
<body>
<tu>
<tuv xml:lang="en">
<seg>Hello world!</seg>
</tuv>
<tuv xml:lang="">
<seg>Bonjour tout le monde!</seg>
</tuv>
</tu>
</body>
</tmx>
8
2.5 Benefit of TM for translator/post-editor
• TM helps to increase productivity of a translator. The most common and

recurring comment on the benefits of TM is by far the increase in
productivity that they allow in cases where substantial reuse of TM
sentences is possible.
• TM helps to improve consistency (terminology, phraseology). According

to a majority of users, the second main benefit of TM is that they help to
improve consistency[8]
• TM eliminates uninteresting and repetitive work (e.g. updates, manuals).

As translators are often called upon to make major or minor changes to
texts or update them (e.g. instruction manuals),
• TM are also used as a searchable database (parallel corpus). Many

translators consider TM as a ‘one-stop shop’ because of the search
options they offer.
• TMs can have a pedagogical function (e.g. sharing of solutions, subject

knowledge repository).
2.6 TM vs MT
Machine translation refers to automated translation by a computer, without

human input.
Translation memory software requires human input as it reuses content that

has been previously translated to complete new translation work. The original
translation is performed by a professional translator.
9
2.7 TM vs term-base?
A translation memory stores segments of text as translation units. A segment

can consist of a sentence or paragraph. The TM holds both the original and
translated version of each segment for reuse.
A term-base, on the other hand, is a searchable database that contains a list of

multilingual terms and rules regarding their usage.
A translation memory is typically used in conjunction with a termbase.
10
Chapter 3 : Experiments and Results
3.1 Clause Extraction
OpenNLP[6] the open-source tool supports the most common Natural Language
Processing (NLP) tasks, such as tokenization, sentence segmentation, part-of-
speech tagging, named entity extraction, chunking, parsing, language
detection and coreference resolution.
For clause extraction tasks we used parsing module of OpenNLP. OpenNLP

provides Machine Learning models which is trained on “Penn Treebank POS”
tagged corpus. OpenNLP[6] creates a “constituency parse tree” also known
“phrase parse tree”. Parser takes white space separated tokenized string of a
sentence and returns phrase tree. We parse this tree to extract the clauses of a
sentence.
Penn tree bank defines the clause level tag to identify the clauses from a parse
tree. These tags are SBAR, SQ, SINV etc. We use these tags to extract the
clauses from a given English sentence.
Example : For a sentence
“Earphone is the best option available for them as it doesn't disturb others' sleep”
generated bracket notation tree is given below -
[TOP [S [NP [NN Earphone]] [VP [VBZ is] [NP [NP [DT the] [JJS best] [NN option]] [ADJP
[ADJP [JJ available] [PP [IN for] [NP [PRP them]]]] [SBAR [IN as] [S [NP [PRP it]] [VP [VBD
doesn't] [NP [ADJP [NN disturb] [JJ others']] [NN sleep]]]]]]]] [. .] ]]
The pictorial representation of this parse tree is given in “Fig 1”. We can see
that SBAR tag node represents a clause and remaining S tag node represents
the other clause.
11
Fig 1: constituency parse tree
3.2 Experimental Data
For our experiments we have created three different test data set for English
language. Set-A contains 100 input sentences, Set-B contains 200 input
sentences and Set-C contains 300 input sentences. For TM we have a data set
which contains 3,500 sentences. This data contains simple sentences and
complex sentences. Complex sentence contains more than one clause.
3.3 Relevant Match
A match for a sentence of Query data in TM data will be a relevant match if the
percentage of matching for the query sentence with the available data in TM is
equal to or greater than the fuzzy threshold set for the TM. For our experiment
this fuzzy threshold is set to 75%.
In addition to this, there are several query inputs which partially matches with
the TM data. But their percentage matching is less than the defined threshold,
hence they are not reflected in the result – left-out from the result. In fact, if
we examine minutely, we see that some of these matches are also relevant for
the translator.
12
For example, the given sentence is
Query: “it doesn't disturb others' sleep”
the sentence in TM with which the input partially matches (36%) is
“Earphone is the best option available for them as it doesn't disturb others' sleep”
as the percentage match (36%) is below the set threshold (75%) of the TM,
this is not reflected in Query results.
The translator did not get any match but the given sentence in TM is a relevant
match because it contains Query sentence as a clause which will significantly
help the translator.
In our experiments, we are trying to show that when we use sentence-level

data, we may miss relevant match for a query that exists as a partial data
(clause) in TM. If we use clause structuring in TM data we can retrieve the
partial data as relevant match for query that was previously not retrieved as
the match was below the set TM threshold.
In our experiments, we have ignored the number of results that are retrieved
for a single query. We are only interested in knowing, whether even a single
match is found or not, if the query exists as a clause in TM.
3.4 Experiments
3.4.1 Configuration
In this dissertation we have performed experiments with three kinds of TM
configurations.
3.4.1.1 TM configuration 1
In TM configuration 1, Query data contains full sentences. TM data also
contains sentences. In this configuration, there is no clause level splitting
either in Query data or in TM data.
13
In TM configuration 2, Query data contains sentences. TM data contains

sentences as well as clauses of those sentences. We used clause
splitter/structure in this configuration.
In TM configuration 3, Query data contains the original sentence and its

clauses, in case of complex sentences. So there are not only one query but a
set of queries while looking into TM database. Similarly, TM data contains the
original sentences as well as the clauses of those sentences. We used clause
splitter for structuring Query data in this configuration.
3.4.2 Process to Run Experiment
For a given sentence or clause, we prepare Query data set (which may contain
one query or more than one query). We lookup into the TM database to see
whether there is a match for each Query data in the TM data. Whenever, even
a single relevant match is found for a given Query data, we record the value of
"Match Found" as ‘True’ other wise "Match Found" will be set to ‘False’.
Based on this we calculate how many Query data were matched with TM data
and how many of them did not match with TM data.
14
3.5 Experimental Results
3.5.1 Result: Data Set-A
‘S’ denotes the sentences, ‘C’ denotes the clauses of these sentences, ‘S, C’
denotes the sentences and its clauses.
Configuration Query TM Match %

data data Found Match
TM configuration 1 100 S 3500 S 87 87%
TM configuration 2 100 S 5500 S,C 90 90%
TM configuration 3 138 S,C 5500 S,C 127 92%
Table 3: Result of Set-A
3.5.2 Result: Data Set-B
Configuration Query data TM Match %

data Found Match
TM configuration 1 200 S 3500 S 47 22%
Table 4: Result of Set-B
15
3.5.3 Result: Data Set-C
Configuration Query TM Match %

data data Found Match
TM configuration 1 300 S 3500 S 54 18 %
Table 5: Result of Set-C
3.5.4 Conclusion of Results
Table 3, Table 4 and Table 5 show the results of experiment for different data
set (Set-A, Set-B, Set-C) and different TM configurations. We have created a
line graph (as shown in “Fig 2”) from these tables. Y-axis of the line graph
shows the percentage of matching find for Query data with TM data. X-axis
shows the different configuration as we have defined above in configuration
section.
Fig 2: analysis graph
16
3.5.4.1 Conclusion 1
Table 3, Table 4 and Table 5, shows that whenever there is clause splitting
either in the Query data or in TM data, there is increase in % Match from TM.
In Table 3, Table 4 and Table 5, the Query data set is different in each case. In
spite of different data set (Set-A, Set-B, Set-C) we observe that for each data
set there is an increasing trend in % Match from TM. Hence we can conclude
that clause splitting will always increase the % Match over sentence level data.
In Table 3, Table 4 and Table 5, the increase in % Match in each case is not
uniform. This is because the % Match for a given data set from the TM is
dependent upon the two factors, I.e. TM data and Query data.
The detailed example of TM configuration is defined in Annex.
17
Annex
Example of TM configurations
Defined threshold for TM is 75%
Query data TM data
1. Earphone is the best option available for 1. Earphone is the best option available for
2. It doesn't disturb others' sleep
Table 6: TM configuration 1
To explain how the TM works, Table 6 shows two instances of Query data and
one instance of TM data. When we search for the first query in TM database
we get 100% match, as shown in TM data.
For second query we do not get any result from the TM database, as the
matching percentage is (36%) and the threshold of acceptable match for the
TM is 75%. But we can see that the second query has a relevant match
(underlined portion of the sentence 1 of TM data)
When we used “TM Configuration 2” for the data set shown in the Table 6, we
get data as shown in “Table 7”. We can see in Table 7, if we search for
sentence 2 of Query data, we get an acceptable % match. It matches with TM
data 1.2, (which is actually a clause of sentence 1 in TM data).
18
Query data TM data
1 Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep them as it doesn't disturb others' sleep
2 It doesn't disturb others' sleep 1.1 Earphone is the best option available for
them
1.2 as it doesn't disturb others' sleep
Query data TM data
1. You should wear some protective materials 1. Drugs help us when we are not feeling well
while using the things that contain chemicals but you should keep the drugs from the reach
and you should keep the chemical from the of kids because it may be dangerous for the
reach of kids kids
As we can see in Table 8, there is a relevant match for sentence 1 (underlined)

in Query data with sentence 1 (underlined) of TM data. However, we get no
match because the matching percentage (36%) between sentences of Query
data and TM data is below the defined threshold (75%) of TM.
Query data TM data
1. You should wear some protective materials 1. Drugs help us when we are not feeling well
while using the things that contain chemicals
but you should keep the drugs from the reach
and you should also keep the chemical from
the reach of kids of child because it may be dangerous for the
child
1.1 You should wear some protective
materials while using the things that contain 1.1 Drugs help us when we are not feeling
chemicals
well
1.2. and you should keep the chemical from 1.2 but you should keep the drugs from the
the reach of kids
reach of kids
1.3 because it may be dangerous for the kids
19
Now we use “TM configuration 3” for data set in Table 8, and the results are
shown in Table 9. Sentence 1 of Query data is divided into two clauses i.e.
clause 1.1 and clause 1.2. Sentence 1 of TM data is divided into three clauses
i.e. clause 1.1, clause 1.2 and clause 1.3. As shown in table, a relevant match
exists between clause 1.2 of Query data with clause 1.2 of TM data. As they
are in the form of clauses, the percentage match (82%) in this case is higher
than the set TM threshold(75%). Hence, they qualify as acceptable match and
are reflected in the result set. These were earlier dropped from the result set
as shown in Table 8.
20
SUMMARY
This dissertation shows clause structure based Translation Memory (TM). We

use a clause splitter to extract the clauses from a complex sentence. Then we
put the clauses and its sentence in TM. This is what we call clause based TM
structuring or in general terms, structuring of TM.
We have seen that when we use clause level structuring in TM, relevant
matches for Query data that were earlier dropped due to low percentage match
in sentence level, are also retrieved in the resulting set.
So we get more relevant matches for Query data from the TM database.
This study uses different TM configurations (TM configuration 1, TM

configuration 2, TM configuration 3 ) to support the above claim on different
test data set. A translator might not get a match for a complete sentence but
he or she will still get a match for a clause, which helps him to perform
translation task thereby increasing his productivity (translated word per hour).
21
Direction for future work
We have seen that when we use clause level structuring in TM, relevant
matches for Query data that were earlier dropped due to low percentage match
in sentence level, are also retrieved in the resulting set.
The clause level structuring increases the size of TM database because in this
its contains sentences as well as its clauses. The increase in size of TM
database may affect the performance of search in TM database. Most search
algorithm for fuzzy matching is based on word level edit distance or bag of
word model.
The performance can be improved if we keep POS tag annotated data for each
word in TM and in Query data. Our hypothesis is that if we keep the POS tag
information of words this may improve performance as well as accuracy.
Because in that case we can search for TM match based on tag information
instead of word. So we can know which text is structurally same in the TM with
Query data. For this experiment a large POS tagged corpus will be required for
building the TM database. This work is in progress.
22
REFERENCES :
[1] Reinke, U. (2013), State of the Art in Translation Memory Technology. Proceedings of
the Workshop on Natural Language Processing for Translation Memories (NLP4TM),
pages 17–23,
[2] Grönroos, Mickel., Becks,Ari., Bringing Intelligence to Translation Memory Technology.
Translating and the Computer 27, November 2005 [London: Aslib, 2005]
[3] Timonera, Katerina., and Mitkov, Ruslan., (Sept 2015), Improving Translation Memory
Matching through Clause Splitting. Proceedings of the Workshop on Natural Language
Processing for Translation Memories (NLP4TM), pages 17–23, Hissar, Bulgaria,
[4] Sharma, Sanjeev Kumar.,(2016), Clause Boundary Identification for Different
Languages: A Survey, International Journal of Computer Applications & Information
Technology Vol. 8, Issue II 2016 (ISSN: 2278-7720)
[5] https://stanfordnlp.github.io/CoreNLP/
[6] https://opennlp.apache.org/
[7] https://www.ibm.com/developerworks/library/x-localis3/
[8] Translators on translation memory (TM). Results of an ethnographic study in three
translation services and agencies Matthieu LeBlanc Université de Moncton,
[9] Christensen,Tina Paulsen. and Schjoldager, Anne., (2011) The Impact of Translation-
Memory (TM) Technology on Cognitive Processes. NLPSC 2011
[10] A.Zerfass., (2002). Evaluating Translation Memory Systems. Proceedings of the LREC
2002 Workshop, Las Palmas, Canary Islands, SPAIN.
[11] Timothy Baldwin & Hozumi Tanaka. (2001). Balancing up Efficiency and Accuracy in
Translation Retrieval. Journal of Natural Language Processing vol. 8.
[12] Soroosh, Mortezapoor., (2013). Translation Memory and Bitext Alignment. Seminar in
Artificial Intelligence.]
23
View publication stats

Structuring of Translation Memory Structuring of Translation Memory Work-Integrated Learning Programmes Division

Uploaded by

Copyright:

Available Formats

You might also like

Structuring of Translation Memory Structuring of Translation Memory Work-Integrated Learning Programmes Division

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Structuring of Translation Memory Structuring of Translation Memory Work-Integrated Learning Programmes Division

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Structuring of Translation Memory Structuring of Translation Memory Work-

Thesis · April 2018

Transzaar View project

The user has requested enhancement of the downloaded file.

BITS ZG628T: Dissertation

Dissertation work carried out at

Expert Software Consultants Pvt. Ltd. , Noida

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE

BITS ZG628T: Dissertation

Dissertation work carried out at

Submitted in partial fulfillment of M.Tech. Software Systems

Under the Supervision of

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE

Work-Integrated Learning Programmes Division

Second Semester 2017-2018

BITS ZG628T: Dissertation

BITS ID No. : 2015HT13439

NAME OF THE STUDENT : Ashutosh Kumar

EMAIL ADDRESS : 2015HT13439@wilp.bits-pilani.ac.in

STUDENT’S EMPLOYING : Expert Software Consultants Pvt. Ltd.

ORGANIZATION & LOCATION Noida

SUPERVISOR’S NAME : Dr. Pawan Kumar

SUPERVISOR’S EMPLOYING : Expert Software Consultants Pvt. Ltd.

ORGANIZATION & LOCATION : Noida

SUPERVISOR’S EMAIL ADDRESS: hawahawai@gmail.com

DISSERTATION TITLE : Structuring of Translation Memory

Translation Memory (TM) is an archive of previously translated segments that

When a translator uses a TM tool to translate a new segment, the tool

Most of the TM tool stores sentence-level segments in TM database. Hence the

In TM comprising of sentence-level segments, It may sometimes occur that

This dissertation aims to introduce the clause level structuring for TM to

Firstly, I would like to express my sincere gratitude to my Supervisor Dr.

In recent years, with the intense application of Artificial Intelligence techniques

1.1 Translation Memory and Uses

TM is an archive of previously translated texts, that stores source language

Each single record in TM is called a Translation Unit (TU). A TU contains a

Human Translator can use TM to translate a new text segment. Translator

TM will help Human Translators to increase their productivity. It helps to ensure

Searching technique used for retrieval of TU from TM is generally based on

A source-language data in TM can be an exact match (100%) of Query data;

1.3 Problems faced

Query data TM data

Table 1: TM Without clause structuring

Table 1, shows two sentences in Query data. Exact match of sentence 1 is

To circumvent the above mentioned problem, we can lower the value of

1.4 Our approach to solve this problem

Retrieving clauses is desirable because there is a higher chance for a match to

Query data TM data

1.2 as it doesn't disturb others' sleep

Table 2: TM with clause splitting

The goal of this dissertation is to retrieve the dropped relevant matches of

2.1 Translation Memory

A TM[10][12] technology is used as a translator’s aid for providing a good

2.2 Matching Techniques for TM

2.2.1 Simple Matching

(1) improving recall and precision of (monolingual) retrieval, i.e. enhancing