Structuring of Translation Memory Structuring of Translation Memory Work-Integrated Learning Programmes Division

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/327700572

Structuring of Translation Memory Structuring of Translation Memory Work-


Integrated Learning Programmes Division

Thesis · April 2018


DOI: 10.13140/RG.2.2.22931.94240

CITATIONS READS

0 151

1 author:

Ashutosh Kumar
Expert Software
4 PUBLICATIONS   2 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Transzaar View project

All content following this page was uploaded by Ashutosh Kumar on 17 September 2018.

The user has requested enhancement of the downloaded file.


Structuring of Translation Memory

BITS ZG628T: Dissertation

by

Ashutosh Kumar

2015HT13439

Dissertation work carried out at

Expert Software Consultants Pvt. Ltd. , Noida

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE


PILANI (RAJASTHAN)

April, 2018
Structuring of Translation Memory

BITS ZG628T: Dissertation

by

Ashutosh Kumar

2015HT13439

Dissertation work carried out at


Expert Software Consultants Pvt. Ltd., Noida

Submitted in partial fulfillment of M.Tech. Software Systems

Under the Supervision of


Dr. Pawan Kumar, Sr. Software Engineer
Expert Software Consultants Pvt. Ltd., Noida

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE


PILANI (RAJASTHAN)

April, 2018
Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division

Second Semester 2017-2018

BITS ZG628T: Dissertation

ABSTRACT

BITS ID No. : 2015HT13439

NAME OF THE STUDENT : Ashutosh Kumar

EMAIL ADDRESS : 2015HT13439@wilp.bits-pilani.ac.in

STUDENT’S EMPLOYING : Expert Software Consultants Pvt. Ltd.

ORGANIZATION & LOCATION Noida

SUPERVISOR’S NAME : Dr. Pawan Kumar

SUPERVISOR’S EMPLOYING : Expert Software Consultants Pvt. Ltd.

ORGANIZATION & LOCATION : Noida

SUPERVISOR’S EMAIL ADDRESS: hawahawai@gmail.com

DISSERTATION TITLE : Structuring of Translation Memory

II
ABSTRACT

Translation Memory (TM) is an archive of previously translated segments that


stores source language segment and its corresponding translation to target
language. Here, segment refers to a sentence or paragraph.

When a translator uses a TM tool to translate a new segment, the tool


identifies similarities between segment of Query data and the stored segment
in TM database. A translator may then choose to insert or make slight changes
to the previous translation of that segment.

Most of the TM tool stores sentence-level segments in TM database. Hence the


benefits of TM are only realized for identical or similar sentences, which may
occur rarely because usually sentences are complex; while sentence fragments
(clauses) may recur much more often.

In TM comprising of sentence-level segments, It may sometimes occur that


input sentence contains a sub-segment (clause) and its translation is
available in TM. But search-and-retrieval function does not show any result.
This is due to the fact that the matching percentage between input and TM
data is below the defined threshold for TM. Since the sentence in Query data is
a small part (clause) of the longer sentence in TM data, therefore it does not
qualify for the filtering criteria of partial match. Thus, a relevant segment
which was available in TM - not identical but similar enough for the translation
to be worth recalling for the translator, is dropped.

This dissertation aims to introduce the clause level structuring for TM to


retrieve the dropped relevant matches of Query data that could not be
retrieved when we used the sentence level TM.

III
ACKNOWLEDGEMENTS

Firstly, I would like to express my sincere gratitude to my Supervisor Dr.


Pawan Kumar for the continuous support of my M.Tech. study and related
research, for his patience, motivation, and immense knowledge. His guidance
helped me in all the time of research and writing of this dissertation.

I would also like to thank Dr. Mukul Kumar Sinha for his insightful comments
and encouragement, but also for asking hard questions which helped me
widen my research from various perspectives.

I would also like to thank my colleague, Ms. Himanshi Thapliyal for editing and
proof-reading the dissertation.

Last but not the least, I would like to express my love and gratitude to my
beloved family, for their understanding & motivation, through the duration of
this project.

V
Table of Contents
CERTIFICATE......................................................................................................................................I
ABSTRACT.......................................................................................................................................III
ACKNOWLEDGEMENTS.................................................................................................................V
Chapter 1 : Introduction........................................................................................................................1
1.1 Translation Memory and Uses........................................................................................................1
1.2 Basic Approach of Retrieval from TM...........................................................................................2
1.3 Problems faced...............................................................................................................................2
1.4 Our approach to solve this problem................................................................................................3
1.5 Objective.........................................................................................................................................4
Chapter 2 : Translation Memory - Literature review............................................................................5
2.1 Translation Memory.......................................................................................................................5
2.2 Matching Techniques for TM.........................................................................................................5
2.2.1 Simple Matching.....................................................................................................................5
2.2.2 Using “linguistic knowledge”.................................................................................................6
2.3 Types of Matches............................................................................................................................6
2.3.1 Exact Match (100%)...............................................................................................................6
2.3.2 Fuzzy Match (1% - 99%)........................................................................................................6
2.3.3 No Match................................................................................................................................7
2.4 Ways of creating TM......................................................................................................................7
2.4.1 During Translation..................................................................................................................7
2.4.2 Importing another TM database..............................................................................................7
2.4.3 Aligning existing translations and their original texts............................................................7
2.5 Data exchange standards for TM systems......................................................................................7
2.6 TM vs MT.......................................................................................................................................9
2.7 TM vs term-base?.........................................................................................................................10
Chapter 3 : Experiments and Results..................................................................................................11
3.1 Clause Extraction..........................................................................................................................11
3.2 Experimental Data........................................................................................................................12
3.3 Relevant Match.............................................................................................................................12
3.4 Experiments..................................................................................................................................13

VI
3.4.1 Configuration........................................................................................................................13
3.4.1.1 TM configuration 1.......................................................................................................13
3.4.1.2 TM configuration 2.......................................................................................................14
3.4.1.3 TM configuration 3.......................................................................................................14
3.4.2 Process to Run Experiment...................................................................................................14
3.5 Experimental Results....................................................................................................................15
3.5.1 Result: Data Set-A................................................................................................................15
3.5.2 Result: Data Set-B................................................................................................................15
3.5.3 Result: Data Set-C................................................................................................................16
3.5.4 Conclusion of Results...........................................................................................................16
3.5.4.1 Conclusion 1.................................................................................................................17
3.5.4.2 Conclusion 2.................................................................................................................17
3.5.4.3 Conclusion 3.................................................................................................................17
Annex.................................................................................................................................................18
SUMMARY........................................................................................................................................21
Direction for future work....................................................................................................................22
REFERENCES :.................................................................................................................................23
Checklist of items for the Final Dissertation Report..........................................................................24

VII
List of Figures
Fig 1: constituency parse tree............................................................................................................12
Fig 2: analysis graph..........................................................................................................................16

VIII
List of Tables
Table 1: TM Without clause structuring...............................................................................................2
Table 2: TM with clause splitting.........................................................................................................4
Table 3: Result of Set-A.....................................................................................................................14
Table 4: Result of Set-B......................................................................................................................15
Table 5: Result of Set-C......................................................................................................................15
Table 6: TM configuration 1...............................................................................................................17
Table 7: TM configuration 2...............................................................................................................17
Table 8: TM configuration 1...............................................................................................................17
Table 9: TM configuration 3...............................................................................................................18

IX
Chapter 1 : Introduction

In recent years, with the intense application of Artificial Intelligence techniques


accuracy of machine translation systems have improved significantly. Yet lot of
human editing is required to make the translated content correct and fluent
(publishable quality). To improve the quality of translated content human
translators use several computer aided tools, which aids them in performing
their translation/editing tasks. These tools include dictionary, glossary, term-
base, translation memory, spell checker, concordance, etc. Among them
Translation Memory (TM) is one of the most important tool that enhances
the productivity of a translator.

1.1 Translation Memory and Uses

TM is an archive of previously translated texts, that stores source language


text and its corresponding translation to target language. It is generally
created during the process of translation done by Human Translators.

Each single record in TM is called a Translation Unit (TU). A TU contains a


pair of source language and target language segment. Segment may be either
a paragraph or a sentence or a phrase[1].

Human Translator can use TM to translate a new text segment. Translator


searches the new segment in TM, and if a matching segment is found, it is
provided to the translator. The translator can use the this segment as it is,
revise it, or reject it.

TM will help Human Translators to increase their productivity. It helps to ensure


that the same terminology is consistently used across translation. It helps to
ensure uniform style of translation across a large document.

1
1.2 Basic Approach of Retrieval from TM

Searching technique used for retrieval of TU from TM is generally based on


word level string matching between TM source-language data and Query data.

A source-language data in TM can be an exact match (100%) of Query data;


partial match of Query data with an acceptable threshold of 75%-99%
(approx.) match; or there will be no match if source-language data is below
than the above mentioned threshold of 75%. This threshold depends upon the
accuracy of the algorithm used for matching.

1.3 Problems faced

It has been observed that during the process of building TM, sometimes there
exists a relevant clause in TM data but it is not retrieved. This is because the
TM tool in which I am working is structured in sentences only.

Sentence level data made the matching result below the acceptable threshold
for a clause match. So relevant result from TM is dropped. Thus a valid and
relevant TU is not available to the translator thereby affecting his or her
efficiency. For example -

Query data TM data

1. Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep
them as it doesn't disturb others' sleep
2. It doesn't disturb others' sleep

Table 1: TM Without clause structuring

Table 1, shows two sentences in Query data. Exact match of sentence 1 is


available in TM data. Sentence 2, which is one of the clauses of sentence 1 in
TM source data, is a relevant match. But we did not get any relevant match

2
from TM. The relevant match for sentence 2 in Query data was not found
because this match (36%) is below the defined threshold (75%) for TM.

To circumvent the above mentioned problem, we can lower the value of


threshold to get the relevant sentences from TM data which are left out due to
higher threshold. But it may also retrieve too many irrelevant sentences as
well, if the translation memory is large.

Therefore recommending a low threshold value for partial match is not a very
good solution.

1.4 Our approach to solve this problem

In our approach we will use clause splitting to define clause level structure of
TM. In clause level structure of TM, We will split the sentence into clauses ,and
put the clauses along with the its sentence. So in clause level structure of TM,
TM contains the sentences along with the clauses of these sentences.

Retrieving clauses is desirable because there is a higher chance for a match to


be found for a clause than for a complex sentence (contains more than one
clause). A Clause contains complete thought because it comprises of a subject
and a verb. Hence even if a translator does not find a match for the entire
sentence, he or she still might get a match for clauses and therefore the
translator will be benefited.

Query data TM data

1 Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep them as it doesn't disturb others' sleep

2 It doesn't disturb others' sleep 1.1 Earphone is the best option available for
them

1.2 as it doesn't disturb others' sleep

Table 2: TM with clause splitting

3
Table 1 shows TM source data without clause structuring. Table 2 shows TM
source data with clause structuring. Sentence 1 of TM source data in Table 1 is
divided into two clauses i.e. clause 1.1 and clause 1.2 in Table 2. Now if we try
to find a match with TM data for sentence 2 of Query data, we will get the
desired match because it crosses the defined threshold(75%) for TM.

1.5 Objective

The goal of this dissertation is to retrieve the dropped relevant matches of


Query data that could not retrieve when we used the sentence level TM. To
achieve this we will define a clause-level structuring of TM to retrieve the
dropped sentences of Query data with TM data.

4
Chapter 2 : Translation Memory - Literature review

Translation memory systems (TMs) have been around for a number of years,
and their use is now widespread not only among translators, but also among
language service providers (LSPs) and other translation services (public service
or other corporate settings), that have seen the numerous advantages that
TMs offer. Today, TMs are the most widely used translation aid in market.

2.1 Translation Memory

A TM[10][12] technology is used as a translator’s aid for providing a good


precision translation. Basically, a translation memory is a database that
consists of Translation unit (TU) which constitutes the language pair of source
and its translation. TM system matches the source-language input (that needs
to be translated) with the source-language text in TM and retrieves the
matched result in source language and target language. This may help human
translators to either accept the translation, replace it with a fresh translation or
modify the translation to fit into the new translation and update it in TM.

2.2 Matching Techniques for TM

In past years, lot of research work has been done to syntactically compare two
strings. String similarity measure states the extent to which the two stated
strings are similar or dissimilar . These algorithms can be effectively used in
the fuzzy string searches. This constitutes the main approach for the best
match in TM if an exact match is not found[1].

2.2.1 Simple Matching

The matching algorithms have been altered and modified over time, but they
still rely on simple character- or token-based matching procedures without
taking into account linguistic aspects like morphosyntactic, syntactic or
semantic features that may determine the “similarity” of translation units [1].

5
2.2.2 Using “linguistic knowledge”

[1]
Current approaches in commercial and research systems Linguistics-driven
efforts on enhancing retrieval in TM systems are basically motivated by two
different goals:

(1) improving recall and precision of (monolingual) retrieval, i.e. enhancing


quantity, quality and ranking of matches, at segment level and at subsegment
level[1].

(2) Automized adjustment of fuzzy matches to enhance reusability and reduce


post-editing efforts by integrating SMT technology into TM systems [1].

2.3 Types of Matches

2.3.1 Exact Match (100%)

Exact matches are found if the word chain of a new Input Segment is identical
to that of a stored source-text segment in TM.

Example:

• What are earphones and why do people use them. (Input Segment)
• What are earphones and why do people use them. (TM segment)

2.3.2 Fuzzy Match (1% - 99%)

A fuzzy match is defined as a 1%-99% correspondence, but in TM, Fuzzy


match is used by defining a threshold.

Example:

• What are earphones and why do people use them. (Input Segment)
• What are wireless earphones and why do people use them. (TM
segment)

6
2.3.3 No Match

No matches are found if the correspondence is below the defined fuzzy


threshold.

2.4 Ways of creating TM

There are 3 ways of creating TM -

2.4.1 During Translation

The translated and reviewed segments goes to TM.

2.4.2 Importing another TM database

This can either be a TM created with the same TM system or a TM available in


the Translation Memory exchange format (TMX), which is supported by all
commercial systems.

2.4.3 Aligning existing translations and their original texts

With the help of an alignment tool it is possible to create TM databases from


the source and target text files of previous translation projects.

2.5 Data exchange standards for TM systems

The formal definition of TMX shown at the Localization Industry Standards


Association (LISA) Web site states: TMX is the vendor-neutral open XML
standard for the exchange of TM data created by CAT and localization tools.

The purpose of TMX is to allow easier exchange of translation memory data


between tools and/or translation vendors with little or no loss of critical data
during the process[7].

7
[7]
A TMX document is an XML document whose root element is <tmx>. The
<tmx> element contains two children: <header> and <body>. General
information about the TMX document is described in the attributes of the
<header> element. Additional information is provided in the <note>, <ude>,
and <prop> elements. The main content of the TMX document is stored inside
the <body> element. It holds a collection of translations contained in
[7]
translation unit elements (<tu>). Each translation unit contains text in one
or more languages in translation unit variant elements (<tuv>).

Sample of TMX document

<tmx version="1.4">
<header
creationtool="XYZTool" creationtoolversion="1.01-023"
datatype="PlainText" segtype="sentence"
adminlang="en-us" srclang="en"
o-tmf="ABCTransMem"/>
<body>
<tu>
<tuv xml:lang="en">
<seg>Hello world!</seg>
</tuv>
<tuv xml:lang="">
<seg>Bonjour tout le monde!</seg>
</tuv>
</tu>
</body>
</tmx>

8
2.5 Benefit of TM for translator/post-editor

• TM helps to increase productivity of a translator. The most common and


recurring comment on the benefits of TM is by far the increase in
productivity that they allow in cases where substantial reuse of TM
sentences is possible.

• TM helps to improve consistency (terminology, phraseology). According


to a majority of users, the second main benefit of TM is that they help to
improve consistency[8]

• TM eliminates uninteresting and repetitive work (e.g. updates, manuals).


As translators are often called upon to make major or minor changes to
texts or update them (e.g. instruction manuals),

• TM are also used as a searchable database (parallel corpus). Many


translators consider TM as a ‘one-stop shop’ because of the search
options they offer.

• TMs can have a pedagogical function (e.g. sharing of solutions, subject


knowledge repository).

2.6 TM vs MT

Machine translation refers to automated translation by a computer, without


human input.

Translation memory software requires human input as it reuses content that


has been previously translated to complete new translation work. The original
translation is performed by a professional translator.

9
2.7 TM vs term-base?

A translation memory stores segments of text as translation units. A segment


can consist of a sentence or paragraph. The TM holds both the original and
translated version of each segment for reuse.

A term-base, on the other hand, is a searchable database that contains a list of


multilingual terms and rules regarding their usage.

A translation memory is typically used in conjunction with a termbase.

10
Chapter 3 : Experiments and Results

3.1 Clause Extraction

OpenNLP[6] the open-source tool supports the most common Natural Language
Processing (NLP) tasks, such as tokenization, sentence segmentation, part-of-
speech tagging, named entity extraction, chunking, parsing, language
detection and coreference resolution.

For clause extraction tasks we used parsing module of OpenNLP. OpenNLP


provides Machine Learning models which is trained on “Penn Treebank POS”
tagged corpus. OpenNLP[6] creates a “constituency parse tree” also known
“phrase parse tree”. Parser takes white space separated tokenized string of a
sentence and returns phrase tree. We parse this tree to extract the clauses of a
sentence.

Penn tree bank defines the clause level tag to identify the clauses from a parse
tree. These tags are SBAR, SQ, SINV etc. We use these tags to extract the
clauses from a given English sentence.

Example : For a sentence

“Earphone is the best option available for them as it doesn't disturb others' sleep”

generated bracket notation tree is given below -

[TOP [S [NP [NN Earphone]] [VP [VBZ is] [NP [NP [DT the] [JJS best] [NN option]] [ADJP
[ADJP [JJ available] [PP [IN for] [NP [PRP them]]]] [SBAR [IN as] [S [NP [PRP it]] [VP [VBD
doesn't] [NP [ADJP [NN disturb] [JJ others']] [NN sleep]]]]]]]] [. .] ]]

The pictorial representation of this parse tree is given in “Fig 1”. We can see
that SBAR tag node represents a clause and remaining S tag node represents
the other clause.

11
Fig 1: constituency parse tree

3.2 Experimental Data

For our experiments we have created three different test data set for English
language. Set-A contains 100 input sentences, Set-B contains 200 input
sentences and Set-C contains 300 input sentences. For TM we have a data set
which contains 3,500 sentences. This data contains simple sentences and
complex sentences. Complex sentence contains more than one clause.

3.3 Relevant Match

A match for a sentence of Query data in TM data will be a relevant match if the
percentage of matching for the query sentence with the available data in TM is
equal to or greater than the fuzzy threshold set for the TM. For our experiment
this fuzzy threshold is set to 75%.

In addition to this, there are several query inputs which partially matches with
the TM data. But their percentage matching is less than the defined threshold,
hence they are not reflected in the result – left-out from the result. In fact, if
we examine minutely, we see that some of these matches are also relevant for
the translator.

12
For example, the given sentence is

Query: “it doesn't disturb others' sleep”

the sentence in TM with which the input partially matches (36%) is

“Earphone is the best option available for them as it doesn't disturb others' sleep”

as the percentage match (36%) is below the set threshold (75%) of the TM,
this is not reflected in Query results.

The translator did not get any match but the given sentence in TM is a relevant
match because it contains Query sentence as a clause which will significantly
help the translator.

In our experiments, we are trying to show that when we use sentence-level


data, we may miss relevant match for a query that exists as a partial data
(clause) in TM. If we use clause structuring in TM data we can retrieve the
partial data as relevant match for query that was previously not retrieved as
the match was below the set TM threshold.

In our experiments, we have ignored the number of results that are retrieved
for a single query. We are only interested in knowing, whether even a single
match is found or not, if the query exists as a clause in TM.

3.4 Experiments

3.4.1 Configuration
In this dissertation we have performed experiments with three kinds of TM

configurations.

3.4.1.1 TM configuration 1
In TM configuration 1, Query data contains full sentences. TM data also
contains sentences. In this configuration, there is no clause level splitting
either in Query data or in TM data.

13
3.4.1.2 TM configuration 2

In TM configuration 2, Query data contains sentences. TM data contains


sentences as well as clauses of those sentences. We used clause
splitter/structure in this configuration.

3.4.1.3 TM configuration 3

In TM configuration 3, Query data contains the original sentence and its


clauses, in case of complex sentences. So there are not only one query but a
set of queries while looking into TM database. Similarly, TM data contains the
original sentences as well as the clauses of those sentences. We used clause
splitter for structuring Query data in this configuration.

3.4.2 Process to Run Experiment

For a given sentence or clause, we prepare Query data set (which may contain
one query or more than one query). We lookup into the TM database to see
whether there is a match for each Query data in the TM data. Whenever, even
a single relevant match is found for a given Query data, we record the value of
"Match Found" as ‘True’ other wise "Match Found" will be set to ‘False’.
Based on this we calculate how many Query data were matched with TM data
and how many of them did not match with TM data.

14
3.5 Experimental Results

3.5.1 Result: Data Set-A

‘S’ denotes the sentences, ‘C’ denotes the clauses of these sentences, ‘S, C’
denotes the sentences and its clauses.

Configuration Query TM Match %


data data Found Match

TM configuration 1 100 S 3500 S 87 87%

TM configuration 2 100 S 5500 S,C 90 90%

TM configuration 3 138 S,C 5500 S,C 127 92%

Table 3: Result of Set-A

3.5.2 Result: Data Set-B

Configuration Query data TM Match %


data Found Match

TM configuration 1 200 S 3500 S 47 22%

TM configuration 2 200 S 5500 S,C 48 23%

TM configuration 3 314 S,C 5500 S,C 86 27%

Table 4: Result of Set-B

15
3.5.3 Result: Data Set-C

Configuration Query TM Match %


data data Found Match

TM configuration 1 300 S 3500 S 54 18 %

TM configuration 2 300 S 5500 S,C 58 20%

TM configuration 3 384 S,C 5500 S,C 67 23%

Table 5: Result of Set-C

3.5.4 Conclusion of Results

Table 3, Table 4 and Table 5 show the results of experiment for different data
set (Set-A, Set-B, Set-C) and different TM configurations. We have created a
line graph (as shown in “Fig 2”) from these tables. Y-axis of the line graph
shows the percentage of matching find for Query data with TM data. X-axis
shows the different configuration as we have defined above in configuration
section.

Fig 2: analysis graph

16
3.5.4.1 Conclusion 1

Table 3, Table 4 and Table 5, shows that whenever there is clause splitting
either in the Query data or in TM data, there is increase in % Match from TM.

3.5.4.2 Conclusion 2

In Table 3, Table 4 and Table 5, the Query data set is different in each case. In
spite of different data set (Set-A, Set-B, Set-C) we observe that for each data
set there is an increasing trend in % Match from TM. Hence we can conclude
that clause splitting will always increase the % Match over sentence level data.

3.5.4.3 Conclusion 3

In Table 3, Table 4 and Table 5, the increase in % Match in each case is not
uniform. This is because the % Match for a given data set from the TM is
dependent upon the two factors, I.e. TM data and Query data.

The detailed example of TM configuration is defined in Annex.

17
Annex

Example of TM configurations

Defined threshold for TM is 75%

Query data TM data

1. Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep
them as it doesn't disturb others' sleep
2. It doesn't disturb others' sleep

Table 6: TM configuration 1

To explain how the TM works, Table 6 shows two instances of Query data and
one instance of TM data. When we search for the first query in TM database
we get 100% match, as shown in TM data.

For second query we do not get any result from the TM database, as the
matching percentage is (36%) and the threshold of acceptable match for the
TM is 75%. But we can see that the second query has a relevant match
(underlined portion of the sentence 1 of TM data)

When we used “TM Configuration 2” for the data set shown in the Table 6, we
get data as shown in “Table 7”. We can see in Table 7, if we search for
sentence 2 of Query data, we get an acceptable % match. It matches with TM
data 1.2, (which is actually a clause of sentence 1 in TM data).

18
Query data TM data

1 Earphone is the best option available for 1. Earphone is the best option available for
them as it doesn't disturb others' sleep them as it doesn't disturb others' sleep

2 It doesn't disturb others' sleep 1.1 Earphone is the best option available for
them

1.2 as it doesn't disturb others' sleep

Table 7: TM configuration 2

Query data TM data

1. You should wear some protective materials 1. Drugs help us when we are not feeling well
while using the things that contain chemicals but you should keep the drugs from the reach
and you should keep the chemical from the of kids because it may be dangerous for the
reach of kids kids

Table 8: TM configuration 1

As we can see in Table 8, there is a relevant match for sentence 1 (underlined)


in Query data with sentence 1 (underlined) of TM data. However, we get no
match because the matching percentage (36%) between sentences of Query
data and TM data is below the defined threshold (75%) of TM.

Query data TM data

1. You should wear some protective materials 1. Drugs help us when we are not feeling well
while using the things that contain chemicals
but you should keep the drugs from the reach
and you should also keep the chemical from
the reach of kids of child because it may be dangerous for the
child
1.1 You should wear some protective
materials while using the things that contain 1.1 Drugs help us when we are not feeling
chemicals
well
1.2. and you should keep the chemical from 1.2 but you should keep the drugs from the
the reach of kids
reach of kids
1.3 because it may be dangerous for the kids

Table 9: TM configuration 3

19
Now we use “TM configuration 3” for data set in Table 8, and the results are
shown in Table 9. Sentence 1 of Query data is divided into two clauses i.e.
clause 1.1 and clause 1.2. Sentence 1 of TM data is divided into three clauses
i.e. clause 1.1, clause 1.2 and clause 1.3. As shown in table, a relevant match
exists between clause 1.2 of Query data with clause 1.2 of TM data. As they
are in the form of clauses, the percentage match (82%) in this case is higher
than the set TM threshold(75%). Hence, they qualify as acceptable match and
are reflected in the result set. These were earlier dropped from the result set
as shown in Table 8.

20
SUMMARY

This dissertation shows clause structure based Translation Memory (TM). We


use a clause splitter to extract the clauses from a complex sentence. Then we
put the clauses and its sentence in TM. This is what we call clause based TM
structuring or in general terms, structuring of TM.

We have seen that when we use clause level structuring in TM, relevant
matches for Query data that were earlier dropped due to low percentage match
in sentence level, are also retrieved in the resulting set.

So we get more relevant matches for Query data from the TM database.

This study uses different TM configurations (TM configuration 1, TM


configuration 2, TM configuration 3 ) to support the above claim on different
test data set. A translator might not get a match for a complete sentence but
he or she will still get a match for a clause, which helps him to perform
translation task thereby increasing his productivity (translated word per hour).

21
Direction for future work

We have seen that when we use clause level structuring in TM, relevant
matches for Query data that were earlier dropped due to low percentage match
in sentence level, are also retrieved in the resulting set.

The clause level structuring increases the size of TM database because in this
its contains sentences as well as its clauses. The increase in size of TM
database may affect the performance of search in TM database. Most search
algorithm for fuzzy matching is based on word level edit distance or bag of
word model.

The performance can be improved if we keep POS tag annotated data for each
word in TM and in Query data. Our hypothesis is that if we keep the POS tag
information of words this may improve performance as well as accuracy.
Because in that case we can search for TM match based on tag information
instead of word. So we can know which text is structurally same in the TM with
Query data. For this experiment a large POS tagged corpus will be required for
building the TM database. This work is in progress.

22
REFERENCES :

[1] Reinke, U. (2013), State of the Art in Translation Memory Technology. Proceedings of
the Workshop on Natural Language Processing for Translation Memories (NLP4TM),
pages 17–23,
[2] Grönroos, Mickel., Becks,Ari., Bringing Intelligence to Translation Memory Technology.
Translating and the Computer 27, November 2005 [London: Aslib, 2005]
[3] Timonera, Katerina., and Mitkov, Ruslan., (Sept 2015), Improving Translation Memory
Matching through Clause Splitting. Proceedings of the Workshop on Natural Language
Processing for Translation Memories (NLP4TM), pages 17–23, Hissar, Bulgaria,
[4] Sharma, Sanjeev Kumar.,(2016), Clause Boundary Identification for Different
Languages: A Survey, International Journal of Computer Applications & Information
Technology Vol. 8, Issue II 2016 (ISSN: 2278-7720)
[5] https://stanfordnlp.github.io/CoreNLP/
[6] https://opennlp.apache.org/
[7] https://www.ibm.com/developerworks/library/x-localis3/
[8] Translators on translation memory (TM). Results of an ethnographic study in three
translation services and agencies Matthieu LeBlanc Université de Moncton,
[9] Christensen,Tina Paulsen. and Schjoldager, Anne., (2011) The Impact of Translation-
Memory (TM) Technology on Cognitive Processes. NLPSC 2011
[10] A.Zerfass., (2002). Evaluating Translation Memory Systems. Proceedings of the LREC
2002 Workshop, Las Palmas, Canary Islands, SPAIN.
[11] Timothy Baldwin & Hozumi Tanaka. (2001). Balancing up Efficiency and Accuracy in
Translation Retrieval. Journal of Natural Language Processing vol. 8.
[12] Soroosh, Mortezapoor., (2013). Translation Memory and Bitext Alignment. Seminar in

Artificial Intelligence.]

23
View publication stats

You might also like