Adv. CAT MT Proposal

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Group: LTP. Co.

Name: Hailun Chen, Cheng Peng, Cecilia Hiu See Tang, Kimberly Kwan
Objective: To train a suitable SMT engine for translating avian influenza
reports
Topic:
We have chosen to use Avian Influenza Reports in Hong Kong En <> Zh (hk) to
train our Statistical Machine Translation engine. Hong Kong’s Department of
Health releases weekly report consistently since 2013. Each year, there are a total
of 52 reports on Avian Influenza in Hong Kong. Our pool of information is
sufficient.
Source:
English: https://www.chp.gov.hk/en/resources/29/332.html
Sample: https://www.chp.gov.hk/files/pdf/2018_avian_influenza_report_vol14_wk09.pdf
Traditional Chinese (HK):https://www.chp.gov.hk/tc/resources/29/332.html
Sample: https://www.chp.gov.hk/files/pdf/2018_avian_influenza_report_vol14_wk09.pdf

Quality goal:
One of the challenges would be how to deal with English words that should not be
translated. E.g. H5N1, city names that weren’t translated.
Per document 2,000 words:

NUMBER OF
ERROR ERRORS TOLERATE OUR PROJECT

Medical terminology 0/10

Accuracy 7/10

Human post-editing
Format and style 10/10 needed
Country Name 0/10

There are many repetitions in medical terms, sentence structure, and format. We
assume that it could have about 10% of exact matches, and more than 25 - 30%
fuzzy matches. In additional, existing reports on the government websites are in
PDF format. We need to have access to software or online tools to convert the PDF
to TXT and then to TMX formats.

Process
With a selection of 130 bilingual documents in PDF format in total, the first round
of training will be based on 12,000 segments for training, 500 segments for
tuning, and 500 segments for testing. Based on the initial BLEU score produced
during the first round of training, we would adjust our dataset to attain a higher
BLEU score, such as: adding more training and tuning documents, using training
and tuning documents that are more related to the topic, and cleaning up line
breaks, punctuation, etc., and adding glossary data and term bases.

In order to guarantee that the data being inputted to the engine is of the highest
quality, our team of terminologists will first clean the data of any awkward or
inappropriate strings through Okapi Framework. Then we will align the
documents to create bilingual translation memory because only with translation
memories does the machine recognize 100% of the segments. Similarly, bilingual
translation memories will also be used for the tuning stage and testing stage.

We will manually converted the format from PDF to TXT so that our alignment
program, YouAlign, could read the file better. From YouAlign, we download the
TMX file in order for it to be uploaded to the SMT. Before we uploaded it to
SMT, we refine the TMX file by realigning the source and the target text together
better with Trados. Since the language code for Chinese from YouAlign is
different that the code that Microsoft Translator Hub recognizes, we used
NotePad++ to manually change the language code in the TMX code, thus
allowing Translator Hub to recognize the newly uploaded TMX.
Our quality measure for this project will be analyzing the possible time savings
and cost savings we earn when we employ a human translator, and when we
employ machine translation but with post-human editing. If the time used for
post-human editing (assuming the translated product do not contain any critical or
major errors) is significantly lower than the time used for a human translator, we
may decide to pursue with this human translation engine.

To chart this, we decided to have our translator translate a test document from
English to Traditional Chinese and record the time devoted to this task. The
machine translation result will then be post-edited by two reviewers. Each
reviewers will have 0.5 hour to edit it. Currently, we set our goal for this engine to
be a 40% time save comparing to a human translator, meaning if the translator
uses 40% less time to edit the machine translated piece, then we consider this
engine as a success.

In order to evaluate the time savings achieved through PEMT, the comparison will
be made on the basis of the standard 300 words/hour for human translation (HT).
Cost savings will also be evaluated based on a standard $0.23/word rate for HT
and the total cost of designing the engine. We aim for 80% cost savings. The two
translations will then be exchanged and their quality assessed using a version of
the LISA QA Model.

Timeline
Following the kickoff meeting on Monday, March 12, the project will be carried
out over a period of 4 weeks. Starting on March 13, one to two rounds of training
will be completed per weekday, with the final rounds being completed on, April
13. Post-editing and QA will be completed by Monday, April 16. Data collected
from the engine training, post-editing and QA will be used to calculate time and
cost savings and quality estimates, which will be presented in the proposal to be
delivered by 10:00 a.m. on Thursday, April 18.
Estimated time breakdown and costs for this project are detailed in the following
table.
Task Est. hr Rounds Hourly Rate Subtotal
PEMT
-general datapool
-tuning
-testing 2 10 $35.00 $700.00
Document
alignment
-data cleaning 5 1 $25.00 $125.00
Terminology
management/
defining ratio of
single
vocab/segments 2 1 $30.00 $60.00
Human post-
editing 0.5 2 $20.00 $20.00
QA 0.5 2 $30.00 $30.00
Total $935

Cost saving: The following table shows the cost breakdown of PEMT versus HT
based on a sample of 2,000 words and using rates established in the project
proposal.

English > PEMT Rate/Word Subtotal Editing Editing Total


Chinese(Tr ready Rate Subtotal
aditional)
HT (per - $0.23 $460.00 $0.13 $260.00 $720.00
document)
PEMT $935 - - $0.03 $60.00 $935.00
PEMT $-215
Savings:
Deliverables:
Since our target documents are health reports published by the Hong Kong
government, an abundance of materials is already available for the training and
tuning process. Our timeline matches closely with the process of training the
engine, taking a total of 20 hours to complete all 10 rounds of training.

While the cost for training a SMT engine does exceed the cost of hiring a human
translator, we should keep in mind that the cost for a human translator shown
above indicates only one document, while the $935 for the SMT Engine can be
written off after the first document. Each document hereinafter will only cost about
$60 due to editing costs, assuming similar documents in length are used.

As the final stage before human post-editing, the translated work from the engine
will be inputted into a CAT tool, and reviewed using said CAT tool’s automatic
QA checks. This stage is especially necessary in regards to verifying the engine
has properly translated the numerical values, and that the dates are properly
localized.

Possible Future Opportunity:


Assuming that machine-translation takes less time to produce an acceptable
quality, we may also do additional test runs to chart whether pursuing this engine
would be worth it, if the translator is provided with acceptable translation memory
when translating the piece. From there on, we would then compare the time a
human translator with translation memory used and the time the engine took along
with post-human editing.
___________________________________ ________________________
Client Date

Findings:
Based on the results from our project, we would not recommend training a SMT
engine for this type of documents. Clearly, our results indicate that using a human
translator would save more, rather than using a SMT engine. Additionally, if we
provide the translator with pre-approved translation memory, the segments that the
translator will need to translate would be reduced by more than half, thus cutting
cost by at least 50%. Since these documents are published on a very frequent basis,
the amount of new information published in each document would be minimal.
Furthermore, the engine ran into a multitude of problems, since the segments of
our documents weren’t ‘unique’ enough for the machine to recognize.

In order to bypass this issue, we had the engine automatically select the text it
wanted to use for the testing stage, instead of manually selecting it ourselves.
Therefore, possible errors could arise due to this discrepancy. In the latter part of
testing, this issue caused multiple rounds to fail, the reason being the testing data
wasn’t unique enough.

In addition, based on the actual total we calculated through the process, the cost of
making a SMT actually exceeds $900 to over $2,000 dollar.

Hypothetically, if the SMT works, a total of ten rounds will need to be conducted
until the cost for the engine and for a human translator will even out.

# of doc 1 2 3 4 5 6 7 8
9 10

HT $720 $960 $1,680 $1,920 $2,160 $2,400 $2,640 $2,880


$3120 $3360

PEMT $2,395 $2,475 $2,555 $2,635 $2,715 $2,795 $2,875 $2,955


$3035 $3115
In rounds that succeeded, the product translated by the SMT engine was composed
of segments from multiple difference sources (ex. From 2013, 2014, and 2018),
causing our translator unable to fully edit the document without searching for the
individual source texts as reference material, taking more than 2 hours.
Another thing that is worth noting is that the nature of the document is repetitive,
and relatively easy, despite many medical and technical terms. As a result, it is
easier for translators to perform human translation within a shorter period of time,
after building sufficient TM for this type of documents.

Actual time and cost breakdown:


No. of
Task Est. hr Rounds people Hourly Rate Subtotal

Group meeting 2 6 4 $30.00 $1440


PEMT
-general
datapool
-tuning
-testing 2 10 1 $35.00 $700.00
Document
alignment
-data cleaning 5 1 1 $25.00 $125.00
Terminology
management/
defining ratio
of single
vocab/segment
s 2 1 1 $30.00 $60.00
Human post-
editing 0.5 2 2 $20.00 $40.00
QA 0.5 2 1 $30.00 $30.00
Total $2395
INITIAL ROUND Rate/Word Cost
Human Translation $0.23 $460
(per 2,000 word
document)
Editing $0.13 $260
Total $720
SECOND ROUND and Rate/Word
ONWARD
HT 0.08 $160
Editing 0.04 $80

PEMT savings:
Human Translation $720
PEMT $2395
Savings -$1675

You might also like