Professional Documents
Culture Documents
Adv. CAT MT Proposal
Adv. CAT MT Proposal
Adv. CAT MT Proposal
Name: Hailun Chen, Cheng Peng, Cecilia Hiu See Tang, Kimberly Kwan
Objective: To train a suitable SMT engine for translating avian influenza
reports
Topic:
We have chosen to use Avian Influenza Reports in Hong Kong En <> Zh (hk) to
train our Statistical Machine Translation engine. Hong Kong’s Department of
Health releases weekly report consistently since 2013. Each year, there are a total
of 52 reports on Avian Influenza in Hong Kong. Our pool of information is
sufficient.
Source:
English: https://www.chp.gov.hk/en/resources/29/332.html
Sample: https://www.chp.gov.hk/files/pdf/2018_avian_influenza_report_vol14_wk09.pdf
Traditional Chinese (HK):https://www.chp.gov.hk/tc/resources/29/332.html
Sample: https://www.chp.gov.hk/files/pdf/2018_avian_influenza_report_vol14_wk09.pdf
Quality goal:
One of the challenges would be how to deal with English words that should not be
translated. E.g. H5N1, city names that weren’t translated.
Per document 2,000 words:
NUMBER OF
ERROR ERRORS TOLERATE OUR PROJECT
Accuracy 7/10
Human post-editing
Format and style 10/10 needed
Country Name 0/10
There are many repetitions in medical terms, sentence structure, and format. We
assume that it could have about 10% of exact matches, and more than 25 - 30%
fuzzy matches. In additional, existing reports on the government websites are in
PDF format. We need to have access to software or online tools to convert the PDF
to TXT and then to TMX formats.
Process
With a selection of 130 bilingual documents in PDF format in total, the first round
of training will be based on 12,000 segments for training, 500 segments for
tuning, and 500 segments for testing. Based on the initial BLEU score produced
during the first round of training, we would adjust our dataset to attain a higher
BLEU score, such as: adding more training and tuning documents, using training
and tuning documents that are more related to the topic, and cleaning up line
breaks, punctuation, etc., and adding glossary data and term bases.
In order to guarantee that the data being inputted to the engine is of the highest
quality, our team of terminologists will first clean the data of any awkward or
inappropriate strings through Okapi Framework. Then we will align the
documents to create bilingual translation memory because only with translation
memories does the machine recognize 100% of the segments. Similarly, bilingual
translation memories will also be used for the tuning stage and testing stage.
We will manually converted the format from PDF to TXT so that our alignment
program, YouAlign, could read the file better. From YouAlign, we download the
TMX file in order for it to be uploaded to the SMT. Before we uploaded it to
SMT, we refine the TMX file by realigning the source and the target text together
better with Trados. Since the language code for Chinese from YouAlign is
different that the code that Microsoft Translator Hub recognizes, we used
NotePad++ to manually change the language code in the TMX code, thus
allowing Translator Hub to recognize the newly uploaded TMX.
Our quality measure for this project will be analyzing the possible time savings
and cost savings we earn when we employ a human translator, and when we
employ machine translation but with post-human editing. If the time used for
post-human editing (assuming the translated product do not contain any critical or
major errors) is significantly lower than the time used for a human translator, we
may decide to pursue with this human translation engine.
To chart this, we decided to have our translator translate a test document from
English to Traditional Chinese and record the time devoted to this task. The
machine translation result will then be post-edited by two reviewers. Each
reviewers will have 0.5 hour to edit it. Currently, we set our goal for this engine to
be a 40% time save comparing to a human translator, meaning if the translator
uses 40% less time to edit the machine translated piece, then we consider this
engine as a success.
In order to evaluate the time savings achieved through PEMT, the comparison will
be made on the basis of the standard 300 words/hour for human translation (HT).
Cost savings will also be evaluated based on a standard $0.23/word rate for HT
and the total cost of designing the engine. We aim for 80% cost savings. The two
translations will then be exchanged and their quality assessed using a version of
the LISA QA Model.
Timeline
Following the kickoff meeting on Monday, March 12, the project will be carried
out over a period of 4 weeks. Starting on March 13, one to two rounds of training
will be completed per weekday, with the final rounds being completed on, April
13. Post-editing and QA will be completed by Monday, April 16. Data collected
from the engine training, post-editing and QA will be used to calculate time and
cost savings and quality estimates, which will be presented in the proposal to be
delivered by 10:00 a.m. on Thursday, April 18.
Estimated time breakdown and costs for this project are detailed in the following
table.
Task Est. hr Rounds Hourly Rate Subtotal
PEMT
-general datapool
-tuning
-testing 2 10 $35.00 $700.00
Document
alignment
-data cleaning 5 1 $25.00 $125.00
Terminology
management/
defining ratio of
single
vocab/segments 2 1 $30.00 $60.00
Human post-
editing 0.5 2 $20.00 $20.00
QA 0.5 2 $30.00 $30.00
Total $935
Cost saving: The following table shows the cost breakdown of PEMT versus HT
based on a sample of 2,000 words and using rates established in the project
proposal.
While the cost for training a SMT engine does exceed the cost of hiring a human
translator, we should keep in mind that the cost for a human translator shown
above indicates only one document, while the $935 for the SMT Engine can be
written off after the first document. Each document hereinafter will only cost about
$60 due to editing costs, assuming similar documents in length are used.
As the final stage before human post-editing, the translated work from the engine
will be inputted into a CAT tool, and reviewed using said CAT tool’s automatic
QA checks. This stage is especially necessary in regards to verifying the engine
has properly translated the numerical values, and that the dates are properly
localized.
Findings:
Based on the results from our project, we would not recommend training a SMT
engine for this type of documents. Clearly, our results indicate that using a human
translator would save more, rather than using a SMT engine. Additionally, if we
provide the translator with pre-approved translation memory, the segments that the
translator will need to translate would be reduced by more than half, thus cutting
cost by at least 50%. Since these documents are published on a very frequent basis,
the amount of new information published in each document would be minimal.
Furthermore, the engine ran into a multitude of problems, since the segments of
our documents weren’t ‘unique’ enough for the machine to recognize.
In order to bypass this issue, we had the engine automatically select the text it
wanted to use for the testing stage, instead of manually selecting it ourselves.
Therefore, possible errors could arise due to this discrepancy. In the latter part of
testing, this issue caused multiple rounds to fail, the reason being the testing data
wasn’t unique enough.
In addition, based on the actual total we calculated through the process, the cost of
making a SMT actually exceeds $900 to over $2,000 dollar.
Hypothetically, if the SMT works, a total of ten rounds will need to be conducted
until the cost for the engine and for a human translator will even out.
# of doc 1 2 3 4 5 6 7 8
9 10
PEMT savings:
Human Translation $720
PEMT $2395
Savings -$1675