A Quantitative Method For Evaluation of CAT Tools Based On User Preferences. Anna Zaretskaya

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A quantitative method for evaluation of CAT tools based

on user preferences
Anna Zaretskaya
University of Malaga, Avda. Cervantes, 2, 29071 Málaga, Spain

Abstract

Translation software evaluation is a task that highly depends on its purpose. The purpose can be comparing
and ranking of existing tools, evaluating advancements in the development of one tool, assessing usefulness
of a tool for a specific working scenario, etc. There is no evaluation methodology that could fit any evaluation
purpose. In this article we attempt to evaluate four popular translation tools from the point of view of user
preferences. The evaluation is based on a user survey where respondents ranked features of translation tools
by their usefulness. The evaluation scheme we propose takes into account three software quality
characteristics: Functionality, Adaptability and Interoperability. We suggest that the scheme is suitable for
evaluating how currently existing tools satisfy the requirements most of the users regarding these
characteristics.

Keywords: Computer-assisted translation; user requirements; user survey; software evaluation.

1. Introduction

CAT (computer-assisted translation) tools are the most popular type of computer programs
designed for professional translators. Their purpose is to reduce the amount of repetitive work, make
the translation process faster and easier thus allowing larger translation throughput. Nowadays CAT
tools assist translators in a great number of translation-related tasks by offering more and more new
advanced features. However, it also makes translators' life complicated because, firstly, they have to
adapt to new systems, and secondly, they have to choose among the growing variety of available
tools.
Even though translation systems and CAT tools in particular are being frequently evaluated for
various purposes, finding a reliable and objective evaluation strategy is still a challenge. First of all,
that is because each evaluation method is created for a specific purpose. This paper presents a
quantitative user-oriented method for evaluation and comparison of CAT tools. It is based on the
feature checklist methodology, which is often used for CAT tool evaluation. It consists in having a
list of features, and deciding for each of them whether it is present or absent in a certain tool. In our
case, the feature list was developed based on a user survey, in which participants evaluated the
usefulness of different features and functionalities of CAT tools (see Section 3.1). We add a user
perspective to the evaluation method by assigning weights to the features based on their popularity
among the survey respondents. Thus, the purpose of this evaluation scheme is to assess how the
existing tools satisfy the requirements of a large population of users.

2. Previous research

In the development of the evaluation scheme we considered previous attempts to evaluate CAT
tools. In particular, we paid attention to the EAGLES framework for the evaluation of NLP (Natural
Language Processing) systems. The EAGLES (1996) report includes formalisms of evaluation
procedures for various types of systems, including translators' aids, for which it was proposed to use
a feature checklist. This idea was further adopted in multiple works on CAT tool evaluation
(Starlander & Morado Vázquez, 2013; Rico, 2001; Höge, 2002). The EAGLES methodology used as
a starting point the ISO 9126 standard for software quality (ISO1991). It includes definitions of
general quality characteristics, some of which we adopted for our scheme, namely the following:

• Functionality. A set of attributes that bear on the existence of a set of functions and their
specified properties. The functions are those that satisfy stated or implied needs. In CAT
tools, they are functions related to translation and word processing, e.g. TM (translation

153
memory) matching, spelling checker, concordance search, etc.
• Interoperability. Attributes of software that bear on its ability to interact with specified
systems. In our case these include, for instance, interaction with other CAT tools via
support of their file formats.
• Adaptability. Attributes of software that bear on the opportunity for its adaptation to
different specified environments without applying other actions or means than those
provided for this purpose for the software considered. These are various adjustable settings,
for instance keyboard shortcuts in CAT tools.

3. Evaluation method

3.1. User survey results

The survey ‘Computer tools for translators: user needs’ (Zaretskaya, Corpas Pastor & Seghiri,
2015) was conducted in order to obtain information from professional translators about their needs
and preferences regarding these tools. It included both closed and open-ended questions on the use
of machine translation (MT), translation memories, terminology management systems, textual
corpora, among others.
For this research we used the data from one of the survey questions where participants had to
evaluate different functionalities of CAT software. They were given a list of features that are
included in state-of-the-art tools, and had to label them as ‘essential’, ‘useful’, ‘not so useful’, ‘not
important’, or ‘inconvenient’. There is a number of features that are considered essential by many
users, such as an integrated terminology management system, the possibility to save a TM on your
own PC, a high working speed, a simple interface and smooth learning curve, support for a big
number of document formats, and support for formats originated from other TM software. In
general, all of the mentioned features were regarded as useful, while some of them, such as data
storage in the cloud, web-based version of the tool, or different OS versions seemed less important
for respondents. Additionally, many users mentioned the auto-propagation and concordance search
in their comments.

3.2. CAT tools

For this case study we chose four of the popular CAT tools available on the market: SDL Trados
Studio 2014, MemoQ 2013, Memsource Web Editor (version of February, 2015) and Matecat
(version of June, 2015). The two first tools are installable tools, while Memsource is a cloud-based
application with a license subscription and Matecat is a free open-source web-based tool.
SDL Trados Studio26 (further referred to as Studio) is the leader on the market and is often
required in many translation projects by clients and translation agencies. MemoQ27 by Kilgray is one
of the most popular alternatives to Studio. Memsrouce Web Editor28 has a desktop version and a
web browser version. Its main advantage is a relatively simple interface with all the main features
included. This is also true for Matecat29, which is free of charge and in addition uses cutting-edge
web-based technologies like incorporation of a publicly available online TM and MT.

3.3. Evaluation scheme

The list of features for evaluation includes the following features:

• Features that were evaluated by the respondents of the questionnaire. They were selected
based on the previous surveys on the subject and on the analysis of common CAT tools.
• Features that the respondents mentioned as their favourite in the corresponding open-ended
question of the survey.

The usefulness score given by each respondent takes values from -1 (Inconvenient) to 3
(Essential). For each feature we calculated its average usefulness score:

26
http://www.translationzone.com/products/sdl-trados-studio/.
27
https://www.memoq.com/.
28
https://www.memsource.com/en/.
29
https://www.matecat.com/.

154
u1 + ... + un
uav = ,
n
where n is the number of respondents who evaluated the feature.
Then, we grouped the features into three classes based on the score, where the ones with the lowest
scores were put in Class 1 and the features with the highest scores in Class 3. The features retrieved
from the comments were assigned a class based on their relative popularity compared to other
features mentioned in the comments. In total, Class 1 consisted of 18 items, Class 2 of six items, and
Class 3 of four items. Correspondingly, during the evaluation, the score given to each feature was
weighted based on the class it belonged to. Thus, more popular features have more weight in the
total quality score. For instance, concordance search, terminology management and support for
many file formats were the most essential features, so they are assigned weight of 3. Aligning
sentences, storing TM in the cloud, and accessing online terminology resources were among the
least important features, so they received the weight of 1.
Consider the table in Appendix A for a detailed evaluation scheme. We can see that the scores that
can be assigned to the tools for each feature vary from 0 to 2. As specified in the column Metric, in
most cases 0 is given when the feature is absent, 1 means some partial integration, and 2 is given
when the feature fully accomplishes its function. For example, the ‘Real time target preview feature’
has a scale from 0 to 2. The maximum is given when the translator can see the target text in its final
format in real time while translating. Some tools do not provide this possibility, but most of them
allow generating and downloading preview at any time during translation. In this case the tool gets
the score of 1. As another example, consider the feature ‘Support of a big number of document
formats’: these are the formats of the source document one can use with the tool. The tools that
support less than 40 document formats are given zero points, the ones that support between 41 and
50 formats received one point, and more than 50 formats got two points.
The score given to each tool for a specific feature is multiplied by the feature’s weight. Finally,
each feature represents a measurable component of one of the broader quality characteristics
described in Section 2: Functionality, Adaptability or Interoperability. This allows to make
conclusions about more abstract characteristics of the software, which is an advantage compared to a
simple feature checklist.

3.4. Evaluation results

SDL Trados Studio received the highest total score, closely followed by MemoQ. The two web-
based applications Matecat and Memsource showed significantly lower total scores with a small
difference between them. The difference in scores is mostly caused by different Functionality
scores: MemoQ and Studio win in Functionality over the other two tools with more than ten points.
An interesting observation is that even though Studio got the highest total score, for the
Functionality characteristic alone it loses to MemoQ by two points. Instead, it got significantly
higher scores for Adaptability and Interoperability. MemoQ, on the contrary, was the best in
Functionality, and the worst in Interoperability. Appendix A provides the complete evaluation
results.
In order to see how the weights influenced the final quality scores we repeated the same
calculations without assigning the weights (consider Table 1, where weighted scores are given in
bold). The total ranking of tools remained the same. However, some differences in specific
characteristics were revealed. Memsource and Matecat received the same Functionality score,
whereas in the weighted scheme Memsource scored better. In addition, Memsource and Studio
received the same Adaptability score (while in the weighted scheme Studio had a higher score).

Table 1. Difference between weighted and non-weighted evaluation.


MemoQ Memsource Matecat Studio
Functionality 42 27 28 18 27 18 40 26
Adaptability 8 4 7 5 4 3 9 5
Interoperability 8 4 11 5 11 5 14 6
Total 58 35 46 28 42 26 63 37

155
4. Conclusions

In this paper we presented a quantitative evaluation method designed to compare certain quality
characteristics of CAT tools, namely Functionality, Adaptability and Interoperability. The
evaluation scheme takes into account preferences of professional translators regarding different
features of CAT tools, identified by means of a user survey. Thus, our evaluation method takes into
account not only different quality characteristic of translation software, but also introduces a user
perspective. The evaluation was performed on four popular tools. We suggest that the proposed
evaluation method can be used by developers to verify if a tool satisfies the needs of the majority of
the users. In addition, CAT software users can adjust the scheme by assigning different weights
according to their own individual preferences. It has to be mentioned, however, that this method, as
well as other methods based on feature checklists, does not produce an absolute quality score. That
is first of all because it does not cover some important quality characteristics, such as Usability. In
other words, even software that provides all of the considered functionalities can be difficult in use
or have a steep learning curve due to the way these features are implemented.

Acknowledgements

Anna Zaretskaya is supported by the People Programme (Marie Curie Actions) of the European
Union’s Framework Programme (FP7/2007-2013) under REA grant agreement No 317471. The
research reported in this article has been partially carried out in the framework of the research group
Lexytrad.

References

EAGLES (1996). Evaluation of natural language processing systems. Final report. EAGLES Evaluation Working Group.
Höge, M. (2002). Towards a Framework for the Evaluation of Translators’ Aids Systems. Ph.D. thesis, Department of
Translation Studies, Faculty of Arts, University of Helsinki.
ISO/IEC (1991). Software engineering – Product quality. International standard, ISO/IEC.
Rico, C. (2001). Reproducible models for CAT tools evaluation: A user-oriented perspective. In Proceedings of the
Twenty-third International Conference on Translating and the Computer. London: Aslib.
Starlander, M. & Morado Vázquez, L. (2013). Training translation students to evaluate CAT tools using eagles: a case
study. In Proceedings of the Thirty-fifth Translating and the Computer Conference. London: Aslib.
Zaretskaya, A., Corpas Pastor, G., & Seghiri, M. (2015). Translators’ requirements for translation technologies: a user
survey. In Corpas-Pastor, G., Seghiri-Domínguez, M., Gutiérrez-Florido, R., & Urbano, M. (Eds), New Horizons in
Translation and Interpreting Studies (pp. 247-254). Malaga: AIETI, Tradulex.

156
Metric Weight MemoQ Memsource Matecat Studio

Functionality Concordance 0-no, 2-yes 3 2*3 2*3 2*3 2*3


Auto propagation 0-no, 2-yes 2 2*2 0 2*2 2*2
Aligner 0-no, 2-yes 1 2*1 2*1 0 2*1
Storing TM in the cloud 0-no, 2-yes 1 2*1 2*1 2*1 2*1
Real-time QA 0-no, 1-not real-time, 2-yes 2 2*2 1*2 2*2 2*2
Evaluation table

Access to online TM 0-no, 1-with plugin, 2-yes 1 1 0 2*1 1


Access to online term. res. 0-no, 1-with plugin, 2-yes 2 1*2 0 0 1*2
Sub-segment suggestions 0-no, 1-with plugin, 2-yes 1 2*1 1 0 1
Real-time target preview 0-no, 1-generate preview, 2-real-time 2 2*2 1*2 1*2 1*2
Good grammar checker 0-no, 2-yes 1 1 1 1 1
Merge TMs 0-no, 2-yes 1 0 0 0 2*1
Easily add terms 0-no, 1-not easy, 2-yes 1 1 1 0 1
Segment assembly 0-no, 2-yes 1 2*1 0 0 0
Dictation 0-no, 1-works w/DNS, 2-integration 1 1 1 1 1
Simple handling of tags 0-no, 1-some tags, 2-all tags 1 1 1 1 1
Terminology management 0-no, 2-yes 3 2*3 2*3 0 2*3
Machine translation 0-no, 1-with plugin, 2-yes 1 1 1 2*1 2*1
Work with > 1 TM 0-no, 2-yes 1 2*1 2*1 2*1 2*1

157
Total Functionality 42 28 27 40

Adaptability Different OS 0-no, 2-yes 1 0 1 1 0


Adjustable keyboard shortcuts 0-no, 2-yes 2 2*2 0 0 2*2
Adjustable segmentation 0-no, 1-merge/split segments, 2-yes 2 2*2 2*2 2*1 2*2
Adaptable/modular interface 0-no, 1-some modules, 2-yes 1 0 0 0 0
Web-based version 0-only desktop, 1-only web, 2-both 1 0 2*1 1 1

Total adaptability 8 7 4 9

Interoperability Share TM 0-no, 2-yes 1 2*1 2*1 2*1 2*1


Number of TM formats 0-0, 1-1 to 2, 2->3 3 1*3 1*3 1*3 2*3
Number of document formats 0-<40, 1-41 to 50, 2->50 3 1*3 2*3 2*3 2*3

Total interoperability 8 11 11 14

Total score 58 46 42 63

You might also like