Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

D:\SimilisTeam\Extracting_terminology_with_Similis4.

doc

1/6

Release 13 02 2011 12:36

Extracting terminology from Translation Memory with Similis, step by step Similis is a Translation Memory (TM) program of French origin, supporting English, German, French, Italian, Spanish, Portuguese and Dutch. It includes a linguistic analysis engine to break down segments into chunks and generate corresponding Term Bases (TB) or glossaries. As Similis failed to succeed commercially, it is now offered free (for how long ?), offering a good opportunity to experiment its terminology extraction potential. Being a layman in linguistics, I would define a chunk as a unit of words belonging together like article + noun + adjective (nominal group) or subject + verb + object (verbal group), fragments of a sentence if you prefer. Cutting long sentences into smaller groups should increase the probability of getting matches between your source text and the glossary. Installation Download the SimilisFreelance-2.16.04-Setup.exe file (160MB) from http://similis.org/linguaetmachina.www/index.php?afficher=7&info=Downloads__Purchase From experience, following hints seem useful. Under Vista or Windows 7, install as administrator Temporarily disable antivirus software during the installation. Bypass the suggested folder "C:\Program Files\Lingua et Machina" and set something like C:\Sim2 (my choice). Setup lets one choose where to store the created data, not relevant for our test. Start Similis. You will be prompted to retrieve a free license key. This is an automated process and you'll get the license key by e-mail (for how long ?). The program takes 1.6 GB disk space. I reduced it to 0.9 GB by deleting the unused language combinations in C:\Sim2\xelda\lingdata\DictionaryLookup and keeping only English, German and French. TM for test Using a "foreign" TM is important to identify the newness and relevance of the extracted TB entries. I received a German/French TM of Wordfast origin Hippo.tmx with some 2970 entries. Checking the TM 1st opened in Olifant, cleaned of bullets and some duplicates, 2927 entries remaining. Similis is not affected by unformatted tags like {1}. AFAIK, it ignores also other formatting strings for Bold, Italics and the like. Similis being not entirely Unicode compliant, one should check with Olifant if the TM contains any specifically German quotes and suppress them: Lower opening double quote Alt+0132 Upper closing double quote Alt+0147 Upper opening double quote Alt+0148 Otherwise you may get messy chunks &ldquo, &rsquo and the like

D:\SimilisTeam\Extracting_terminology_with_Similis4.doc

2/6

Release 13 02 2011 12:36

Start Similis, dialog box root@Local Host, OK. Similis Manager opens.

File menu, New, Project

I create a new project Hippo_Import, tick "Extract terminology from alignment tasks"

Next (dialog box mixes English and French because of my French Windows 7 version).

D:\SimilisTeam\Extracting_terminology_with_Similis4.doc

3/6

Release 13 02 2011 12:36

Leave "Create a memory for this project" ticked. Finish. An auto-generated Hippo_Import_014 memory appears in the list. 014 comes from a project counter.

I don't create a glossary. Right click on Hippo_Import memory, Import.

D:\SimilisTeam\Extracting_terminology_with_Similis4.doc

4/6

Release 13 02 2011 12:36

Import file format: Trados TMX translation memory, insert path of tmx file, tick Terminology extraction, Merge mode: Merge new matches with old, don't tick: Each entry is a complete term. Finish.

Memory is read, very slow progression (percentage indicated), when finished:

D:\SimilisTeam\Extracting_terminology_with_Similis4.doc

5/6

Release 13 02 2011 12:36

A new auto-generated glossary Hippo appears in the list of glossaries.

Right-click Hippo to export it from Similis to the desktop (default, easy to locate) as TSV (tab separated values).

D:\SimilisTeam\Extracting_terminology_with_Similis4.doc

6/6

Release 13 02 2011 12:36

Inspecting the extracted chunks Opened and saved Hippo.tsv in Excel. File identified as of Windows/ANSI origin, 2579 entries, 7 columns: Source language, source chunk, NOUN (many) or VERB (a few), target language, target chunk, NOUN or VERB. I keep only source chunk and target chunk. Chunks are generally correctly extracted. I would have liked more single terms, so I tried to transform Hippo.tsv into a tmx TM and process it a second time, to no avail. Statistics Filtered 200 pairs out of the 2579 (every 4th entry in groups of 13 entries). 5 pairs (2.5%) are wrong or incomplete to an extent making them useless or misleading. 33 pairs (16.5%) are somewhat incomplete: elements are missing or foreign elements are lumped in the target chunk. See examples in the table below: (missing words added in parentheses), redundant words crossed out Abrasionsschden aktuellen Entwicklungen der Elektro- und Leittechnik automatisches Luftventil damit verbundenen hohen Personalaufwand daraus resultierende nderungen durckbeaufschlagter Betriebsdichtung dgts (par abrasion) autre part des derniers dveloppements (en matire d'lectrotechnique et de commande) (soupape ) air automatique frais levs de personnel (qui en dcoulent) ventuelles modifications (qui en rsulteraient) joint de service (sous pression)

Comparison with X Fully functional test copy kindly submitted by the editor. Same TM Hippo.tmx imported. The program asks for many parameters (in stark contrast with Similar that asks for nothing!). I vary only minimum and maximum number of source terms and experiment with 3/8, 2/8, 1/8, 1/3. I don't claim to have mastered the program! The possibility to limit the number of source terms sounds promising to extract "root" terms instead of long strings (supposed less likely to be encountered). X extracts less strings than Similis . The combination of 1 minimum and 3 maximum number of source terms extracts 973 strings, I filter 196 pairs. 58 pairs (29.6%) are wrong or incomplete to an extent making them useless or misleading. 41 pairs (20.9%) are somewhat incomplete. Usability of extracted chunks I have a TB made of chunks gathered from another TM. Activating it with memoQ (the only TEnT tool I am presently using) has told me that incomplete pairs are more a trouble than a help and that they HAVE TO be deleted. Be prepared to invest time and effort into it or decide to give up. I wish you happy experimenting! Feedback and comments are welcome Jean-Marc Tapernoux translate(at)techni-tra.com / www.techni-tra.com

You might also like