1401

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Abbreviation detection in

Swedish Medical Records


The Development of SCAN, a Swedish Clinical Abbreviation
Normalizer

Niklas Isenius

Department of Computer and Systems


Sciences
One year master degree project 15 HE credits
Computer and Systems Sciences
Degree project at the master level
Spring term 2012
Supervisor: Sumithra Velupillai
Co-Supervisor: Maria Kvist
Reviewer: Hercules Dalianis
Swedish title: Förkortningsdetektion i Svensk klinisk text
Abbreviation detection in Swedish
Medical Records

The Development of SCAN, a Swedish Clinical Abbreviation Normalizer

Niklas Isenius

Abstract
Swedish clinical records are filled with numerous abbreviations that are difficult to understand. This
problem can span from patients reading their own records to medical personnel reading texts from a
different clinical domain than the one they are used to. Computer algorithms processing clinical texts
also suffer from this problem since abbreviations can hide valuable information important to the
algorithm's task.
In this report, a first step is taken towards solving this problem by developing an algorithm that detects
abbreviations in Swedish clinical texts and supplies suggestions to their full length translations. A
combination of design science and empirical research was used to approach the problem.
The development process was started by reviewing existing algorithms in similar domains, from which
key features were taken and used as a foundation for a rule based algorithm. Some additions were also
made in order to mitigate problems experienced by previous researchers as well as some new problems
predicted to arise from the specific domain of Swedish clinical texts. The developed algorithm was
tested against assessment sections from 300 medical records from an emergency department of the
Karolinska University Hospital in Stockholm, Sweden. In these tests, the best performing version of
the algorithm achieved an F-Measure score of 79%, with 76% recall and 81% precision. These results
were compared against a baseline result as well as the results from existing similar algorithms. The
algorithm was concluded to be successful in its task and the performance was established to be better
than the baseline tests and close to the similar algorithms, though questions were raised regarding the
validity of these results due to different conditions. Problems and flaws with the algorithm were
analyzed and discussed; and it was concluded that many of the problems were due to the usage of
word lists.
Keywords
Clinical Text, Medical Records, Abbreviation detection, Text Normalization

Svensk sammanfattning
Svenska kliniska texter är fyllda med en rad olika förkortningar, vilka kan vara svåra att förstå. Detta
problem kan sträcka sig från patienter som vill läsa sina egna journaler till personal från ett annat
medicinskt område. Även datoralgoritmer lider av problemet med förkortningar då dessa kan dölja
värdefull information som algoritmen behöver tillgång till.
I denna rapport tas ett första steg mot att lösa detta problem genom att utveckla en algoritm som
identifierar förkortningar i svensk klinisk text och tillhandahåller förslag till deras möjliga
fullängdsversioner. En kombination av designvetenskap och empirisk forskning användes för att
närma sig problemet.
Utvecklingsprocessen inleddes med att granska tidigare forskning inom liknande områden. Från dessa
valdes några nyckelegenskaper till att användas i en regelbaserad algoritm. Vissa tillägg gjordes även
till algoritmen för att adressera problem som nämnts i de tidigare studierna, samt även för problem
som förutspåddes att tillkomma vid användandet av svenska kliniska texter. Den utvecklade
algoritmen testades sedan mot bedömningsfält från 300 olika patientjournaler från akutmottagningen
på Karolinska Universitetssjukhuset i Stockholm, Sverige. I dessa tester uppnådde algoritmen ett F-
värde på 79%, med 76% täckning och 81% precision. Dessa resultat jämfördes sedan mot
standardvärden samt resultat från redan existerande liknande algoritmer. Algoritmen visade sig vara
framgångsrik i sin uppgift, med en prestanda högre än standarvärdena och nära de redan existerande
liknande algoritmerna. Frågor ställdes dock angående validiteten av dessa jämförelser på grund av
olika förutsättningar. Algoritmens brister och problem analyserades och diskuterades, varav många
fastställdes bero på användandet av ordlistor.

Keywords
Klinisk Text, Patientjournaler, Förkortningsdetektion, Textnormalisering
Acknowledgements
The author wishes to send out his deepest thanks to his supervisors, Sumithra Velupillai and Mia
Kvist, for their amazing support and guidance during this project. Many thanks also goes out to my
girlfriend and family members for bouncing ideas, not to mention helping with the proofreading.
Table of Contents
1. Introduction ........................................................................................ 1
1.1 Background .................................................................................................. 1
1.1.1 Health Care Analytics and Modeling ........................................................... 2
1.2 Research Problem ......................................................................................... 2
1.3 Research Question ........................................................................................ 2
1.3.1 Delimitations .......................................................................................... 2
1.4 Expected Results .......................................................................................... 3
1.5 Research Approach ....................................................................................... 3
1.6 Outline ........................................................................................................ 3
2. Terminology ........................................................................................ 5
2.1 Abbreviations and Acronyms .......................................................................... 5
2.2 Clinical and Biomedical texts .......................................................................... 5
2.3 Precision and Recall....................................................................................... 6
2.4 Pattern Matching and Machine Learning ........................................................... 6
3. Related Research................................................................................. 7
3.1 Taghva and Gilbreth (1999) ........................................................................... 7
3.1.1 Method .................................................................................................. 7
3.1.2 Results .................................................................................................. 7
3.2 Yeates (1999) .............................................................................................. 8
3.2.1 Method .................................................................................................. 8
3.2.2 Results .................................................................................................. 8
3.3 Park and Byrd (2001) .................................................................................... 9
3.3.1 Method .................................................................................................. 9
3.3.2 Results .................................................................................................. 9
3.4 Dannélls (2003) ...........................................................................................10
3.4.1 Method .................................................................................................10
3.4.2 Results .................................................................................................10
3.5 Larkey et al. (2004) .....................................................................................10
3.5.1 Method .................................................................................................11
3.5.2 Results .................................................................................................11
3.6 Xu et al. (2007) ...........................................................................................12
3.6.1 Method .................................................................................................12
3.6.2 Results .................................................................................................12
4. Choice of method............................................................................... 14
4.1 Choice of algorithm ......................................................................................14
4.1.1 Constraints ............................................................................................14
4.1.2 Taghva and Gilbreth (1999).....................................................................14
4.1.3 Yeates (1999) ........................................................................................15
4.1.4 Park and Byrd's (2001) ...........................................................................15
4.1.5 Dannélls (2003) .....................................................................................15
4.1.6 Larkey et al. (2004) ...............................................................................15
4.1.7 Xu et al. (2007). ....................................................................................16
4.1.8 Alternative methods ...............................................................................16
4.1.9 Selected method ....................................................................................16
4.1.10 Summary ............................................................................................18
4.1.11 Potential Changes .................................................................................18
4.2 Evaluation ...................................................................................................18
4.3 Ethical considerations ...................................................................................19
5. Application of Method ........................................................................ 21
5.1 Algorithm Development Process .....................................................................21
5.1.1 Algorithm Architecture ............................................................................21
5.2 Algorithm Evaluation Process .........................................................................23
5.2.1 External components ..............................................................................23
5.2.2 Data set ................................................................................................24
5.2.3 Evaluation .............................................................................................24
6. Results .............................................................................................. 26
6.1 Baseline Results...........................................................................................26
6.2 Original Algorithm ........................................................................................26
6.3 Improved Algorithm .....................................................................................27
6.4 Summarized results .....................................................................................28
7. Analysis ............................................................................................. 29
7.1 Algorithm versions .......................................................................................29
7.2 The maximum length variable .......................................................................30
7.3 Further Analysis ...........................................................................................30
7.4 Algorithm Performance .................................................................................31
8. Discussion ......................................................................................... 33
8.1 Algorithm Performance .................................................................................33
8.2 Algorithm Problems ......................................................................................34
8.3 Conclusion ..................................................................................................35
8.3.1 Research Question and Expected Results and Reliability ..............................35
8.3.3 Originality and Research Contribution .......................................................35
8.3.4 Ethical and Societal Consequences ...........................................................36
8.4 Future Work ................................................................................................36
8.4.1 Improvements .......................................................................................36
8.4.1 Extensions ............................................................................................36
References ............................................................................................ 38
9. Appendix A ........................................................................................ 40
9.1 Correct Tokens - Top 15 ...............................................................................40
9.2 Incorrect Tokens - Top 15 .............................................................................40
9.3 Missed Tokens - Top 15 ................................................................................41
9.4 Annotated Tokens - Top 15 ...........................................................................41

List of Figures
Figure 1. The intended workflow of the algorithm ............................................................................... 20
Figure 2. The SCAN-Application architecture ...................................................................................... 21
Figure 3. Diagram for the handler class ................................................................................................ 22
Figure 4. Diagram for the Token class .................................................................................................. 22
Figure 5. The algorithm process of the SCAN-application ................................................................... 23
Figure 6. SCAN-Ouput and the XML-Format of a detected abbreviation. ........................................... 24

List of Tables
Table 1. Baseline test results. MaxLength implies the allowed maximum token length ...................... 26
Table 2. The results of the original algorithm with the regular word list and a maximum token length
of 3 to 8. ................................................................................................................................................ 27
Table 3. The results of the original algorithm with the improved word list and a maximum token
length of 3 to 8....................................................................................................................................... 27
Table 4. The results of the improved algorithm with the regular word list and a maximum token length
of 3 to 8. ................................................................................................................................................ 28
Table 5. The results of the improved algorithm with the improved word list and a maximum token
length of 3 to 8....................................................................................................................................... 28
Table 6. The results of the top 6 version of the algorithm that performed the best. .............................. 28
Table 7. The evaluation results of the versions of the algorithm using the maximum word length of six.
............................................................................................................................................................... 29
1. Introduction
1.1 Background
Patient medical records have always played a key role in the modern health care system. As the
technologies of the modern age advances it is only natural that we should try to find better and smarter
ways to utilize these records. One way of doing this is to improve the availability of the information
stored within them. This could be any number of things, from a patient being able to read their own
record to relevant information being identified by computer algorithms and stored in statistical
databases. The problem is that the structure of medical records makes them hard to process, for
humans and computers alike. The structure is such that most of the information is stored in a free text
format, divided into a few simple categories. Despite the unstructured nature of medical records, there
is an even bigger problem which makes them hard to process; namely the unique and complicated way
in which they are written.
When the content of a medical record is written, it is often done under a certain time constraint, which
tends to form the content thereafter. An example of this is that clinical texts have been shown to
contain a frequency of spelling errors of around 10%, which is significantly more than in ordinary
texts (Ruch, et al., 2003). Another consequence of the time pressure is that clinical texts tend to
contain a large amount of abbreviations, sometimes to such an extent that the text starts to look more
like telegraphic shorthand. A recent study by Skeppstedt et al. (2012) shows that in Swedish clinical
texts, 14% of all disorders present are named in an abbreviated format. The use of abbreviations in
clinical texts has been shown to be one of the reasons why patients have difficulties understanding
their own medical records (Pyper et al., 2004). To complicate things even further, the abbreviations
used tend to be ad-hoc and very localized to a specific domain, e.g. specialist clinics. This often results
in that even though a person has clinical training, they might still be unable to interpret some
abbreviations due to the fact that they are only used in a small subset of the medical domain. As an
illustration of the problem, a Swedish clinical text containing abbreviations could read like the
following:
"ant STEMI. direkt till angio. PCI mot ockluderad proximal LAD med 2 stent. trombektomy. bra
resultat. komplikationsfritt. får reoproinf 12 h"
Understandably, abbreviations do not only complicate things for humans, but for computers as well as
abbreviations hide information that might be interesting for certain algorithms. In Natural Language
Processing (NLP), the process of successfully interpreting abbreviations is problematic and something
that has been researched extensively. A first step towards solving this is building dictionaries which
list abbreviations and their definitions. Building such dictionaries manually however would be an
extremely time consuming task, especially since it's an ever expanding area. This is why much effort
has been put into automating this process (Taghva and Gilbreth, 1999; Yeates, 1999; Larkey et al.,
2000; Park and Byrd, 2001; Adar, 2002; Nadeau and Turney, 2005). Many researchers have also
focused on the automatic dictionary generation specific to the biomedical domain (Schwartz and
Hearst, 2003; Dannélls, 2006;).

1
Generating abbreviation dictionaries only solves half of the problem of abbreviation normalization
though. A problem that still remains is that abbreviations can be ambiguous (i.e. that one abbreviation
can translate into multiple definitions). In clinical text, this is especially a problem and it has been
shown that 33% of the abbreviations used have a chance of being highly ambiguous (Liu et al., 2001).
This disambiguation problem of abbreviations in clinical and biomedical texts has been approached by
several researchers who show promising results (Pakhomov, 2002; Guadan et al., 2005).

1.1.1 Health Care Analytics and Modeling


Health Care Analytics and Modeling is a research group currently at Stockholm University's
Department of Computer and Systems Sciences1. One of the goals of the group is to be able to
automatically extract information from patient records, such as symptoms, diagnoses, treatments, age,
gender and social situations in the hope of finding previously unknown connections between these.
However, this is not an easy endeavor. Many steps have to be taken before such information extraction
can be successful, one of them being to pre-process a record and convert its text into a normalized
form.

1.2 Research Problem


Swedish clinical records are filled with abbreviations that are difficult to understand (Pyper et al.,
2004). Computer algorithms processing clinical texts also suffer from this problem since abbreviations
can hide valuable information important to the algorithm's task.
A further dimension to the problem is that many abbreviations can translate into multiple definitions
(Liu et al., 2001), so to successfully translate an abbreviation to its full length counterpart; the
translation has to be contextually aware in order to pick the correct definition.
There are today existing similar algorithms that to various extends tries to solve these problems
(Pakhomov, 2002; Guadan et al., 2005). There are however none heard of that has been specifically
developed and tested on Swedish clinical texts.

1.3 Research Question


The research question to be answered in this project will be: Can an algorithm be developed that
detects abbreviations in Swedish clinical texts, using features from existing similar algorithms?
Further, will the developed algorithm perform at the same level as these similar algorithms and what
changes can be made to the algorithm in order for it to perform better?

1.3.1 Delimitations
In this project, the algorithm will be limited to only detecting abbreviations in a clinical text and
suggest possible full length translations, i.e. the algorithm will not replace the abbreviations with their
full length counterpart nor will it handle the disambiguation problem of abbreviation normalization.
A further limitation to the project is to only test the algorithm on patient records specific to Emergency
clinics. The reason for this is to get a more accurate result for a specific domain rather than a less
accurate result for a general domain.

1
http://dsv.su.se/forskning/health/ [2012-04-03]

2
1.4 Expected Results
If features are gathered from algorithms that has proved to be successful on English clinical texts, the
results on Swedish clinical texts should be somewhat successful as well. There are however some key
differences between the English language and the Swedish language that might affect the results. One
example is the frequent use of compounding in the Swedish language, which could make the task of
abbreviation detection more difficult. If this turn out to be a problem, additions to the algorithm can
hopefully be made in order to mitigate the problems. Other than this there should not be any problems
keeping the algorithm from reaching the same performance levels as existing similar algorithms.

1.5 Research Approach


Two viable approaches for research in information systems are Behavioral Science and Design
Science (Hevner et al., 2004). Hevner et al. (2004) gives the following definition of Design Science: "
Design science addresses research through the building and evaluation of artifacts designed to meet
the identified business need ". Behavioral Science on the other hand is described by Hevner et
al.(2004) as the following: "Behavioral addresses research through the development and justification
of theories that explain or predict phenomena related to the identified business need."
As the fundamental purpose of this project is do develop and evaluate an algorithm (an artifact) that
solves an identified need in the health care sector (to find and translate abbreviations), a design
science approach would be the best method to use for this project.
Johannesson and Perjons (2012) has defined five main activities in Design Science. These are:
Explicate Problem, Outline Artifact and Define Requirements, Design and Develop Artifact,
Demonstrate Artifact; and Evaluate Artifact. The first activity, to explicate the problem, has already
been covered in this project with the background and problem statement, which leaves the remaining
four to be carried out.
The activities "Outline Artifact and Define Requirements" and "Design and Develop Artifact" will be
covered in chapter 4. The demonstration of the algorithm is described in chapter 5, while the
evaluation of the artifact is described in chapter 6.

1.6 Outline
The following outline gives a brief overview of this report and what each chapter of it contains:
Chapter 2: The more essential terms of this report are listed and briefly explained.
Chapter 3: Gives an in-depth look into the related research for this project. The methods and results of
every research piece are summarized along with a short introduction.
Chapter 4: The different method choices available for this project are analyzed and compared before
one is finally chosen.
Chapter 5: The application of the method chosen in the previous chapter is described, along with how
the evaluation of the final algorithm was carried out.
Chapter 6: The algorithm is evaluated and the results are presented and compared. Improvements to
the algorithm are motivated and implemented to the algorithm. The improved version is also evaluated
and then compared to the original version.

3
Chapter 7: The results of the algorithm are analyzed and compared to the baseline results and similar
existing algorithms.
Chapter 8: The results from the analysis is discussed and conclusions about the project and its result
are also given, along with some suggestions on future work and improvements.

4
2. Terminology
In this section some of the more specific terms used in this report will be presented and defined in
order to avoid misunderstandings.

2.1 Abbreviations and Acronyms


The Oxford English Dictionary defines the word abbreviation as: "A shortened form of a word or
phrase." 2, e.g. Prof. (Professor) or Dr. (Doctor). Abbreviations can be divided into different subsets,
one of them being acronyms. Acronyms are defined by The Oxford English Dictionary as: "An
abbreviation formed from the initial letters of other words and pronounced as a word.” 3, e.g. AIDS
(Acquired Immune Deficiency Syndrome) or Scuba (Self-Contained Underwater Breathing
Apparatus). There are however disagreements in what actually counts as an acronym. Abbreviations
such as ICU (Intesive Care Unit), where each letter in the abbreviation is pronounced separately, are
often considered as acronyms even though they should be considered members of the abbreviation
subset initialisms. In this report, the term abbreviation will be used to imply any shortened form of a
word or a phrase, i.e. acronyms and initialisms will not be differentiated. The motivation behind this
is that no distinction between the two terms are made in any of the referenced research articles, thus
doing so in this report would only complicate matters.

2.2 Clinical and Biomedical texts


Clinical and Biomedical texts are two terms that are easily mixed up since they are not established
terms and have no clear definitions. In this report, the two terms will be used with the definition given
by Meystre et al. (2008). They define biomedical texts as: "...the kind of text that appears in books,
articles, literature abstracts, posters, and so forth.", i.e. ordinary text that one can read in any
literature in the biomedical domain. In terms of abbreviation and their usage in biomedical text, every
abbreviation used is almost always accompanied with their definition at least once.
Clinical texts are a bit different. Meystre et al. (2008) define them as: "...texts written by clinicians in
the clinical setting. These texts describe patients, their pathologies, their personal, social, and medical
histories, findings made during interviews or during procedures, and so forth.". By this definition, it is
clear that clinical texts are a lot less formal than biomedical texts and written under harder time
constraints. This does of course affect the language they are written in, and especially the use of
abbreviations. In addition to them being used very frequently, a lot of them are also ad-hoc, making
them hard to understand for anyone but the author since there are no formal definitions accompanied
in the text.

2
http://oxforddictionaries.com/definition/abbreviation?q=abbreviation [2012-03-09]
3
http://oxforddictionaries.com/definition/acronym?q=acronym [2012-03-09]

5
2.3 Precision and Recall
Precision and Recall are two terms used to in order to measure the performance of algorithms in
pattern recognition and information retrieval. One definition of the two terms is given by van
Rijsbergen (1979), who defines recall as: "... the proportion of relevant material actually retrieved in
answer to a search request." and precision as: " ...the proportion of retrieved material that is actually
relevant.".
Precision and recall are often presented in terms of percentages. The percentage of recall is calculated
by taking the number of found elements divided by the number of sought elements in a data set. The
percentage of precision is calculated by dividing the number of found elements that are correct with
the total number of found elements.

2.4 Pattern Matching and Machine Learning


Pattern matching and machine learning are two methods for finding information in unstructured media
and both are often used in the Natural Language Processing field. Since both methods will be viable as
approaches for solving the research question, a brief description of the two methods will be given
along with their strengths and weaknesses. The descriptions of the two methods will be based on how
they are used in this report and thus might not be fully coherent with the more formal definitions that
exist.
In pattern matching, the process of localizing the desired information is based upon a set of predefined
rules that matches the structure of the sought information (Meystre, 2008). Because of these rules,
pattern matching methods are also known as rule-based methods. The rule set can either be static or
dynamic depending on if they allow new rules to be added or not. The rules that are implemented are
often generated by manual revision of the domain in which the information resides. This is one of the
weaknesses of pattern matching, that the rule-sets are generated for a specific domain and thus is less
flexible when applied in a more general domain. On the other hand, the strength of pattern matching
methods is that they are relatively easy to implement while still being effective.
Machine learning methods differ from pattern matching methods with the feature that they are able to
alter their decision making depending on the input they receive. In order to function correctly from the
start, machine learning algorithms need a large set of pre-annotated training data to process (Meystre,
2008). After that, in contrast to the strict rule sets of the pattern matching methods, machine learning
methods can alter their decision making as they gain more experience of the specific domain that they
are processing. For example, if a machine learning algorithm processes text in search of specific terms
(much like the intended algorithm in this report), it might start out with a very general set of rules in
the start, but as more data is processed; these rules will become more and more specific. Machine
learning approaches tend to perform better than pattern based ones since they are not depending on
human intuition to define their decision making, while also being more adaptable (since the only thing
needed to apply them to a new domain is a set of domain-specific training data) (Meystre, 2008). This
can also be considered a disadvantage, since a sufficient amount of pre-annotated training data often
requires a great amount of time to generate. In some cases, such training data might not be available at
all, rendering a machine learning algorithm useless.

6
3. Related Research
In this section the related research for this project will be summarized. Each described research piece
will begin with a short introduction followed by a deeper description of its method and results. Since
this project focuses on abbreviation detection, the research described will only be reviewed in terms of
what is relevant for this project. For example, if a research piece is about abbreviation definition
extraction, focus will be on the part of the method that identifies and extracts the acronyms, i.e. the
methods for finding abbreviation definitions will be dismissed since they are irrelevant to this project.
It should be stated that there are several research articles that cover abbreviation identification that are
not mentioned here. Examples of such are the ones by Adar (2002) and Schwartz and Hearst (2003).
Both of them are based on the assumption that abbreviations are positioned inside or adjacent to
parentheses. This might be valid in biomedical texts, but in clinical texts which is the domain of this
project, abbreviations are almost never used in such a format.

3.1 Taghva and Gilbreth (1999)


The first pieces of research in to be reviewed is Recognizing Acronyms and their Definitions by
Taghva and Gilbreth in 1999. The motivation behind the research was that they at the time were
developing a post processing system for output from optical character recognition. In order to improve
this post processing they had to come up with a way to recognize and document new acronyms and
their definitions.

3.1.1 Method
The method for finding acronym candidates described in the article is a simple one. Any word that
consists of capital letters and is between three and ten letters long is considered an acronym, except if
it is in a list of rejected words supplied to the algorithm, e.g. TABLE or FIGURE. The reason for
choosing the given word length limitation was that the authors considered it the best compromise
between recall and precision. Accepting words with only two letters might increase the recall, but it
would also worsen the precision since a lot of incorrect acronym candidates would be accepted. The
chosen upper limit of ten characters was motivated by the fact that there are very few acronyms that
are longer than ten letters.
Taghva and Gilbreth's method is described as only identifying acronyms, but there are no conditions in
the part of the method described here that would stop it from identifying other types of abbreviations
as well. In the full version of the method, i.e. the one that includes the identification of definitions,
there are however conditions that will filter out other types of abbreviations than acronyms. However,
since that part has not been included here, one can consider Taghva and Gilbreth's method as an
abbreviation finding algorithm.

3.1.2 Results
Presenting the results of the method used by Taghva and Gilbreth in terms of finding acronyms is
difficult since the results given in their report are based upon finding both acronyms and their
definitions. With their measurements they reach a recall of 86% and a precision of 98% as their
algorithm was tested on a set consisting of government studies. A negative feature of the method

7
though is as stated earlier that it does not recognize acronyms with two letters or shorter. The authors
re-evaluated the algorithm on test data where two letter acronyms had been excluded, which increased
the recall to 93%.

3.2 Yeates (1999)


Another research report on automated acronym definition extraction that also was released in 1999
was Automatic Extraction of Acronyms from Text by Stuart Yeates. The report describes the
development of an algorithm that generates acronym-definition pairs from digital libraries. The
motivation behind the development of the algorithm was so that existing digital library tools, such as
key word recognition software, would operate more smoothly with the aid of acronym dictionaries.

3.2.1 Method
The method used by Yeates to find acronym candidates is closely linked with the process of finding
their corresponding definition. Due to this there are no methods that are exclusively used for
identifying acronyms in a text. Instead we will have to examine the method used for the acronym
definition extraction as a whole in order to examine if there are any useful bits that can be applied to
the domain of our project.
The method described by Yeates is divided into two steps. The first step is to divide the text to be
processed into chunks, where the size of the chunks is determined by left and right parentheses and
punctuations. Every word is then compared with the chunks before and after itself. If the word turns
out to match one of the chunks, i.e. if the letters of the word match the leading letters of the multiple
words in chunk, then the acronym-definition pair is sent on to the second step of the algorithm.
The second step consists of a set of heuristics that are loosely based upon Yeates' definition of an
acronym. If the acronym-definition pair given from the first step fails to match these heuristics, then
the pair is discarded. The acronym definitions that the heuristics are based upon are the following:
 Acronyms are shorter than their definitions
 Acronyms contain initials of most of the words in their definitions
 Acronyms are given in upper case
 Shorter acronyms tend to have longer words in their definition
 Longer acronyms tend to have more stop words

As Yeates explains it, the first step of the algorithm is quite forgiving in what counts as acronym-
definition pair thus putting a big responsibility on the heuristics to sort out the false pairs.

3.2.2 Results
As in the research by Taghva and Gilbreth (1999), Yeates results are presented in terms of how many
acronym and definition pairs that were found by the algorithm. The results, which were generated
from a sample of ten computer science technical reports, show a recall rate of 91% and a precision rate
of 68%. Some suggestions of improvements to the algorithm are given by the author. These are
however not uniquely bound to the identification of acronyms and are thus not relevant to this project.

8
3.3 Park and Byrd (2001)
The motivation behind the report Hybrid Text Mining for Finding Abbreviations and their definitions
was that the authors Park and Byrd were having problems with abbreviations hiding important
keywords from their information extraction algorithms. To alleviate this problem Park and Byrd
started to work on an algorithm that could automatically extract abbreviations and their definitions into
a dictionary, which then could be used to translate abbreviations into their full length format. Unlike
the previous work of Gilbreth and Taghva (1999) and Yeates (1999), Park and Byrd decided on a more
refined method of hybrid text mining as a foundation for their algorithm. Another difference is that
Park and Byrd tried to find abbreviation definitions and not just acronym definitions, which is a far
more difficult task.

3.3.1 Method
The method for finding abbreviation candidates has by Park and Byrd been divided into two steps.
First a candidate has to satisfy the following three conditions:
 Its first character is alphabetic or numeric
 Its length is between 2 and 10 characters long
 It contains at least one capital letter

If all these conditions are satisfied, the candidate must also meet the following three restrictions in
order to be recognized as an abbreviation.
 It is not a known (dictionary) word containing an initial capital letter and appearing as the first word
in a sentence.
 It is not a member of a predefined set of person and location names.
 It is not a member of user-defined list of stop words.

The restrictions are all in place in order to strengthen the precision of the algorithm by sorting out false
positives, since the conditions are quite forgiving in what words they accept. To clarify what the user-
defined list of stop words include, it's a list with all the words that the user wishes to be filtered out
and that is not covered by the first two restrictions.

3.3.2 Results
Park and Byrd present their results in terms of finding abbreviations and their definitions, i.e. not just
abbreviation detection. The algorithm was tested on three different sets, one was a book on automotive
engineering, one was a technical book from a pharmaceutical company, and the last set consisted of a
collection of NASA press releases. The results show a minimum performance of 93,9% recall and
97% precision. The authors discuss the reason for missing abbreviations, which all are connected to
matching the abbreviations to their definitions, i.e. none of the missed abbreviations are due to the
algorithm not identifying a word as an abbreviation. This can lead us to believe that the algorithm
performs even better than the test results show in terms of finding abbreviations in a text.
Considerations have to be taken to the fact that we have no information about the different types of
abbreviations processed by the algorithm. Since the structure of clinical abbreviations might differ
from the ones processed in these results, the outcome might be a lot different if we would try to
implement this method in clinical texts.

9
3.4 Dannélls (2003)
In Acronym Recognition (2003), Dana Dannélls describes the work of developing an algorithm for
detecting acronyms and their definitions in Swedish text, primarily in the biomedical domain, but she
also states that the algorithm should work in more general domains as well.
The reason why Dannélls chooses to focus on the Swedish language is that there are the inherent
difficulties when it comes to matching the acronym with their definitions that you do not find in a
language like English. The problem lies in that when words are compounded in Swedish, like for
example "intensiv" (intensive), "vård" (care) and "avdelning" (unit), they are often compounded into
the single word, in this case "intensivvårdsavdelning", which has the acronym "IVA". It is obvious
why matching this acronym with its definition is a lot harder than matching its English counterpart,
intensive care unit (ICU).

3.4.1 Method
The method used by Dannélls to identify acronym candidates is a little bit different from previous
examples in the sense that she uses the aid of a part-of-speech (POS) tagger to pre-process the text.
With the additional parameters this brings she lists the following conditions that have to be met for an
acronym candidate:
 The POS tag for the token is either N, Y or X (i.e. noun, abbreviation or foreign word). In case X, it
must consist of at least 2 upper-case letters.
 The token is not in the list of noise words, nor names. The list of noise words contains words such
as “by”, “cm” and “ml”. The list of names includes person names as “The-Hung Bui”, “Hans”.
 The token does not contain characters such as ’(’, ’)’, ’[’, ’]’, ’=’.
 The token must be between 2 and 14 characters long

As with previous examples that also were limited to acronyms, it is clear that these conditions also
allow abbreviations to be accepted as candidates. It is only in the methods original form, where
definitions are also taken into consideration, that general abbreviations are discarded.

3.4.2 Results
In this report, the author has presented separate results for how well the algorithm performs in finding
acronyms. The algorithm was tested upon a set of Swedish biomedical texts, where a 98% recall and
94% precision is achieved. The algorithm successfully extracted 845 correct acronyms out of 898
possible. The author identifies the reasons to why the remaining acronyms were left out with these
three explanations.
 The Acronym consisted of more than one token, e.g. ’PUU N’ or ’P S A’.
 The acronym was removed due to their definition string that might have included symbols and
letters such as ’@’, ’www’.
 Wrong interpretation by the POS tagger.

3.5 Larkey et al. (2004)


The next piece of related research reviewed in this section is Acrophile: An Automated Acronym
Extractor and Server by Larkey et al. (2004). In their article, the authors describe the development of a

10
web server for acronym and abbreviation lookup, where the underlying database is automatically
generated by browsing a large number of web pages. The web server is developed to handle general
acronyms, i.e. not domain specific acronyms.

3.5.1 Method
Larkey et al. propose three different approaches for finding acronyms in texts, one contextual, one
context/canonical and one simple canonical. The different approaches have some similarities, but also
some key differences in what they accept as a valid acronym. Their definitions are as follows:
Contextual
 All letters must be uppercase, they can be lowercase if they are at the end of the word and preceded
by three uppercase letters (e.g. COGSNet) or if the lowercase letters are in the middle of the word
with at least two preceding uppercase letters and at least one following uppercase letter (e.g.
AChemS)
 The acronym is allowed to contain punctuations and spaces if they follow every letter (e.g. U.S.A.)
 Can contain any number of digits anywhere

Context/Canonical
 A letter is allowed to be lowercase if it is preceded and followed with at least one uppercase letter
(e.g. DoD)
 The acronym is allowed to end with a lowercase letter if it is an 's' (e.g. USA's)
 The acronym is allowed to contain punctuations and spaces if they follow every letter (e.g. U.S.A.)
 Slashes and hyphens are allowed in the acronym
 Only one digit is allowed
 Must be between 2 and 9 characters

Simple Canonical
 A letter is allowed to be lowercase if it is preceded and followed with at least one uppercase letter
(e.g. DoD)
 Slashes and hyphens are allowed in the acronym
 The acronym may not contain digits, periods or spaces.
 Must be between 2 and 10 characters

These three methods for finding acronym are later matched with different patterns to find acronym
definitions, but as that is of no interest for this project, it will not be described here.

3.5.2 Results
As with many of the other studies, the results of each acronym finding method in Larkey et al. are
given together with their definition finding counterpart, thus making it hard to get a clear picture of
their effectiveness. However, a comparison of the three methods results should give a reasonable good
measurement of which of them is most effective. According to Larkey et al., the context/canonical
method performed best with a precision of 92% and 84% recall, with the runner up being the
contextual method with 96% precision and 60% recall. The methods are stated to have been developed

11
for a general domain. The tests however are done upon military and governmental web pages only,
which might have made the results less general than intended.

3.6 Xu et al. (2007)


In A Study of Abbreviations in Clinical Notes, Xu et al. discuss clinical notes and the problems that
arise from the extensive use of abbreviations. In the report they also describe their attempts to develop
a model for building a clinical abbreviation database. The model is divided into two parts; one part
that identifies abbreviations in clinical notes and a second part to build a sense inventory for these
abbreviations. As stated earlier, this project is limited to not handle the disambiguation of
abbreviations, therefore the second part of Xu et al.'s method will be ignored.

3.6.1 Method
Similar to Larkey et al. (2004), Xu et al. describes and tests four different methods for abbreviation
detection. The first method is a simple one where each token in a text is tested against two
dictionaries. The first dictionary is an English word list (consisting of 110 573 words) and the second
one is a list of medical terms (consisting of 9721 words). If a token is not in any of these dictionaries it
is considered an unknown word and must therefore be an abbreviation.
The second method used by the authors is based on a set of rules that was devised by looking at
several admission notes. The following criteria were formed, and if a token meets one of them it is
considered an abbreviation.
 The word contains a special character such as "-" and "."
 The word contains less than 6 characters and contains one of the following: a) A mixture of
numeric and alphabetical characters. b) Capital letter(s), but not when only the first letter is
uppercase following a period. c) Lower case letters where the word is not in the English or medical
list.
The second method could be deemed identical to the first method as it also uses the aid of the two
mentioned dictionaries, but with some added heuristic rules to make it a little more fine grained.
The third and fourth methods are both based upon a decision tree classifier which needs to be trained
with pre annotated test data. The differences between the two methods are the features used by the
classifier. The third method uses the following features:
 Special characters such as “-“, and “.”
 Alphabetic/numeric characters and their combination
 Information about upper case and positions in the word
 Length of the word
 The document frequency of a word

The fourth method uses the same features as method number three, with the added feature if the word
is in the previously mentioned dictionaries, i.e. if the word is a known English word or a medical term.

3.6.2 Results
The four methods for abbreviation detection were tested against a set of admission notes from the New
York Presbyterian Hosptial. The method with the highest scores in terms of precision and recall was

12
the fourth one with 91.4% precision and 80.3% recall. In second place came the second method with
85.4% precision and 83.9% recall. The error analysis performed by the authors showed that their
methods had the biggest problem with abbreviations that were divided into multiple tokens. Examples
of such were "ex tol" (exercise tolerance) which was interpreted as two tokens when it should be
interpreted as one.

13
4. Choice of method
The research question for the project will be approached by choosing an existing algorithm designed
to detect abbreviations in clinical texts. The development and evaluation of the algorithm poses
several problems that need to be discussed before any further decisions are made. In this section of the
report, the methods and algorithms available for abbreviation detection will be discussed and
compared before one is eventually selected for the project. Appropriate methods for evaluating the
algorithm will also be examined as well as the ethical considerations for the project.

4.1 Choice of algorithm


The key aspect of the algorithm is the identification of all possible abbreviations in a clinical text. To
make a well motivated choice of method that the algorithm will follow, the research described in the
previous chapter will be analyzed and compared in order to select a method with good performance
and traits that are suitable for this project. All the reviewed methods have presented their results in
terms of precision and recall. This will give us somewhat of a standardized metric, which should
simplify the process of comparing the different methods.
The methods described in chapter 3 consist of a mix of pattern matching methods and machine
learning methods. As these are reviewed in order to find a suitable method for this project, there are
some factors that have to be taken into consideration. First of all, the complexity of machine learning
algorithms make their implementation a lot harder, hence the gain from using them has to substantially
outweigh the cost of implementing them. Second, machine learning algorithms require a certain
amount of domain specific data that has been manually annotated in order for the algorithm to function
properly. In this project, such specialized data will not be available, hence special conditions will have
to apply in order for a machine learning algorithm to be considered for this project.

4.1.1 Constraints
Before the different methods can be examined, the constraints under which they will be forced to
operate under will have to be taken into consideration. One such constraint is the one mentioned in the
previous section with pre annotated training data not being available. Another constraint that can be
gathered by a quick examination of a general Swedish clinical text, is that abbreviations seldom are
spelled with capital letters. Therefore, a potential method cannot rely on the assumption that an
abbreviation is spelled with only capital letters.

4.1.2 Taghva and Gilbreth (1999)


The algorithm described by Taghva and Gilbreth has its advantages in that it is very simplistic and
easy to implement. The downside is that it only recognizes acronyms with three letters or more, i.e.
abbreviations and/or acronyms with only two letters will be discarded. This might be acceptable in a
more general domain where one can sacrifice some recall in order to get higher precision, but in the
clinical domain where both recall and precision is equally important, such a limitation on the
algorithm is not possible. Another factor that makes this algorithm an even less viable candidate is that
even though the algorithm might catch abbreviations that are not acronyms, it makes far too many
assumptions for it to work on all types of abbreviations. For example, one constraint for the methods

14
was that they cannot rely on abbreviations being spelled exclusively with capital letters, which makes
this method a lot less viable since it makes this specific assumption.

4.1.3 Yeates (1999)


In Yeates' algorithm, the method used for identifying abbreviations is closely intertwined with finding
their respective definition. The method he describes could have been useful if we were to process texts
where the definitions were likely to be included. But since one cannot trust clinical abbreviations to
have their definitions listed in the text, this method can instantly be discarded for our project.

4.1.4 Park and Byrd's (2001)


Reviewing Park and Byrd's (2001) results; their method seems to perform very well by the judging by
its high numbers in both precision and recall. If one also takes into consideration that all the missed
abbreviations were due to definition mismatches, you get an even more impressive result. When it
comes to how easy the algorithm would be to implement, there should not be that much of a problem
since it is based upon simple conditions and restrictions. The biggest challenge would most likely be
in supplying the algorithm with a good enough dictionary and sufficient sets of names and stop words
that is adapted to the context of clinical texts. This, however, is something that could be improved as
the algorithm is tested. The important part is that it has been scientifically established that it performs
well with the correct input. Another strength of this algorithm is that it has no restrictions in terms of
accepting different abbreviations, unlike the two previous algorithms. The disadvantage of Park and
Byrd's algorithm though is that is has been designed and tested in a more general area than clinical
texts, thus avoiding the specific complications that these present. It is therefore difficult to measure
how well the algorithm would perform in the domain of this project.

4.1.5 Dannélls (2003)


The next method to evaluate is Dannélls. One of the strengths of this method is that it has been
specifically developed and tested for the Swedish language, albeit not for the specific domain of
clinical texts, but at least for biomedical texts. Further, Dannélls accounts for the results of the
acronym finding part of her method separately from the other results. This gives us more reliable
figures of how well the method actually performs. The stated test results for the algorithm was 98%
recall and 94% precision, which is the best so far, since the results from Park and Byrd could not be
properly validated. A special requirement for Dannélls’ method is that the input text has to be
preprocessed by a POS-tagger in order for it to function properly. Normally a POS-tagger has to be
trained in the domain on which it operates, which would make Dannélls’ method unavailable to this
project due to the lack of training data. However, there are POS-taggers available that comes trained
for certain domains. If such a POS-tagger can be found, Dannélls’ method can still be used for this
project.

4.1.6 Larkey et al. (2004)


Larkey et al. (2004) described four different methods for detecting abbreviations. If we focus on the
context/canonical method, which proved to be the most successful one, it has a quite simplistic rule
based system for selecting valid acronyms. The limitation though is that it shows weaker results in
term of recall and precision compared to Park and Byrd (2001) and Dannélls (2003). What it also lacks
compared to the two mentioned algorithms, is some kind of filtering of words, such as names,

15
dictionary words etc. This could be added as an extension though if the decision is made to use this
method.

4.1.7 Xu et al. (2007).


Finally, we have the methods used by Xu et al. (2007). The method that performed best according to
the authors test results was the one based upon a decision tree classifier that has to be trained upon pre
annotated data, which is the reason why we cannot include it as a viable option in the choice of
method. Their second best method on the other hand would work and shows reasonably good
performance (85,4% precision and 83,9% recall), although a little bit lacking in comparison to the
more prominent methods of Park and Byrd (2001) and Dannélls (2003). What should be taken into
consideration though is that the method of Xu et al. was the only one tested on the actual domain that
our project is focused on, i.e. clinical texts. As we have pointed out, clinical texts are much more
complicated in terms of abbreviations detection, which is why the test results of this method should be
considered stronger than the numbers indicate. The test scores of the method are also specific to the
actual detection of abbreviations, compared to Park and Byrd's (2001) test scores that are merged with
the detection of the abbreviation definitions and thus cannot be accurately judged in terms of
abbreviation detection. Therefore the method of Xu et al. should be one of the frontrunners in method
selection since it is the only method that has been properly tested in the domain of this project.

4.1.8 Alternative methods


All the methods that have been reviewed so far have had some distinct differences. They are however
all based on the principle of pattern matching. To get a wider perspective of approaches for this
project, the field of machine learning should also be considered as an alternative method. Examples of
previous research that have based their methods on machine learning is Xu et al. (2007). With their
method they reach an impressive result of 94.5% recall and 97% precision on a data set consisting of
clinical discharge summaries. This strengthens the argument that machine learning algorithms perform
better than rule based ones. However, machine learning algorithms are a lot more demanding in terms
of complexity in addition to that they also require training data. Due to the limited the resources for
this project, such as the lack of training data, machine learning algorithms will have to be discarded,
despite the possibility of it generating a better result.

4.1.9 Selected method


In the review of the available methods of this project, Park and Byrd's (2001) method was the one that
showed the highest test results. However, those results were based on test data which was not from the
clinical domain and thus did not have to face the hardships that comes with it. This also applies to the
method in Dannells (2003), which also displayed good results, but was developed and tested against
the much simpler biomedical domain. For this reason the chosen method will be the rule based one by
Xu et al. (2007) since it is the only one of the reviewed methods that has been tailor made and tested
against the domain of clinical texts.
The weakness of the chosen method is that the presented results are a little bit lacking in precision and
recall. Some additions to the existing conditions must therefore be made in order to improve the
performance of the algorithm. For this project, the addition will be to take the POS-tag condition from
Dannélls (2003) and insert it into the conditions of Xu et al. (2007). The intention is to increase the
algorithms precisions since it sorts out tokens with an incorrect POS-tag.

16
The method of this project will consequently consist of the following conditions gathered from Xu et
al. (2007) and Dannélls (2003):
 The POS tag for the token is either N, Y or X (noun, abbreviation or foreign word). In case X, it
must consist of at least 2 upper-case letters
 The word contains a special character such as "-" and "."
 The word contains less than 6 characters and contains one of the following: a) A mixture of
numeric and alphabetical characters. b) Capital letter(s), but not when only the first letter is
uppercase following a period. c) Lower case letters where the word is not in the Swedish word list
or medical list.

Xu et al. state in their result that a large degree of the missed abbreviations were due to the over
simplistic tokenizer that they used. For example, the abbreviation "ex tol" (exercise tolerance) was
mistakenly divided into two tokens instead of one, which resulted in a loss of both precision and
recall. The authors does not specify how their tokenizer works, but the tokenizer developed for this
project will have to somehow try to mitigate the mentioned problem.
By observations of Swedish clinical texts, it is possible to identify similar problems to the one
mentioned by Xu et al. Authors of clinical texts tend to put blank spaces between the characters of an
abbreviation, e.g. "p.g.a." (På Grund Av) becomes "p g a". If the tokenizer were to go under the
assumption that a blank space is a definitive separator between two tokens, the abbreviation "p g a"
will be missed and interpreted into three incorrect tokens instead.
Having taken into consideration the two problems just mentioned, the tokenization process proposed
for this project will be divided into the following steps:
1. Separate the text into tokens with the single condition that a space marks the beginning of a new
token.
2. When the whole text has been processed, the algorithm will test all of the tokens consisting of a
single character (punctuations will not be counted as a character). If a single character token has
one or more following tokens consisting of only one letter, they will be combined into a single
token.
3. Apply the selected conditions for abbreviation detection.
4. Test all the tokens that are considered abbreviations in step 3. If a token is not in a dictionary of
known medical abbreviations, the tokenizer tests if there are other adjacent tokens that fit the same
profile. If there is, the tokeinizer combines these and test if the new combined token is in the
dictionary. If yes, the new token is accepted, if not, the token is split back into the original two
tokens.

Steps one and three in this process are simple and should not require further explanation. Step two
tries to alleviate the mentioned problem concerning Swedish clinical texts having blank spaces where
inserted into acronyms. Step 4 is an attempt to avoid the limitations that Xu et al. mentioned in their
tokenizer. The dictionary of known medical abbreviations is used so that correct abbreviations are not
accidently combined into faulty tokens. The algorithm and its usage will be a bit more complicated by
the usage of this new abbreviation dictionary. However we deem this to be a necessary evil in order to
deal with the tokenization problem.

17
4.1.10 Summary
The selected method, along with the presented additions can now be summarized. The algorithm to be
developed will hence contain the following steps:
1. Tokenize the text to be processed. This is a two step process which first divides all tokens by
searching for blank spaces. The second step iterates through the tokens and checks if some single
character tokens should be combined.
2. Identify which of the given tokens could be considered abbreviations by using the conditions from
4.1.7.
3. Take the tokens considered abbreviations in step 2 and run them against the abbreviation
dictionary. Test if unknown tokens can be combined to form an abbreviation that exists in the
dictionary of known medical abbreviations.
4. Supply each token with the available translations from the abbreviation dictionary. Return the final
result.
The indented workflow of the algorithm is also illustrated in figure 1.

4.1.11 Potential Changes


The development of the algorithm will be an iterative process, which means that the algorithm
described in 4.1.10 might not be the final version. Depending on the evaluation, changes might be
applied to the algorithm and its external components in order to improve the performance of the
algorithm. The final version of the algorithm should therefore be similar to the original one, but with
the possibility of slightly altered conditions . Experiments with alternate parameters for the algorithm
will also be carried out, e.g. changing the maximum length of an abbreviation to both a lower and
higher value.

4.2 Evaluation
For the evaluation part of the project there are multiple methods to choose from. In the reviewed
previous research, one method has been used exclusively in their evaluations, which is the concept of
precision and recall devised by Rijsbergen (1979). Other methods such as accuracy and false
negatives could be considered for this project, but since the project should be comparable to previous
research in the domain, the method of precision of recall will be the one selected.
As evaluation data, authentic medical records will be used in order to test our method in an as real
environment as possible. These records will be processed manually to identify all abbreviations that
are present in the text. The same records will then be processed by the algorithm whose results will be
compared to the results from the manual identification process. By this comparison, the recall and
precision of the algorithm can then be calculated, generating a value that can be used to measure the
level of performance of the algorithm.
The amount of evaluation data will be based upon the number of abbreviations contained in it. Xu et
al. (2007) used a data set that contained 411 abbreviations and Dannélls (2003) used a data set
containing 898 abbreviations. So the data set for this project should preferably contain a minimum of
400 abbreviations in order for fair evaluations to be conducted against previous studies.

18
4.3 Ethical considerations4
Since the evaluation part of this project will be done with aid of authentic medical records whose
content is highly sensitive, ethical aspects have to be taken into consideration. The medical records
used will be stripped of personal information in advance, in order to protect the patients’ privacy. In
addition to this, the biggest precautions will have to be taken in order to make sure that none of the
information stored in the medical records is leaked outside the project. Because of this, all of the
sensitive information will be kept in an encrypted format during the times that they are not used. As a
further precaution, no external network connections will be allowed to the device were the medical
records are stored, except if they are in an encrypted state. If data is needed in order to aid in the
development of the algorithm, a set of made up data will be created that matches the structure of actual
medical records but does not contain sensitive information.
Before any authentic medical records can be used in this project, the necessary authorization has to be
acquired in order to prove that the author of this project has the approval to use the medical records for
research related purposes.
Lastly, potential examples from the medical records that have to be presented in this report in order to
strengthen an argument will have to be replicated with slightly different content, since the sensitive
parts of the medical records cannot be published in this report.

4
This section only covers the basic aspects of the ethical considerations for this project. More details of
how the ethical aspects were followed is given in chapter 5.

19
Figure 1. The intended workflow of the algorithm

20
5. Application of Method
In this section of the report, the application of the method will be presented. The presentation will be
divided into two parts. The first part describes the development of the actual algorithm and the second
part describes how the algorithm was evaluated.

5.1 Algorithm Development Process


The algorithm for the project was developed as a software application in the Java programming
language. The application was named SCAN, or Swedish Clinical Abbreviation Normalizer. SCAN
consists of the algorithm functionality and a simple graphical user interface (GUI) to keep the
interaction with the algorithm as simple as possible. The GUI part of the application was kept strictly
separated from the algorithm functionality. The reason for this is so that the algorithm could be
implemented on its own in a different application, without having to separate the functionality from
the GUI. The thought behind this was also that the application, or rather the algorithm part of the
application, could be use as a command line tool as well. A visual representation of the application
architecture can be seen in figure 2. Since the GUI part of the application only was developed to aid in
the evaluation process, i.e. it is not relevant to the project; it will not be described any further in this
report.

Figure 2. The SCAN-Application architecture

5.1.1 Algorithm Architecture


The algorithm component of the application, which is the main focus of this report, was developed
with single handler-object through which all the input and output gets filtered. Several subcomponents
then make up the actual functionality of the algorithm, and their collaboration is synchronized by the
handler-object. To describe the algorithm and how it has been developed, we will go through each step
that the algorithm takes during its runtime.
First off, before the algorithm is used, or the handler-object since it is the one being called, it has to be
initiated. This is done by manually calling the three initialization methods of the handler class and
supplying the corresponding dictionary or wordlist as arguments. After the wordlists and dictionary
has been initiated, the handler is ready for usage. To start the process, the findAbb-method is called
with the text as argument. A diagram of the handler class can be seen in figure 3.

21
Figure 3. Diagram for the handler class

When the process starts, the first component to be called by the handler is the tokenization process.
This corresponds to step 1 of the algorithm described in 4.1.8. The process first iterates the text one
time to separate it into tokens. Then it iterates through the newly created set of tokens in order to
evaluate which tokens can be combined as one. The tokenization process keeps all the punctuations in
the tokens. The reason for this is because the second step of the algorithm uses the periods as potential
identifiers for abbreviations. Each token is also supplied with their position in the original text, and
their order in the text, so that components later in the process can access that information. The
structure of a token can be seen in figure 4.

Figure 4. Diagram for the Token class

In the second step of the algorithm, the tokens are sent to the abbreviation identifying component that
applies the conditions from 4.1.6 in order to test which tokens can be considered abbreviations. A set
of logical statements are used in order to test each condition. The component was built so that new
conditions can easily be added or removed if an alteration to the algorithm is deemed necessary. This
could also be used to quickly test how a condition alters the output result of the algorithm. As the final
part of the component, each token is tested against the external medical and Swedish wordlists to
check if it is a ordinary word or a medical term and thus should be removed. The component just
described is meant to correspond to step 2 of the algorithm described in 4.1.8.
In the next step of the algorithm, the tokens are sent to the expansion finding component. What it does
is just a simple test if each token has a known expansion in the external abbreviation dictionary. If so
they are marked as known and the expansion is added to the expansion list in the token object. All the
tokens that were not matched against an expansion are sent back to the tokenizer component. There the
tokens are combined (if adjacent) to form new tokens that are tested against the abbreviation
dictionary. If the new combined token is matched against an expansion in the abbreviation dictionary,
the new token is kept, if not the combined tokens are restored to their former state and are just
considered as unknown abbreviations. After this process, which corresponds to step 3 described in
4.1.8, the whole set of tokens are returned to the algorithm handler, which sends it to the process that
made the call to the findAbb-method. A graphical overview of this entire process is presented in figure
5.

22
5.2 Algorithm Evaluation Process
The algorithm evaluation process was dived into three parts. One was the acquiring of a suitable
abbreviation dictionary, Swedish word list, medical word list and POS-tagger for the algorithm, i.e.
the external components of the algorithm. The second part was supplying authentic medical records
that had been manually processed by marking all existing abbreviations, i.e. the data set that the
algorithm would be tested against. The last part of the evaluation process was generating the results
and from them performing the actual evaluation of the algorithm.

5.2.1 External components


The perhaps most important part of the evaluating process is supplying the algorithm with good
enough wordlists and dictionary in order to give the algorithm a fair chance of performing well. In our

Figure 5. The algorithm process of the SCAN-application

evaluation, we used "Lars Aronssons svenska ordlista" as the Swedish word list5. It is a free digital
word list consisting of 221.599 Swedish words, which was considered large enough to fit our needs.
Though one problem with the word list was that it in addition to Swedish words also contained
Swedish abbreviations. This would result in the algorithm recognizing the abbreviations as ordinary
words and thus rejecting them as abbreviation candidates. To mitigate this problem, the wordlist was
subtracted with a list of common Swedish abbreviations taken from Svenska Akademins Ordbok 6 (The
Swedish Academy's Book of Words).
The medical wordlist had to be generated specially for this project since there were no existing
medical wordlists in Swedish available for this project. The medical list used was generated from the
medical dictionary supplied by FASS7, which is a dictionary for all pharmaceutical drugs distributed in
Sweden, provided by Läkemedelsindustriföreningens Service AB (LIF). The mentioned
pharmaceutical dictionary was also used when the medical wordlist for this project was generated.

5
http://runeberg.org/words/ [2012-04-11]
6
http://g3.spraakdata.gu.se/saob/ [2012-04-11]
7
http://www.fass.se [2012-04-03]

23
As the abbreviation dictionary, we used a digital version of the medical abbreviation dictionary
presented in the book "Medicinska förkortningar och akronymer" by Staffan Cederblom (2005). The
dictionary contains all the abbreviations and their definitions that Cederblom had come upon so far. It
should be stated that this is not a complete dictionary, but the most complete one available for the
project at the time of the evaluation.
An equally important task for the evaluation process is properly POS-tagging the evaluation data
before it is processed by the algorithm. As stated earlier, the project lacked the resources for training a
POS-tagger for the specific domain of Swedish clinical text. Instead a pre-trained POS-tagger had to
be selected, where Granska Tagger8 was deemed the most suitable. Granska Tagger is a POS-tagger
that has been pre-trained for the Swedish language and considered to be the best option available for
this project. The Granska POS-Tagger also had the benefit of having been previously tested on
Swedish clinical records, were it obtained an accuracy of 92.4% (Hassel et al., 2011).

5.2.2 Data set


The data set used in the evaluation of the algorithm was a subset of The Stockholm EPR Corpus,
which consist of over 1 million de-identified9 Swedish medical records from over 900 clinical units
(Dalianis et al., 2009). The selected subset consisted of the assessment sections from 300 medical
records originating from the emergency department of the Karolinska University Hospital in
Stockholm, Sweden10. The whole subset was then manually processed and annotated with their
respective abbreviations by a senior physician with previous experience in annotating clinical texts. In
total, the data set contained 19408 tokens, where 2050 of them were abbreviation tokens (335 unique),
which is well over the 400 that was deemed as a minimum.

5.2.3 Evaluation
The actual evaluation of the algorithm was performed by a separate Java application. The application
ran the supplied data set through the SCAN-application and automatically compared the results against
manually annotated results, which was supplied via XML-files. An example from both the output of
the SCAN-algorithm and the format of a XML-file representing the manually generated result can be
seen in figure 6.

SCAN-Output:
Span TokenOrder TokenString
199-201 28 sr
XML-Format:
<annotation>
<mention id="abbreviation_annotation_Instance_15" />
<span start="199" end="201" />
<spannedText>SR</spannedText>
<annotation>
Figure 6. SCAN-Ouput and the XML-Format of a detected abbreviation.

8
http://www.csc.kth.se/tcs/humanlang/tools.html [2012-04-03]
9
De-identified implying that names and social security numbers had been removed.
10
The selected data set was used after approval from the Regional Ethical Review Board in Stockholm,
permission number 2009/1742-31/5

24
The matching process, i.e. the test to see if the SCAN-output conformed to the results in the XML-
files, was deemed successful if the following logical statement was satisfied:
XML.Span.Start >= SCAN.Span.Start && XML.Span.End <= Scan.Span.End

In other words, an abbreviation token from the SCAN-output was considered correct if the span from a
XML-element could be matched as a subset of its own span. The reason the XML-span only has to be
a subset of the SCAN-token span and not a perfect match; is that in the manual annotation process
only the abbreviation part of a word was marked as the actual abbreviation. If for example the
compounded word “huddr” (made out of the word hud and the abbreviation dr) was found, only the
second part of the word (dr) was marked by the annotator, since the first part (hud) is not an
abbreviation. The SCAN-algorithm on the other hand would mark the entire word, so while still a
correct result, the span comparison would result in a mismatch if a perfect match were to be required.
When the entire data set had been evaluated, a log file was created with the results for each individual
record, as well as the result for the entire data set. In addition to this, the following four types of lists
were also generated:
 Frequency list of the correct words found by SCAN
 Frequency list of the incorrect words found by SCAN
 Frequency list of the missed words by SCAN
 Frequency list of all the words found in the manual annotation

25
6. Results
Already in the initial phase of the evaluation, it became apparent that the selected POS-tagger could
not be used in the project. The reason for this was that the tagger would filter out some tokens (such as
numbers), thus making them invisible to the algorithm. This would also lead to he indexation of the
tokens becoming incorrect and would not be correctly matched against the manually generated results
in the XML-files. Since no POS information was available to the algorithm after this, the following
condition in the rule set had to be discarded:

 The POS tag for the token is either N, Y or X (noun, abbreviation or foreign word). In case X, it
must consist of at least 2 upper-case letters

The results of the algorithm is presented by listing the number of found tokens, the number of correct
tokens and the tokens found by the manual revision. Recall and precision are also presented as well as
the F-Score of the algorithm. The F-Score is a combined value of the precision and recall and is
calculated with the following formula (van Rijsbergen, 1979):
F-Score = 2 x (Precision x Recall) / (Precision + Recall)

6.1 Baseline Results


Before the algorithm was evaluated, a set of baseline results were generated in order to have
something to compare the evaluation results of the algorithm against. Two different algorithms were
generated as baseline algorithms. The first one was based on the single condition of the length of a
token, i.e. if the length of a token was equal or shorter than the set maximum value, it was considered
an abbreviation. The second baseline algorithm used the wordlists to decide whether or not a token
was an abbreviation. If the token was found in the wordlists is was ignored and if it was not found it
was considered an abbreviation. The baseline results can be seen in table 1.

Version Found Correct Total Recall Precision F-Measure


Baseline-maxLength4 9696 1529 2050 0,74 0,15 0,26
Baseline-maxLength5 11597 1576 2050 0,76 0,13 0,23
Baseline-maxLength6 13276 1616 2050 0,78 0,12 0,21
Baseline-maxLength7 14526 1681 2050 0,82 0,11 0,20
Baseline-maxLength8 15793 1755 2050 0,85 0,11 0,19
Baseline-wordLists 4889 1793 2050 0,87 0,36 0,51
Table 1. Baseline test results. MaxLength implies the allowed maximum token length

6.2 Original Algorithm


The results from the first evaluation of the algorithm can be seen in table 2. The algorithm was run six
times, each time with a different value for the maximum length of an abbreviation.

26
Version Found Correct Annotated Recall Precision F-Measure
Original-RegList-maxLength3 1688 1299 2050 0,63 0,76 0,69
Original-RegList-maxLength4 1892 1444 2050 0,70 0,76 0,73
Original-RegList-maxLength5 2065 1485 2050 0,72 0,71 0,72
Original-RegList-maxLength6 2749 1511 2050 0,73 0,68 0,70
Original-RegList-maxLength7 2946 1550 2050 0,75 0,64 0,69
Original-RegList-maxLength8 2649 1576 2050 0,76 0,59 0,67
Table 2. The results of the original algorithm with the regular word list and a maximum token length of 3
to 8.

When reviewing the frequency lists generated from the first evaluation, it became apparent that the
algorithm missed a lot of ordinary abbreviations that should have been recognized by the rule set.
After a closer inspection of the word lists, they were found to contain numerous abbreviations, thus in
them being ignored by the algorithm. In order for the algorithm to be given a fair evaluation (since the
word lists are considered as external components), the word lists were subtracted with a list of
common Swedish abbreviations supplied by the Swedish Academy Wordbook11. A second evaluation
was then performed with these improved word lists. The result from this evaluation can be seen in
table 3.

Version Found Correct Annotated Recall Precision F-Measure


Original-ImpList-maxLength3 1808 1413 2050 0,68 0,78 0,73
Original-ImpList-maxLength4 2031 1576 2050 0,76 0,77 0,77
Original-ImpList-maxLength5 2204 1617 2050 0,78 0,73 0,76
Original-ImpList-maxLength6 2352 1643 2050 0,80 0,69 0,74
Original-ImpList-maxLength7 2546 1682 2050 0,82 0,66 0,73
Original-ImpList-maxLength8 2788 1708 2050 0,83 0,61 0,70
Table 3. The results of the original algorithm with the improved word list and a maximum token length of
3 to 8.

6.3 Improved Algorithm


Even though replacing the original word lists improved the results, the algorithm was still lacking in
precision. After another review of the frequency lists, it became apparent that the algorithm was a little
too coarse in what words were accepted as an abbreviation. The biggest problem was in the condition
that tested for special characters, such as ‘-‘ and ‘.’. Since the mere existence of these characters would
make the algorithm see a token as an abbreviation, it generated a lot of incorrect tokens. Examples of
misinterpreted tokens due to this are “30-50” or other combinations with exclusively numbers and
special characters. Another example that needed to be addressed is that in the Swedish language, the
‘-‘ character is often used to tie together numerals and words, such as “50-årig” (50 years of age). The
following alteration was therefore made in the rule set.
 The word contains a special character such as "-" and "."
Was replaced with:

11
http://g3.spraakdata.gu.se/saob/foerkortn.shtml [2012-06-09]

27
 The word contains a special character such as "-" and ".", unless the character is ‘-‘ and the
words on both sides are either a numeral or a dictionary word.
With this improved condition, another evaluation of the algorithm was performed. The results using
the regular word lists can be seen in table 4 and the results from using the improved word lists can be
seen in table 5.

Version Found Correct Annotated Recall Precision F-Measure


Improve-RegList-maxLength3 1585 1295 2050 0,63 0,81 0,71
Improve-RegList-maxLength4 1789 1440 2050 0,70 0,80 0,75
Improve-RegList-maxLength5 1962 1481 2050 0,72 0,75 0,73
Improve-RegList-maxLength6 2111 1508 2050 0,73 0,71 0,72
Improve-RegList-maxLength7 2351 1547 2050 0,75 0,65 0,70
Improve-RegList-maxLength8 2596 1573 2050 0,76 0,60 0,67
Table 4. The results of the improved algorithm with the regular word list and a maximum token length of
3 to 8.

Version Found Correct Annotated Recall Precision F-Measure


Improve-ImpList-maxLength3 1706 1410 2050 0,68 0,82 0,75
Improve-ImpList-maxLength4 1929 1573 2050 0,76 0,81 0,79
Improve-ImpList-maxLength5 2102 1614 2050 0,78 0,76 0,77
Improve-ImpList-maxLength6 2250 1640 2050 0,80 0,72 0,76
Improve-ImpList-maxLength7 2490 1679 2050 0,81 0,67 0,73
Improve-ImpList-maxLength8 2735 1705 2050 0,83 0,62 0,71
Table 5. The results of the improved algorithm with the improved word list and a maximum token length
of 3 to 8.

6.4 Summarized results


The summarized evaluation results of the top 6 versions of the algorithm that performed the best can
be seen in table 6, sorted by the highest F-Score. The results show that the version that performed the
best was the improved algorithm with the improved word lists and a maximum abbreviation length of
four. Comparing the versions that all used the original maximum abbreviation length of six, it is also
the improved algorithm together with the improved word lists that shows the best results.

Version Found Correct Annotated Recall Precision F-Measure


Improve-ImpList-maxLength4 1929 1573 2050 0,76 0,81 0,79
Improve-ImpList-maxLength5 2102 1614 2050 0,78 0,76 0,77
Original-ImpList-maxLength4 2031 1576 2050 0,76 0,77 0,77
Improve-ImpList-maxLength6 2250 1640 2050 0,80 0,72 0,76
Original-ImpList-maxLength5 2204 1617 2050 0,78 0,73 0,76
Improve-ImpList-maxLength3 1706 1410 2050 0,68 0,82 0,75
Table 6. The results of the top 6 version of the algorithm that performed the best.

28
7. Analysis
In this section, the results from chapter 6 will be analyzed. The analysis will be on how well the
different versions of the algorithm performed, as well as pointing out their strengths and their
weaknesses. Comparisons will also be made with other algorithms similar to the one developed in this
project in order to get a more nuanced picture of the algorithms performance.

7.1 Algorithm versions


The developed algorithm was tested in a variety of versions during the evaluation phase of this project,
which in turn generated a lot of different results. To start the analysis, the results of the versions that
all used the default maximum word length value of six was compared. The reason the value six is
chosen is because it was the default value of the algorithm, i.e. the value defined in the method-section
of this project. These results can be seen summarized in table 7.

Version Found Correct Annotated Recall Precision F-Measure


Original-RegList-maxLength6 2213 1511 2050 0,73 0,68 0,70
Original-ImpList-maxLength6 2352 1643 2050 0,80 0,69 0,74
Improve-RegList-maxLength6 2111 1508 2050 0,73 0,71 0,72
Improve-ImpList-maxLength6 2250 1640 2050 0,80 0,72 0,76
Table 7. The evaluation results of the versions of the algorithm using the maximum word length of six.

The first step in the analysis was to examine how the different word lists affected the performance of
the algorithm. The two versions of the word lists used in the evaluation was the regular word lists,
were no alterations had been made and the improved word lists, where some of the most common
abbreviations had been removed. Not surprisingly, the algorithm performed better in both recall and
precision when the improved word lists were used. The greatest effect could be seen on the recall
value, where an increase of seven percent was achieved. The reason for this was that a lot
abbreviations were incorrectly marked as known words by the flawed regular word lists. With the
improved word lists however these abbreviations were instead correctly identified, which in turn lead
to a higher recall percentage. The small increase in precision was just a side effect of more correct
abbreviations being found by the algorithm.
The second step in the analysis was to compare the results between the original version of the
algorithm and the one with the improvements that were listed in section 6.3. As table 7 shows, there
was no difference in recall between the two versions, but there was a three percent increase in
precision. The reason for this was that the original version of the algorithm saw every word that
contained the "-" character as an abbreviation. The improved version used a more fine grained rule set,
which tested if both sides of the "-" character was both either numeric or words present in the word
lists. This would lead to that more incorrect tokens could be filtered out by the improved algorithm,
thus resulting in better precision.

29
7.2 The maximum length variable
One key variable that was extensively tested during the evaluation, was the one which decided how
many letters a token could contain and still be viable as an abbreviation. The default value in the start
of the evaluation was six characters, but the whole range between three and eight characters was also
tested. Surprisingly, it was not the default value of six characters that resulted in the best performance,
but the value of four characters, which can be seen in tables 2 to 5. Although a small drop in recall, the
value of four characters substantially increased the precision of the algorithm compared to the default
value of six characters. The reason for this increase in performance can be explained if one examines
the frequency lists generated during the evaluations. In these lists, one could see that the majority of
correct abbreviations found by the algorithm only consisted of four characters or less. If one also
examined the frequency list of the tokens incorrectly taken as abbreviations by the algorithm, the
majority of the list consisted of tokens with more than four characters. With this pattern, it is clear that
decreasing the allowed maximum length of an abbreviation would filter out many of the incorrect
tokens while still finding close to the same amount of correct abbreviations. This pattern also reflects
in the results for the versions of the algorithms that used a value for the maximum word length greater
than six characters. As the value increases, the recall of the algorithm gets a small boost, but the
precision on the other hand drops severely.

7.3 Further Analysis


In order to make a further analysis of the algorithm, one version of the algorithm had to be selected for
which the evaluation results could be more closely inspected, since inspecting these results for every
version would have been be both tedious and redundant. The version selected was the improved
algorithm with the improved word lists, using the limit of four characters as the maximum word
length. This version was chosen simply because it was the one that displayed the best test results.
The chosen version of the algorithm achieved an F-Measure score of 79%, with an 76% recall and
81% precision. Though a good result, a deeper analysis have to be made in order to find out what kind
of tokens the algorithm missed or misinterpreted and if some kind of pattern can be found within
these. The analysis depended heavily on the inspection of the frequency lists generated at the
evaluation. Excerpts from these lists generated for the chosen version of the algorithm can be seen in
appendix A.
The first subject to be investigated was the abbreviations missed by the algorithm. Among these, two
distinct types of abbreviations could be identified. The first type of token was words that even though
they were abbreviations, also could be seen as "known words" . An example of this is the word "hö",
short for "höger" (eng. "right"). The shortened word is not only the abbreviation for "höger" but also
a word in itself, meaning "hay" in Swedish. Since the word "hö" is considered as a known word, it
subsequently also exists in the word lists, which is why it would be marked as a known word and not
as an abbreviation by the algorithm. Other examples of similar words that the algorithm would miss by
are "kol", "mott", "eko" and "alt" which all are known Swedish words, while at the same time being
abbreviations.
The second type of abbreviations missed by the algorithm were abbreviations that had been
compounded with other words to form one single token. These tokens would in most cases end up
being longer than four characters and thus be ignored by the algorithm. An example of such a token is
"lungrtg" which is a shortened form of "lungröntgen" (eng. "lung X-Ray"), where only the second part

30
of the word has been abbreviated. In Swedish texts in general, as well as in Swedish clinical text,
compounding is a much more common phenomenon compared to English texts. This makes the
mentioned problem much more of a nuisance for algorithms operating on Swedish clinical texts than
for algorithms operating on English clinical texts.
The next step in the deeper analysis was to investigate what kind of tokens that were incorrectly
marked as abbreviations by the algorithm. First off, there were some abbreviations that actually were
correct, but were listed as incorrect tokens due to the fact that they were missed in the manual
annotation process. An example of this is the abbreviation "pat", which was sometimes missed due to
the simple fact that they are used to such an extent that the annotator by pure habit saw them as
ordinary words and not abbreviations. These misses were however of relatively small numbers and did
not affect the results in any notable way.
Another example of incorrect abbreviations that should not be considered as the algorithms fault, is
where the letter "x" was used as a multiplication character (e.g. "1x2 pills daily"). In the manual
annotations process they were not marked as abbreviations, but whether that is correct could be
debated. Again, these "errors" were also of relatively small numbers and had little effect on the end
results.
Ignoring these "invalid" incorrect tokens, the majority of the tokens that were incorrectly marked as
abbreviations by the algorithm, were words that should have been filtered out by the word lists. There
were two reason to why they were not filtered out; Either the word was misspelled (e.g. "symtom"
correctly spelled "symptom") or it just did not exists in the word lists. In the later case, most of the
words were either proper names or words unique to the medical domain, which in both cases are
difficult to fully cover in a word list.

7.4 Algorithm Performance


The measuring of the performance of the algorithm was a big part of assessing the quality of the
developed algorithm. However, the data gathered during the evaluation of the algorithm was not
enough to alone decide whether the algorithm performed at an adequate level or not. Since the data
only presents a isolated picture, other external data was needed to compare the results against in order
to get a better perspective. The version of the algorithm used as the representative for this project, was
the same as in section 7.3, i.e. the improved algorithm with the improved word lists and a maximum
word length of six characters.
As a basic comparison, the baseline test results was used as first benchmark . This was used in order to
see if the algorithm brought any improvements at all compared to the minimal complexity of the
baseline algorithms. The result compared against was the baseline algorithm that achieved the highest
results (the algorithm that used only the word lists), which reached an F-Measure score of 51% with a
87% recall and 36% precision. As the best performing version of the algorithm developed in this
project reached an F-Measure score of 79%, with an 76% recall and 81% precision, it is clear that it
performed significantly better than the baseline algorithms.
As a further measurement of the algorithms performance, it was compared against similar algorithms
from previous research that had been tested on a similar domain. Such similar algorithms were the
ones developed by Xu et al. (2007), who had also tested them against authentic English medical
records. In the evaluations by Xu et al. (2007) , their rule-based algorithm reached an F-Measure score
of 84.6% (85.4% recall and 83.9% precision) while their two machine learning based algorithms

31
reached F-Measure scores of 76% (87.5% recall and 71.5% precision) and 85% (91.4% recall and
80.3%% precision). Compared to these results, the algorithm developed for this project is little bit
behind in performance with its 79% F-Measure score (76% recall and 81% precision), but still better
compared to the lowest score presented by Xu et al. (2007). It can therefore be stated that the
performance of the developed algorithm is close to the performance of existing similar algorithms, but
not equal.

32
8. Discussion
In this section there will be a discussion concerning the algorithm produced in this project, the results
generated and whether or not it answers the research question stated in the beginning of this project. In
the discussion in this chapter, the version of the algorithm chosen to be discussed will be the same as
in the analysis, i.e. the improved algorithm with the improved word lists and the maximum word
length of four, simply because it was the version of the algorithm that performed best. Any references
in the discussion to "the algorithm" will imply this version, unless something else is explicitly stated.

8.1 Algorithm Performance


The algorithm reached a F-Measure score of 79%, which in the analysis was shown to be significantly
better than the baseline test results. Compared to the algorithms of Xu et al. (2007), the algorithm
performed better than their least well performing algorithm, and was a little bit behind compared to
their best performing algorithm.
These results presented in the analysis raises some interesting topics of discussion. First of all it is
debatable whether the algorithm and the algorithms of Xu et al. (2007) can be seen as similar, i.e. if a
comparison between them is relevant. The facts that speaks for this assertion, is that they are all
algorithms with the lone task of detecting abbreviations and that they all are specially developed for
clinical texts. The algorithms are however designed for two different languages. The algorithms of Xu
et al. (2007) were developed for the English language while the algorithm developed in this project
was designed for the Swedish language, which makes the conditions for the algorithms fundamentally
different.
Another fact that makes the validity of the comparison of the algorithms debatable, is that they are
heavily dependent on external components in order to perform correctly, namely the word lists. As the
evaluation of this projects algorithm and the evaluation of the algorithms of Xu et. al (2007) used
different word lists, they were not tested under the same conditions. Because of this, it is hard to tell
whether the difference in performance were due to the actual algorithms or to the fact that one
algorithm had the aid of superior word lists. There are also further examples of how comparison
between the algorithms might not be completely valid.
The first version of the algorithm developed for this project was almost identical to the rule based one
described in Xu et al. (2007) (since the part-of-speech information became unusable). Despite this, the
two algorithms showed a big difference in performance in their test results. The original algorithm of
this project with the improved word lists and the default token length of six, reached an F-Measure
score of 74% (80% recall and 69% precision), while the rule based one from Xu et al. (2007) reached a
F-Measure score of 84.6% (85.4% recall and 83.9% precision). So in spite of being almost identical
algorithms, they had a performance difference of 10 percent, which leads to the assumption that this
divergence can have come from either the difference in clinical language or the use of different word
lists, or a combination of the two. This points towards that a lot of the algorithms performance is
dependent on the quality of the external components, i.e. the word lists.
A final remark can also be made concerning the specific clinical domains that the algorithms was
tested on. This projects algorithm was tested on medical records from an emergency department, while
Xu et al. (2007) tested their algorithms on admission notes from an internal medicine department. This

33
might result in that different varieties of abbreviations were present in the two data sets, in addition to
that the frequency of abbreviations might have differed as well. It is not clear how big a difference
these different domains makes, since the variation in data set languages has a greater impact and thus
makes that comparison difficult.

8.2 Algorithm Problems


In the analysis of the algorithm, several areas were brought up were the algorithm in some way
performed suboptimal and thus lead to a lower performance. A subject to discuss is what or if
something can been done to alleviate these problem areas.
One problem area is the use of the word lists to filter out so called "known words". As seen in the
analysis, a lot of misinterpretations by the algorithm arose from erroneous or ambiguous words in the
word lists. The part where abbreviations also matched ordinary words in the word lists, i.e. that they
were ambiguous, could have been solved if part-of-speech(POS) information would have been
available to the algorithm, which was originally intended. An example of how this would have helped
the algorithm is the previous mentioned word "hö". The known word "hö" would have had the POS-
tag "noun" while the abbreviation "hö" would have had the POS-tag "pronoun". So if the algorithm
would have found a token consisting of the string "hö" and the POS-tag "pronoun", it would have been
able to tell that it was an abbreviation at not a known word. There are however no guarantees that this
would have solved the problem completely, but it could at least have helped since more abbreviations
would have been found by the algorithm. Implementing such a feature though would put further
demands on the already crucial word lists, since they now also would have contain POS-information.
Implementing POS-information would however not solve the other problem that in the analysis were
identified to be connected with the word lists. The problem was due to the fact that the word lists were
far from complete, resulting in words that should have been considered as "known words", were
instead treated as unknown words thus ending up being marked as abbreviations instead. Attempts
could be made on trying to solve this problem by complementing the word lists with new words, but
such a task would be very difficult since a full inventory of all "known words" in a domain is nearly
impossible to obtain. If done within a certain limit however, the performance gained by the algorithm
might be worth the extra work.
A problem that was identified in the analysis to be especially problematic in Swedish texts, is the
subject of compounding. The problem brought on by this is that abbreviations can be compounded
with ordinary words, resulting in them surpassing the upper word length limit and thus being filtered
out by the algorithm. Correctly identifying these kind of compounded abbreviations is no easy task.
Increasing the maximum word length could solve it, but that would also only bring about new
problems, such as compounds of ordinary words not being recognized by the word lists and thus being
labeled as abbreviations. The much more complicated solution to the problem, but perhaps the only
viable one, would be to use an algorithm (preferably one that already exists, such as Compound
Splitter12) that separates compounded words. How such an algorithm would work on compounded
abbreviations though is a another problem and will not be covered here.
The problems that have been identified and discussed for the algorithm are somewhat limited but
problematic. These would eventually have to be solved in order to increase the performance, which

12
http://www.csc.kth.se/tcs/humanlang/tools.html [2012-05-13]

34
would be desirable if the algorithm someday were to be used in a live environment. It is however
unclear if such solutions exists for the algorithm in its current form. A completely new approach might
have to be attempted instead, e.g. using machine learning based methods. Another approach would be
one that is not as dependent on word lists as algorithm developed in this project. Though the word lists
have their benefits in reducing the complexity of the algorithm, they also have their share of
disadvantages; disadvantages which in the end are hard to get around.

8.3 Conclusion
The results of this project, i.e. the developed algorithm and its evaluation, have been successful in the
terms that it performs its intended task at a performance level above the baseline algorithms. Further
comparisons were also made with algorithms similar to the one developed in this project. In these
comparisons, the algorithm showed close but not equal performance. It was also concluded though that
validity of these comparisons was hard to establish due to the different evaluation conditions between
the compared algorithms.

8.3.1 Research Question and Expected Results and Reliability


The following research question was stated in the beginning of this project:
"Can an algorithm be developed that detects abbreviations in Swedish clinical texts, using features
from existing similar algorithms? Further, will the developed algorithm perform at the same level as
these similar algorithms and what changes can be made to the algorithm in order for it to perform
better?"
It has been concluded that an algorithm capable of detecting abbreviations has been developed in this
project, with features taken from existing algorithms from similar domains. The performance of the
developed algorithm was established to be close to these similar algorithms, though questions were
raised regarding the validity of these results. The analysis and discussion has also brought suggestions
that might improve the performance of the algorithm.
Regarding the expected results, they were correct in the sense that an algorithm was developed that
succeeded in its intended task. The problem with compounded words was also predicted, though the
difficulty in solving it was greatly underestimated. Another fact that was also missed was how
dependent the algorithm would come to be on external resources such as the word lists. This would
however been a difficult prediction to make since the exact structure of the algorithm had not been
decided at that stage.
The reliability of the presented results in this report must be considered as high. The algorithm was
evaluated with an amount of abbreviations surpassing those in similar evaluations, such as Xu et
al.(2007). The abbreviations were extracted from authentic Swedish medical records, which also
should strengthen the reliability of the presented results.

8.3.3 Originality and Research Contribution


The originality of this report can be debated since most of the features of the developed algorithm
were taken from previous research. It is however among the first algorithms of its kind to be
developed and tested for the Swedish clinical domain, which brings a certain degree of originality. The
same can be said about what contributions this project has made to the research field of abbreviation
detection. Even though no revolutionary new methods have been developed for abbreviation detection,

35
the projects have found and highlighted some of the problems in attempting to apply rule based
abbreviation detecting algorithms on Swedish clinical texts.
Lastly it ought to be discussed whether or not the developed algorithm can be used in any clinical
setting, or as a part of a larger system implemented on real clinical texts. As it is now, it could work to
some degree, but would benefit a lot from having some of the suggested improvements applied first. In
addition to this, it could also stand as a good foundation for further development in the same field,
since much experience has been generated and documented in this report.

8.3.4 Ethical and Societal Consequences


The ethical and societal consequences of this thesis can be discussed from two different aspects. The
direct consequences of the results presented in this thesis can be considered to be very limited since
the algorithm itself does not touch any ethical and societal aspects. The intended usage of the
algorithm could however have a significant impact. If the algorithm in the future would be used for
translating abbreviations in medical records to their full length counterparts, it would most likely raise
both societal and ethical questions. The reason for this is that potential flaws in the algorithm could
have severe consequences, with the reader of the translated text for example perceiving the wrong
symptoms and diagnosis. As a conclusion, it can be determined that if the algorithm were to be used in
some form of medicinal area, more extensive tests, as well as a deeper discussion regarding ethical and
societal aspects would have to be carried out. But with the algorithm in its current state and form, such
tests and discussions is out of the scope of this report.

8.4 Future Work


The future work that can be conducted from this report can be divided into two directions. It can either
try to implement improvements to the existing algorithm in order to increase its performance, or it can
try to add further functionality to the existing algorithm.

8.4.1 Improvements
There are many improvements that could be made to the algorithm. Since some of them are extensive
problems, each problem alone could be a potential research topic. For example, attempts could be
made to implement part-of-speech information in the abbreviation identification process and see what
impact it has on the algorithms performance. The problem with compounding could also be attempted
to be solved seeing as it was identified as one of the major hurdles in Swedish clinical texts.
Another suggestion would be to do a deeper analysis of how the word lists affect the performance of
the algorithm. Different word lists than the ones used in this project could be applied in order to see
how they shape the output of the algorithm.
A final suggestion is implementing the algorithm on other types of clinical texts, i.e. not from
emergency departments, to see what (if any) differences can be seen in the performance of the
algorithm.

8.4.1 Extensions
The perhaps most apparent extension to the algorithm would be to replace the found abbreviations
with their full length counterpart. This is however a huge research area, mostly because the
disambiguation problem of abbreviations, where one abbreviation could translate into multiple full

36
length version. Such an extension would therefore have to be contextually aware, which would
probably suggest that some machine learning features would have to be implemented.
Another extension to the algorithm would be to add spell check algorithm that runs before the
abbreviation detection process is started., since it was found in the analysis of this project that a lot of
misinterpreted tokens were due to misspellings. Preferably an already existing algorithm would be
used for this task, though it is unclear how such an algorithm would work on clinical texts.

37
References
Adar, E., 2002. S-RAD A Simple and Robust Abbreviation Dictionary. Bioinformatics, 20(4), pp.527-533

Cederblom, S., 2005. Medicinska förkortningar och akronymer. Lund: Studentlitteratur AB.

Dalianis, H., Hassel, M., Velupillai, S., 2009. The Stockholm EPR Corpus - Characteristics and Some Initial
Findings. Proceedings of the 14th International Symposium on Health Information Management Research, pp.
243-249.

Dannélls, D., 2003. Acronym Recognition. Master Thesis. Department of Linguistics, Göteborg University,
Sweden

Guadan, S., Kirsch, H., Rebholz-Schumann, D., 2005. Resolving abbreviations to their senses in Medline.
Bioinformatics, 21, pp. 3658-3664.

Hassel, M., Henriksson, A., Velupillai, S., 2011. Something Old, Something New – Applying a Pre-trained
Parsing Model to Clinical Swedish. Proceedings of NODALIDA`11 - 18th Nordic Conference on
Computational Linguistics, pp. 287-290.

Hevner, A.R., March, S.T., Park, J., 2004. Design Science in Information Systems Research. MIS Quaterly,
28(1), 2004, pp. 75-105.

Johannesson, P., Perjons, E., 2012. A Design Science Primer (draft). [online]. Available at:
<http://dl.dropbox.com/u/1666015/Design%20Science%20Primer-24jan12.pdf> [Accessed 9 June 2012].

Larkey, L.S., Ogilvie, P., Price, M.A., Tamilio, B., 2000. Acrophile: An Automated Acronym Extractor and
Server. Proceedings of the Fifth ACM Conference on Digital Libraries, pp.205-214.

Liu, H., Lussier, Y.A, Friedman, C., 2001. A Study of Abbreviations in UMLS. Proceedings of the AMIA
Symposium 2001, pp.393–397.

Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F., 2008. Extracting Information from Textual
Documents in the Electronic Health Record: A Review of Recent Research. IMIA Yearbook of Medical
Informatics 2008, pp.128-144.

Nadeau, D., Turney, P.D., 2005. A Supervised Learning Approach To Acronym Identification. 8th Canadian
Conference on Artificial Intelligence. pp 319-329.

Pakhomov, S., 2002. Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation
Normalization in Medical Texts. Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), pp.160-167.

Park, Y., Byrd, R.J., 2001. Hybrid text mining for finding abbreviations and their definitions. Proceedings of
Empirical Methods in Natural Language Processing 2001, pp.126-133.

38
Pyper, C., Amery, J., Watson, M., Crook, C., 2004. Patients’ experiences when accessing their on-line electronic
patient records in primary care. British Journal of General Practice, January 2004, pp.38-43.

Ruch, P., Baud, R., Geissbühler, A., 2003. Using lexical disambiguation and named-entity recognition to
improve spelling correction in the electronic patient record. Artificial Intelligence in Medicine, 29, pp.169-
184.

Schwartz, A.S, Hearst, M.A., 2003. A simple algorithm for identifying abbreviation definitions in biomedical
text. Proceedings of the Pacific Symposium on Biocomputing, 8, pp. 451-462.

Skeppstedt, M., Kvist, M., Dalianis, H., 2012. Rule-based Entity Recognition and Coverage of SNOMED CT in
Swedish Clinical Text. Proceedings of the Language Resources Evaluation Conference 2012.

Taghva, K., Gilbreth, J., 1999. Recognizing Acronyms and their Definitions. International Journal on Document
Analysis and Recognition, 1, pp.191-198.

van Rijsbergen, C.J., 1979. Information Retreival. London: Butterworth-Heinemann.

Wu, Y., Rosenbloom, S.T., Denny, J.C., Miller, R.A., Mani, S., Giuse, D.A., Xu, H., 2011. Detecting
Abbreviations in Discharge Summaries using Machine Learning Methods. AMIA Annual Symposium
Proceedings 2011, pp.1541–1549.

Yeates, S., 1999. Automatic Extraction of Acronyms from Text. Proceedings of the Third New Zealand
Computer Science Research Students' Conference, pp 117-124.

Xu, H., Stetson, P.D., Friedman, C., 2007. A Study of Abbreviations in Clinical Notes. AMIA Annual
Symposium Proceedings 2007, pp.821–825.

39
9. Appendix A
In this appendix, excerpts from the frequency lists generated by the algorithm version using the
improved algorithm, the improved word lists and a maximum word length of 6 will be listed. The top
15 of each list, sorted by the number of occurrences, will be listed and structured in the following way:
[Number of occurences] - [Token]

9.1 Correct Tokens - Top 15


329 pat
102 mg
57 dr
53 pga
50 ekg
49 crp
36 ua
35 ev
31 tabl
28 hia-jour
28 pats
27 iv
25 dvt
25 kl
20 t

Total amount found: 1573 (280 unique)

9.2 Incorrect Tokens - Top 15


114 pat
15 t
11 x1
9 1x1
8 var
6 x2
5 dvt
5 enl
5 pats
4 ev
4 rulu
4 vena
3 1x2
3 2x1
3 anna

Total amount incorrect: 1573 (142 unique)

40
9.3 Missed Tokens - Top 15
24 jour
18 hö
15 el
13 kol
13 mott
12 rtg
12 st
11 troponin t
10 eko
10 elstatus
10 neg
8 alt
7 lab
6 gyn
6 rem

Total amount missed: 121 (154 unique)

9.4 Annotated Tokens - Top 15


325 pat
120 mg
75 jour
73 ekg
56 dr
52 crp
45 hia
39 pga
34 ct
33 ev
32 tabl
30 dvt
28 pats
25 t
24 kl

Total amount annotated: 2050 (335 unique)

41
Department of Computer and Systems Sciences
Stockholm University
Forum 100
SE-164 40 Kista
Phone: 08 – 16 20 00
www.su.se

42

You might also like