Professional Documents
Culture Documents
1401
1401
1401
Niklas Isenius
Niklas Isenius
Abstract
Swedish clinical records are filled with numerous abbreviations that are difficult to understand. This
problem can span from patients reading their own records to medical personnel reading texts from a
different clinical domain than the one they are used to. Computer algorithms processing clinical texts
also suffer from this problem since abbreviations can hide valuable information important to the
algorithm's task.
In this report, a first step is taken towards solving this problem by developing an algorithm that detects
abbreviations in Swedish clinical texts and supplies suggestions to their full length translations. A
combination of design science and empirical research was used to approach the problem.
The development process was started by reviewing existing algorithms in similar domains, from which
key features were taken and used as a foundation for a rule based algorithm. Some additions were also
made in order to mitigate problems experienced by previous researchers as well as some new problems
predicted to arise from the specific domain of Swedish clinical texts. The developed algorithm was
tested against assessment sections from 300 medical records from an emergency department of the
Karolinska University Hospital in Stockholm, Sweden. In these tests, the best performing version of
the algorithm achieved an F-Measure score of 79%, with 76% recall and 81% precision. These results
were compared against a baseline result as well as the results from existing similar algorithms. The
algorithm was concluded to be successful in its task and the performance was established to be better
than the baseline tests and close to the similar algorithms, though questions were raised regarding the
validity of these results due to different conditions. Problems and flaws with the algorithm were
analyzed and discussed; and it was concluded that many of the problems were due to the usage of
word lists.
Keywords
Clinical Text, Medical Records, Abbreviation detection, Text Normalization
Svensk sammanfattning
Svenska kliniska texter är fyllda med en rad olika förkortningar, vilka kan vara svåra att förstå. Detta
problem kan sträcka sig från patienter som vill läsa sina egna journaler till personal från ett annat
medicinskt område. Även datoralgoritmer lider av problemet med förkortningar då dessa kan dölja
värdefull information som algoritmen behöver tillgång till.
I denna rapport tas ett första steg mot att lösa detta problem genom att utveckla en algoritm som
identifierar förkortningar i svensk klinisk text och tillhandahåller förslag till deras möjliga
fullängdsversioner. En kombination av designvetenskap och empirisk forskning användes för att
närma sig problemet.
Utvecklingsprocessen inleddes med att granska tidigare forskning inom liknande områden. Från dessa
valdes några nyckelegenskaper till att användas i en regelbaserad algoritm. Vissa tillägg gjordes även
till algoritmen för att adressera problem som nämnts i de tidigare studierna, samt även för problem
som förutspåddes att tillkomma vid användandet av svenska kliniska texter. Den utvecklade
algoritmen testades sedan mot bedömningsfält från 300 olika patientjournaler från akutmottagningen
på Karolinska Universitetssjukhuset i Stockholm, Sverige. I dessa tester uppnådde algoritmen ett F-
värde på 79%, med 76% täckning och 81% precision. Dessa resultat jämfördes sedan mot
standardvärden samt resultat från redan existerande liknande algoritmer. Algoritmen visade sig vara
framgångsrik i sin uppgift, med en prestanda högre än standarvärdena och nära de redan existerande
liknande algoritmerna. Frågor ställdes dock angående validiteten av dessa jämförelser på grund av
olika förutsättningar. Algoritmens brister och problem analyserades och diskuterades, varav många
fastställdes bero på användandet av ordlistor.
Keywords
Klinisk Text, Patientjournaler, Förkortningsdetektion, Textnormalisering
Acknowledgements
The author wishes to send out his deepest thanks to his supervisors, Sumithra Velupillai and Mia
Kvist, for their amazing support and guidance during this project. Many thanks also goes out to my
girlfriend and family members for bouncing ideas, not to mention helping with the proofreading.
Table of Contents
1. Introduction ........................................................................................ 1
1.1 Background .................................................................................................. 1
1.1.1 Health Care Analytics and Modeling ........................................................... 2
1.2 Research Problem ......................................................................................... 2
1.3 Research Question ........................................................................................ 2
1.3.1 Delimitations .......................................................................................... 2
1.4 Expected Results .......................................................................................... 3
1.5 Research Approach ....................................................................................... 3
1.6 Outline ........................................................................................................ 3
2. Terminology ........................................................................................ 5
2.1 Abbreviations and Acronyms .......................................................................... 5
2.2 Clinical and Biomedical texts .......................................................................... 5
2.3 Precision and Recall....................................................................................... 6
2.4 Pattern Matching and Machine Learning ........................................................... 6
3. Related Research................................................................................. 7
3.1 Taghva and Gilbreth (1999) ........................................................................... 7
3.1.1 Method .................................................................................................. 7
3.1.2 Results .................................................................................................. 7
3.2 Yeates (1999) .............................................................................................. 8
3.2.1 Method .................................................................................................. 8
3.2.2 Results .................................................................................................. 8
3.3 Park and Byrd (2001) .................................................................................... 9
3.3.1 Method .................................................................................................. 9
3.3.2 Results .................................................................................................. 9
3.4 Dannélls (2003) ...........................................................................................10
3.4.1 Method .................................................................................................10
3.4.2 Results .................................................................................................10
3.5 Larkey et al. (2004) .....................................................................................10
3.5.1 Method .................................................................................................11
3.5.2 Results .................................................................................................11
3.6 Xu et al. (2007) ...........................................................................................12
3.6.1 Method .................................................................................................12
3.6.2 Results .................................................................................................12
4. Choice of method............................................................................... 14
4.1 Choice of algorithm ......................................................................................14
4.1.1 Constraints ............................................................................................14
4.1.2 Taghva and Gilbreth (1999).....................................................................14
4.1.3 Yeates (1999) ........................................................................................15
4.1.4 Park and Byrd's (2001) ...........................................................................15
4.1.5 Dannélls (2003) .....................................................................................15
4.1.6 Larkey et al. (2004) ...............................................................................15
4.1.7 Xu et al. (2007). ....................................................................................16
4.1.8 Alternative methods ...............................................................................16
4.1.9 Selected method ....................................................................................16
4.1.10 Summary ............................................................................................18
4.1.11 Potential Changes .................................................................................18
4.2 Evaluation ...................................................................................................18
4.3 Ethical considerations ...................................................................................19
5. Application of Method ........................................................................ 21
5.1 Algorithm Development Process .....................................................................21
5.1.1 Algorithm Architecture ............................................................................21
5.2 Algorithm Evaluation Process .........................................................................23
5.2.1 External components ..............................................................................23
5.2.2 Data set ................................................................................................24
5.2.3 Evaluation .............................................................................................24
6. Results .............................................................................................. 26
6.1 Baseline Results...........................................................................................26
6.2 Original Algorithm ........................................................................................26
6.3 Improved Algorithm .....................................................................................27
6.4 Summarized results .....................................................................................28
7. Analysis ............................................................................................. 29
7.1 Algorithm versions .......................................................................................29
7.2 The maximum length variable .......................................................................30
7.3 Further Analysis ...........................................................................................30
7.4 Algorithm Performance .................................................................................31
8. Discussion ......................................................................................... 33
8.1 Algorithm Performance .................................................................................33
8.2 Algorithm Problems ......................................................................................34
8.3 Conclusion ..................................................................................................35
8.3.1 Research Question and Expected Results and Reliability ..............................35
8.3.3 Originality and Research Contribution .......................................................35
8.3.4 Ethical and Societal Consequences ...........................................................36
8.4 Future Work ................................................................................................36
8.4.1 Improvements .......................................................................................36
8.4.1 Extensions ............................................................................................36
References ............................................................................................ 38
9. Appendix A ........................................................................................ 40
9.1 Correct Tokens - Top 15 ...............................................................................40
9.2 Incorrect Tokens - Top 15 .............................................................................40
9.3 Missed Tokens - Top 15 ................................................................................41
9.4 Annotated Tokens - Top 15 ...........................................................................41
List of Figures
Figure 1. The intended workflow of the algorithm ............................................................................... 20
Figure 2. The SCAN-Application architecture ...................................................................................... 21
Figure 3. Diagram for the handler class ................................................................................................ 22
Figure 4. Diagram for the Token class .................................................................................................. 22
Figure 5. The algorithm process of the SCAN-application ................................................................... 23
Figure 6. SCAN-Ouput and the XML-Format of a detected abbreviation. ........................................... 24
List of Tables
Table 1. Baseline test results. MaxLength implies the allowed maximum token length ...................... 26
Table 2. The results of the original algorithm with the regular word list and a maximum token length
of 3 to 8. ................................................................................................................................................ 27
Table 3. The results of the original algorithm with the improved word list and a maximum token
length of 3 to 8....................................................................................................................................... 27
Table 4. The results of the improved algorithm with the regular word list and a maximum token length
of 3 to 8. ................................................................................................................................................ 28
Table 5. The results of the improved algorithm with the improved word list and a maximum token
length of 3 to 8....................................................................................................................................... 28
Table 6. The results of the top 6 version of the algorithm that performed the best. .............................. 28
Table 7. The evaluation results of the versions of the algorithm using the maximum word length of six.
............................................................................................................................................................... 29
1. Introduction
1.1 Background
Patient medical records have always played a key role in the modern health care system. As the
technologies of the modern age advances it is only natural that we should try to find better and smarter
ways to utilize these records. One way of doing this is to improve the availability of the information
stored within them. This could be any number of things, from a patient being able to read their own
record to relevant information being identified by computer algorithms and stored in statistical
databases. The problem is that the structure of medical records makes them hard to process, for
humans and computers alike. The structure is such that most of the information is stored in a free text
format, divided into a few simple categories. Despite the unstructured nature of medical records, there
is an even bigger problem which makes them hard to process; namely the unique and complicated way
in which they are written.
When the content of a medical record is written, it is often done under a certain time constraint, which
tends to form the content thereafter. An example of this is that clinical texts have been shown to
contain a frequency of spelling errors of around 10%, which is significantly more than in ordinary
texts (Ruch, et al., 2003). Another consequence of the time pressure is that clinical texts tend to
contain a large amount of abbreviations, sometimes to such an extent that the text starts to look more
like telegraphic shorthand. A recent study by Skeppstedt et al. (2012) shows that in Swedish clinical
texts, 14% of all disorders present are named in an abbreviated format. The use of abbreviations in
clinical texts has been shown to be one of the reasons why patients have difficulties understanding
their own medical records (Pyper et al., 2004). To complicate things even further, the abbreviations
used tend to be ad-hoc and very localized to a specific domain, e.g. specialist clinics. This often results
in that even though a person has clinical training, they might still be unable to interpret some
abbreviations due to the fact that they are only used in a small subset of the medical domain. As an
illustration of the problem, a Swedish clinical text containing abbreviations could read like the
following:
"ant STEMI. direkt till angio. PCI mot ockluderad proximal LAD med 2 stent. trombektomy. bra
resultat. komplikationsfritt. får reoproinf 12 h"
Understandably, abbreviations do not only complicate things for humans, but for computers as well as
abbreviations hide information that might be interesting for certain algorithms. In Natural Language
Processing (NLP), the process of successfully interpreting abbreviations is problematic and something
that has been researched extensively. A first step towards solving this is building dictionaries which
list abbreviations and their definitions. Building such dictionaries manually however would be an
extremely time consuming task, especially since it's an ever expanding area. This is why much effort
has been put into automating this process (Taghva and Gilbreth, 1999; Yeates, 1999; Larkey et al.,
2000; Park and Byrd, 2001; Adar, 2002; Nadeau and Turney, 2005). Many researchers have also
focused on the automatic dictionary generation specific to the biomedical domain (Schwartz and
Hearst, 2003; Dannélls, 2006;).
1
Generating abbreviation dictionaries only solves half of the problem of abbreviation normalization
though. A problem that still remains is that abbreviations can be ambiguous (i.e. that one abbreviation
can translate into multiple definitions). In clinical text, this is especially a problem and it has been
shown that 33% of the abbreviations used have a chance of being highly ambiguous (Liu et al., 2001).
This disambiguation problem of abbreviations in clinical and biomedical texts has been approached by
several researchers who show promising results (Pakhomov, 2002; Guadan et al., 2005).
1.3.1 Delimitations
In this project, the algorithm will be limited to only detecting abbreviations in a clinical text and
suggest possible full length translations, i.e. the algorithm will not replace the abbreviations with their
full length counterpart nor will it handle the disambiguation problem of abbreviation normalization.
A further limitation to the project is to only test the algorithm on patient records specific to Emergency
clinics. The reason for this is to get a more accurate result for a specific domain rather than a less
accurate result for a general domain.
1
http://dsv.su.se/forskning/health/ [2012-04-03]
2
1.4 Expected Results
If features are gathered from algorithms that has proved to be successful on English clinical texts, the
results on Swedish clinical texts should be somewhat successful as well. There are however some key
differences between the English language and the Swedish language that might affect the results. One
example is the frequent use of compounding in the Swedish language, which could make the task of
abbreviation detection more difficult. If this turn out to be a problem, additions to the algorithm can
hopefully be made in order to mitigate the problems. Other than this there should not be any problems
keeping the algorithm from reaching the same performance levels as existing similar algorithms.
1.6 Outline
The following outline gives a brief overview of this report and what each chapter of it contains:
Chapter 2: The more essential terms of this report are listed and briefly explained.
Chapter 3: Gives an in-depth look into the related research for this project. The methods and results of
every research piece are summarized along with a short introduction.
Chapter 4: The different method choices available for this project are analyzed and compared before
one is finally chosen.
Chapter 5: The application of the method chosen in the previous chapter is described, along with how
the evaluation of the final algorithm was carried out.
Chapter 6: The algorithm is evaluated and the results are presented and compared. Improvements to
the algorithm are motivated and implemented to the algorithm. The improved version is also evaluated
and then compared to the original version.
3
Chapter 7: The results of the algorithm are analyzed and compared to the baseline results and similar
existing algorithms.
Chapter 8: The results from the analysis is discussed and conclusions about the project and its result
are also given, along with some suggestions on future work and improvements.
4
2. Terminology
In this section some of the more specific terms used in this report will be presented and defined in
order to avoid misunderstandings.
2
http://oxforddictionaries.com/definition/abbreviation?q=abbreviation [2012-03-09]
3
http://oxforddictionaries.com/definition/acronym?q=acronym [2012-03-09]
5
2.3 Precision and Recall
Precision and Recall are two terms used to in order to measure the performance of algorithms in
pattern recognition and information retrieval. One definition of the two terms is given by van
Rijsbergen (1979), who defines recall as: "... the proportion of relevant material actually retrieved in
answer to a search request." and precision as: " ...the proportion of retrieved material that is actually
relevant.".
Precision and recall are often presented in terms of percentages. The percentage of recall is calculated
by taking the number of found elements divided by the number of sought elements in a data set. The
percentage of precision is calculated by dividing the number of found elements that are correct with
the total number of found elements.
6
3. Related Research
In this section the related research for this project will be summarized. Each described research piece
will begin with a short introduction followed by a deeper description of its method and results. Since
this project focuses on abbreviation detection, the research described will only be reviewed in terms of
what is relevant for this project. For example, if a research piece is about abbreviation definition
extraction, focus will be on the part of the method that identifies and extracts the acronyms, i.e. the
methods for finding abbreviation definitions will be dismissed since they are irrelevant to this project.
It should be stated that there are several research articles that cover abbreviation identification that are
not mentioned here. Examples of such are the ones by Adar (2002) and Schwartz and Hearst (2003).
Both of them are based on the assumption that abbreviations are positioned inside or adjacent to
parentheses. This might be valid in biomedical texts, but in clinical texts which is the domain of this
project, abbreviations are almost never used in such a format.
3.1.1 Method
The method for finding acronym candidates described in the article is a simple one. Any word that
consists of capital letters and is between three and ten letters long is considered an acronym, except if
it is in a list of rejected words supplied to the algorithm, e.g. TABLE or FIGURE. The reason for
choosing the given word length limitation was that the authors considered it the best compromise
between recall and precision. Accepting words with only two letters might increase the recall, but it
would also worsen the precision since a lot of incorrect acronym candidates would be accepted. The
chosen upper limit of ten characters was motivated by the fact that there are very few acronyms that
are longer than ten letters.
Taghva and Gilbreth's method is described as only identifying acronyms, but there are no conditions in
the part of the method described here that would stop it from identifying other types of abbreviations
as well. In the full version of the method, i.e. the one that includes the identification of definitions,
there are however conditions that will filter out other types of abbreviations than acronyms. However,
since that part has not been included here, one can consider Taghva and Gilbreth's method as an
abbreviation finding algorithm.
3.1.2 Results
Presenting the results of the method used by Taghva and Gilbreth in terms of finding acronyms is
difficult since the results given in their report are based upon finding both acronyms and their
definitions. With their measurements they reach a recall of 86% and a precision of 98% as their
algorithm was tested on a set consisting of government studies. A negative feature of the method
7
though is as stated earlier that it does not recognize acronyms with two letters or shorter. The authors
re-evaluated the algorithm on test data where two letter acronyms had been excluded, which increased
the recall to 93%.
3.2.1 Method
The method used by Yeates to find acronym candidates is closely linked with the process of finding
their corresponding definition. Due to this there are no methods that are exclusively used for
identifying acronyms in a text. Instead we will have to examine the method used for the acronym
definition extraction as a whole in order to examine if there are any useful bits that can be applied to
the domain of our project.
The method described by Yeates is divided into two steps. The first step is to divide the text to be
processed into chunks, where the size of the chunks is determined by left and right parentheses and
punctuations. Every word is then compared with the chunks before and after itself. If the word turns
out to match one of the chunks, i.e. if the letters of the word match the leading letters of the multiple
words in chunk, then the acronym-definition pair is sent on to the second step of the algorithm.
The second step consists of a set of heuristics that are loosely based upon Yeates' definition of an
acronym. If the acronym-definition pair given from the first step fails to match these heuristics, then
the pair is discarded. The acronym definitions that the heuristics are based upon are the following:
Acronyms are shorter than their definitions
Acronyms contain initials of most of the words in their definitions
Acronyms are given in upper case
Shorter acronyms tend to have longer words in their definition
Longer acronyms tend to have more stop words
As Yeates explains it, the first step of the algorithm is quite forgiving in what counts as acronym-
definition pair thus putting a big responsibility on the heuristics to sort out the false pairs.
3.2.2 Results
As in the research by Taghva and Gilbreth (1999), Yeates results are presented in terms of how many
acronym and definition pairs that were found by the algorithm. The results, which were generated
from a sample of ten computer science technical reports, show a recall rate of 91% and a precision rate
of 68%. Some suggestions of improvements to the algorithm are given by the author. These are
however not uniquely bound to the identification of acronyms and are thus not relevant to this project.
8
3.3 Park and Byrd (2001)
The motivation behind the report Hybrid Text Mining for Finding Abbreviations and their definitions
was that the authors Park and Byrd were having problems with abbreviations hiding important
keywords from their information extraction algorithms. To alleviate this problem Park and Byrd
started to work on an algorithm that could automatically extract abbreviations and their definitions into
a dictionary, which then could be used to translate abbreviations into their full length format. Unlike
the previous work of Gilbreth and Taghva (1999) and Yeates (1999), Park and Byrd decided on a more
refined method of hybrid text mining as a foundation for their algorithm. Another difference is that
Park and Byrd tried to find abbreviation definitions and not just acronym definitions, which is a far
more difficult task.
3.3.1 Method
The method for finding abbreviation candidates has by Park and Byrd been divided into two steps.
First a candidate has to satisfy the following three conditions:
Its first character is alphabetic or numeric
Its length is between 2 and 10 characters long
It contains at least one capital letter
If all these conditions are satisfied, the candidate must also meet the following three restrictions in
order to be recognized as an abbreviation.
It is not a known (dictionary) word containing an initial capital letter and appearing as the first word
in a sentence.
It is not a member of a predefined set of person and location names.
It is not a member of user-defined list of stop words.
The restrictions are all in place in order to strengthen the precision of the algorithm by sorting out false
positives, since the conditions are quite forgiving in what words they accept. To clarify what the user-
defined list of stop words include, it's a list with all the words that the user wishes to be filtered out
and that is not covered by the first two restrictions.
3.3.2 Results
Park and Byrd present their results in terms of finding abbreviations and their definitions, i.e. not just
abbreviation detection. The algorithm was tested on three different sets, one was a book on automotive
engineering, one was a technical book from a pharmaceutical company, and the last set consisted of a
collection of NASA press releases. The results show a minimum performance of 93,9% recall and
97% precision. The authors discuss the reason for missing abbreviations, which all are connected to
matching the abbreviations to their definitions, i.e. none of the missed abbreviations are due to the
algorithm not identifying a word as an abbreviation. This can lead us to believe that the algorithm
performs even better than the test results show in terms of finding abbreviations in a text.
Considerations have to be taken to the fact that we have no information about the different types of
abbreviations processed by the algorithm. Since the structure of clinical abbreviations might differ
from the ones processed in these results, the outcome might be a lot different if we would try to
implement this method in clinical texts.
9
3.4 Dannélls (2003)
In Acronym Recognition (2003), Dana Dannélls describes the work of developing an algorithm for
detecting acronyms and their definitions in Swedish text, primarily in the biomedical domain, but she
also states that the algorithm should work in more general domains as well.
The reason why Dannélls chooses to focus on the Swedish language is that there are the inherent
difficulties when it comes to matching the acronym with their definitions that you do not find in a
language like English. The problem lies in that when words are compounded in Swedish, like for
example "intensiv" (intensive), "vård" (care) and "avdelning" (unit), they are often compounded into
the single word, in this case "intensivvårdsavdelning", which has the acronym "IVA". It is obvious
why matching this acronym with its definition is a lot harder than matching its English counterpart,
intensive care unit (ICU).
3.4.1 Method
The method used by Dannélls to identify acronym candidates is a little bit different from previous
examples in the sense that she uses the aid of a part-of-speech (POS) tagger to pre-process the text.
With the additional parameters this brings she lists the following conditions that have to be met for an
acronym candidate:
The POS tag for the token is either N, Y or X (i.e. noun, abbreviation or foreign word). In case X, it
must consist of at least 2 upper-case letters.
The token is not in the list of noise words, nor names. The list of noise words contains words such
as “by”, “cm” and “ml”. The list of names includes person names as “The-Hung Bui”, “Hans”.
The token does not contain characters such as ’(’, ’)’, ’[’, ’]’, ’=’.
The token must be between 2 and 14 characters long
As with previous examples that also were limited to acronyms, it is clear that these conditions also
allow abbreviations to be accepted as candidates. It is only in the methods original form, where
definitions are also taken into consideration, that general abbreviations are discarded.
3.4.2 Results
In this report, the author has presented separate results for how well the algorithm performs in finding
acronyms. The algorithm was tested upon a set of Swedish biomedical texts, where a 98% recall and
94% precision is achieved. The algorithm successfully extracted 845 correct acronyms out of 898
possible. The author identifies the reasons to why the remaining acronyms were left out with these
three explanations.
The Acronym consisted of more than one token, e.g. ’PUU N’ or ’P S A’.
The acronym was removed due to their definition string that might have included symbols and
letters such as ’@’, ’www’.
Wrong interpretation by the POS tagger.
10
web server for acronym and abbreviation lookup, where the underlying database is automatically
generated by browsing a large number of web pages. The web server is developed to handle general
acronyms, i.e. not domain specific acronyms.
3.5.1 Method
Larkey et al. propose three different approaches for finding acronyms in texts, one contextual, one
context/canonical and one simple canonical. The different approaches have some similarities, but also
some key differences in what they accept as a valid acronym. Their definitions are as follows:
Contextual
All letters must be uppercase, they can be lowercase if they are at the end of the word and preceded
by three uppercase letters (e.g. COGSNet) or if the lowercase letters are in the middle of the word
with at least two preceding uppercase letters and at least one following uppercase letter (e.g.
AChemS)
The acronym is allowed to contain punctuations and spaces if they follow every letter (e.g. U.S.A.)
Can contain any number of digits anywhere
Context/Canonical
A letter is allowed to be lowercase if it is preceded and followed with at least one uppercase letter
(e.g. DoD)
The acronym is allowed to end with a lowercase letter if it is an 's' (e.g. USA's)
The acronym is allowed to contain punctuations and spaces if they follow every letter (e.g. U.S.A.)
Slashes and hyphens are allowed in the acronym
Only one digit is allowed
Must be between 2 and 9 characters
Simple Canonical
A letter is allowed to be lowercase if it is preceded and followed with at least one uppercase letter
(e.g. DoD)
Slashes and hyphens are allowed in the acronym
The acronym may not contain digits, periods or spaces.
Must be between 2 and 10 characters
These three methods for finding acronym are later matched with different patterns to find acronym
definitions, but as that is of no interest for this project, it will not be described here.
3.5.2 Results
As with many of the other studies, the results of each acronym finding method in Larkey et al. are
given together with their definition finding counterpart, thus making it hard to get a clear picture of
their effectiveness. However, a comparison of the three methods results should give a reasonable good
measurement of which of them is most effective. According to Larkey et al., the context/canonical
method performed best with a precision of 92% and 84% recall, with the runner up being the
contextual method with 96% precision and 60% recall. The methods are stated to have been developed
11
for a general domain. The tests however are done upon military and governmental web pages only,
which might have made the results less general than intended.
3.6.1 Method
Similar to Larkey et al. (2004), Xu et al. describes and tests four different methods for abbreviation
detection. The first method is a simple one where each token in a text is tested against two
dictionaries. The first dictionary is an English word list (consisting of 110 573 words) and the second
one is a list of medical terms (consisting of 9721 words). If a token is not in any of these dictionaries it
is considered an unknown word and must therefore be an abbreviation.
The second method used by the authors is based on a set of rules that was devised by looking at
several admission notes. The following criteria were formed, and if a token meets one of them it is
considered an abbreviation.
The word contains a special character such as "-" and "."
The word contains less than 6 characters and contains one of the following: a) A mixture of
numeric and alphabetical characters. b) Capital letter(s), but not when only the first letter is
uppercase following a period. c) Lower case letters where the word is not in the English or medical
list.
The second method could be deemed identical to the first method as it also uses the aid of the two
mentioned dictionaries, but with some added heuristic rules to make it a little more fine grained.
The third and fourth methods are both based upon a decision tree classifier which needs to be trained
with pre annotated test data. The differences between the two methods are the features used by the
classifier. The third method uses the following features:
Special characters such as “-“, and “.”
Alphabetic/numeric characters and their combination
Information about upper case and positions in the word
Length of the word
The document frequency of a word
The fourth method uses the same features as method number three, with the added feature if the word
is in the previously mentioned dictionaries, i.e. if the word is a known English word or a medical term.
3.6.2 Results
The four methods for abbreviation detection were tested against a set of admission notes from the New
York Presbyterian Hosptial. The method with the highest scores in terms of precision and recall was
12
the fourth one with 91.4% precision and 80.3% recall. In second place came the second method with
85.4% precision and 83.9% recall. The error analysis performed by the authors showed that their
methods had the biggest problem with abbreviations that were divided into multiple tokens. Examples
of such were "ex tol" (exercise tolerance) which was interpreted as two tokens when it should be
interpreted as one.
13
4. Choice of method
The research question for the project will be approached by choosing an existing algorithm designed
to detect abbreviations in clinical texts. The development and evaluation of the algorithm poses
several problems that need to be discussed before any further decisions are made. In this section of the
report, the methods and algorithms available for abbreviation detection will be discussed and
compared before one is eventually selected for the project. Appropriate methods for evaluating the
algorithm will also be examined as well as the ethical considerations for the project.
4.1.1 Constraints
Before the different methods can be examined, the constraints under which they will be forced to
operate under will have to be taken into consideration. One such constraint is the one mentioned in the
previous section with pre annotated training data not being available. Another constraint that can be
gathered by a quick examination of a general Swedish clinical text, is that abbreviations seldom are
spelled with capital letters. Therefore, a potential method cannot rely on the assumption that an
abbreviation is spelled with only capital letters.
14
was that they cannot rely on abbreviations being spelled exclusively with capital letters, which makes
this method a lot less viable since it makes this specific assumption.
15
dictionary words etc. This could be added as an extension though if the decision is made to use this
method.
16
The method of this project will consequently consist of the following conditions gathered from Xu et
al. (2007) and Dannélls (2003):
The POS tag for the token is either N, Y or X (noun, abbreviation or foreign word). In case X, it
must consist of at least 2 upper-case letters
The word contains a special character such as "-" and "."
The word contains less than 6 characters and contains one of the following: a) A mixture of
numeric and alphabetical characters. b) Capital letter(s), but not when only the first letter is
uppercase following a period. c) Lower case letters where the word is not in the Swedish word list
or medical list.
Xu et al. state in their result that a large degree of the missed abbreviations were due to the over
simplistic tokenizer that they used. For example, the abbreviation "ex tol" (exercise tolerance) was
mistakenly divided into two tokens instead of one, which resulted in a loss of both precision and
recall. The authors does not specify how their tokenizer works, but the tokenizer developed for this
project will have to somehow try to mitigate the mentioned problem.
By observations of Swedish clinical texts, it is possible to identify similar problems to the one
mentioned by Xu et al. Authors of clinical texts tend to put blank spaces between the characters of an
abbreviation, e.g. "p.g.a." (På Grund Av) becomes "p g a". If the tokenizer were to go under the
assumption that a blank space is a definitive separator between two tokens, the abbreviation "p g a"
will be missed and interpreted into three incorrect tokens instead.
Having taken into consideration the two problems just mentioned, the tokenization process proposed
for this project will be divided into the following steps:
1. Separate the text into tokens with the single condition that a space marks the beginning of a new
token.
2. When the whole text has been processed, the algorithm will test all of the tokens consisting of a
single character (punctuations will not be counted as a character). If a single character token has
one or more following tokens consisting of only one letter, they will be combined into a single
token.
3. Apply the selected conditions for abbreviation detection.
4. Test all the tokens that are considered abbreviations in step 3. If a token is not in a dictionary of
known medical abbreviations, the tokenizer tests if there are other adjacent tokens that fit the same
profile. If there is, the tokeinizer combines these and test if the new combined token is in the
dictionary. If yes, the new token is accepted, if not, the token is split back into the original two
tokens.
Steps one and three in this process are simple and should not require further explanation. Step two
tries to alleviate the mentioned problem concerning Swedish clinical texts having blank spaces where
inserted into acronyms. Step 4 is an attempt to avoid the limitations that Xu et al. mentioned in their
tokenizer. The dictionary of known medical abbreviations is used so that correct abbreviations are not
accidently combined into faulty tokens. The algorithm and its usage will be a bit more complicated by
the usage of this new abbreviation dictionary. However we deem this to be a necessary evil in order to
deal with the tokenization problem.
17
4.1.10 Summary
The selected method, along with the presented additions can now be summarized. The algorithm to be
developed will hence contain the following steps:
1. Tokenize the text to be processed. This is a two step process which first divides all tokens by
searching for blank spaces. The second step iterates through the tokens and checks if some single
character tokens should be combined.
2. Identify which of the given tokens could be considered abbreviations by using the conditions from
4.1.7.
3. Take the tokens considered abbreviations in step 2 and run them against the abbreviation
dictionary. Test if unknown tokens can be combined to form an abbreviation that exists in the
dictionary of known medical abbreviations.
4. Supply each token with the available translations from the abbreviation dictionary. Return the final
result.
The indented workflow of the algorithm is also illustrated in figure 1.
4.2 Evaluation
For the evaluation part of the project there are multiple methods to choose from. In the reviewed
previous research, one method has been used exclusively in their evaluations, which is the concept of
precision and recall devised by Rijsbergen (1979). Other methods such as accuracy and false
negatives could be considered for this project, but since the project should be comparable to previous
research in the domain, the method of precision of recall will be the one selected.
As evaluation data, authentic medical records will be used in order to test our method in an as real
environment as possible. These records will be processed manually to identify all abbreviations that
are present in the text. The same records will then be processed by the algorithm whose results will be
compared to the results from the manual identification process. By this comparison, the recall and
precision of the algorithm can then be calculated, generating a value that can be used to measure the
level of performance of the algorithm.
The amount of evaluation data will be based upon the number of abbreviations contained in it. Xu et
al. (2007) used a data set that contained 411 abbreviations and Dannélls (2003) used a data set
containing 898 abbreviations. So the data set for this project should preferably contain a minimum of
400 abbreviations in order for fair evaluations to be conducted against previous studies.
18
4.3 Ethical considerations4
Since the evaluation part of this project will be done with aid of authentic medical records whose
content is highly sensitive, ethical aspects have to be taken into consideration. The medical records
used will be stripped of personal information in advance, in order to protect the patients’ privacy. In
addition to this, the biggest precautions will have to be taken in order to make sure that none of the
information stored in the medical records is leaked outside the project. Because of this, all of the
sensitive information will be kept in an encrypted format during the times that they are not used. As a
further precaution, no external network connections will be allowed to the device were the medical
records are stored, except if they are in an encrypted state. If data is needed in order to aid in the
development of the algorithm, a set of made up data will be created that matches the structure of actual
medical records but does not contain sensitive information.
Before any authentic medical records can be used in this project, the necessary authorization has to be
acquired in order to prove that the author of this project has the approval to use the medical records for
research related purposes.
Lastly, potential examples from the medical records that have to be presented in this report in order to
strengthen an argument will have to be replicated with slightly different content, since the sensitive
parts of the medical records cannot be published in this report.
4
This section only covers the basic aspects of the ethical considerations for this project. More details of
how the ethical aspects were followed is given in chapter 5.
19
Figure 1. The intended workflow of the algorithm
20
5. Application of Method
In this section of the report, the application of the method will be presented. The presentation will be
divided into two parts. The first part describes the development of the actual algorithm and the second
part describes how the algorithm was evaluated.
21
Figure 3. Diagram for the handler class
When the process starts, the first component to be called by the handler is the tokenization process.
This corresponds to step 1 of the algorithm described in 4.1.8. The process first iterates the text one
time to separate it into tokens. Then it iterates through the newly created set of tokens in order to
evaluate which tokens can be combined as one. The tokenization process keeps all the punctuations in
the tokens. The reason for this is because the second step of the algorithm uses the periods as potential
identifiers for abbreviations. Each token is also supplied with their position in the original text, and
their order in the text, so that components later in the process can access that information. The
structure of a token can be seen in figure 4.
In the second step of the algorithm, the tokens are sent to the abbreviation identifying component that
applies the conditions from 4.1.6 in order to test which tokens can be considered abbreviations. A set
of logical statements are used in order to test each condition. The component was built so that new
conditions can easily be added or removed if an alteration to the algorithm is deemed necessary. This
could also be used to quickly test how a condition alters the output result of the algorithm. As the final
part of the component, each token is tested against the external medical and Swedish wordlists to
check if it is a ordinary word or a medical term and thus should be removed. The component just
described is meant to correspond to step 2 of the algorithm described in 4.1.8.
In the next step of the algorithm, the tokens are sent to the expansion finding component. What it does
is just a simple test if each token has a known expansion in the external abbreviation dictionary. If so
they are marked as known and the expansion is added to the expansion list in the token object. All the
tokens that were not matched against an expansion are sent back to the tokenizer component. There the
tokens are combined (if adjacent) to form new tokens that are tested against the abbreviation
dictionary. If the new combined token is matched against an expansion in the abbreviation dictionary,
the new token is kept, if not the combined tokens are restored to their former state and are just
considered as unknown abbreviations. After this process, which corresponds to step 3 described in
4.1.8, the whole set of tokens are returned to the algorithm handler, which sends it to the process that
made the call to the findAbb-method. A graphical overview of this entire process is presented in figure
5.
22
5.2 Algorithm Evaluation Process
The algorithm evaluation process was dived into three parts. One was the acquiring of a suitable
abbreviation dictionary, Swedish word list, medical word list and POS-tagger for the algorithm, i.e.
the external components of the algorithm. The second part was supplying authentic medical records
that had been manually processed by marking all existing abbreviations, i.e. the data set that the
algorithm would be tested against. The last part of the evaluation process was generating the results
and from them performing the actual evaluation of the algorithm.
evaluation, we used "Lars Aronssons svenska ordlista" as the Swedish word list5. It is a free digital
word list consisting of 221.599 Swedish words, which was considered large enough to fit our needs.
Though one problem with the word list was that it in addition to Swedish words also contained
Swedish abbreviations. This would result in the algorithm recognizing the abbreviations as ordinary
words and thus rejecting them as abbreviation candidates. To mitigate this problem, the wordlist was
subtracted with a list of common Swedish abbreviations taken from Svenska Akademins Ordbok 6 (The
Swedish Academy's Book of Words).
The medical wordlist had to be generated specially for this project since there were no existing
medical wordlists in Swedish available for this project. The medical list used was generated from the
medical dictionary supplied by FASS7, which is a dictionary for all pharmaceutical drugs distributed in
Sweden, provided by Läkemedelsindustriföreningens Service AB (LIF). The mentioned
pharmaceutical dictionary was also used when the medical wordlist for this project was generated.
5
http://runeberg.org/words/ [2012-04-11]
6
http://g3.spraakdata.gu.se/saob/ [2012-04-11]
7
http://www.fass.se [2012-04-03]
23
As the abbreviation dictionary, we used a digital version of the medical abbreviation dictionary
presented in the book "Medicinska förkortningar och akronymer" by Staffan Cederblom (2005). The
dictionary contains all the abbreviations and their definitions that Cederblom had come upon so far. It
should be stated that this is not a complete dictionary, but the most complete one available for the
project at the time of the evaluation.
An equally important task for the evaluation process is properly POS-tagging the evaluation data
before it is processed by the algorithm. As stated earlier, the project lacked the resources for training a
POS-tagger for the specific domain of Swedish clinical text. Instead a pre-trained POS-tagger had to
be selected, where Granska Tagger8 was deemed the most suitable. Granska Tagger is a POS-tagger
that has been pre-trained for the Swedish language and considered to be the best option available for
this project. The Granska POS-Tagger also had the benefit of having been previously tested on
Swedish clinical records, were it obtained an accuracy of 92.4% (Hassel et al., 2011).
5.2.3 Evaluation
The actual evaluation of the algorithm was performed by a separate Java application. The application
ran the supplied data set through the SCAN-application and automatically compared the results against
manually annotated results, which was supplied via XML-files. An example from both the output of
the SCAN-algorithm and the format of a XML-file representing the manually generated result can be
seen in figure 6.
SCAN-Output:
Span TokenOrder TokenString
199-201 28 sr
XML-Format:
<annotation>
<mention id="abbreviation_annotation_Instance_15" />
<span start="199" end="201" />
<spannedText>SR</spannedText>
<annotation>
Figure 6. SCAN-Ouput and the XML-Format of a detected abbreviation.
8
http://www.csc.kth.se/tcs/humanlang/tools.html [2012-04-03]
9
De-identified implying that names and social security numbers had been removed.
10
The selected data set was used after approval from the Regional Ethical Review Board in Stockholm,
permission number 2009/1742-31/5
24
The matching process, i.e. the test to see if the SCAN-output conformed to the results in the XML-
files, was deemed successful if the following logical statement was satisfied:
XML.Span.Start >= SCAN.Span.Start && XML.Span.End <= Scan.Span.End
In other words, an abbreviation token from the SCAN-output was considered correct if the span from a
XML-element could be matched as a subset of its own span. The reason the XML-span only has to be
a subset of the SCAN-token span and not a perfect match; is that in the manual annotation process
only the abbreviation part of a word was marked as the actual abbreviation. If for example the
compounded word “huddr” (made out of the word hud and the abbreviation dr) was found, only the
second part of the word (dr) was marked by the annotator, since the first part (hud) is not an
abbreviation. The SCAN-algorithm on the other hand would mark the entire word, so while still a
correct result, the span comparison would result in a mismatch if a perfect match were to be required.
When the entire data set had been evaluated, a log file was created with the results for each individual
record, as well as the result for the entire data set. In addition to this, the following four types of lists
were also generated:
Frequency list of the correct words found by SCAN
Frequency list of the incorrect words found by SCAN
Frequency list of the missed words by SCAN
Frequency list of all the words found in the manual annotation
25
6. Results
Already in the initial phase of the evaluation, it became apparent that the selected POS-tagger could
not be used in the project. The reason for this was that the tagger would filter out some tokens (such as
numbers), thus making them invisible to the algorithm. This would also lead to he indexation of the
tokens becoming incorrect and would not be correctly matched against the manually generated results
in the XML-files. Since no POS information was available to the algorithm after this, the following
condition in the rule set had to be discarded:
The POS tag for the token is either N, Y or X (noun, abbreviation or foreign word). In case X, it
must consist of at least 2 upper-case letters
The results of the algorithm is presented by listing the number of found tokens, the number of correct
tokens and the tokens found by the manual revision. Recall and precision are also presented as well as
the F-Score of the algorithm. The F-Score is a combined value of the precision and recall and is
calculated with the following formula (van Rijsbergen, 1979):
F-Score = 2 x (Precision x Recall) / (Precision + Recall)
26
Version Found Correct Annotated Recall Precision F-Measure
Original-RegList-maxLength3 1688 1299 2050 0,63 0,76 0,69
Original-RegList-maxLength4 1892 1444 2050 0,70 0,76 0,73
Original-RegList-maxLength5 2065 1485 2050 0,72 0,71 0,72
Original-RegList-maxLength6 2749 1511 2050 0,73 0,68 0,70
Original-RegList-maxLength7 2946 1550 2050 0,75 0,64 0,69
Original-RegList-maxLength8 2649 1576 2050 0,76 0,59 0,67
Table 2. The results of the original algorithm with the regular word list and a maximum token length of 3
to 8.
When reviewing the frequency lists generated from the first evaluation, it became apparent that the
algorithm missed a lot of ordinary abbreviations that should have been recognized by the rule set.
After a closer inspection of the word lists, they were found to contain numerous abbreviations, thus in
them being ignored by the algorithm. In order for the algorithm to be given a fair evaluation (since the
word lists are considered as external components), the word lists were subtracted with a list of
common Swedish abbreviations supplied by the Swedish Academy Wordbook11. A second evaluation
was then performed with these improved word lists. The result from this evaluation can be seen in
table 3.
11
http://g3.spraakdata.gu.se/saob/foerkortn.shtml [2012-06-09]
27
The word contains a special character such as "-" and ".", unless the character is ‘-‘ and the
words on both sides are either a numeral or a dictionary word.
With this improved condition, another evaluation of the algorithm was performed. The results using
the regular word lists can be seen in table 4 and the results from using the improved word lists can be
seen in table 5.
28
7. Analysis
In this section, the results from chapter 6 will be analyzed. The analysis will be on how well the
different versions of the algorithm performed, as well as pointing out their strengths and their
weaknesses. Comparisons will also be made with other algorithms similar to the one developed in this
project in order to get a more nuanced picture of the algorithms performance.
The first step in the analysis was to examine how the different word lists affected the performance of
the algorithm. The two versions of the word lists used in the evaluation was the regular word lists,
were no alterations had been made and the improved word lists, where some of the most common
abbreviations had been removed. Not surprisingly, the algorithm performed better in both recall and
precision when the improved word lists were used. The greatest effect could be seen on the recall
value, where an increase of seven percent was achieved. The reason for this was that a lot
abbreviations were incorrectly marked as known words by the flawed regular word lists. With the
improved word lists however these abbreviations were instead correctly identified, which in turn lead
to a higher recall percentage. The small increase in precision was just a side effect of more correct
abbreviations being found by the algorithm.
The second step in the analysis was to compare the results between the original version of the
algorithm and the one with the improvements that were listed in section 6.3. As table 7 shows, there
was no difference in recall between the two versions, but there was a three percent increase in
precision. The reason for this was that the original version of the algorithm saw every word that
contained the "-" character as an abbreviation. The improved version used a more fine grained rule set,
which tested if both sides of the "-" character was both either numeric or words present in the word
lists. This would lead to that more incorrect tokens could be filtered out by the improved algorithm,
thus resulting in better precision.
29
7.2 The maximum length variable
One key variable that was extensively tested during the evaluation, was the one which decided how
many letters a token could contain and still be viable as an abbreviation. The default value in the start
of the evaluation was six characters, but the whole range between three and eight characters was also
tested. Surprisingly, it was not the default value of six characters that resulted in the best performance,
but the value of four characters, which can be seen in tables 2 to 5. Although a small drop in recall, the
value of four characters substantially increased the precision of the algorithm compared to the default
value of six characters. The reason for this increase in performance can be explained if one examines
the frequency lists generated during the evaluations. In these lists, one could see that the majority of
correct abbreviations found by the algorithm only consisted of four characters or less. If one also
examined the frequency list of the tokens incorrectly taken as abbreviations by the algorithm, the
majority of the list consisted of tokens with more than four characters. With this pattern, it is clear that
decreasing the allowed maximum length of an abbreviation would filter out many of the incorrect
tokens while still finding close to the same amount of correct abbreviations. This pattern also reflects
in the results for the versions of the algorithms that used a value for the maximum word length greater
than six characters. As the value increases, the recall of the algorithm gets a small boost, but the
precision on the other hand drops severely.
30
of the word has been abbreviated. In Swedish texts in general, as well as in Swedish clinical text,
compounding is a much more common phenomenon compared to English texts. This makes the
mentioned problem much more of a nuisance for algorithms operating on Swedish clinical texts than
for algorithms operating on English clinical texts.
The next step in the deeper analysis was to investigate what kind of tokens that were incorrectly
marked as abbreviations by the algorithm. First off, there were some abbreviations that actually were
correct, but were listed as incorrect tokens due to the fact that they were missed in the manual
annotation process. An example of this is the abbreviation "pat", which was sometimes missed due to
the simple fact that they are used to such an extent that the annotator by pure habit saw them as
ordinary words and not abbreviations. These misses were however of relatively small numbers and did
not affect the results in any notable way.
Another example of incorrect abbreviations that should not be considered as the algorithms fault, is
where the letter "x" was used as a multiplication character (e.g. "1x2 pills daily"). In the manual
annotations process they were not marked as abbreviations, but whether that is correct could be
debated. Again, these "errors" were also of relatively small numbers and had little effect on the end
results.
Ignoring these "invalid" incorrect tokens, the majority of the tokens that were incorrectly marked as
abbreviations by the algorithm, were words that should have been filtered out by the word lists. There
were two reason to why they were not filtered out; Either the word was misspelled (e.g. "symtom"
correctly spelled "symptom") or it just did not exists in the word lists. In the later case, most of the
words were either proper names or words unique to the medical domain, which in both cases are
difficult to fully cover in a word list.
31
reached F-Measure scores of 76% (87.5% recall and 71.5% precision) and 85% (91.4% recall and
80.3%% precision). Compared to these results, the algorithm developed for this project is little bit
behind in performance with its 79% F-Measure score (76% recall and 81% precision), but still better
compared to the lowest score presented by Xu et al. (2007). It can therefore be stated that the
performance of the developed algorithm is close to the performance of existing similar algorithms, but
not equal.
32
8. Discussion
In this section there will be a discussion concerning the algorithm produced in this project, the results
generated and whether or not it answers the research question stated in the beginning of this project. In
the discussion in this chapter, the version of the algorithm chosen to be discussed will be the same as
in the analysis, i.e. the improved algorithm with the improved word lists and the maximum word
length of four, simply because it was the version of the algorithm that performed best. Any references
in the discussion to "the algorithm" will imply this version, unless something else is explicitly stated.
33
might result in that different varieties of abbreviations were present in the two data sets, in addition to
that the frequency of abbreviations might have differed as well. It is not clear how big a difference
these different domains makes, since the variation in data set languages has a greater impact and thus
makes that comparison difficult.
12
http://www.csc.kth.se/tcs/humanlang/tools.html [2012-05-13]
34
would be desirable if the algorithm someday were to be used in a live environment. It is however
unclear if such solutions exists for the algorithm in its current form. A completely new approach might
have to be attempted instead, e.g. using machine learning based methods. Another approach would be
one that is not as dependent on word lists as algorithm developed in this project. Though the word lists
have their benefits in reducing the complexity of the algorithm, they also have their share of
disadvantages; disadvantages which in the end are hard to get around.
8.3 Conclusion
The results of this project, i.e. the developed algorithm and its evaluation, have been successful in the
terms that it performs its intended task at a performance level above the baseline algorithms. Further
comparisons were also made with algorithms similar to the one developed in this project. In these
comparisons, the algorithm showed close but not equal performance. It was also concluded though that
validity of these comparisons was hard to establish due to the different evaluation conditions between
the compared algorithms.
35
the projects have found and highlighted some of the problems in attempting to apply rule based
abbreviation detecting algorithms on Swedish clinical texts.
Lastly it ought to be discussed whether or not the developed algorithm can be used in any clinical
setting, or as a part of a larger system implemented on real clinical texts. As it is now, it could work to
some degree, but would benefit a lot from having some of the suggested improvements applied first. In
addition to this, it could also stand as a good foundation for further development in the same field,
since much experience has been generated and documented in this report.
8.4.1 Improvements
There are many improvements that could be made to the algorithm. Since some of them are extensive
problems, each problem alone could be a potential research topic. For example, attempts could be
made to implement part-of-speech information in the abbreviation identification process and see what
impact it has on the algorithms performance. The problem with compounding could also be attempted
to be solved seeing as it was identified as one of the major hurdles in Swedish clinical texts.
Another suggestion would be to do a deeper analysis of how the word lists affect the performance of
the algorithm. Different word lists than the ones used in this project could be applied in order to see
how they shape the output of the algorithm.
A final suggestion is implementing the algorithm on other types of clinical texts, i.e. not from
emergency departments, to see what (if any) differences can be seen in the performance of the
algorithm.
8.4.1 Extensions
The perhaps most apparent extension to the algorithm would be to replace the found abbreviations
with their full length counterpart. This is however a huge research area, mostly because the
disambiguation problem of abbreviations, where one abbreviation could translate into multiple full
36
length version. Such an extension would therefore have to be contextually aware, which would
probably suggest that some machine learning features would have to be implemented.
Another extension to the algorithm would be to add spell check algorithm that runs before the
abbreviation detection process is started., since it was found in the analysis of this project that a lot of
misinterpreted tokens were due to misspellings. Preferably an already existing algorithm would be
used for this task, though it is unclear how such an algorithm would work on clinical texts.
37
References
Adar, E., 2002. S-RAD A Simple and Robust Abbreviation Dictionary. Bioinformatics, 20(4), pp.527-533
Cederblom, S., 2005. Medicinska förkortningar och akronymer. Lund: Studentlitteratur AB.
Dalianis, H., Hassel, M., Velupillai, S., 2009. The Stockholm EPR Corpus - Characteristics and Some Initial
Findings. Proceedings of the 14th International Symposium on Health Information Management Research, pp.
243-249.
Dannélls, D., 2003. Acronym Recognition. Master Thesis. Department of Linguistics, Göteborg University,
Sweden
Guadan, S., Kirsch, H., Rebholz-Schumann, D., 2005. Resolving abbreviations to their senses in Medline.
Bioinformatics, 21, pp. 3658-3664.
Hassel, M., Henriksson, A., Velupillai, S., 2011. Something Old, Something New – Applying a Pre-trained
Parsing Model to Clinical Swedish. Proceedings of NODALIDA`11 - 18th Nordic Conference on
Computational Linguistics, pp. 287-290.
Hevner, A.R., March, S.T., Park, J., 2004. Design Science in Information Systems Research. MIS Quaterly,
28(1), 2004, pp. 75-105.
Johannesson, P., Perjons, E., 2012. A Design Science Primer (draft). [online]. Available at:
<http://dl.dropbox.com/u/1666015/Design%20Science%20Primer-24jan12.pdf> [Accessed 9 June 2012].
Larkey, L.S., Ogilvie, P., Price, M.A., Tamilio, B., 2000. Acrophile: An Automated Acronym Extractor and
Server. Proceedings of the Fifth ACM Conference on Digital Libraries, pp.205-214.
Liu, H., Lussier, Y.A, Friedman, C., 2001. A Study of Abbreviations in UMLS. Proceedings of the AMIA
Symposium 2001, pp.393–397.
Meystre, S.M., Savova, G.K., Kipper-Schuler, K.C., Hurdle, J.F., 2008. Extracting Information from Textual
Documents in the Electronic Health Record: A Review of Recent Research. IMIA Yearbook of Medical
Informatics 2008, pp.128-144.
Nadeau, D., Turney, P.D., 2005. A Supervised Learning Approach To Acronym Identification. 8th Canadian
Conference on Artificial Intelligence. pp 319-329.
Pakhomov, S., 2002. Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation
Normalization in Medical Texts. Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), pp.160-167.
Park, Y., Byrd, R.J., 2001. Hybrid text mining for finding abbreviations and their definitions. Proceedings of
Empirical Methods in Natural Language Processing 2001, pp.126-133.
38
Pyper, C., Amery, J., Watson, M., Crook, C., 2004. Patients’ experiences when accessing their on-line electronic
patient records in primary care. British Journal of General Practice, January 2004, pp.38-43.
Ruch, P., Baud, R., Geissbühler, A., 2003. Using lexical disambiguation and named-entity recognition to
improve spelling correction in the electronic patient record. Artificial Intelligence in Medicine, 29, pp.169-
184.
Schwartz, A.S, Hearst, M.A., 2003. A simple algorithm for identifying abbreviation definitions in biomedical
text. Proceedings of the Pacific Symposium on Biocomputing, 8, pp. 451-462.
Skeppstedt, M., Kvist, M., Dalianis, H., 2012. Rule-based Entity Recognition and Coverage of SNOMED CT in
Swedish Clinical Text. Proceedings of the Language Resources Evaluation Conference 2012.
Taghva, K., Gilbreth, J., 1999. Recognizing Acronyms and their Definitions. International Journal on Document
Analysis and Recognition, 1, pp.191-198.
Wu, Y., Rosenbloom, S.T., Denny, J.C., Miller, R.A., Mani, S., Giuse, D.A., Xu, H., 2011. Detecting
Abbreviations in Discharge Summaries using Machine Learning Methods. AMIA Annual Symposium
Proceedings 2011, pp.1541–1549.
Yeates, S., 1999. Automatic Extraction of Acronyms from Text. Proceedings of the Third New Zealand
Computer Science Research Students' Conference, pp 117-124.
Xu, H., Stetson, P.D., Friedman, C., 2007. A Study of Abbreviations in Clinical Notes. AMIA Annual
Symposium Proceedings 2007, pp.821–825.
39
9. Appendix A
In this appendix, excerpts from the frequency lists generated by the algorithm version using the
improved algorithm, the improved word lists and a maximum word length of 6 will be listed. The top
15 of each list, sorted by the number of occurrences, will be listed and structured in the following way:
[Number of occurences] - [Token]
40
9.3 Missed Tokens - Top 15
24 jour
18 hö
15 el
13 kol
13 mott
12 rtg
12 st
11 troponin t
10 eko
10 elstatus
10 neg
8 alt
7 lab
6 gyn
6 rem
41
Department of Computer and Systems Sciences
Stockholm University
Forum 100
SE-164 40 Kista
Phone: 08 – 16 20 00
www.su.se
42