Rad-Former: Structuring Radiology Reports Using Transformers

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Rad-Former: Structuring Radiology Reports using

Transformers*
Ashok Ajad, Taniya Saini, Niranjan Kumar M
Medical and Life Science (AI-CoE), L&T Technology Services
Bangalore India
(cse.aa9, tania.gagiyan1, niranjankumarm12)@gmail.com
(ashok.ajad, taniya.saini, niranjankumar.m)@ltts.com

Abstract—Several professional societies have advocated for Therefore, there is a need for a method to automatically
structured reporting in radiology, citing gains in quality, but structure an unstructured report.
some studies have shown that rigid templates and strict adherence
In their extensive survey, Pavlopoulos et al. in [22] show
may be too distracting and lead to incomplete reports. To gain
the advantages of structured reporting while requiring minimal several deep-learning based report generation methods. Most
change to a radiologist’s work-flow, the present work proposes of these use an encoder-decoder architecture where the encoder
a two-stage abstractive summarization approach that first finds extracts a high-dimensional representation of the image and
the key findings in an unstructured report and then generates the decoder generates unstructured text describing all the
and organizes descriptions of each finding into a given template.
abnormalities at once. Evaluating the diagnostic correctness
The method uses a large manually annotated dataset and a
taxonomy and other domain knowledge that were prepared in of such reports is a difficult problem since simple word-
consultation with several practising radiologists. It can be used overlap based measures do not take negation into account. For
to structure reports dictated by radiologists and as post- and pre- example, no evidence of pneumothorax would be considered
processing steps for machine-learning pipelines. On the subtask a better match for evidence of pneumothorax than volume loss
of label extraction, the method achieves significantly better
seen, even though the latter is diagnostically correct and the
performance than previous rule-based approaches and learning-
based approaches that were trained on automatically extracted former is wrong.
labels. On the task of summarization, the method achieves more A potentially better solution is to first find the abnormalities
than 0.5 BLEU-4 score across 8 of the 10 most common labels present in an image and then generate a report describing
and serves as a strong baseline for future experiments. only those present. In order to train such a pipeline, short
Index Terms—Deep Learning, Radiology, Structuring, Sum-
marization, Transformer, Healthcare. abnormality-specific captions are required for each image.
However, existing public datasets [6], [14] only contain un-
I. I NTRODUCTION structured text describing all the abnormalities, where a single
After a patient undergoes an imaging study, a radiologist sentence often describes multiple abnormalities. Therefore,
typically interprets the image and writes a detailed report there is a need for a method to decompose such unstructured
summarizing her findings for the referring physician, which reports into a set of short abnormality-specific captions.
is called a radiology report. The structure of reports has been To address these problems, the proposed method first de-
shown to significantly affect their accuracy, completeness and composes a given unstructured report into a set of short
understand-ability [15], [19], [25]–[27]. captions, one for each abnormality present. These can be used
A conventional report has no definite structure. However, as training data to train report generation models. The captions
most radiologists agree that a report should be organized and are then organized into a pre-defined template to generate
present findings in a logical sequence. Structured reporting a semi-structured report. The overall algorithm is shown in
counters the satisfaction of search bias, where a radiologist Figure 1.
stops looking for additional findings once she makes an initial Contributions of the present work include
diagnosis [17]. Structured reporting has been shown to lead 1) An approach to label extraction that includes language-
to more accurate, complete diagnoses [19], [25], [26] [27], model pre-training, weakly supervised training and su-
[29] by having the radiologist systematically consider every pervised fine-tuning that achieves significantly better
anatomical region and type of finding. performance than previous methods.
However, a rigid template has been shown to distract radi- 2) A novel method of label-specific abstractive summariza-
ologists from the image, leading to the problem of eye dwell tion of reports that provides a strong baseline for future
and causing them to miss important findings [15], [29], [24]. experiments.
Fundamentally, when a radiologist has to adhere to a structure 3) A method of automatic structuring of reports that uses
which is not compatible with their thought process/work- the label extraction and summarization components
flow, they tend to focus on the report more than the image. along with other domain knowledge to organize label-
specific summaries into a given template.
L&T Technology Services Bangalore India
Previous works have also developed methods to generate
the impression section from the findings. Zhang et al. in [32]
use a pointer-generator network which chooses at each time
step to either select a word from the vocabulary or copy a
word from the input. Dai et al. in [3] use an encoder-decoder
transformer for the same task. Other works [16], [20] use an
ontology of medical terms to improve on [32]. MacAvaney et
al. in [20] first use an ontology of medical concepts to generate
an augmented sequence containing only the words present in
the ontology. The resulting context vector is used to guide the
decoding process. Joshi et al. in [16] use a pointer-generator
network to summarize medical dialogue. However, medical
concepts and negation words are given special attention by
modifying the attention block, and generation is penalized (as
opposed to copying) in the loss function.
Previous work has explored both rule-based and neural mod-
Fig. 1. The overall algorithm with an example. els for classification and entity recognition, and summarization
models to generate the impression section from the findings.
II. R ELATED W ORK To the best of our knowledge, this is the first attempt at solving
the complementary problem of generating a structured findings
Previous works have explored methods to identify the section using unstructured text similar to the impression.
abnormalities present in a report, the core of which is the
problem of negation detection, i.e., determining whether a III. DATA
phrase indicates the presence or absence of an abnormality. For all experiments, the publicly available MIMIC-CXR
Automatic rule-based systems have been used to extract labels dataset [1], [9], [13], [14] which contains 227,835 chest X-Ray
from radiology reports. studies with images and reports, and an in-house CA dataset
NegEx [2], NegBio [23] and CheXPert [1], [11] are rule of images and reports collected from the Columbia-Asia group
based methods that first use a pre-defined list of phrases to find of hospitals containing 238,631 studies were used. Each report
mentions of abnormalities and then use regular expressions was first cleaned by removing special symbols such as ‘[’, ‘;’,
to detect negation of findings. More recently, transformers ‘*’, etc. Duplicate reports in both the MIMIC and CA datasets
have been used to label radiology reports. Drozdov et al. were dropped, which mostly included reports with no findings,
in [8] trained a transformer on an expert-labelled dataset resulting in 202,917 studies in MIMIC and 87,139 studies in
consisting of 3,856 chest X-Ray reports to classify a report CA. These reports were used to train a language model.
into one of 3 categories: Normal, Abnormal and Unclear. The CheXPert labeller [11] was used to label all 238,631
Wood et al. in [30] built on top of BioBert [18] to label MRI reports in the in-house CA dataset. The ChexPert labeller finds
radiology reports. Similarly, Smit et al. in [28] used a BERT 14 labels using a pre-defined list of mentions for each label, the
[7] model pre-trained on clinical notes and showed that pre- occurrence of any of which may indicate the presence of that
training the model on labels automatically extracted using the label. For each of the 33 labels that did not align with the 14 of
CheXPert labeler improves performance. While these methods CheXPert, a list of mentions was defined in consultation with
require manually annotated reports, CheXPert++ [21] showed a group of radiologists. This was iteratively refined over time
that even when trained only on CheXPert labels, the model as non-expert human annotators found potential new mentions.
was able to achieve 99.81% parity with CheXPert, and that These reports and labels were used to pre-train a classifier on
clinicians preferred CheXPert++ labels over that of CheXPert. weak labels.
Other works have also explored the more fine-grained task A group of non-expert human annotators then annotated a
of detecting all relevant entities and relationships between subset of the MIMIC dataset containing 118,648 reports. The
them. While some use rule-based methods [10], others use annotators saw a report where mentions of each label were
learning-based methods [4], [5], [12]. Jain et al [12] is the highlighted in a unique color and could look up which label
most recent work in this direction in which the authors created was indicated by a mention. For example, CP angle would be
a dataset containing dense entity and relation annotations and highlighted in mild blunting of left CP angle and this would
trained BERT [7] transformers to extract them. Perhaps the correspond to Pleural Effusion. For each label mentioned, they
closest to ours in this direction is Wu et al. [31] where had to determine whether the it is present or negated and write
a domain-learning assistant tool was used to curate a large a grammatically correct sentence summarizing all mentions of
lexicon of abnormalities, devices, quality assessment, etc and the label
for each abnormality/disease, a major anatomy (lungs, pleura, For all our experiments, a randomly sampled 10% of the
etc) was associated. However this work lacks a summarization training dataset was used as a validation set. Additionally,
module which is needed to generate a structured report. a test set of 1000 reports, which was initially set aside,
was annotated with labels and sentences by the consensus A GPT-2-like transformer is first trained using a language-
of multiple annotators in consultation with a radiologist. The modelling objective. A radiology report corresponding to a
performance of the labelling and summarization models are chest X-ray study is tokenized and the model is trained to
reported on this dataset. predict the (i + 1)th token given the first i tokens, using the
cross entropy loss.
A classifier composed of fully-connected layers is then
IV. A PPROACH added to the model such that the inputs to the classifier are the
latent embeddings of the last token of the report, from the final
layer of the transformer (the architecture is shown in figure
3). This is formulated as a multi-label classification problem
Fig. 2. Architecture of the models. A raw report is tokenized and input and the model is trained on a multi-task loss combining
to the models. The last token’s latent embeddings from the final layer is the language-modelling (cross entropy) loss LLM and the
input to the classifier (a). The classifier is trained to extract one or more of
33 labels using the combined language modelling and classification losses.
classification (binary cross entropy) loss LCLS , as shown in
Given a report and an extracted label, the summarization model (b) is trained equation 1.
to generate a summary of the report specific to the label. LLC = LCLS + λ · LLM (1)
.49
where λ is a hyper-parameter that can be tuned.
The language model is then fine-tuned on a manually
annotated dataset of reports, labels and label-specific report
summaries, to generate a summary of the report for each label
present (the architecture is shown in figure 4). This is formu-
lated as a language-modelling problem. Given a report r ∈ R,
a set of c extracted labels l = {l1 , l2 , . . . lc } and the corre-
sponding label-specific report summaries s = {s1 , s2 , . . . sc },
each label li is tokenized to li = {li1 , li2 , . . . } and summary
si to si = {si1 , si2 , . . . }. Label-specific sequence ri as in
Equation 2 is then generated for each label, where sep is a
special seperator token.

ri = {start, t1 . . . tn , sep, li1 , li2 , . . . sep, si1 , si2 , . . . end}


(2)
During inference, the set of labels l in the report r is first
extracted using the classifier and used to generate the partial
sequence r′ i as in Equation 3 for each li ∈ l. These are input to
Fig. 3. the report summarization model in an auto-regressive manner
.49 to obtain the label-specific summaries si = {si1 , si2 , . . . } as
completions.

r′ i = {start, t1 , t2 , . . . tn , sep, li1 , li2 , . . . , sep} (3)

After the set of labels l present in an unstructured report r


is extracted using the classifier and label-specific summaries
si = {si1 , si2 , . . . } are obtained for each label using the report
summarization model, a structured report can be formed by
organizing the summaries in a template. Each summary is
scanned to find specific phrases indicating the type of label,
for example, mediastinal calcification. Using prior knowledge,
each of these phrases is mapped to one of the regions LUNG
FIELDS, COSTOPHRENIC ANGLES, HILAR / MEDIASTI-
NAL, CARDIAC SILHOUETTE, DOMES OF DIAPHRAGM
BONY THORAX and ADDITIONAL (for example, mediastinal
calcification occurs in the HILAR / MEDIASTINAL region). If
no summary corresponds to a particular region, a pre-defined
sentence indicating normalcy (for example, Cardiac silhouette
is within normal limits) is used.
Fig. 4.
V. I MPLEMENTATION TABLE II
N UMBER OF POSITIVE SAMPLES N+ IN THE TEST DATASET AND
For each experiment, a report r was cropped to the first PERFORMANCE OF OUR BEST CLASSIFIER (LMPT + WS + MS) ON THE 10
256 tokens and padded with padding tokens if |r| < 256. MOST COMMON LABELS .
Each token was embedded into a 768-dimensional word-
embedding vector. A decoder-only transformer with 12 blocks
each composed of 12-head self-attention, layer normalization A BNORMALITY N+ P RECISION R ECALL F1-S CORE AUROC
and feed-forward layers was used. The learning rate was Bronchitis 277 0.993 0.993 0.993 1.000
Calcification 182 0.983 0.951 0.966 0.999
linearly warmed to the maximum value over 160 iterations, Cardiomegaly 199 0.980 1.000 0.990 0.999
after which it was decayed using cosine learning-rate decay. Collapse 155 0.927 0.987 0.956 0.996
A language model was trained on the combined MIMIC Consolidation 218 0.973 0.982 0.977 0.999
Nodule 128 0.948 0.992 0.969 0.999
+ CA dataset for 20 epochs using a batch-size of 64 and a Osseous Lesions 109 0.914 0.972 0.942 0.997
learning rate of 4e−5 . In each experiment, the classifier was Pleural Effusion 362 0.968 0.989 0.978 0.999
Reticulo Nodular
trained for 20 epochs with a batch-size of 96 and learning Pattern
148 0.993 0.973 0.983 0.998
rate of 6e−5 . Uncertain labels were considered to be present. Support Devices 122 0.864 0.992 0.924 0.996
Similarly, the report summarization model was trained for 20
epochs with a batch-size of 80 and a learning rate of 5e−5 . A
random seed of 42 was set before beginning each training run, competitive with it. Comparing LMPT + MS and LMPT + WS
for each library which was a source of randomness. A machine + MS, pre-training with weak supervision slightly improved
with 256 GB of CPU memory, 8 Nvidia GeForce RTX 2080 performance. However, comparing LMPT + WS and LMPT +
Ti GPUs, Ubuntu 18.04 and PyTorch 1.7.1 was used for all WS + MS, manual supervision led to the largest gain.
experiments. CheXPert [11] with the default mentions, CheXBert [28]
and CheXPert++ [21] which were trained on labels extracted
VI. E XPERIMENTS AND R ESULTS using CheXPert were compared to CheXPert extended with
new mentions (CheXPert-EM) and the best performing pro-
TABLE I posed method (LMPT+WS+MS). Table I (bottom) shows av-
O N THE TASK OF CLASSIFICATION - T OP : A COMPARISON OF C HE XP ERT
[11] EXTENDED WITH NEW MENTIONS (C HE XP ERT-EM) AND THE
erage precision, recall, F1 score and AUROC over 5 labels that
PROPOSED MODELS ON THE 10 MOST COMMON LABELS , AND B OTTOM : A the methods had in common: Consolidation, Cardiomegaly,
COMPARISON OF C HE XP ERT [11], C HE XB ERT [28], C HE XP ERT ++ [21], Collapse (Atelectasis), Pleural Effusion and Support Devices.
C HE XP ERT EXTENDED WITH NEW MENTIONS (C HE XP ERT-EM) AND THE
BEST PROPOSED MODEL (LMPT + WS + MS) ON 5 COMMON LABELS .
It was found that CheXPert++ closely approximates CheX-
Pert while CheXBert shows some improvement. Although
extending the list of mentions (CheXPert-EM) reduced the
M ETHOD P RECISION R ECALL F1-S CORE AUROC
precision, it significantly improved recall. CheXBert and
CheXPert++ similarly have low recall since these are trained
LABELS

5* CheXPert-EM (U1) 0.836 0.991 0.895 0.977 either initially or entirely on labels extracted using CheXPert.
The CheXPert labeler is conservative in its usage of men-
10

MS 0.949 0.970 0.959 0.992 tions since the inclusion of partially related phrases may lead
LMPT + MS 0.952 0.972 0.961 0.996
LMPT + WS 0.821 0.977 0.883 0.982 to reduced precision. For example, although the phrase heart
LMPT + WS + MS 0.954 0.983 0.968 0.998 shadow does not indicate Cardiomegaly, the context in which
5 LABELS

it appears in a report can be used to determine whether Car-


5* CheXPert (U1) 0.895 0.761 0.804 0.871
diomegaly is present. However, addition of label-specific rules
CheXBert 0.884 0.796 0.820 0.888 addressing each such context is infeasible. Since the proposed
CheXPert++ 0.891 0.760 0.800 0.870 method learns from a large set of manually annotated data,
CheXPert-EM (U1) 0.865 0.987 0.915 0.978
LMPT+WS+MS 0.942 0.990 0.965 0.998 it has better precision even though the mentions (used during
weakly supervised training and seen by non-expert annotators)
were extended and included partially related phrases.
A. Classifier
Table I (top) shows the average precision, recall, F1-score B. Summarization
and area under the ROC curve over the 10 most common BLEU scores of the proposed models MS, a Trans-
labels of CheXPert-EM (CheXPert [11] with an extended set former trained from scratch on manually annotated data and
of mentions) and the proposed methods where MS refers to LMPT+MS, a Transformer first trained as a language model
manual supervision, LMPT to language model pre-training and and fine-tuned on manual annotations, were compared for each
WS to weak supervision. For example, LMPT + WS + MS is of the 10 most common labels (Table III). BLEU-2 and BLEU-
a pre-trained language model trained using weak supervision 3 along with BLEU-4 scores are reported since some of the
and further fine-tuned on manually annotated data. manually written summaries were shorter than 4 tokens in
It was found that LMPT + WS, which was only trained length (for example, Cardiomegaly seen). An automatically
on labels extracted using the CheXPert-EM labeller, was weighted BLEU score BLEU-Auto which is equivalent to
TABLE III
O N THE TASK OF REPORT SUMMARIZATION , A COMPARISON OF THE PERFORMANCE OF A BASELINE MODEL TRAINED ON MANUAL ANNOTATIONS (MS)
AND A PRE - TRAINED LANGUAGE MODEL TRAINED ON MANUAL ANNOTATIONS (LMPT + MS).

MS LMPT + MS
Abnormality N+ BLEU-2/3/4 BLEU-Auto BLEU-2/3/4 BLEU-Auto
Bronchitis 277 0.778 / 0.739 / 0.683 0.722 0.767 / 0.721 / 0.668 0.703
Calcification 184 0.685 / 0.608 / 0.536 0.56 0.700 / 0.625 / 0.561 0.59
Cardiomegaly 200 0.829 / 0.796 / 0.610 0.815 0.845 / 0.816 / 0.619 0.832
Collapse 155 0.643 / 0.574 / 0.493 0.524 0.715 / 0.657 / 0.573 0.611
Consolidation 219 0.613 / 0.555 / 0.485 0.519 0.649 / 0.592 / 0.525 0.558
Nodule 128 0.707 / 0.652 / 0.601 0.612 0.713 / 0.670 / 0.627 0.641
Osseous Lesions 109 0.488 / 0.376 / 0.290 0.336 0.504 / 0.413 / 0.334 0.38
Pleural Effusion 365 0.668 / 0.595 / 0.539 0.545 0.655 / 0.585 / 0.528 0.534
Reticulo Nodular Pattern 148 0.699 / 0.644 / 0.604 0.619 0.739 / 0.687 / 0.640 0.657
Support Devices 122 0.573 / 0.499 / 0.429 0.459 0.597 / 0.512 / 0.439 0.474
Average - 0.668 / 0.604 / 0.527 0.571 0.688 / 0.628 / 0.551 0.598

TABLE IV
E XAMPLE OF REPORT STRUCTURING . A N unstructured report (I NPUT ) IS INPUT TO THE CLASSIFICATION AND SUMMARIZATION MODELS TO OBTAIN THE
predicted annotations, WHICH ARE ORGANIZED IN A TEMPLATE TO FORM THE CORRESPONDING structured report (O UTPUT )

I NPUT O UTPUT

U NSTRUCTURED REPORT P REDICTED ANNOTATIONS


RIGHT- SIDED VOLUME LOSS AND CHRONIC PLEURAL THICKENING Pleural Effusion: CHRONIC PLEURAL THICKENING IS NOTED
ANDOR EFFUSION ACCOMPANIED BY A MULTIFOCAL Calcification: BILATERAL CALCIFIED GRANULOMAS ARE NOTED
PARENCHYMAL SCARRING AND BRONCHIECTASIS IS SIMILAR Collapse: RIGHT SIDED VOLUME LOSS IS NOTED
TO THE PRIOR RADIOGRAPH MULTIPLE BILATERAL CALCIFIED Bronchiectasis: BRONCHIECTASIS IS NOTED
GRANULOMAS ALSO APPEAR UNCHANGED AS WELL AS A
FOCAL AREA OF SCARRING IN THE LEFT UPPER LOBE S TRUCTURED REPORT
LUNG FIELDS: R IGHT SIDED VOLUME LOSS IS NOTED . B RONCHIECTASIS IS NOTED .
H UMAN ANNOTATIONS COSTOPHRENIC ANGLES: C HRONIC PLEURAL THICKENING IS NOTED .
Pleural Effusion: CHRONIC PLEURAL THICKENING IS NOTED HILAR/MEDIASTINAL: N O HILAR OR MEDIASTINAL MASS SEEN .
Calcification: BILATERAL CALCIFIED GRANULOMAS ARE NOTED CARDIAC SILHOUETTE: C ARDIAC SILHOUETTE IS WITHIN NORMAL LIMITS .
Collapse: RIGHT SIDED VOLUME LOSS IS NOTED DOMES OF DIAPHRAGM: B OTH DOMES OF DIAPHRAGM ARE NORMAL .
Bronchiectasis: BRONCHIECTASIS IS NOTED BONY THORAX: V ISUALISED BONY THORAX IS NORMAL .
ADDITIONAL: B ILATERAL CALCIFIED GRANULOMAS ARE SEEN .

BLEU-4 if the smallest reference sentence is 4 or more tokens friction in a radiologist’s work-flow and lead to reduced speed
in length, and equivalent to BLEU-n if it is n < 4 tokens long and quality. The present work proposed and implemented
is also shown. a 2-stage approach to report structuring that automatically
The proposed method achieves more than 0.5 BLEU-4 organized a report into a tier-2 structure.
scores in all labels excluding Osseous Lesions and Support The current work serves as a first step toward complete
Devices. Both Osseous Lesions and Support Devices have automatic structuring of reports. Methods further using dense
more mentions than most other labels, which makes the task entity-relationship modelling to organize a report into a tier-3
of generating the correct mention harder. It was also found that structure show promise for the future.
the effect of pre-training with a language modelling objective An important aspect of the problem which was ignored is
is more pronounced for summarization than in classification. the modelling of uncertainty. In order to enhance explain-
Language model pre-training significantly improved perfor- ability and improve radiologists’ trust in such a system, it
mance. is necessary that the generated report not only captures the
C. Structuring presence or absence but also the level of uncertainty of an
abnormality.
Labels present in a report were extracted using the classifier
Finally, a major draw-back of deep-learning based methods
and given as input to the report summarization model to
that have been explored in the literature is that unlike rule-
generate a summary for each label present. These summaries
based methods, the set of mentions cannot be changed or
were then organized in a template using prior knowledge to
extended without re-annotation and re-training. However, it is
map each summary to a region based on its tokens. A pre-
sometimes necessary to re-define which abnormality a phrase
defined normal sentence was used if a region did not contain
refers to, to adapt to an individual radiologist, or when new
any findings. Some examples are shown in Table IV.
information about an abnormality is discovered.
VII. C ONCLUSION Methods that can accurately model uncertainty and allow
Studies have show that while structured reporting leads to for a re-definition of mentions can make these systems more
more accurate, complete diagnoses, it may also be a point of practical and usable.
R EFERENCES [17] Lee, C.S., Nagy, P.G., Weaver, S.J., Newman-Toker, D.E.: Cognitive
and system factors contributing to diagnostic errors in radiology
[1] A Ajad, S.G., Sadhwani, K.J.: Cares: Knowledge infused chest x- 201(3), 611–617 (2013). https://doi.org/10.2214/AJR.12.10375,
ray report generation scheme. In: RSNA 2020-106th Annual Meeting. https://www.ajronline.org/doi/full/10.2214/AJR.12.10375,
RSNA (2020) publisher: American Roentgen Ray Society
[2] Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., Buchanan, [18] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H.,
B.G.: A simple algorithm for identifying negated findings and diseases in Kang, J.: BioBERT: a pre-trained biomedical language
discharge summaries. Journal of Biomedical Informatics 34(5), 301–310 representation model for biomedical text mining. Bioinformatics
(2001). https://doi.org/10.1006/jbin.2001.1029 (Sep 2019). https://doi.org/10.1093/bioinformatics/btz682,
[3] Dai, S., Wang, Q., Lyu, Y., Zhu, Y.: BDKG at MEDIQA https://doi.org/10.1093/bioinformatics/btz682
2021: System report for the radiology report summarization [19] Lin, E., Powell, D.K., Kagetsu, N.J.: Efficacy of a checklist-
task. In: Proceedings of the 20th Workshop on Biomedical style structured radiology reporting template in reducing resi-
Language Processing. pp. 103–111. Association for Computational dent misses on cervical spine computed tomography examina-
Linguistics (2021). https://doi.org/10.18653/v1/2021.bionlp-1.11, tions 27(5), 588–593 (2014). https://doi.org/10.1007/s10278-014-9703-
https://www.aclweb.org/anthology/2021.bionlp-1.11 2, https://doi.org/10.1007/s10278-014-9703-2
[4] Datta, S., Si, Y., Rodriguez, L., Shooshan, S.E., Demner-Fushman, D., [20] MacAvaney, S., Sotudeh, S., Cohan, A., Goharian, N., Talati,
Roberts, K.: Understanding spatial language in radiology: Representa- I., Filice, R.W.: Ontology-aware clinical abstractive summarization.
tion framework, annotation, and spatial relation extraction from chest In: Proceedings of the 42nd International ACM SIGIR Confer-
x-ray reports using deep learning. Journal of Biomedical Informat- ence on Research and Development in Information Retrieval. pp.
ics 108, 103473 (Aug 2020). https://doi.org/10.1016/j.jbi.2020.103473, 1013–1016. ACM (2019). https://doi.org/10.1145/3331184.3331319,
https://doi.org/10.1016/j.jbi.2020.103473 https://dl.acm.org/doi/10.1145/3331184.3331319
[5] Datta, S., Ulinski, M., Godfrey-Stovall, J., Khanpara, S., Riascos- [21] McDermott, M.B.A., Hsu, T.M.H., Weng, W.H., Ghassemi, M.,
Castaneda, R.F., Roberts, K.: Rad-SpatialNet: A frame-based resource Szolovits, P.: Chexpert++: Approximating the chexpert labeler for
for fine-grained spatial relations in radiology reports. In: Proceedings of speed,differentiability, and probabilistic output (2020)
the 12th Language Resources and Evaluation Conference. pp. 2251– [22] Pavlopoulos, J., Kougia, V., Androutsopoulos, I., Papamichail, D.: Di-
2260. European Language Resources Association, Marseille, France agnostic captioning: A survey (2021)
(May 2020), https://aclanthology.org/2020.lrec-1.274 [23] Peng, Y., Wang, X., Lu, L., Bagheri, M., Summers, R., Lu, Z.: Negbio:
[6] Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., a high-performance tool for negation and uncertainty detection in
Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Prepar- radiology reports (2017)
ing a collection of radiology examinations for distribution and [24] Pradhan, J., Ajad, A., Pal, A.K., Banka, H.: Multi-level colored direc-
retrieval. Journal of the American Medical Informatics Associa- tional motif histograms for content-based image retrieval. The Visual
tion 23(2), 304–310 (Jul 2015). https://doi.org/10.1093/jamia/ocv080, Computer 36(9), 1847–1868 (2020)
https://doi.org/10.1093/jamia/ocv080 [25] Quattrocchi, C.C., Giona, A., Di Martino, A.C., Errante, Y., Scarciolla,
[7] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of L., Mallio, C.A., Denaro, V., Zobel, B.B.: Extra-spinal incidental find-
deep bidirectional transformers for language understanding (2018) ings at lumbar spine MRI in the general population: a large cohort
[8] Drozdov, I., Forbes, D., Szubert, B., Hall, M., Carlin, C., study 4(3), 301–308 (2013). https://doi.org/10.1007/s13244-013-0234-z,
Lowe, D.J.: Supervised and unsupervised language modelling https://doi.org/10.1007/s13244-013-0234-z
in chest x-ray radiological reports. Plos One 15(3) (2020). [26] Rosskopf, A.B., Dietrich, T.J., Hirschmann, A., Buck, F.M., Sutter,
https://doi.org/10.1371/journal.pone.0229963 R., Pfirrmann, C.W.A.: Quality management in musculoskeletal
[9] Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., imaging: Form, content, and diagnosis of knee MRI reports and
Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., effectiveness of three different quality improvement measures
Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet. Circu- 204(5), 1069–1074 (2015). https://doi.org/10.2214/AJR.14.13216,
lation 101(23) (Jun 2000). https://doi.org/10.1161/01.cir.101.23.e215, https://www.ajronline.org/doi/10.2214/AJR.14.13216,
https://doi.org/10.1161/01.cir.101.23.e215 publisher: American Roentgen Ray Society
[10] Goryachev, S., Sordo, M., Zeng, Q.T.: A suite of natural lan- [27] Saini, T., Tripathi, S.: Predicting tags for stack overflow questions
guage processing tools developed for the i2b2 project. AMIA ... using different classifiers. In: 2018 4th International Conference on
Annual Symposium proceedings. AMIA Symposium p. 931 (2006), Recent Advances in Information Technology (RAIT). pp. 1–5 (2018).
https://europepmc.org/articles/PMC1839726 https://doi.org/10.1109/RAIT.2018.8389059
[11] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., [28] Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.P.:
Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J., Mong, Chexbert: Combining automatic labelers and expert annotations for
D.A., Halabi, S.S., Sandberg, J.K., Jones, R., Larson, D.B., Langlotz, accurate radiology report labeling using bert (2020)
C.P., Patel, B.N., Lungren, M.P., Ng, A.Y.: Chexpert: A large chest [29] Wankhade, M., Rao, A.C.S., Kulkarni, C.: A survey on sentiment
radiograph dataset with uncertainty labels and expert comparison (2019) analysis methods, applications, and challenges. Artificial Intelligence
[12] Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, Review pp. 1–50 (2022)
T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., Langlotz, C.P., [30] Wood, D.A., Lynch, J., Kafiabadi, S., Guilhem, E., Busaidi, A.A.,
Rajpurkar, P.: Radgraph: Extracting clinical entities and relations from Montvila, A., Varsavsky, T., Siddiqui, J., Gadapa, N., Townend, M.,
radiology reports (2021) Kiik, M., Patel, K., Barker, G., Ourselin, S., Cole, J.H., Booth, T.C.:
[13] Johnson, A.E.W., Pollard, T., Mark, R., Berkowitz, S., Horng, Automated labelling using an attention model for radiology reports of
S.: The mimic-cxr database (2019). https://doi.org/10.13026/C2JT1Q, mri scans (alarm) (2020)
https://physionet.org/content/mimic-cxr/ [31] Wu, J.T., Syed, A., Ahmad, H., Pillai, A., Gur, Y., Jadhav, A., Gruhl,
[14] Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J., Greenbaum, D., Kato, L., Moradi, M., Syeda-Mahmood, T.: AI accelerated human-
N.R., Lungren, M.P., ying Deng, C., Mark, R.G., Horng, in-the-loop structuring of radiology reports 2020, 1305–1314 (2021),
S.: MIMIC-CXR, a de-identified publicly available database https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8075499/
of chest radiographs with free-text reports. Scientific Data [32] Zhang, Y., Ding, D.Y., Qian, T., Manning, C.D., Langlotz,
6(1) (Dec 2019). https://doi.org/10.1038/s41597-019-0322-0, C.P.: Learning to summarize radiology findings. In: Proceed-
https://doi.org/10.1038/s41597-019-0322-0 ings of the Ninth International Workshop on Health Text Min-
[15] Johnson, A.J., Chen, M.Y.M., Swan, J.S., Applegate, K.E., Littenberg, ing and Information Analysis. pp. 204–213. Association for Com-
B.: Cohort study of structured reporting compared with conventional dic- putational Linguistics (2018). https://doi.org/10.18653/v1/W18-5623,
tation 253(1), 74–80 (2009). https://doi.org/10.1148/radiol.2531090138, http://aclweb.org/anthology/W18-5623
https://pubs.rsna.org/doi/10.1148/radiol.2531090138,
publisher: Radiological Society of North America
[16] Joshi, A., Katariya, N., Amatriain, X., Kannan, A.: Dr. summarize:
Global summarization of medical dialogue by exploiting local structures
(2020), http://arxiv.org/abs/2009.08666

You might also like