Professional Documents
Culture Documents
Rad-Former: Structuring Radiology Reports Using Transformers
Rad-Former: Structuring Radiology Reports Using Transformers
Rad-Former: Structuring Radiology Reports Using Transformers
Transformers*
Ashok Ajad, Taniya Saini, Niranjan Kumar M
Medical and Life Science (AI-CoE), L&T Technology Services
Bangalore India
(cse.aa9, tania.gagiyan1, niranjankumarm12)@gmail.com
(ashok.ajad, taniya.saini, niranjankumar.m)@ltts.com
Abstract—Several professional societies have advocated for Therefore, there is a need for a method to automatically
structured reporting in radiology, citing gains in quality, but structure an unstructured report.
some studies have shown that rigid templates and strict adherence
In their extensive survey, Pavlopoulos et al. in [22] show
may be too distracting and lead to incomplete reports. To gain
the advantages of structured reporting while requiring minimal several deep-learning based report generation methods. Most
change to a radiologist’s work-flow, the present work proposes of these use an encoder-decoder architecture where the encoder
a two-stage abstractive summarization approach that first finds extracts a high-dimensional representation of the image and
the key findings in an unstructured report and then generates the decoder generates unstructured text describing all the
and organizes descriptions of each finding into a given template.
abnormalities at once. Evaluating the diagnostic correctness
The method uses a large manually annotated dataset and a
taxonomy and other domain knowledge that were prepared in of such reports is a difficult problem since simple word-
consultation with several practising radiologists. It can be used overlap based measures do not take negation into account. For
to structure reports dictated by radiologists and as post- and pre- example, no evidence of pneumothorax would be considered
processing steps for machine-learning pipelines. On the subtask a better match for evidence of pneumothorax than volume loss
of label extraction, the method achieves significantly better
seen, even though the latter is diagnostically correct and the
performance than previous rule-based approaches and learning-
based approaches that were trained on automatically extracted former is wrong.
labels. On the task of summarization, the method achieves more A potentially better solution is to first find the abnormalities
than 0.5 BLEU-4 score across 8 of the 10 most common labels present in an image and then generate a report describing
and serves as a strong baseline for future experiments. only those present. In order to train such a pipeline, short
Index Terms—Deep Learning, Radiology, Structuring, Sum-
marization, Transformer, Healthcare. abnormality-specific captions are required for each image.
However, existing public datasets [6], [14] only contain un-
I. I NTRODUCTION structured text describing all the abnormalities, where a single
After a patient undergoes an imaging study, a radiologist sentence often describes multiple abnormalities. Therefore,
typically interprets the image and writes a detailed report there is a need for a method to decompose such unstructured
summarizing her findings for the referring physician, which reports into a set of short abnormality-specific captions.
is called a radiology report. The structure of reports has been To address these problems, the proposed method first de-
shown to significantly affect their accuracy, completeness and composes a given unstructured report into a set of short
understand-ability [15], [19], [25]–[27]. captions, one for each abnormality present. These can be used
A conventional report has no definite structure. However, as training data to train report generation models. The captions
most radiologists agree that a report should be organized and are then organized into a pre-defined template to generate
present findings in a logical sequence. Structured reporting a semi-structured report. The overall algorithm is shown in
counters the satisfaction of search bias, where a radiologist Figure 1.
stops looking for additional findings once she makes an initial Contributions of the present work include
diagnosis [17]. Structured reporting has been shown to lead 1) An approach to label extraction that includes language-
to more accurate, complete diagnoses [19], [25], [26] [27], model pre-training, weakly supervised training and su-
[29] by having the radiologist systematically consider every pervised fine-tuning that achieves significantly better
anatomical region and type of finding. performance than previous methods.
However, a rigid template has been shown to distract radi- 2) A novel method of label-specific abstractive summariza-
ologists from the image, leading to the problem of eye dwell tion of reports that provides a strong baseline for future
and causing them to miss important findings [15], [29], [24]. experiments.
Fundamentally, when a radiologist has to adhere to a structure 3) A method of automatic structuring of reports that uses
which is not compatible with their thought process/work- the label extraction and summarization components
flow, they tend to focus on the report more than the image. along with other domain knowledge to organize label-
specific summaries into a given template.
L&T Technology Services Bangalore India
Previous works have also developed methods to generate
the impression section from the findings. Zhang et al. in [32]
use a pointer-generator network which chooses at each time
step to either select a word from the vocabulary or copy a
word from the input. Dai et al. in [3] use an encoder-decoder
transformer for the same task. Other works [16], [20] use an
ontology of medical terms to improve on [32]. MacAvaney et
al. in [20] first use an ontology of medical concepts to generate
an augmented sequence containing only the words present in
the ontology. The resulting context vector is used to guide the
decoding process. Joshi et al. in [16] use a pointer-generator
network to summarize medical dialogue. However, medical
concepts and negation words are given special attention by
modifying the attention block, and generation is penalized (as
opposed to copying) in the loss function.
Previous work has explored both rule-based and neural mod-
Fig. 1. The overall algorithm with an example. els for classification and entity recognition, and summarization
models to generate the impression section from the findings.
II. R ELATED W ORK To the best of our knowledge, this is the first attempt at solving
the complementary problem of generating a structured findings
Previous works have explored methods to identify the section using unstructured text similar to the impression.
abnormalities present in a report, the core of which is the
problem of negation detection, i.e., determining whether a III. DATA
phrase indicates the presence or absence of an abnormality. For all experiments, the publicly available MIMIC-CXR
Automatic rule-based systems have been used to extract labels dataset [1], [9], [13], [14] which contains 227,835 chest X-Ray
from radiology reports. studies with images and reports, and an in-house CA dataset
NegEx [2], NegBio [23] and CheXPert [1], [11] are rule of images and reports collected from the Columbia-Asia group
based methods that first use a pre-defined list of phrases to find of hospitals containing 238,631 studies were used. Each report
mentions of abnormalities and then use regular expressions was first cleaned by removing special symbols such as ‘[’, ‘;’,
to detect negation of findings. More recently, transformers ‘*’, etc. Duplicate reports in both the MIMIC and CA datasets
have been used to label radiology reports. Drozdov et al. were dropped, which mostly included reports with no findings,
in [8] trained a transformer on an expert-labelled dataset resulting in 202,917 studies in MIMIC and 87,139 studies in
consisting of 3,856 chest X-Ray reports to classify a report CA. These reports were used to train a language model.
into one of 3 categories: Normal, Abnormal and Unclear. The CheXPert labeller [11] was used to label all 238,631
Wood et al. in [30] built on top of BioBert [18] to label MRI reports in the in-house CA dataset. The ChexPert labeller finds
radiology reports. Similarly, Smit et al. in [28] used a BERT 14 labels using a pre-defined list of mentions for each label, the
[7] model pre-trained on clinical notes and showed that pre- occurrence of any of which may indicate the presence of that
training the model on labels automatically extracted using the label. For each of the 33 labels that did not align with the 14 of
CheXPert labeler improves performance. While these methods CheXPert, a list of mentions was defined in consultation with
require manually annotated reports, CheXPert++ [21] showed a group of radiologists. This was iteratively refined over time
that even when trained only on CheXPert labels, the model as non-expert human annotators found potential new mentions.
was able to achieve 99.81% parity with CheXPert, and that These reports and labels were used to pre-train a classifier on
clinicians preferred CheXPert++ labels over that of CheXPert. weak labels.
Other works have also explored the more fine-grained task A group of non-expert human annotators then annotated a
of detecting all relevant entities and relationships between subset of the MIMIC dataset containing 118,648 reports. The
them. While some use rule-based methods [10], others use annotators saw a report where mentions of each label were
learning-based methods [4], [5], [12]. Jain et al [12] is the highlighted in a unique color and could look up which label
most recent work in this direction in which the authors created was indicated by a mention. For example, CP angle would be
a dataset containing dense entity and relation annotations and highlighted in mild blunting of left CP angle and this would
trained BERT [7] transformers to extract them. Perhaps the correspond to Pleural Effusion. For each label mentioned, they
closest to ours in this direction is Wu et al. [31] where had to determine whether the it is present or negated and write
a domain-learning assistant tool was used to curate a large a grammatically correct sentence summarizing all mentions of
lexicon of abnormalities, devices, quality assessment, etc and the label
for each abnormality/disease, a major anatomy (lungs, pleura, For all our experiments, a randomly sampled 10% of the
etc) was associated. However this work lacks a summarization training dataset was used as a validation set. Additionally,
module which is needed to generate a structured report. a test set of 1000 reports, which was initially set aside,
was annotated with labels and sentences by the consensus A GPT-2-like transformer is first trained using a language-
of multiple annotators in consultation with a radiologist. The modelling objective. A radiology report corresponding to a
performance of the labelling and summarization models are chest X-ray study is tokenized and the model is trained to
reported on this dataset. predict the (i + 1)th token given the first i tokens, using the
cross entropy loss.
A classifier composed of fully-connected layers is then
IV. A PPROACH added to the model such that the inputs to the classifier are the
latent embeddings of the last token of the report, from the final
layer of the transformer (the architecture is shown in figure
3). This is formulated as a multi-label classification problem
Fig. 2. Architecture of the models. A raw report is tokenized and input and the model is trained on a multi-task loss combining
to the models. The last token’s latent embeddings from the final layer is the language-modelling (cross entropy) loss LLM and the
input to the classifier (a). The classifier is trained to extract one or more of
33 labels using the combined language modelling and classification losses.
classification (binary cross entropy) loss LCLS , as shown in
Given a report and an extracted label, the summarization model (b) is trained equation 1.
to generate a summary of the report specific to the label. LLC = LCLS + λ · LLM (1)
.49
where λ is a hyper-parameter that can be tuned.
The language model is then fine-tuned on a manually
annotated dataset of reports, labels and label-specific report
summaries, to generate a summary of the report for each label
present (the architecture is shown in figure 4). This is formu-
lated as a language-modelling problem. Given a report r ∈ R,
a set of c extracted labels l = {l1 , l2 , . . . lc } and the corre-
sponding label-specific report summaries s = {s1 , s2 , . . . sc },
each label li is tokenized to li = {li1 , li2 , . . . } and summary
si to si = {si1 , si2 , . . . }. Label-specific sequence ri as in
Equation 2 is then generated for each label, where sep is a
special seperator token.
5* CheXPert-EM (U1) 0.836 0.991 0.895 0.977 either initially or entirely on labels extracted using CheXPert.
The CheXPert labeler is conservative in its usage of men-
10
MS 0.949 0.970 0.959 0.992 tions since the inclusion of partially related phrases may lead
LMPT + MS 0.952 0.972 0.961 0.996
LMPT + WS 0.821 0.977 0.883 0.982 to reduced precision. For example, although the phrase heart
LMPT + WS + MS 0.954 0.983 0.968 0.998 shadow does not indicate Cardiomegaly, the context in which
5 LABELS
MS LMPT + MS
Abnormality N+ BLEU-2/3/4 BLEU-Auto BLEU-2/3/4 BLEU-Auto
Bronchitis 277 0.778 / 0.739 / 0.683 0.722 0.767 / 0.721 / 0.668 0.703
Calcification 184 0.685 / 0.608 / 0.536 0.56 0.700 / 0.625 / 0.561 0.59
Cardiomegaly 200 0.829 / 0.796 / 0.610 0.815 0.845 / 0.816 / 0.619 0.832
Collapse 155 0.643 / 0.574 / 0.493 0.524 0.715 / 0.657 / 0.573 0.611
Consolidation 219 0.613 / 0.555 / 0.485 0.519 0.649 / 0.592 / 0.525 0.558
Nodule 128 0.707 / 0.652 / 0.601 0.612 0.713 / 0.670 / 0.627 0.641
Osseous Lesions 109 0.488 / 0.376 / 0.290 0.336 0.504 / 0.413 / 0.334 0.38
Pleural Effusion 365 0.668 / 0.595 / 0.539 0.545 0.655 / 0.585 / 0.528 0.534
Reticulo Nodular Pattern 148 0.699 / 0.644 / 0.604 0.619 0.739 / 0.687 / 0.640 0.657
Support Devices 122 0.573 / 0.499 / 0.429 0.459 0.597 / 0.512 / 0.439 0.474
Average - 0.668 / 0.604 / 0.527 0.571 0.688 / 0.628 / 0.551 0.598
TABLE IV
E XAMPLE OF REPORT STRUCTURING . A N unstructured report (I NPUT ) IS INPUT TO THE CLASSIFICATION AND SUMMARIZATION MODELS TO OBTAIN THE
predicted annotations, WHICH ARE ORGANIZED IN A TEMPLATE TO FORM THE CORRESPONDING structured report (O UTPUT )
I NPUT O UTPUT
BLEU-4 if the smallest reference sentence is 4 or more tokens friction in a radiologist’s work-flow and lead to reduced speed
in length, and equivalent to BLEU-n if it is n < 4 tokens long and quality. The present work proposed and implemented
is also shown. a 2-stage approach to report structuring that automatically
The proposed method achieves more than 0.5 BLEU-4 organized a report into a tier-2 structure.
scores in all labels excluding Osseous Lesions and Support The current work serves as a first step toward complete
Devices. Both Osseous Lesions and Support Devices have automatic structuring of reports. Methods further using dense
more mentions than most other labels, which makes the task entity-relationship modelling to organize a report into a tier-3
of generating the correct mention harder. It was also found that structure show promise for the future.
the effect of pre-training with a language modelling objective An important aspect of the problem which was ignored is
is more pronounced for summarization than in classification. the modelling of uncertainty. In order to enhance explain-
Language model pre-training significantly improved perfor- ability and improve radiologists’ trust in such a system, it
mance. is necessary that the generated report not only captures the
C. Structuring presence or absence but also the level of uncertainty of an
abnormality.
Labels present in a report were extracted using the classifier
Finally, a major draw-back of deep-learning based methods
and given as input to the report summarization model to
that have been explored in the literature is that unlike rule-
generate a summary for each label present. These summaries
based methods, the set of mentions cannot be changed or
were then organized in a template using prior knowledge to
extended without re-annotation and re-training. However, it is
map each summary to a region based on its tokens. A pre-
sometimes necessary to re-define which abnormality a phrase
defined normal sentence was used if a region did not contain
refers to, to adapt to an individual radiologist, or when new
any findings. Some examples are shown in Table IV.
information about an abnormality is discovered.
VII. C ONCLUSION Methods that can accurately model uncertainty and allow
Studies have show that while structured reporting leads to for a re-definition of mentions can make these systems more
more accurate, complete diagnoses, it may also be a point of practical and usable.
R EFERENCES [17] Lee, C.S., Nagy, P.G., Weaver, S.J., Newman-Toker, D.E.: Cognitive
and system factors contributing to diagnostic errors in radiology
[1] A Ajad, S.G., Sadhwani, K.J.: Cares: Knowledge infused chest x- 201(3), 611–617 (2013). https://doi.org/10.2214/AJR.12.10375,
ray report generation scheme. In: RSNA 2020-106th Annual Meeting. https://www.ajronline.org/doi/full/10.2214/AJR.12.10375,
RSNA (2020) publisher: American Roentgen Ray Society
[2] Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., Buchanan, [18] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H.,
B.G.: A simple algorithm for identifying negated findings and diseases in Kang, J.: BioBERT: a pre-trained biomedical language
discharge summaries. Journal of Biomedical Informatics 34(5), 301–310 representation model for biomedical text mining. Bioinformatics
(2001). https://doi.org/10.1006/jbin.2001.1029 (Sep 2019). https://doi.org/10.1093/bioinformatics/btz682,
[3] Dai, S., Wang, Q., Lyu, Y., Zhu, Y.: BDKG at MEDIQA https://doi.org/10.1093/bioinformatics/btz682
2021: System report for the radiology report summarization [19] Lin, E., Powell, D.K., Kagetsu, N.J.: Efficacy of a checklist-
task. In: Proceedings of the 20th Workshop on Biomedical style structured radiology reporting template in reducing resi-
Language Processing. pp. 103–111. Association for Computational dent misses on cervical spine computed tomography examina-
Linguistics (2021). https://doi.org/10.18653/v1/2021.bionlp-1.11, tions 27(5), 588–593 (2014). https://doi.org/10.1007/s10278-014-9703-
https://www.aclweb.org/anthology/2021.bionlp-1.11 2, https://doi.org/10.1007/s10278-014-9703-2
[4] Datta, S., Si, Y., Rodriguez, L., Shooshan, S.E., Demner-Fushman, D., [20] MacAvaney, S., Sotudeh, S., Cohan, A., Goharian, N., Talati,
Roberts, K.: Understanding spatial language in radiology: Representa- I., Filice, R.W.: Ontology-aware clinical abstractive summarization.
tion framework, annotation, and spatial relation extraction from chest In: Proceedings of the 42nd International ACM SIGIR Confer-
x-ray reports using deep learning. Journal of Biomedical Informat- ence on Research and Development in Information Retrieval. pp.
ics 108, 103473 (Aug 2020). https://doi.org/10.1016/j.jbi.2020.103473, 1013–1016. ACM (2019). https://doi.org/10.1145/3331184.3331319,
https://doi.org/10.1016/j.jbi.2020.103473 https://dl.acm.org/doi/10.1145/3331184.3331319
[5] Datta, S., Ulinski, M., Godfrey-Stovall, J., Khanpara, S., Riascos- [21] McDermott, M.B.A., Hsu, T.M.H., Weng, W.H., Ghassemi, M.,
Castaneda, R.F., Roberts, K.: Rad-SpatialNet: A frame-based resource Szolovits, P.: Chexpert++: Approximating the chexpert labeler for
for fine-grained spatial relations in radiology reports. In: Proceedings of speed,differentiability, and probabilistic output (2020)
the 12th Language Resources and Evaluation Conference. pp. 2251– [22] Pavlopoulos, J., Kougia, V., Androutsopoulos, I., Papamichail, D.: Di-
2260. European Language Resources Association, Marseille, France agnostic captioning: A survey (2021)
(May 2020), https://aclanthology.org/2020.lrec-1.274 [23] Peng, Y., Wang, X., Lu, L., Bagheri, M., Summers, R., Lu, Z.: Negbio:
[6] Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., a high-performance tool for negation and uncertainty detection in
Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Prepar- radiology reports (2017)
ing a collection of radiology examinations for distribution and [24] Pradhan, J., Ajad, A., Pal, A.K., Banka, H.: Multi-level colored direc-
retrieval. Journal of the American Medical Informatics Associa- tional motif histograms for content-based image retrieval. The Visual
tion 23(2), 304–310 (Jul 2015). https://doi.org/10.1093/jamia/ocv080, Computer 36(9), 1847–1868 (2020)
https://doi.org/10.1093/jamia/ocv080 [25] Quattrocchi, C.C., Giona, A., Di Martino, A.C., Errante, Y., Scarciolla,
[7] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of L., Mallio, C.A., Denaro, V., Zobel, B.B.: Extra-spinal incidental find-
deep bidirectional transformers for language understanding (2018) ings at lumbar spine MRI in the general population: a large cohort
[8] Drozdov, I., Forbes, D., Szubert, B., Hall, M., Carlin, C., study 4(3), 301–308 (2013). https://doi.org/10.1007/s13244-013-0234-z,
Lowe, D.J.: Supervised and unsupervised language modelling https://doi.org/10.1007/s13244-013-0234-z
in chest x-ray radiological reports. Plos One 15(3) (2020). [26] Rosskopf, A.B., Dietrich, T.J., Hirschmann, A., Buck, F.M., Sutter,
https://doi.org/10.1371/journal.pone.0229963 R., Pfirrmann, C.W.A.: Quality management in musculoskeletal
[9] Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., imaging: Form, content, and diagnosis of knee MRI reports and
Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., effectiveness of three different quality improvement measures
Stanley, H.E.: PhysioBank, PhysioToolkit, and PhysioNet. Circu- 204(5), 1069–1074 (2015). https://doi.org/10.2214/AJR.14.13216,
lation 101(23) (Jun 2000). https://doi.org/10.1161/01.cir.101.23.e215, https://www.ajronline.org/doi/10.2214/AJR.14.13216,
https://doi.org/10.1161/01.cir.101.23.e215 publisher: American Roentgen Ray Society
[10] Goryachev, S., Sordo, M., Zeng, Q.T.: A suite of natural lan- [27] Saini, T., Tripathi, S.: Predicting tags for stack overflow questions
guage processing tools developed for the i2b2 project. AMIA ... using different classifiers. In: 2018 4th International Conference on
Annual Symposium proceedings. AMIA Symposium p. 931 (2006), Recent Advances in Information Technology (RAIT). pp. 1–5 (2018).
https://europepmc.org/articles/PMC1839726 https://doi.org/10.1109/RAIT.2018.8389059
[11] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., [28] Smit, A., Jain, S., Rajpurkar, P., Pareek, A., Ng, A.Y., Lungren, M.P.:
Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J., Mong, Chexbert: Combining automatic labelers and expert annotations for
D.A., Halabi, S.S., Sandberg, J.K., Jones, R., Larson, D.B., Langlotz, accurate radiology report labeling using bert (2020)
C.P., Patel, B.N., Lungren, M.P., Ng, A.Y.: Chexpert: A large chest [29] Wankhade, M., Rao, A.C.S., Kulkarni, C.: A survey on sentiment
radiograph dataset with uncertainty labels and expert comparison (2019) analysis methods, applications, and challenges. Artificial Intelligence
[12] Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, Review pp. 1–50 (2022)
T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., Langlotz, C.P., [30] Wood, D.A., Lynch, J., Kafiabadi, S., Guilhem, E., Busaidi, A.A.,
Rajpurkar, P.: Radgraph: Extracting clinical entities and relations from Montvila, A., Varsavsky, T., Siddiqui, J., Gadapa, N., Townend, M.,
radiology reports (2021) Kiik, M., Patel, K., Barker, G., Ourselin, S., Cole, J.H., Booth, T.C.:
[13] Johnson, A.E.W., Pollard, T., Mark, R., Berkowitz, S., Horng, Automated labelling using an attention model for radiology reports of
S.: The mimic-cxr database (2019). https://doi.org/10.13026/C2JT1Q, mri scans (alarm) (2020)
https://physionet.org/content/mimic-cxr/ [31] Wu, J.T., Syed, A., Ahmad, H., Pillai, A., Gur, Y., Jadhav, A., Gruhl,
[14] Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J., Greenbaum, D., Kato, L., Moradi, M., Syeda-Mahmood, T.: AI accelerated human-
N.R., Lungren, M.P., ying Deng, C., Mark, R.G., Horng, in-the-loop structuring of radiology reports 2020, 1305–1314 (2021),
S.: MIMIC-CXR, a de-identified publicly available database https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8075499/
of chest radiographs with free-text reports. Scientific Data [32] Zhang, Y., Ding, D.Y., Qian, T., Manning, C.D., Langlotz,
6(1) (Dec 2019). https://doi.org/10.1038/s41597-019-0322-0, C.P.: Learning to summarize radiology findings. In: Proceed-
https://doi.org/10.1038/s41597-019-0322-0 ings of the Ninth International Workshop on Health Text Min-
[15] Johnson, A.J., Chen, M.Y.M., Swan, J.S., Applegate, K.E., Littenberg, ing and Information Analysis. pp. 204–213. Association for Com-
B.: Cohort study of structured reporting compared with conventional dic- putational Linguistics (2018). https://doi.org/10.18653/v1/W18-5623,
tation 253(1), 74–80 (2009). https://doi.org/10.1148/radiol.2531090138, http://aclweb.org/anthology/W18-5623
https://pubs.rsna.org/doi/10.1148/radiol.2531090138,
publisher: Radiological Society of North America
[16] Joshi, A., Katariya, N., Amatriain, X., Kannan, A.: Dr. summarize:
Global summarization of medical dialogue by exploiting local structures
(2020), http://arxiv.org/abs/2009.08666