Medical Corpora Comparison Using Topic Modeling Medical Corpora Comparison Using Topic Modeling

Available online at www.sciencedirect.
com
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 178 (2020) 244–253
9th International Young Scientist Conference on Computational Science (YSC 2020)
Medical Corpora Comparison Using Topic Modeling

Alevtina A. Shaikinaa, Anastasia A. Funknera,*
a
ITMO University, 197101, 49 Kronverksky pr., St Petersburg, 197101, Russia
Abstract
Free-form texts from electronic medical records are often used to build predictive models for medical and healthcare processes.
In different medical centers, treatment of patients and other healthcare processes can occur in different ways according to the
hospital's internal protocols, which affects the structure of electronic medical records and the style of free-form texts. The paper
aim is to compare two medical corpora in content to understand whether trained models of the first corpora apply to the second
corpora. The approach contains topic modeling, topic segmentation, topic cross-segmentation and specific metric Topic
Segmentation Collation to compare cross-segmentation results. Also, the results of the word-level analysis for both corpora are
provided. We conclude each of the corpora needs different word-level processing and has a specific set of descriptions, which
limits the use of predictive models for some diseases.
© 2020 The Authors. Published by Elsevier B.V.
© 2020 The Authors. Published by ELSEVIER B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the 9th International Young Scientist Conference on Computational
Peer-review
Science under responsibility of the scientific committee of the 9th International Young Scientist Conference on
Computational Science
Keywords: clinical texts; medical records; medical corpus; topic modeling; topic segmentation; natural language processing;
1. Introduction
Electronic medical records (EMRs) are widely used by hospitals to store all data about patients, including their
metadata and a description of the treatment process. Currently, EMR data is often used to build predictive models for
medical and healthcare processes. However, the structure and style of EMR can strongly differ depending on the
hospital and make it hard to use the same predictive models in different medical centers.
For models of supervised learning, scientists usually use data with labels which are manually set by experts. Thus,
if it is necessary to use such a model for a new medical center, there are two ways: mark up the data of this center
* Corresponding author. Tel.: +7-812-914-59-46.

E-mail address: funkner.anastasia@gmail.com
1877-0509 © 2020 The Authors. Published by ELSEVIER B.V.

Peer-review under responsibility of the scientific committee of the 9th International Young Scientist Conference on Computational Science
1877-0509 © 2020 The Authors. Published by Elsevier B.V.

Peer-review under responsibility of the scientific committee of the 9th International Young Scientist Conference on Computational Science
10.1016/j.procs.2020.11.026
Alevtina A. Shaikina et al. / Procedia Computer Science 178 (2020) 244–253 245
2 Author name / Procedia Computer Science 00 (2019) 000–000
and retrain the model or use a previously trained model. In the second case, time and other resources are saved, but
the quality of predictions can crucially reduce.
In this paper, we consider the corpora of medical texts as input data. Usually, part of the information about the
patients and their treatment in EMR is stored in free text: medical history, anamnesis, surgery protocols, etc. The
style and content of EMR texts can differ greatly in various medical centers due to internal routines in the hospital,
unwritten rules of information presentation between doctors or the specifics of the center itself. In this article, we
aim to compare two medical corpora in content, using methods of topic modeling and segmentation.
2. Related works
Medical texts contain a large amount of applicable statistical and dated information that can be used by a
decision-making system [1] or doctor to support their decision on further treatment of a patient. However, such
records do not generally possess a clear structure and are typically written in an open format, making them difficult
to analyse. Moreover, in 2001 a significant difference was shown between medical cases and daily speech cases,
which also complicates the processing of anamnesis and extracts [2].
Natural language processing (NLP) techniques are an integral part of working with EMRs, though mostly English
data is used for analysis. Currently, EMRs are used for the extraction and analysis of information applicable to the
process of treatment. In past years disease classification model using NLP [3], automatic system to determine risk
factors of heart diseases in clinical texts [4] were developed. At the same time, the variety of medical text-oriented
problems were solved, such as tasks of automated venous thromboembolism events extraction from radiology
reports [5], disease status identification in EMR [6], diagnoses detection from short fragments of medical texts [7],
the search of Hepatocellular Cancer patients from Administrative Data and EMR [8].
Topic modeling is used for thematic search, construction of a hierarchical thematic catalogue of a collection of
documents, keywords analysis [9], research and analysis of interests of users in social networks, quantitative
analysis of large amounts of texts [10], topic segmentation. Topic segmentation of documents, i.e. their division into
smaller fragments of text, described by distinct topic, can be useful in different tasks, where the analysis of large
documents or search for the contents of a document is required [11].
To compare medical corpora, the frequency of words of several corpora is most often used, as well as statistical
hypothesis testing based on these frequencies [12–14]. In recent works, we find how topic modeling is used to
compare the contents of several corpora. As a rule, at least one corpus has labels, based on which the metric for
comparing is calculated [15,16]. We could not find work that compares several medical corpora. However, there are
other works in this area: a comparison of medical and non-medical corpora [2], development of a standard for
processing medical text based on three corpora [17], comparing applicability of NLP tools on different types of
medical documents [18].
3. Backgrounds
Topic modeling and segmentation described in this paper form one of the modules of the system being developed
for Russian language medical texts processing (see Fig. 1). Tools like this system are usually developed for the
English language, however, to the author’s knowledge, there are no tools for processing medical texts in Russian.
Further details on the modules of the system could be found in [19–22]. The development and testing of the modules
of this system are carried out on the data of patients from one medical center: Almazov National Medical Research
Centre, Saint Petersburg, Russia. This work should help us understand if we can apply the developed modules to
data from other medical institutions.
246 Alevtina
AuthorA.name
Shaikina et al. Computer
/ Procedia / ProcediaScience
Computer00 Science 178 (2020) 244–253
(2019) 000–000 3
Fig. 1. Pink blocks indicate modules which are currently under development. Green blocks modules which already developed. Dashed blocks
indicate the methods which to be developed in the future.
4. Methods
The content-level corpora comparison method includes the next steps:

1. automatic discovering topics for each corpus with topic modeling;
2. topic segmentation for each corpus with its discovered topics;
3. topic cross-segmentation for each corpus with the discovered topics of other corpora;
4. comparing cross-segmentation results using metric Text Segmentation Collation.
4.1. Topic modeling
Topic modeling is one of NLP methods – it is a technology of statistical analysis of texts for automatic
identification of topics in large collections of documents. A topic model determines which topics each document
refers to and which terms describe each topic [23]. Thus, a topic model defines topics that are contained in the
collection of text documents.
Before comparing the corpora, it is necessary to understand the main topics within each of them. For a better
modeling result, specific preparation of the data is required, such as removing text contamination: tags and
punctuation marks and transforming words in their normal form, excluding the determined list of acronyms.
In this paper, we use Additive Regularization of Topic Models as topic modeling technique [24,25]. This
technique allows tuning many regularization parameters for automatic identification of topic amount, sparsing topic
terms, filtration of linear and similar topics. Also, there is Python module BigARTM for processing big datasets (see
details in Section 5.2).
4.2. Topic segmentation
The topic segmentation module was developed as a part of the solution shown in Fig. 1. This module divides the
input text into parts and assigns a theme to each of them. The user submits topics and the corresponding lists of
terms for initializing the module. Then the segmentation process splits into:
1. calculation of each topic weights for each sentence that uses frequencies of occurrence of topic-related
words,
2. correction of topics weights according to topics found in previous and next sentences,
3. assignment of a topic to sentence using its total weight and its popularity in full text that is used in case of
several topics are equally presented in a single sentence.
As a result, the tool returns a list, each element of which contains one or several consecutive sentences and topic
that suits them best.
4.3. Cross segmentation
To compare the topic segmentation of two data corpora, cross segmentation is applied: using the module
described in 3.2 and results of first corpus topic modeling, topic segmentation is performed on the texts from the
first and the second corpora. The same actions are performed using the results of second corpus topic modeling. We
define a metric to compare the results of topic segmentation (Text Segmentation Collation):
∑𝑛𝑛−1
1 𝑆𝑆𝐴𝐴𝑖𝑖
𝑇𝑇𝑇𝑇𝑇𝑇 = ,
𝑛𝑛 − 1
where 𝑛𝑛 – number of sentences in text, and 𝑆𝑆𝑆𝑆 (sentence affiliation) is calculated for each sentence using the
following rule:
• 𝑆𝑆𝑆𝑆 = 1, if in both segmentations the following sentence belongs to the same segment as the current
sentence or it belongs to different segments
• 𝑆𝑆𝑆𝑆 = 0, else.
Thereby, the TSC metric compares how the text is divided into segments. In case two divisions are the same,
𝑇𝑇𝑆𝑆𝑆𝑆 = 1. In case the text was divided into segments in absolutely different positions, 𝑇𝑇𝑇𝑇𝑇𝑇 = 0. Thus, 0 ≤ 𝑇𝑇𝑇𝑇𝑇𝑇 ≤
1.
5. Results
5.1. Word-level analysis
Two independent cardiovascular centers form different cities in Russia provided us with datasets of EMRs. Two
corpora of free-form anamnesis were collected. Anamnesis is an EMR record which contains a diseases’ history: the
reason for a current hospitalization, when the main disease occurred firstly, other associating diseases and
comorbidities, last medical test results, etc. The first dataset (further Dataset 1) contains almost 50,000 records and
the second one (further Dataset 2) is about 14,000 records.
First, we analyze corpora at the level of words and sentences. Fig. 2 shows violin plots for comparing the
symbolic length of records, the number of words and sentences in them. The anamnesis of Dataset 1 is on average
longer (see Fig. 2a), while Dataset 2 has more words and sentences in the records (see Fig. 2b, 2c). Dataset 2
contains two types of records: shorter records with fewer words and sentences. Moreover, this difference can be
248 Alevtina A. Shaikina et al. / Procedia Computer Science 178 (2020) 244–253
Author name / Procedia Computer Science 00 (2019) 000–000 5
explained by the fact that Dataset 2 contains word abbreviations much more often, which shortens the records and
can increase the number of words in the sentence. This is worth paying attention to when using the models trained
on Dataset 1: if abbreviations are not processed during model training, then when applying such a model to Dataset
2, the quality of predictions can significantly decrease. The difference in the number of the sentences is small: the
medians are seven and nine sentences for Dataset 1 and Dataset 2, respectively.
Fig. 2. Comparison of Dataset 1 and Dataset 2 with text length (a), a number of words (b) and sentences without outliers (c).
Secondly, it is necessary to analyse the vocabulary of corpora. Initially, the words are normalized using the
module pymorphy2 [26] and filtered with the list of stop words from nltk Python module. Dataset 1 and dataset 2
contain 68112 and 19194 unique non-stop words, respectively. At the same time, 10726 words are common.
Although, about 50% of the unique words of each set are found only once. There are words with errors and typos,
rare terms and drug names. Table 1 presents the top 15 words by the frequency of their appearance in the corpora.
Most of the words are the same in the tops and give an idea of the anamnesis: the patient’s medical history with the
date (year, therapy, treatment, course, min, patient, time, identify), blood pressure measurements (mm, BP, Hg,
increasing), drug prescription (mg , take). However, the top of Dataset 2 contains diseases (atrial fibrillation,
paroxysm), which are only at 52nd and 74th places in Dataset 1. This may indicate the specificity of the medical
center that provided us with Dataset 2. Perhaps the models trained on Dataset 2 to predict the characteristics of these
diseases will show low accuracy for Dataset 1.
Table 1. Top 15 the most frequent non-stop words in both corpora.

№ Dataset 1 Dataset 2
In Russian In English Counts in In Russian In English Counts in
corpus corpus
1 год year 154297 год year 52678
2 мг mg 71221 мг mg 22861
3 мм mm 67244 принимать take (a medication) 19389
4 ад blood pressure (BP) 56484 лечение treatment 14908
5 рт Hg 47678 ад blood pressure (BP) 11244
6 терапия therapy 40412 пароксизм paroxysm 11096
7 лечение treatment 34366 фп atrial fibrillation 9676
8 экг ECG 32077 экг ECG 9467
9 повышение increasing (blood pressure) 31320 пациент patient (male) 9296
10 принимать take (a medication) 24157 пациентка patient (female) 9269
11 течение course (of a disease) 23993 отмечать note 8913
12 чсс heart rate 23253 время time 8701
13 боль pain 22422 течение course 8268
14 норма norm 20385 выявить detect 7945
15 мин min (from minute) 19888 ритм rate 7907
5.2. Content-level analysis
In this section, data sets are compared by content with identifying and matching their topics. Firstly, after the
initial data processing (normalization, collecting 1-, 2- and 3-grams), topic modeling should be performed. For both
datasets we used a similar strategy: firstly, we added a smoothing background topic regularizer, and then a sparsing
subject topic regularizer. In this way, we get “clean” subject topics without background words.
In this paper, we use BigARTM library [27]. Additive regularization (ARTM) implemented in this library allows
the user to set several criteria at once. In BigARTM smoothing and sparsing regularizers are defined by the same
class artm.SmoothSparsePhiRegularizer: if the tau factor is positive, the regularizer will be smoothing, if the tau
factor is negative - the sparsing regularizer. Based on experiments results for a different number of topics and
different regularization parameters, the best results in perplexity value as well as in terms of interpretability and
mutual non-intersectionality of topics were selected models with 10 topics and training parameters presented in
Table 2.
Table 2. Topic models training parameters.

№ Name Dataset 1 Dataset 2
1 Number of initial passes on documents collection 40 40
2 Tau of regularizer SmoothSparsePhiRegularizer (Smoothing) 1e5 4*1e6
3 Number of passes on documents collection after regularization 30 20
4 Tau of regularizer SmoothSparsePhiRegularizer (Sparsing) -5e6 -1*1e6
5 Number of passes on documents collection after regularization 30 30
Tables 3 and 4 show four of the 10 discovered topics. The full list of topics and terms are not presented here from
lack of space (see Fig. 3 for topic names). Each topic is defined by a set of terms (the tables show the 10 most
important terms). The name of the topics was determined by us after viewing and decoding the first 50 terms of each
topic. Fig. 3 compares all the topics of each dataset. Topic nodes are connected by edges if the topics intersect in
terms. The thicker the edge, the more common words between topics. Thus, in Fig. 3 Dataset 1 contains more details
about blood pressure diseases than Dataset 2. Also, in Dataset 1, the topic “Stenocardia” is divided into two topics in
Dataset 2: a vascular examination is an important procedure for stenocardia diagnosis and stenocardia is often an
acute condition for urgent hospitalization. At the same time, Dataset 2 contains much more details of diseases with
heart rhythm disturbance: the terms of the four topics of Dataset 2 are contained in the same topic of “Heart rhythm
disturbance” of Dataset 1. Thus, we can conclude that although both medical centers are cardiologic, the patient
flows differ in diseases and comorbidities. Besides, there are topics in both datasets for which no intersections were
found.
Table 3. Four main topics for Dataset 1.

10 main terms in Russian 10 main terms in English Topic name
ХС, АЛТ, АСТ, ЛПНП, ЭКГ год, АЛТ АСТ, cholesterol, ALT, AST, LDL, ECG year, ALT AST, Test results
ЛПВП, ТГ, ритм синусовый, синусовый ЧСС, HDL, TRIG, rhythm sinus, sinus HR
ФП, пароксизм, ЖЭ, кордарон, ХМ, уд, пауза, AF, paroxysm, PVC, cordarone, holter monitor, bpm, Heart rhythm
синусовый ритм, ХМ ЭКГ, АВ pause, sinus rhythm, Holter ECG monitoring, AV disturbance
КАГ, ПМЖА, ПКА, стентирование, ОА, треть, coronarography, AIA, RCA, stenting, LCX, third, stent, Stenocardia
стент, реваскуляризация, грудина, АКШ revasculization, sternum, CABG
ммоль, диабет, гликемия, натощак, сахарный, mmol, diabetes, glycemia, on an empty stomach, sugar, Diabetes
сахарный диабет, диабетический, гликировать, sugar diabetes, diabetic, glycate, on an empty stomach mellitus
натощак ммоль, натощак ммоль литр mmol, on an empty stomach mmol liter
Table 4. Four main topics for Dataset 2.

10 main terms in Russian 10 main terms in English Topic name
тахикардия, экстрасистолия, предсердный, tachycardia, extrasystole, atrial, normal, Tachycardia
норма, ЧПЭС, узловой, продолжительность, transesophageal pacing, nodal, duration,
наджелудочковый, верапамил, несколько год supraventricular, verapamil, several year
пароксизм, ФП, кордарон, новокаинамид, paroxysm, AF, cordaron, novocainamide, effect, Paroxyzm
эффект, купировать, пропанорма, пароксизм block, propanorma, paroxysm FP, electropulse
ФП, ЭИТ therapy
мм, рт, ст, рт ст, мм рт, АД мм, цифра, mm, hg, st, Hg, mm Hg, BP mm, digit, max, Blood pressure
максимально, бригада, цифра АД brigade, digit BP
ЭКС, имплантация, имплантировать, pacemaker, implantation, implantation pacemaker, Pacemaker
имплантация ЭКС, митральный, клапан, ЧМТ, mitral, valve, TBI, change, check, CVA
смена, проверка, ОНМК
Fig. 3. Terms comparison of topics discovered with topic modelling.
After the topic modeling, the cross segmentation from Section 4.3 was completed. Fig. 4 shows the results of
cross segmentation on the first and second corpora. A study of TSC metric distribution for anamnesis of fixed length
shows that for text up to 6 sentences, the metric has a higher value. That is because of the anamnesis of up to 6
sentences more likely to segment similarly. For larger anamnesis, the medians of TSC metric are 0.62 and 0.64 for
Dataset 1 and Dataset 2, respectively. This metric value is explained by the fact that in both topic models there are
topics that are divided into several others in another modeling case: the topics ‘Stenocardia” and “Heart rhythm
disturbance” for Dataset 1 and the topics “Blood pressure” and “Pacemaker” for Dataset 2. Table 5 shows an
example of the anamnesis segmentation. The example contains “Stenocardia” topic from Dataset 1 which is split
into four topics from Dataset 2.
Fig. 4. Metric TSC distribution based on the length of sentences for Dataset 1 (DS 1) and Dataset 2 (DS 2).
Table 5. Example of anamnesis segmentation with topics discovered from Dataset 1 and Dataset 2.
Topics from Anamnesis parts Topics from
Dataset 1 Dataset 2
Decreased The history of hypertension is denied, usual blood pressure is 110/60 mm Hg. Blood
blood pressure
pressure
The debut of coronary heart disease in 1999 in the form of the appearance of a clinic of angina pectoris of Ischemic
low FC. Did not receive regular therapy; occasionally used Nitromint with a good effect. heart disease
In 2011, in connection with the clinic of unstable angina, he was examined, performed coronary Acute
angiography, and 14.03.12 RTSA with stenting of the permanent residence was performed. condition
In the future, anginal pains did not bother. The resumption of the clinic of angina pectoris from May 2012, Ischemic
07.03.12 hospitalized with ACS. heart disease
Stenocardia 04.07.12 transferred to perform the angiography. 04.07.12, restenosis of the stent of the permanent residence Acute
in the proximal and middle thirds was detected, angioplasty was simultaneously performed with a balloon condition
with medications.
There were no complications. Revascularization is complete. Ischemic
heart disease
In satisfactory condition transferred to the department. Change of
state
6. Discussion
The above analysis of two datasets allows us to determine their similarities and differences. A word-level study
shows that the texts of the sets are similar in the number of words and sentences, but data set 2 requires additional
word processing, as it contains shorter anamneses with a larger number of words. Vocabularies of both datasets
contain about 50% of non-repeating words and require typos to be corrected. Non-stop words tops are similar and
give only a small idea of how datasets differ in content.
Topic modeling and segmentation help to compare anamnesis by their content. Presented on Fig. 3 topics and
their intersected terms give a general idea of which topics are more often found in texts and how specialized they
are. The anamnesis of Dataset 1 contains a more detailed description of blood pressure diseases, while Dataset 2
contains more topics which are specific to diseases of heart rhythm disturbances. Dataset 2 also contains topics that
describe hospitalization processes.
Thus, it can be suggested what difficulties may arise while using trained models to another dataset. At the stage
of text processing, it is necessary to pay attention to abbreviations. Models for predicting the characteristics of
certain diseases or patient outcomes may show lower accuracy, since texts from a different data set may be less
specific due to a different profile of the medical institution. Topic modeling and analysis of terms allow establishing
the specifics of the hospital.
To improve the quality of this study, anamnesis can be processed with modules for medical texts (see Fig. 1). The
spellchecker module can significantly reduce the number of non-repeating words [19]. The negation detection
module helps to understand the purpose for which a disease is described. For example, for Dataset 1 the frequent
mention of blood pressure diseases and myocardial infarction contains many negatives. Specialists of this medical
center are obliged to ask patients about these diseases and write if the disease is not detected [20]. Moreover, for the
interpretation of topics, it is better to connect medical specialists.
7. Conclusion
This study implements a common pipeline on how to analyze several text corpora. This approach makes it
possible to understand whether the same models can be applied to different datasets, even if they are obtained from
similar sources and should contain similar data. We demonstrate our approach by analyzing two corpora of different
cardiology centers. As a result, each of the corpora needs different word-level processing and has a specific set of
descriptions, which limits the use of predictive models for some diseases.
We plan to use a similar corpus analysis when transferring the developed modules to the data of new medical
centers (Fig. 1). In the future, we are going to collaborate with domain specialists to interpret topics and use the
topics as part of the segmentation module. This module is aimed to label a text with discovered topics, which can
help medical staff navigate the EMR free text and help data scientists get an idea of features that can be extracted
from the texts.
Acknowledgements
This research is financially supported by The Russian Science Foundation, Agreement №17-71-30029 with co-
financing of Bank Saint Petersburg.
References
[1] Zhu Runjie, Tu Xinhui, Huang Jimmy. (2020) "Using Deep Learning Based Natural Language Processing Techniques for Clinic al Decision-
Making with EHRs." p. 257–95.
[2] Campbell DA, Johnson SB. (2001) "Comparing syntactic complexity in medical and non-medical corpora." Proc AMIA Symp: 90–4.
[3] Ananthakrishnan Ashwin N, Cai Tianxi, Savova Guergana, Cheng Su Chun, Chen Pei, Perez Raul Guzman, et al. (2013) "Improving case
definition of Crohn’s disease and ulcerative colitis in electronic medical records using natural language processing: A novel informatics
approach." Inflamm Bowel Dis 19 (7): 1411–20.
[4] Chen Qingcai, Li Haodi, Tang Buzhou, Wang Xiaolong, Liu Xin, Liu Zengjian, et al. (2015) "An automatic system to identify heart disease
risk factors in clinical texts over time." J Biomed Inform 58: S158–63.
[5] Tian Zhe, Sun Simon, Eguale Tewodros, Rochefort Christian M. (2017) "Automated extraction of vte events from narrative radiology reports
in electronic health records: A validation study." Med Care 55 (10): e73–80.
[6] Alemzadeh Homa, Devarakonda Murthy. (2017) "An NLP-based cognitive system for disease status identification in electronic health
records." 2017 IEEE EMBS Int. Conf. Biomed. Heal. Informatics, BHI 2017, p. 89–92.
[7] Dudchenko Aleksei, Ganzinger Matthias, Kopanitsa Georgy. (2019) "Diagnoses Detection in Short Snippets of Narrative Medical Texts."
Procedia Comput. Sci., vol. 156, p. 150–7.
[8] Sada Yvonne, Hou Jason, Richardson Peter, El-Serag Hashem, Davila Jessica. (2016) "Validation of case finding algorithms for
hepatocellular cancer from administrative data and electronic health records using natural language processing." Med Care 54 (2): e9–14.
[9] Murakami Akira, Thompson Paul, Hunston Susan, Vajn Dominik. (2017) "“What is this corpus about?”: Using topic modelling to ex plore a
specialised corpus." Corpora 12 (2): 243–77.
[10] Jacobi Carina, Van Atteveldt Wouter, Welbers Kasper. (2016) "Quantitative analysis of large amounts of journalistic texts using topic
modelling." Digit Journal 4 (1): 89–106.
[11] Shtekh Gennady, Nikitinsky Nikita, Kazakova Polina, Skachkov Nikolay. (2018) "Applying topic segmentation to document-level
information retrieval." ACM Int. Conf. Proceeding Ser., p. 1–6.
[12] Rayson Paul, Garside Roger. (2000) "Comparing corpora using frequency profiling." Proc Work Comp Corpora 9: 1–6.
[13] Kilgarriff Adam. (2001) "Comparing Corpora." Int J Corpus Linguist 6 (1): 97–133.
[14] Drouin Patrick. (2004) "Detection of domain specific terminology using corpora comparison." LREC: 79–82.
[15] Fothergill Richard, Cook Paul, Baldwin Timothy. (2016) "Evaluating a topic modelling approach to measuring corpus similarity." Proc 10th
Int Conf Lang Resour Eval Lr 2016 (2001): 273–9.
[16] Lu Jinghui, Henchion Maeve, Namee Brian Mac. (2019) "A topic-based approach to multiple corpus comparison." CEUR Workshop Proc
2563: 64–75.
[17] Deleger Louise, Li Q, Lingren Todd, Kaiser Megan, Molnar Katalin, Stoutenborough Laura, et al. (2012) "Building gold standard corpora
for medical natural language processing tasks." AMIA Annu Symp Proc 2012: 144–53.
[18] Zweigenbaum Pierre, Jacquemart Pierre, Grabar Natalia, Habert Benoît. (2001) "Building a text corpus for representing the variety of
medical language." Stud Health Technol Inform 84: 290–294.
[19] Balabaeva Ksenia, Funkner Anastasia, Kovalchuk Sergey. (2020) "Automated Spelling Correction for Clinical Text Mining in Russian."
Medical Informatics Europe 2020 270: 43-47.
[20] Funkner Anastasia, Balabaeva Ksenia, Kovalchuk Sergey. (2020) "Negation Detection for Clinical Text Mining in Russian." Medical
Informatics Europe 2020 270: 342-346.
[21] Funkner Anastasia A, Kovalchuk Sergey V. (2020) "Time Expressions Identification without Human-labeled Corpus for Clinical Text
Mining in Russian." Lecture Notes in Computer Science 12140: 591-602.
[22] Balabaeva Ksenia, Kovalchuka Sergey. (2020) "Experiencer detection and automated extraction of a family disease tree from medical texts
in Russian language." Lecture Notes in Computer Science 12140: 603-612.
[23] Alghamdi Rubayyi, Alfalqi Khalid. (2015) "A Survey of Topic Modeling in Text Mining." Int J Adv Comput Sci Appl 6 (1).
[24] Vorontsov Konstantin, Potapenko Anna, Plavin Alexander. (2015) "Additive regularization of topic models for topic selection a nd sparse
factorization." Lect Notes Comput Sci (Including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 9047: 193–202.
[25] Vorontsov Konstantin, Potapenko Anna. (2014) "Tutorial on probabilistic topic modeling: Additive regularization for stochastic matrix
factorization." Commun Comput Inf Sci 436: 29–46.
[26] Korobov Mikhail. (2015) "Morphological analyzer and generator for Russian and Ukrainian languages." Commun. Comput. Inf. Sci: 320–
332.
[27] Vorontsov Konstantin, Frei Oleksandr, Apishev Murat, Romov Peter, Dudarenko Marina. (2015) "Bigartm: Open source library for
regularized multimodal topic modeling of large collections." Commun Comput Inf Sci 542: 370–81.

Medical Corpora Comparison Using Topic Modeling Medical Corpora Comparison Using Topic Modeling

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Medical Corpora Comparison Using Topic Modeling Medical Corpora Comparison Using Topic Modeling

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

9th International Young Scientist Conference on Computational Science (YSC 2020)

Medical Corpora Comparison Using Topic Modeling

* Corresponding author. Tel.: +7-812-914-59-46.

1877-0509 © 2020 The Authors. Published by ELSEVIER B.V.

1877-0509 © 2020 The Authors. Published by Elsevier B.V.

The content-level corpora comparison method includes the next steps:

4.1. Topic modeling

4.2. Topic segmentation

4.3. Cross segmentation

5.1. Word-level analysis

Table 1. Top 15 the most frequent non-stop words in both corpora.

5.2. Content-level analysis

Table 2. Topic models training parameters.

Table 3. Four main topics for Dataset 1.

Table 4. Four main topics for Dataset 2.

Fig. 3. Terms comparison of topics discovered with topic modelling.

into four topics from Dataset 2.

You might also like