Professional Documents
Culture Documents
Artificial Intelligence in Medicine Book - 2022 - 1
Artificial Intelligence in Medicine Book - 2022 - 1
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
AI, Machine Learning, and Deep Learning per Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
A Brief History of AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Rising Demand for AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
AI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
AI Staging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
AI Programming Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Types of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Machine Learning Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Limitations of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Introduction of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Single-Layer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
N. Lidströmer (*)
Department of Women’s and Children’s Health, Karolinska
Institutet, Stockholm, Sweden
e-mail: niklas.lidstromer@ki.se
F. Aresu
KTH Royal Institute of Technology, Stockholm, Sweden
e-mail: aresu@kth.se
H. Ashrafian
Department of Surgery and Cancer, Imperial College
London NHS Trust, London, UK
Institute of Global Health Innovation, Imperial College
London, London, UK
Hamlyn Centre for Robotics and Artificial Intelligence,
Department of Surgery and Cancer, Imperial College
London, London, UK
e-mail: h.ashrafian@imperial.ac.uk
This behavior is learned based on the data that the two sciences would “eventually become a
given, by minimizing the error between the cur- firm alliance.”
rent action and the ideal one, and a feedback In 1950, Alan Turing published a landmark paper,
mechanism (the famous “error backpropagation”) in which he speculates about the possibility of creat-
that uses it to lead the machine to self-recognize, ing machines that think. He created a test known as
and internally improve its performances. the Turing test, aimed at determining whether a
Deep Learning (DL) is a subset of machine computer can think like a human being. He con-
learning, which uses multilayer neural networks cluded that thinking is hard to define. If a machine
to solve complex problems. They exploit the pro- can carry out a conversation, which is indistinguish-
cessing of contextual data. Deep learning tech- able from a conversation with a human being, then it
niques are, therefore, more suitable for learning would be reasonable to say the machine is thinking.
unstructured data. The machine would pass the Turing test. No machine
until this day has achieved this result. The Turing test
A Brief History of AI stands out as the first contribution to the philosophy
of artificial intelligence.
The history of AI is very old – it goes far back to Followed by this contribution came the era of
antiquity and Greek mythology. Talos was a giant “game AI” during the 1950s. In 1951 with the
animated bronze warrior, programmed to guard usage of the Ferranti Mark 1 machine, at the Uni-
the island of Crete, created by Hephaestus, to versity of Manchester, the computer scientist
throw rocks at nearing enemy ships. But there Christopher Strachey wrote a checkers program,
are several myths about mechanical men and and Dietrich Prinz wrote one for chess at around
automata throughout human history. These were the same time.
generally well thought of [1]. These were the first attempts to let computers
Modern medicine is intrinsically connected to play games such as chess, and compete with
advanced mathematical analysis, now requiring humans.
computers with fast processors. This has led to This was followed by probably the most impor-
new diagnostics, pharmacovigilance, therapies, tant year in AI history. In 1956, the concept of AI
robot-assisted interventions, epidemiological was coined for the first time at the Dartmouth
data mining, and synthesis of evidence-based Conference by Professor John McCarthy. In 1959
medicine and decision support. the first AI laboratory was established, marking the
Also, mathematics in medicine has a long his- next coming period as the AI research era. The lab
tory. Leiden professor of medicine Archibald is called MIT, and is still in operation and famous.
Pitcairne (1652–1713) developed a theory of In 1960 the first AI robot was implemented into
iatro-mathematics (medicine and mathematics) the GM assembly line, and the first chatbot
and can be regarded as the father of mathematical ELIZA was invented. This is the great grand-
medicine [2]. During his career he stood under mother of Siri and Alexa. Soon hereafter came
documented influence by mathematicians such as the very well-known IBM Deep Blue, which in
David Gregory (1659–1708) and Isaac Newton 1997 beat the world champion Garry Kasparov in
(1642–1727), whose Philosophiæ Naturalis a game of chess. This is regarded as maybe one of
Principia Mathematica, heavily influenced the the first great accomplishments of AI.
formation of a Newtonian medicine, and the for- In 2005 at the DARPA Grand Challenge, the
malized concept of iatro-mathematics, with lec- racing team of Stanford University participated
tures in Leiden during the early 1690s [3]. with Stanley, an autonomous car, which won
In 1920, The Lancet published an article of the the race.
topic of iatro-mathematics [4], where it was noted In 2011 IBM’s answering system Watson
that the “rapprochement of medicine and mathe- defeated two of the greatest Jeopardy! Champions
matics is incomplete.” However, it was concluded Brad Rutter and Ken Jennings.
6 N. Lidströmer et al.
AI started as a hypothetical situation, evolved Microsoft, and most car manufacturers, and a
and is now the most important technology in long list of major tech companies, are deeply
today’s world. AI has presented an exponential investing in AI. The consensus among all men-
growth of its potential. Wherever we look around tioned instances is that AI is the way of the future.
us AI deep learning or machine learning power
many things. For more details on the importance of the AI in
Today AI dominates Knowledge Base, Expert medicine, please see the chapter “On the Impor-
Systems, Deep Learning, Computer Vision and tance of AIM,” by Dr Katarina Gospic et al.
Image Processing, Machine Learning and Natural
Language Processing, etc. AI Applications
though be noted that not been directly involved driving cars and even taxis without a driver, based
into the medical diagnosing, it has only assisted on a range of AI implementations. Also, Netflix uses
with treatment alternatives for already readily AI and pattern recognition to make personal film
diagnosed patients [7]. recommendations. Gmail uses a similar principle to
During the last 10 years Watson has partnered automatically sort incoming letters in mailboxes,
with a long range of organizations, companies, spam filtering, etc. The latter uses buzzwords that
and universities, e.g., Columbia University, Uni- are common in spam, e.g., full refund, lottery, etc.,
versity of Maryland [8], Memorial Sloan- and then directs them to the spam compartment.
Kettering Cancer Center [9], MD Anderson Can-
cer Center, Manipal Hospital, Cleveland Clinic, AI Staging
and Case Western Reserve University [10].
The FAANG group and other large corporations There are three main stages of AI
have started AI initiatives within health; Facebook
(Preventative Health, 2019–), Microsoft (Health • Narrow AI, also known as weak AI. This can
Vault, 2011–2019, Apple (Health, 2014), Amazon only be applied to specific tasks. Most applica-
(Amazon Care, 2018–), Google (Google Health tions today belong to this group, e.g., Alexa –
2006–2012, and with DeepMind 2018–) [11]. although sophisticated, all functions are within a
For instance, DeepMind Health collaborates narrowly defined function range. Other examples
with Moorfields Eye Hospital and are developing are self-driving cars, chess computers, AlphaGo.
AI applications for healthcare, especially eye • General AI, or strong AI. We have not reached
scanning [12], and with University College this stage. Strong AI refers to machines being
London Hospital aiming to develop an algorithm able to possess the ability to perform any intel-
to differentiate between healthy and cancerous lectual task that a human can. Machines now
tissues in the head and neck region [13]. have strong processing powers, but hitherto no
In global large networks, such as Facebook, AI sign of reasoning capacity; hence we’re stuck
is used in, e.g., face verification, both as password in the weak AI stage. Not even AlphaGo Zero,
and auto-tagging of friends, and to personalize which learned without human intervention,
advertising systems using neural networks, could be defined as strong AI.
machine learning, and deep learning concepts. • Super AI, refers to a stage when computers
Many people are not aware of how much AI would surpass human capacities. This stage
they use on a daily basis in their lives – all social encompassed big data statistics, symbolic
media platforms, e.g., Facebook, Instagram, Twit- mathematics, number of faces, generative
ter, LinkedIn, heavily rely on AI. Through the adversarial networks (GANs).
2016 US elections, political ads using social
media are called to the spotlight. Specifically, the AI Programming Languages
controversy of targeting users to obtain their per-
sonal information and determining what adver- There are several languages used in AI applica-
tisement would persuade those electors. tions, and one of the most popular and well-
Twitter uses AI to identify hate speech and known is Python, partly because of its simple
terroristic language in tweets. In this way they and functional syntaxes, and also for the great
detected ca. 300,000 terror-linked accounts. number of libraries designed for Python (to imple-
These nonhuman AI machines found 95% [14]. ment Machine Learning algorithms in a straight-
Also, virtual assistants such as Alexa and Siri forward manner), such as Keras and Tensorflow.
have entered the market, and quite recently also The Python advantages are related to its sim-
Google Duplex, which responds to calls, can book plicity and their maintainability as well as the
appointments and with a human touch, making it possibility to connect and integrate with files writ-
sound realistic. ten in other programming languages. Problems of
Some other examples of AI are Tesla’s and many memory usage and not having multithreading are
other car manufacturers’ experiments with self- substantial Python disadvantages [15].
8 N. Lidströmer et al.
Python was created in 1989 by Guido Rossum, of course, a vital part of AI, and which will be seen
and is an interpreted, object-oriented high-level in AIM, especially when applications are directed
programming language with dynamic semantics. toward patients and students. Java provides better
Hence, it is a high-level language (no concerns managing tools of garbage and provides multi-
with low-level details, e.g., memory allocation), threading, differently from Python.
which is free and open-source. Python is portable, Another alternative language is Lisp, less
i.e., it is supported by many platforms, e.g., Linux, known, but the most ancient and perhaps best-
iOS Mac, Windows PC, FreeBSD, Solaris, OS/2, adapted language for AI development. Lisp goes
Amiga, AS/400, BeOS, OS/390, PlayStation, Win- all back to the origins of AI, and was introduced
dows CE, etc. by John McCarty in the late 1950s, and can pro-
Python supports various programming para- cess symbolic information, can prototype, create
digms, i.e., both object-oriented, and procedure- dynamic objects, automatic garbaging, and is
oriented programming, and is extensible, i.e., it deemed easy by developers. Though, nearly all
can invoke C and C++ libraries, and can integrate of its excellent features have migrated into many
with a multitude of other languages, such as Java other languages. It is the Sanskrit, Swahili, or
and NET products. Python is the most rapid gainer Latin of AI languages. The latter are more effec-
in AI, with a huge momentum. Its use is ubiqui- tive, have better packaging, etc.
tous to create AI algorithms, machine learning, SWI Prolog is a language, which is relevant in
IoT projects, etc. With Python, the developer AIM, since it is often used in knowledge base and
doesn’t need to code very much, because there expert systems – it has features such as pattern
are ready-made packages, with algorithms. For matching, freebase data structuring, and auto-
instance, PiBrain (for Machine Learning), matic backtracking. This gives a strong and flex-
NumPy (scientific computing, Pandas, etc.) can ible framework for programming, which makes it
be implemented, and a vast range of libraries. frequently used in AIM.
Apart from Python, another popular program- Other languages worth mentioning are C++,
ming language used mainly for statistical tasks is SaaS, JavaScript, MATLAB, and Julia. All of
called R. This language well-performs in analysis these can be used for AI.
and manipulation of incoming data for statistical
purposes. R is well known for its publication-
quality plots and its compatibility with other pro- Machine Learning
gramming languages. However, R is less suitable
for handling big data analysis tasks for its consum- Machine Learning (ML) is one of the most
ing memory characteristic compared to Python and important instruments in AI. It is a way of feed-
its speed in other programming languages. ing data into a machine, so it can make its own
R is almost as easy as Python to learn. Both decisions. The need for ML is as old as the
languages are very similar to English in syntax and technical revolution, which has generated
construction, hence they belong to the easiest to immeasurable loads of data. In research we gen-
master. They both have an enormous number of erate over 2.5 quintillion bytes of data per day. In
libraries to provide all thinkable predefined algo- 2020 an estimated 1.7 MB of data was created
rithms, statistical models, data scientific inputs, AI, every second for every person on earth. With this
machine learning with algorithms, NLP, etc. vast amount, models can be created to study and
Moreover, Java is also used in AI, especially analyze complex data, insights and more precise
for artificial neural networks and genetic program- results can hence be delivered. In the last 2 years
ming. Here Java has its benefits with, e.g., simple alone, the astonishing 90% of all the world’s data
packaging and debugging, user interaction, and has been created. At the end of 2020, 44 zetta-
functionality for mega-project scalability and bytes (1021) made up the entire digital universe.
graphics. The latter is one of the outstanding 2.5 quintillion (1018) bytes are created by
assets of Java with its standard interface and humanity every day. It is estimated 463 exabytes
graphics’ toolkit – the graphical presentation is, (1018) of data will be generated every day by
1 Basic Concepts of Artificial Intelligence: Primed for Clinicians 9
humans in 2025 [16]. The rememberable fact is When Arthur Samuel coined Machine Learn-
also that we have decided to store this forever, ing in 1959, the definition was “a computer pro-
costing us substantial energy. gram is said to learn from experience E with
In a worst-case scenario, computer technology respect to some class of tasks T and performance
could use as much as 51% of global electricity in measure P if its tasks in T, as measured by P,
2030. This will happen if not enough improve- improves with experience E.”
ment in electricity efficiency of wireless access In other words:
networks and fixed access networks/data centers
is possible. However, until 2030, globally Data > Training the Machine
gsenerated renewable electricity is likely to > Building a Model > Predicting Outcome
exceed the electricity demand of all networks
and data centers. Nevertheless, a recent investiga- Machine learning (ML) in all essence, is a subset
tion suggests, for the worst-case scenario, that of AI, which provides machines the ability to learn
electricity for computer technology usage could automatically and improve from data pattern anal-
contribute up to 23% of the globally released ysis. The ML has been programmed in a way that it
greenhouse gas emissions in 2030 [17]. can adjust its parameters. Some more definitions of
With machine learning in the finance sector, a terms used in this textbook are the following:
wide range of profitable opportunities and avoid-
able risks can be identified. An understatement is
Algorithms – Set of rules and statistical techniques
of course that the equivalence will enter the med-
to learn patterns in data, mapping all decisions
ical and healthcare domains. The simple founda-
that a model can take.
tion of all machine learning, and of AI, is data per
Model – A mathematical equation, i.e., a compu-
se. Data is the solution, we just need to know how
tation or a formula, which is the result of an
to handle it, and ways to do this are ML, deep
algorithm that takes some values as input and
learning, and AI.
produces some values as output.
In medicine the needs for machine learning are
Predictor feature – Variable of the data that helps
associated with
predict the output, e.g., a physical factor in
patients, where a gene, height, lab value, vital
• Increased data generation from electronic
sign, or weight can predict a symptom or
health records (EHRs), digitalization
diagnosis.
• Improvements in decision making, risk
Response Variable – The feature, output variable,
prediction
or target variable that will be predicted.
• Uncovering of patterns and trends in data,
Training data – The data, which is used to train
finding hidden patterns, key data extraction
the ML model.
• Building statistical models, time saving
Testing data – Unseen data used to test the ML
• Solving of complex problems for humans
model after the training procedure.
• Definition of Objective the output result will depend on. Exactly how,
Formulation of the problem idea: e.g., to we have to find out in the patterns of the vast
predict if an infected patient will recover, yes material, i.e., the correlations of such variables.
or no. • Building a Machine Learning Model
Q: What target feature: e.g., temperature, A predictive model is created with ML algo-
blood pressure, or other specific symptoms rithms, e.g., Linear Regression, Decision
Q: What input data: medical records, vital Trees, etc. This stage always starts with data
signs history, lab results, etc. splicing, into training data and testing data.
Q: What kind of problem: Binary classifi- The former is always used to build the model.
cation? Clustering? Regression problem? Training data is usually 80% of the total.
Hence, at step 1 we define how we will In this AIM example, we are predicting the
solve the problem, what kind of data we need, outcome of classification variables, also
and what we are trying to predict, what infor- known as categorical variables, i.e., recovered
mation is needed to predict. patient yes or no. Two alternatives. In this case
• Data Gathering we can use linear regression, support vector
Data such as temperature curves, lab values, machines, K nearest neighbor, or Naïve
vital signs, physical examinations, massive Bayes, etc. Which algorithmic model can be
loads of previous comparable patients EHRs used depends on the problem statement, i.e., it
etc., and where can we get this data – gathered depends on the task to solve. The methods
manually or scrapping from the web, from suggested to better approach a specific type of
EHRs, from other sources? This is the most task are the results of a continuous process of
time-consuming element of the ML process. trial and error.
We create the dataset. • Model Evaluation and Optimization
For ML exercises there are a lot of training This step evaluates and tests the model, so it
datasets online, e.g., Kaggle (https://www. can be improved, parameters tuned, etc. A part
kaggle.com/) and Grand Challenge (https:// of the testing dataset is used as a validation
grand-challenge.org/challenges/), where a dataset. The accuracy and errors as a form of
whole range of sets can be downloaded, having ML performance metrics are calculated.
numerous different themes; weather forecasts, • Predictions
economic forecasts, etc. Hence, developers can The final outcome is used to make predic-
skip the time-consuming data gathering step, tions about the given medical condition, and in
and just download the dataset. this case the outcome will be a categorical one.
• Curating First you get a probability, and based on a
Data preparation, or cleaning, involves preset range, the clinician can decide whether
erasing inconsistencies, e.g., missing factors the answer is yes or no.
or redundant information, erasing unnecessary
data, format correction, and getting the data
ready for analysis. Step 3 is likely the hardest Types of Machine Learning
step to perform. It is easy to bias the dataset if,
e.g., one factor’s values are missing more fre- There are basically three types of ML – super-
quently. Any mistake will affect the result. vised, unsupervised, and reinforcement learning.
• Exploratory Data Analysis In supervised learning, we train the machine
This is the ML brainstorming step, which with labeled data. In other words, the labels act
involves the understanding of patterns and like guides. In AIM, we may feed the machine
trends in the data. Insights are concluded, with expert interpretations of, e.g., X-ray images,
such as correlations between variables. For such as fracture or no fracture of a specific bone.
instance, low blood pressure and alternating We explicitly train the machine with these labels.
fever or other pre-septic signs are factors that This type is suitable for regression and
1 Basic Concepts of Artificial Intelligence: Primed for Clinicians 11
Fig. 1 Illustration of three machine learning problems; left to right: linear and nonlinear regression, classification, and
clustering [19]
12 N. Lidströmer et al.
• Linear Regression – A method to predict the Let us assume we have a large data set of ICU
dependent variable that belongs to the y-axis, patients, and we would like to predict whether an
based on the values of independent variables infected patient faces the risk of a serious com-
along to the x-axis. The variables are continu- plication, e.g., sepsis, or not. Every step in the
ous or discrete. Used for predictions of a con- inverted tree represents a categorical classifica-
tinuous quantity. Hence the curve fitting the tion step, i.e., a choice between two values in a
data is linear. The equation is: binary tree. For instance, we use these short
queries – “feverish or not,” then “normal blood
Y ¼ ß0 þ ß1 X þ e pressure or not,” “normal pulse or not,” “shills or
not,” “affected general condition or not,” etc.
Y Dependent variable Each node that represents observations is
ß0 Y-intercept directed through the branches, purely conjunc-
ß1 Slope tions of features, to the leaves also known as
X Independent variable targets/labels. A common approach in triage or
e Error telephone consultation, where the nonmedical
operator tries to decide whether a patient should
For instance, variables with lab results can be seek the emergency room, see his/her GP the day
imported in a CSV format into Python and then
or just wait.
prepared, analyzed, and adequate algorithms applied. The most significant clinical factor should be
• Logistic Regression – A method used to predict a the initial root node, followed by internal nodes
dependent variable, from a dataset, when the and then the terminal nodes or leaves lead to a
dependent variable is categorical, and the outcome suggested outcome. Branches are the answers, yes
is 1 or 0, e.g., in orthopedic medical imaging, this or no, 1 or 0, etc.
would correspond to the presence of a bone frac- ID3 algorithm, Iterative Dichotomizer 3 algo-
ture or no fracture. The function is sigmoid with rithm, is one of the most effective ways of build-
values that go from 0 to 1. Afterward classification ing the tree within healthcare:
algorithms can be applied. See: https://en.
Step 1: Selection of the best attribute (A) – e.g.,
wikipedia.org/wiki/Logistic_regression.
affected general condition?
• Decision Tree – This is another classification
Step 2: Assign A a decision variable for the root
algorithm, and looks like an inverted tree. It is a
node – affected or unaffected patient?
ML algorithm where each node signifies a
Step 3: For every value A can take, create a node
predictor variable (aka feature), and the links
descendant – if yes, if no, etc.
between the nodes are decisions and at each
Step 4: Add classification labels to the lead node
branch there is a leaf, which stands for an
Step 5: If the date results in correction classifica-
outcome, or response variable.
tion, then stop.
Step 6: If not, then iterate.
The entropy is a measure of uncertainty or With the bootstrapped dataset we build a deci-
impurity, which the data contains. The information sion tree, starting at the root node, for which the
gain (IG) signals how much information a specific best attribute is used to split the dataset. For each
variable or feature brings us, in regard to the final of the upcoming branch nodes, the process is
outcome, which can be concluded with this logic: repeated. For each step, the best attributes are
E.g., has the patient had a fever? chosen. The iteration is operated hundreds of
p(yes) ¼ No. of yes outcomes in the parent times, and hence a forest of trees arises. The
node/total number of outcomes bootstrap dataset is used multiple times.
The entropy can be calculated in a similar Finally, the prediction stage is reached. If we
manner: intend to predict a medical condition in a patient,
we run the data through the forest of decision trees,
Entropyparent ¼ pyes log 2 pyes and after all trees have been used it is presented
which classification the majority of trees have
þ pno log 2ðpno Þ voted for.
The model can be evaluated with a part of the
In this way, the variables’ data splitting capac- dataset, which was not bootstrapped, in the
ities are calculated to deem their suitability in a so-called out-of-bag dataset.
given place of the decision tree:
• Naïve Bayes Classifier – A supervised classi-
Information Gain ¼ Entropyparent fication algorithm, which is based on Bayes’
½weighted average Theorem, which solves classification prob-
Entropychildren lems with a probabilistic approach. The
main idea is that the predictor variables in
• Random Forest – Builds multiple decision trees an ML model are intrinsically considered
(a forest) and fuses them and achieves a more independent of each other. The concept is
precise and stable prediction. The forest gives called naïve, since it does not consider any
more accuracy, avoids overfitting (when the correlations between the variables. In medi-
models learn also the noise or disturbance and cal real-world problems, there are often
takes this into the model, which negatively exactly such correlations anyway, but those
affects the models’ ability to predict from new are disregarded in this model.
data), and provides bagging, which means mul-
tiple trees test the data. This is suitable for more The mathematics building up this model cal-
complex medical situations, e.g., a patient with culates the probability for an event to occur based
multiple symptoms. From huge data sets of on events in the past:
medical records, then bootstrapping can be exe-
cuted in a row of circuits to make predictions. P(A|B) – the conditional probability of the event A
Random Forest Simplified happening, given event B
P(A) – the likelihood of event A happening
Instance
Random Forest
P(B) – and of event B happening
P(B|A)– the conditional probability of the event B
happening, given event A
on its diagnostic features – lab tests, microscopy, remembers the training set, instead of learning a
X-ray, quick tests, clinical signs, etc. discriminative function. The most important fea-
ture is that it is based on feature similarity with
Infectious One lab Quick Clinical
agent test Microscopy test signs neighboring data points.
Infection 4500/5000 0 0 5000/ The K value stands for the number of nearest
type 1 5000
neighbors, within a radius defined by, e.g., Euclid-
Infection 500/5000 5000/5000 4000/ 0
type 2 5000 ean or Manhattan distances – there are many pos-
Infection 5000/5000 0 1000/ 500/ sible distance metrics. When determining the
type 3 5000 5000
nature of a novel data point, the number of mem-
Above, we have 15,000 patients, which can be bers of preexisting class categories as neighbors are
divided into three groups of 5000 with one spe- crucial. If a new data point would have one data
cific infection per group. Let’s say we are given point within its radius, which is of type A and two
the following observation then: out of B, then it would be assigned the type B, if K
were set to 3 (to include the three closest neigh-
One lab Quick Clinical bors). If more neighbors are enclosed, then the type
test Microscopy test signs
Observation Yes No Yes No may change. The adequate K value can be calcu-
lated with, e.g., the elbow method, see below.
To predict whether the infection is of a certain The data points (y,x) in a diagram are separated
type, Naïve Bayes can be used. P is the probabil- with a length, which is calculated with the Euclid-
ity. H is the hypothesis. C is a specific class. C1, ean distance. Its simple equation is used by KNN
class 1, etc. to check the closeness of a new data point from its
closest neighbors.
PðHjMultiple EvidencesÞ ¼ PðC1Þ j HÞ P ðC2jHÞ . . .
pðCnjHÞ pðHÞ=PðMultiple EvidencesÞ
SVM with the use of kernel tricks. Nonlinear as the sum of the squared distance between each
means the data cannot be separated with a sole data point, or member, of the cluster and its cen-
linear line, elaborated below. troid. The method can be plotted as graph of the
WCSS(x) value (within-cluster sums of squares),
SVM works by drawing a boundary, a hyper- using this formula:
plane, between different classes, and hence sepa-
rating them in the best way. Support vectors are X
k X
WCSSðkÞ ¼ xi x j 2
the data points closest to the hyperplane. The
j¼1 xi cluster j
boundary will be drawn based on info on the
support vectors. The optimal hyperplane has a
Where x j is the sample mean in cluster j
maximal distance, i.e., margin, to the support vec-
Elbow method equation [23].
tors. It is easy as long as the hyperplane separation
The distortion will decrease with increased
is linear. If this is not the case nonlinear SVM is
number of clusters, i.e., with a higher K value.
used. Here the Kernel trick transforms the data
At one point the distortion tilts abruptly, like an
into another dimension, e.g., separating them on
elbow in the graph. In large AIM studies, it is
the z-axis instead, if a 2D-approach didn’t allow
critical to pick the best amount of clusters.
separation with a straight line. In the 3D-space a
clear hyperplane might then be visualized, if the
data allow. Reinforcement Learning
Python can easily demonstrate and run all of Reinforcement learning algorithms consist of two
the mentioned algorithms mentioned in this intro- main components – agent and environment.
ductory part of the book. For instance, the RL agent learns from the
environment by rewards or failures, like if
exposed to a new computer game. The RL iterates
Unsupervised Learning Algorithms
the game until it masters it. Some main concepts:
The main aim of the unsupervised learning algo-
rithm K-means clustering is to group similar data
points into a cluster. The process classifies objects Reward (R) – The instant return from the environ-
into a predefined number of groups, so they are as ment to feedback the last agent action
dissimilar as possible between the groups, and as Policy (π) – The approach the agent uses to decide
similar as possible within the group. the next action
Every cluster has a centroid, from which dis- Value (V) – The expected long-term award with
tances to the objects are calculated. Grouping is discount, in opposition to short-term reward
then based on minimum distance. When faced Action-value (Q) – Like Value, but includes an
with new medical population material, we need extra parameter, the current action (A)
to first guess how many clusters there may be, and
then provide a centroid for all these. The algo- The Reward Maximization Theory states that
rithm then calculates the Euclidean distance of the an RL agent must be trained in such a manner that
points from each centroid and assigns the point to it takes the best action, so that the reward is
the most proximate cluster. Over and over the maximized.
clusters can be recalculated and new points Discount means escaping negative events.
added, and these steps are repeated until the cen- Discounting is measured with a gamma value
troids become the average of the cluster, i.e., between 0 and 1, with larger discount the smaller
iteration until the centroid value doesn’t change. gamma value.
With the elbow method the most optimal K Two other very important terms are explora-
value for a given problem is calculated. First the tion and exploitation trade-off. The former is
sum of squared errors (SSE) is computed for about exploring the environment, the latter about
some values of K. The definition of SSE is defined exploiting the environment.
16 N. Lidströmer et al.
Another concept is the Markov’s Decision Pro- from 1D, 2D, and to 3D demands quite new
cess, which is the mathematical approach for map- requirements. And in a real-life situation such as
ping a solution in reinforcement learning, and in AIM there can be thousands of dimensions.
includes the following parameters: One of the big challenges with traditional ML
models is a process named feature recognition,
Set of actions, A involved in object recognition, handwriting rec-
Set of states, S ognitions, and natural language processing, which
Reward, R consume a lot of resources. In these types of tasks,
Policy, π the dataset size has a big impact on machine
Value, V learning performances. It is necessary to have a
larger dataset for learning such variety and com-
For example, this can be shown in the shortest plex data. Therefore, the usage of deep learning
path problem, i.e., to calculate the shortest path approaches, such as CNN with hundreds of mil-
between two nodes, with the minimum possible lions of trainable weights, is required.
cost.
Q-Learning algorithms are among the most
important examples of reinforced learning, Introduction of Deep Learning
which can be used to reinforce, e.g., the pathways
taken by an agent, which is taught to get out of a Deep learning (DL) models are capable of focus-
labyrinth with several rooms, if this is the set goal. ing by themselves on feature extraction with very
The memory learnt by the agent with experi- little guidance from the programmer, and can
ence is represented as a Q matrix, where the rows solve especially the dimensionality problem – to
represent the current agent state and the columns avoid the curse of dimensionality [24]. The main
the possible actions leading to the next state, idea is to imitate the structure of the brain. DLs
which results in the final formula: learn by themselves automatically, and they can
quickly identify the decisive factors.
Qðstate, actionÞ ¼ Rðstate, actionÞ For instance, DL can be used for facial
þ Gamma Max½Qðnext state, all actionsÞ recognition. If ML is used, all specific ele-
ments of a face must be separately defined;
The γ (gamma) parameter ranges as said 0 to 1, but with the neural networks of DL, they will
and if γ is closer to 0, the agent will consider automatically detect the key features of each
immediate rewards, and if closer to 1, the agent individual.
will weigh future rewards greater. Deep learning is a form of ML, which uses a
Hence, γ closer to 0 leads to exploitation, while computing model, hugely inspired by the cerebral
γ closer to 1 leads to exploration. structure, where dendrites receive signals to the
With these words I round off this simple math- neuron from other neurons, the neuronal cell bodies
ematical review, and tilt our focus to other general summarize all the inputs and the axons transmit
considerations on AI, ML and DL, e.g., the defi- these to other cells, by firing off at a certain
nitions and their limitations, which come next. threshold, etc.
In DL the dendrites are represented by artificial
neurons, also called perceptrons. They hence
Limitations of Machine Learning receive input of data from multiple sources, and
deliver an output. The input and output are exactly
ML is first of all not capable of handling high- like predictor variables. Data fed to a perceptron
dimensional data, where the input and output is will undergo different functions and transforma-
very large, since this complex data consumes tions, and then give an output. The perceptrons are
resources, and causes curse of dimensionality, as connected into an artificial neural network. DL is a
several new dimensions are added. Just going collection of statistical ML techniques, which
1 Basic Concepts of Artificial Intelligence: Primed for Clinicians 17
learn feature hierarchical structures, based on the The perceptron above receives different inputs
concept of artificial neural networks. X1, X2,. . .Xn, etc., and biases, and those are
In this neural network structure, there are dif- weighted accordingly to W1, W2,. . .Wn1, Y2,. . .
ferent types of layers – the input layer receiving all Y n.
the inputs, and the delivering output layer. The word activation function alludes to its
Between these layers are numerous amounts of equivalent in the human cerebral cellular function
hidden layers. The number of perceptrons in each with the action potential, where the neurons are
layer and the number of layers depend on the activated at a certain threshold. It is mirrored here
problem in question. mathematically, when the function becomes satu-
For image recognition in radiology, at first the rated above a given threshold.
high dimensional data is presented to the input There are several types of activation functions,
layer, which in this case due to the massive load e.g., signum, sigmoid, tan, hedge, etc. All func-
will contain multiple sublayers of artificial input tions match the input to the respective output. The
neurons to digest the entire input. The output bias can shift the activation function in order to get
received from the first input layer contains extract- an exact output.
able patterns, identifying the edges of the images, As a medical perception analogy – should a
based on contrast levels. This data will be relayed person at the ER be sent to the ICU or not? There
to the first hidden layer, which will be able to are three cardinal symptoms, X1–3, and those have
identify certain clinically relevant features (or in an importance or weight of W1–3. If the symptom
facial recognitions, facial features), further trans- is present X attains the value of 1 and if not 0. W1
fer to the second hidden layer will be able to create is set to 6, W2 to 2, and W3 to 2. Then the firing
a fuller picture (or the entire face). If the deeper threshold is considered. If the threshold is 5, it
recognition’s entirety is sufficient (and no more means the patient will be admitted if symptom 1 is
layers needed), then the data is transferred to the present, even if the two others are not. If the
output layer to deliver the classification. threshold is 3, then either symptom 1 or the two
other symptoms together will be sufficient for
Single-Layer Perceptrons patient transfer.
55
27
13 13 13
11
5 3
11 3 3 13 dense dense
13 3 13
5 27
224 3 3
384 384 256 100C
55
Max
256 pooling 4096 4096
Max Max
pooling pooling
Stride 96
224 of 4
Fig. 2 A Convolutional Neural Network (CNN) architecture, with (fully connected) dense layer of 4096 units, derived
from the last max-pool layer to the right (dimensionality change) [27]
1 Basic Concepts of Artificial Intelligence: Primed for Clinicians 19
(RGB), and reads the image with RGB values. GPT-3 is 175 billion machine learning parame-
Each of the channels is mapped with the pixels ters, and is the currently largest example of natural
of the image. (The computer calculates the value language procession (NLP) model [29].
of each pixel, and delivers the image size). In a A major concept in NLP is tokenization, which
CNN, a neuron will only be connected to a small is the splitting process of the whole data (corpus)
portion of the previous layer. The CNNs are not into smaller items (chunks). First it breaks a com-
fully connected. plex sentence into words, then analyses each single
word’s importance in the sentence and then trans-
lates. Stemming is the procedure of normalizing the
Natural Language Processing
word into its root or base or nominative state. The
stemming algorithm provides this, with limitations
The main reason for the need of text mining and
though. In order to improve the accuracy
Natural Language Processing (NLP) is the strato-
spheric amount of data generated every day, and it lemmatization uses a large dictionary to group dif-
is expected to grow. For instance: ferent inflected forms of a word, i.e., lemma.
Similar to stemming it labels a group of words
Instagram 8.95 million photos posted/day into a common bundle. The output is then a usable
Twitter 500 million tweets/day proper word. Stemming may lead to an indiscrim-
Facebook 3.2 billion likes and 350 million photos/day inate cutting of the word, so that the grammar or
Mail 300 billion emails sent/day the understanding is lost, hence the introduction
of lemmatization, which always delivers a word
The vast majority of all data is completely found in a dictionary. Lemmatization occurs after
unstructured. With the mining, structuring, and the morphological analysis.
analyzing the data, huge scientific and economic Stop words are critical to applications – they
values can be harvested. This is the essence of text can be removed with no loss of meaning, in order
mining – the process of deriving meaningful data to focus on the keywords. The stop words will
out of natural language. NLP is a method used in only decrease search accuracy.
text mining, and is the part of computer science and Another concept, useful in natural language
AI, which deals with human languages, and helps processing (NLP) is the document-term matrix,
computers read human texts. A few applications of which is a mathematical matrix, which describes
NLP and text mining are: auto-correction, auto- the frequency of terms occurring in a set of docu-
completion in searches, spam email detection, pre- ments. Columns correspond to terms and rows to
dictive typing, spell-checkers, and email classifica- documents. Features can also be used besides
tion. Sentimental analysis provides insights into the terms [30].
public or customer opinions in certain topics or Natural language processing has enormous
products. Social media platforms use this analysis potential for text mining of electronic health
continuously. Chatbots are widely used in, e.g., records and other medical applications, as will
customer services. Speech recognition in use by be elaborated in many other chapters of this refer-
Siri, Cortana, and Alexa are all NLP applications. ence textbook.
Machine translation is also an example of NLP, for
instance, in use by Google Translate. Other NLP
applications include information extraction, key- Conclusions
word search, and advertisement matching.
The Generative Pre-trained Transformer In conclusion, machine learning and deep learn-
(GPT-3) is an autoregressive language model, ing algorithms have been widely implemented in
which uses deep learning to produce human-like various sectors with noticeable exponential appli-
text. In the GPT series, it is the third-generation cation demand.
language prediction model, and was created by There will be no single area in preclinical med-
OpenAIin San Francisco [28]. The capacity of icine or clinical specialty, which will not be
20 N. Lidströmer et al.
profoundly affected by applications of artificial 13. Baraniuk C. Google DeepMind targets NHS head and
intelligence in medicine. AI will execute a thor- neck cancer treatment. BBC; 2016, August 16.
Retrieved 5 Sept 2016.
ough recast of all fields of healthcare. 14. Marr B. Accessed 8 Feb 2021. https://www.
The era of mathematical medicine, which can bernardmarr.com/default.asp?contentID¼1373
be said to have started with Archibald Pitcairne 15. https://medium.com/better-programming/pythons-
already in the seventeenth century [2], now with advantages-and-disadvantages-summarized-212b5fdf
8883. Accessed 8 Feb 2021.
the emerging AI in medicine era, comes into full 16. The 6th DOMO Report. Domo.com, 2018.
fruition. 17. Andrae A, Edler T. On global electricity usage of
This book encompasses the future potential of communication technology: trends to 2030. Chal-
medicine and the benefits that can be unleashed lenges. 2015;6:117–57.
18. Illustration by Nilay Nishit, Birla Institute of Technol-
when AI platforms can unlock the strengths of Big ogy, Mesra, India, May 2019.
Data for Healthcare. 19. Illustration by Federica Aresu, KTH, and Niklas
Lidströmer, Karolinska Institute, Stockholm, Sweden,
February 2021.
20. https://www.javatpoint.com/machine-learning-deci
References sion-tree-classification-algorithm, open-source image.
21. Illustration by Venkata Jagannath, TIBCO Spotfire,
1. Sparkes B. The Red and The Black: studies in Greek http://community.tibco.com. Accessed 12 Feb 2021,
pottery. Routledge; 1996. p. 124. ISBN 0-415-12661- release as free license on Wikipedia.
4, ISBN 978-0-415-12661-8; two late fifth-century 22. http://Machinelearningmastery.com. Accessed 11 Feb
vase paintings depicting the death of Talos are 2021.
discussed by Robertson M. The death of Talos. 23. Oliver Carloni, SemSpirit.com, Research engines in
J Hellenic Stud 1977;97:159f. artificial intelligence. His website presents a profound
2. Ashrafian H. Mathematics in medicine: the 300-year and comprehensive guidance and illustration of most of
legacy of iatro-mathematics. Lancet. 2013;382(9907): the useful calculations for AI.
1780. 24. Poggio T, Liao Q, Theory I. Deep networks and the
3. Guerrini A. Archibald Pitcairne and Newtonian medi- curse of dimensionality. Bull Polish Acad Sci Tech Sci.
cine. Med Hist. 1987;31:70–83. 2018;66(6):761–73.
4. Iatro-mathematics. Lancet 1920;196:610–11. 25. Rosenblatt F. The perceptron, a perceiving and recog-
5. Putting Watson to Work: Watson in healthcare. IBM. nizing automaton Project Para. Cornell Aeronautical
Retrieved 11 Nov 2013. Laboratory; 1957.
6. IBM Watson helps fight cancer with evidence-based 26. Image by Kiyoshi Kawaguchi, The University of Texas
diagnosis and treatment suggestions. IBM. Retrieved at El Paso College of Engineering Electrical & Com-
12 Nov 2013. puter Engineering, utep.edu.
7. Saxena M. IBM Watson progress and 2013 Roadmap 27. Krizhevsky A, Sutskever I, Hinton GE. ImageNet clas-
(Slide 7). IBM; 2013. Retrieved 12 Nov 2013. sification with deep convolutional neural networks
8. Wakeman N. IBM’s Watson heads to medical school. Neural Information Processing Systems (NIPS); 2012.
Washington Technology; 2011. Retrieved 19 Feb 2011. 28. Sagar R. OpenAI releases GPT-3, the largest model so
9. Upbin B. IBM’s Watson gets its first piece of business far. Analytics India Magazine. 2020, June 3. Retrieved
in healthcare. Forbes; 2013, February 8. 31 July 2020.
10. Miliard M. Watson heads to medical school: Cleveland 29. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J,
Clinic, IBM Send Supercomputer to College. Prafulla D, Neelakantan A, Shyam P, Sastry G,
Healthcare IT News; 2012, October 30. Retrieved Askell A, Agarwal S, Herbert-Voss A, Krueger G,
11 Nov 2013. Henighan T, Child R, Ramesh A, Ziegler DM, Wu J,
11. Ghosh S. Google is consolidating DeepMind’s Winter C, Hesse C, Chen M, Sigler E, Litwin M,
healthcare AI business under its new Google Health Gray S, Chess B, Clark J, Berner C, McCandlish S,
unit. Business Insider. Retrieved 30 Jan 2020. Radford A, Sutskever I, Amodei D Language models
12. Baraniuk C. Google’s DeepMind to peek at NHS eye are few- shot learners. 2020;arXiv:2005.14165.
scans for disease analysis. BBC; 2016, 6 July. 30. Document-feature matrix: Tutorials for quanteda.
Retrieved 6 July 2016. http://tutorials.quanteda.io. Accessed 11 Feb 2021.