Professional Documents
Culture Documents
Thesis PDF
Thesis PDF
February 1, 2019
Samuel R Neaves
Department of Informatics
Abstract
2
Acknowledgements
I would first like to thank my supervisor Sophia Tsoka. She has listened to me and helped
me throughout the process of the PhD. Her willingness to spend her time with me and be
flexible in dealing with the challenges I have faced has been much appreciated.
I would also like to thank the people who have been in the Tsoka research group,
they have provided stimulating scientific discussions and moral support; Gareth Muirhead,
Jonathan Cardoso, Laura Bennett and Aristotelis Kittas.
My friends; especially John Burley, Chris Goff, Stuart Lock, Paola Di Pietro, Louise
Poulter and Anna Dodridge. They may have occasionally led me astray (some more than
others...) but in the process kept me sane and happy. I will note Stuart for his hospitality
when I arrived in London, putting me up and keeping a fun house. Anna for keeping me
connected to my first world of student unions and continuing to inspire me with her life
attitude. John for putting me up on countless weekends, never saying no to meeting up
for a drink and even paying for some incredible holidays to see Walruses and catch up
with friends abroad. Paola for her warm friendly personality which made many a London
evening enjoyable. Louise for when I really need to go out dancing, we always have a great
time. Chris for our regular pub sessions putting the world to rights, exploring technology
and politics, these have been hugely appreciated. Many other friends who I can not all
name here, but believe me you are appreciated.
My family have been a huge support to me, with their love and sacrifices. My Mum
and Dad, sister Lucy, Granny and Bernard and my extended family – all have helped me
in countless ways. The most special thanks go to Louise – for being an amazing scientific
collaborator and so much more, thank you.
3
Contents
Abstract 2
Acknowledgements 3
1 Introduction 10
1.1 How this thesis is organised . . . . . . . . . . . . . . . . . . . . . . . . . . 14
I Background 16
2 Biology 17
2.1 Genomic information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Tissues and organs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Organisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Ecosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4
3.3 Health problems summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Data 28
4.1 Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Sequence conservation data and 16S rRNA microbiome data . . . . 29
4.2 Microarrays – CpG methylation data and gene expression data . . . . . . . 29
4.3 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Reactome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.2 IMG/M data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Encode - The encyclopedia of DNA elements . . . . . . . . . . . . . 30
5
6 Descriptive rule induction from biological data in Prolog - subgroup dis-
covery 66
6.1 An introduction to subgroup discovery . . . . . . . . . . . . . . . . . . . . 67
6.1.1 Data description language . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.2 Rule language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.3 Coverage function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1.4 Learning subgroup rules the general framework . . . . . . . . . . . 72
6.1.5 Existing rule learning algorithms . . . . . . . . . . . . . . . . . . . 74
6.1.6 Rule length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1.7 Formulation of common bioinformatics tasks as subgroup discovery
tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Implementation of subgroup discovery task in Prolog . . . . . . . . . . . . 85
6.2.1 Method 1: Pure constraint logic programming implementation . . . 85
6.2.2 Method 2: Heuristic top-down search using a constraint covering
function and weighted instances . . . . . . . . . . . . . . . . . . . . 93
6.2.3 Method 3: Genetic algorithm for searching a very large hypothesis
space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Application of constructed subgroup discovery algorithms to CpG methyla-
tion and microbiome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.1 Application to the CpG sites . . . . . . . . . . . . . . . . . . . . . . 105
6.3.2 Application to the microbiome . . . . . . . . . . . . . . . . . . . . . 109
6.4 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6
7.6 Comparison of Reactome Pengine to existing data access options . . . . . . 139
7.6.1 Amount of data exchanged . . . . . . . . . . . . . . . . . . . . . . . 139
7.6.2 Flexibility of querying . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Bibliography 170
7
List of Tables
8
List of Figures
9
Chapter 1
Introduction
• Data sits in difficult to combine silos. There are now countless biological
databases containing a huge diversity of data. The data ranges from raw sequencing
data to descriptions of chemical reactions and multitudes of recorded phenotypes. A
researcher who attempts to study these collectively has to download masses of data
from many services in many different data formats. The data is not conceived of
in whole and an appropriate unifying knowledge representation is missing. This can
be highly problematic, because researchers combining data need to either use one
of the many software packages that provide mappings of identifiers or derive their
own methods to combine the data. This often results in a duplication of effort with
different researchers writing their own subtly different programs to combine data in
their own research groups.
• Data and knowledge are stored separately. Raw data such as the sequence
for a gene is stored in one set of databases and perhaps gene to gene interactions
is stored in another set, but the real knowledge about what a gene does and how
it works is in journal articles or occasionally isolated computer programs. For the
10
CHAPTER 1. INTRODUCTION 11
most part our collective knowledge exists in journal articles in free text, that require
a person’s time to read, understand and act upon. The knowledge that is contained
in computer programs and algorithms is isolated and human knowledge is needed to
combine these into ad hoc pipelines to answer specific research questions [93].
• Data storage and transfer is resource intensive and error prone. Many
terabytes of data are being generated by biological research [63]. For instance the
European Bioinformatics Institute reported in 2013 storing 20 petabytes of data [98]
and by 2015 [26] storing 75 petabytes of data - nearly quadrupling in 2 years, it is
increasingly difficult to store and even to transfer data from database providers to
the machines on which researchers compute analyses.
Researchers currently resort to techniques such as downloading a database to their
local machine, collectively taking up double the storage space and consuming much
bandwidth. They then need to use cryptographic hashing to ensure that data has
not been corrupted on transfer or storage [16].
• Overly complex models. Conversely, some studies will construct complex ‘black
box’ models, (perhaps built with a deep neural network or a support vector machine
with a complex combined kernel function) [122, 13]. These models may have uses
CHAPTER 1. INTRODUCTION 12
in, for example, predicting prognosis from imaging data [135]. However, it is often
the case that we need to justify our clinical decision making (which admittedly is a
difficult philosophical task) but -put simply- having reasons in the form of rules which
a human can intuitively understand, question and reason about is more informative
and useful to researchers than a classification model with many thousands of weights
contributing to a decision [65].
knowledge into formal rules. Indeed the requirement to curate experts knowledge into for-
mal rules was one of the major issues encountered when the first attempts to apply Prolog
to practical tasks began, these involved the development of ‘expert systems’ [59]. Some
successes were achieved but the manual curation of expert rules proved to be a large bottle
neck in the adoption of these automated reasoning systems.
In order to try and tackle the problem of codifying human knowledge, computer sci-
entists turned their attention to the possibility of automatically learning rules. This has
driven the development of machine learning algorithms and the field of machine learning
has ballooned into a major scientific undertaking. Many remarkable results have been
achieved and large numbers of scientific papers rely on these algorithms to draw out re-
sults. State of the art machine learning algorithms often need two key resources, the first
is a large amount of data and the second is driven by the first, a lot of computing power.
The output of machine learning algorithms are ‘models’ which represent some knowl-
edge about a concept. One thing that seems to have been lost in the success of modern
machine learning methods is the concept of how to deploy these models collectively, the
idea of what was once called an expert system.
Currently, there is a problem – models sit in isolation and do not sit in a grand ‘expert’
system. One learnt model can not automatically reason with another learnt model. Leaving
the knowledge contained in it isolated and underused. Further an expert system would
justify its decisions by being capable of showing what rules have been used in answering
a query, this feature has been neglected in modern machine learning classification models,
where a decision could be made by weighing many thousands of weak contributions from
variables.
So far we have identified the following practical requirements to enable bioinformati-
cians to use the automatic reasoning that Prolog provides, 1) needing machine learning
algorithms to automatically codify knowledge into rules, 2) a large amount of data to feed
the machine learning algorithms and 3) sufficient computing power to handle the large
amount of data. Others include having appropriate software libraries, for example, to read
data files or to write web applications. These allow the computational bioinformatician to
do their job without reinventing the wheel. Many of these libraries are now available in
SWI-Prolog and we will demonstrate and describe their usage in this thesis.
Whilst we are not the first to recognise the suitability of logic programming for bioin-
formatics [156, 6, 108], the topic has not been fully explored. In particular recent develop-
ments in Prolog for deploying applications to the web, improving the efficiency of logical
CHAPTER 1. INTRODUCTION 14
reasoning and learning with structured data have not been fully exploited.
As we will see, the idea of a grand unified web scale logic program with automated
learning of characteristic rules and from these automated reasoning to answer research
questions is tantalisingly close. Our attempts to move the life sciences and bioinformatics
in this direction is the subject of this thesis.
how modern logic programming techniques can improve access to complex datasets on the
web, providing an example data service that is ideal for the distribution of large complex
datasets. In chapter 8 we then go on to give examples of structured data mining of gene ex-
pression data in the presence of complex background knowledge (reactome pathway data).
The two chapters in the third part of the thesis are based on published work. Chapter 7
is based on Reactome Pengine: A web-logic API to the homo sapiens Reactome (Bioin-
formatics 2018) [111]. Chapter 8 is based on Using ILP to Identify Pathway Activation
Patterns in Systems Biology (International Conference on Inductive Logic Programming.
Springer, Cham, 2015) [110]. Finally, chapter 9 summaries our argument and presents a
vision for the future.
Part I
Background
16
Chapter 2
Biology
When Charles Darwin proposed the theory of evolution by means of natural selection [35],
he explained both the unity and diversity of life. The diversity is that life viewed at human
scale is so heterogeneous with birds flying, fish swimming, plants photosynthesising and
everything in between. In contrast, there are two striking unities in life, first everything
under a microscope is very similar, with the same cellular machinery visible, and secondly,
all living beings strive to reproduce and survive but no living individual is immortal. It is
the lineages that survive not the individuals.
Two ideas are worth stating to help a computer scientist understand biology as a science.
The first is summed up by the quote:
That is anything a biologist tries to state in general as a law, another biologist will pipe
17
CHAPTER 2. BIOLOGY 18
up with an exception they have found. A number of famous witnesses to this colloquialism
include:
• All mammals give birth to live young – except the platypus and echidna that lay
eggs.
• All eukaryotic cells have a nucleus – except red blood cells (it’s ejected to make room
for oxygen).
• All blood is red – except for some fish that have colourless blood.
Both of these ideas, that evolution can be used as a framework to understand a bio-
logical phenomena, and that there will often be an exception to any biological phenomena
described, can be helpful when reading and understanding biological research.
“It has not escaped our notice that the specific pairing we have postulated
immediately suggests a possible copying mechanism for the genetic material.”
– Watson and Crick.
What was discovered is that DNA is a molecular structure composing of two paired chains
of chemical bases forming a double helix shape. Each chain has a backbone of sugar and
phosphates and off this backbone there are four different kinds of bases, these are Adenine,
Cytosine, Guanine and Thymine referred to as A,C,G and T.
Once DNA had been described it was now possible for biologists to define a gene as a
sequence of DNA that works together and travels across generations together. From this
discovery, scientists in the second half of the twentieth century derived the central dogma
CHAPTER 2. BIOLOGY 19
of molecular biology; that DNA sequences are transcribed into RNA sequences, which are
translated into proteins sequences [29].
RNA is a molecule that is similar to DNA, but is single stranded. When a gene is to be
expressed the DNA is unzipped and a matching RNA molecule is built. This is then fed
into the cellular machinery as the instructions to build a protein sequence. These protein
sequences are chains of amino acids. When they are formed they fold into complex shapes.
These shapes are the molecular machines that build and form cells, which are the primary
unit of life.
Scientists have developed a number of terms to describe the study of all of these pro-
cesses. These include many ’omics terms. The original ’omic is genomics, which is a term
used to describe the study of the genome which is the collection of genes that exist in an
individual, the word can also be used to describe the collection of genes in a species. For
example the Human genome is the collection of all genes found in humans, a subset of
this is your personal genome which is unique to you (unless you have an identical twin).
Many other omics terms have been coined in order to describe the large scale study of some
aspect of biology. These include the transcriptome, epigenome, reactome and microbiome.
The transcriptome is the set of genes that have been transcribed. That is the set of
RNA molecules that has been constructed from parts of the DNA. The transcriptome is
different, in different cells at different times under different circumstances. When a gene
is transcribed it is said to be expressed. It is the expression of different genes, that allows
different types of cells to exist – despite all the cells in the same organism having the same
DNA.
What controls the expression of genes into RNA molecules is sometimes called epige-
netics [71]. In our context we use this word to describe the molecular features that attach
to DNA and effect if the DNA is transcribed or not. The epigenome is the set of modifi-
cations to DNA that effect the transcription. These include CpG methylation and histone
modifications.
The proteins that have been constructed interact with themselves and other biological
molecules in the cell, these interactions are chemical reactions. The chains of reactions
that control a process in a living organism are called biological pathways, and the set of
reactions is called the reactome.
Finally, the microbiome is the set of all microbiota in a body location, environment or
species, for instance the gut microbiome (the set of microbes found in human guts), or the
human microbiome (the set of all microbes living on or in humans).
CHAPTER 2. BIOLOGY 20
In this thesis we will present methods to study the epigenome (Chapter 6), the micro-
biome (Chapter 6) and the transcriptome (Chapter 8), and we will make use of the genome
(Chapter 6,7 and 8) and the reactome (Chapters 7 and 8).
2.2 Cells
Cells are the building blocks of all organisms and understanding cells is a key goal for
biologists [1]. Billions of years ago the main lineage of life diverged into prokaryotic and
eukaryotic life. Prokaryotic life is unicellular (such as bacteria) whereas eukaryotic life is
more complex and is often multi-cellular, such as plants and animals. Prokaryotic and
eukaryotic life have evolved different types of cells and in the case of multi-cellular life
there will be different types of cells in the same life form. Prokaryotic and eukaryotic cells
differ in how they store their DNA – in eukaryotic life it is stored in a structure called
the nucleus, whereas in prokaryotic it is more free floating. The difference in the storage
of the genetic information, leads to differences in cell reproduction, with eukaryotic cell
reproduction being the more complicated of the two. Eukaryotic cell reproduction is more
complicated because two types of reproduction are needed. The first is called mitosis and
is when one cell replicates itself with the same DNA. The second is called meiosis and is
when one cells DNA combines with another in order to form a third cell. This cell has
some of the DNA from the first cell and some from the second - this is sexual reproduction.
The instructions to build a new cell are stored in the DNA and, via biological pathways,
ultimately control how a cell will reproduce. It is important to remember that whether
life is unicellular or multi-cellular, evolution does not work at the level of individuals but
at the levels of genes. This is the modern synthesis [74]. Dawkins famously described an
organism as a vehicle that has been built by a set of selfish genes in order to help replicate
themselves [36].
different tissues that allow oxygen to be taken from the air and transferred in to the red
blood cells of an animal. The lungs themselves are part of the respiratory system, which is
the group of organs including the chest muscles, tongue and nose that together accomplish
the task of breathing.
2.4 Organisms
We usually think of an organism as a discrete individual but the demarcation of individuals
can be blurred. There are many examples of parasitic and symbiotic relationships where one
organism could not survive without the other. Indeed, even at the level of the cell we can
see these phenomena. For example, eukaryotic cells have an (essential to their survival)
organelle called mitochondria which is thought to have evolved from a deep symbiotic
relationship with a prokaryote organism in ancient times [62]. The evolution of these
symbiotic and parasitic relationships can be understood by using game theory [163]. This
allows us to also understand altruistic ‘behaviour’ between closely related individuals and
symbiotes. A recent prominent area of study for biologists is the relationship between an
organism and its resident colony of microbiota [126]. These colonies are always present
in animals and plants and consist of many thousands of different bacteria, archaea and
viruses. Some biologists have taken to describing the microbiome as the missing organ [10]
to emphasis how important it is to an individual.
2.5 Species
For multi-cellular life, a species can be defined as a group of organisms that can breed [60].
The evolutionary history of a species can be inferred by comparing the genomes of different
species. The study of this is called phylogenetics [119]. Biologists have also categorised life
into a taxonomic hierarchy the levels of which are: Domain → kingdom → phylum → class
→ order → family → genus → species[133]. The demarcation of these levels is ill-defined
especially in organisms such as bacteria, but it is a valuable tool for understanding life.
2.6 Ecosystems
Collections of species in a environment form ecosystems. Although ecosystems are often
imagined as a fixed system that is in balance, evolution shows us this is not the case. It is a
CHAPTER 2. BIOLOGY 22
constant arms race as different genes compete to replicate by creating ever more elaborate
vehicles. Some sets of genes will team up at the cellular level, others at the organism
and species level. The microbiome of an organism can also be considered a study of an
ecosystem [126].
Chapter 3
We will now give some background information on a number of health problems which will
serve as examples in this thesis. We do this in order to provide the context in which we
apply logic programming.
3.1 Cancer
Cancer is when aberrant cells divide and multiply in an uncontrolled fashion – this leads
to neoplasms (an abnormal growth of tissue – a tumour). Cancer tumours begin when one
or a number of cells incorrectly divide due to a signalling problem amongst or inside the
cells. As we described in section 2.2 the machinery of cells is ultimately controlled by DNA,
via RNA and protein production. Therefore, broken aspects of any of these processes can
lead to cancer. A cancer proliferates when daughter cells inherit the erroneous DNA. In
the USA one in two people will get cancer in their lifetime and 600,000 people will die of
cancer each year [20].
Cancer can be understood in terms of local evolution, where cancer cells vary, compete
and the fittest survive [61, 18]. Cancer cells compete against the cell’s tumour suppressing
machinery which attempts to (a) repair DNA in the cell and (b) attack cancerous cells.
Because evolutionary history is on the side of the immune system many copying mistakes
and problems will be stopped quickly, however the evolutionary pressure to develop defences
against cancer only applies to young organisms before they have had a chance to reproduce
(and pass on their genes), hence the evolutionary pressure has been less for older ages, and
is one of the main reasons cancers effect older people more on average [18].
Cancers often occur in parts of the body where cells reproduce quickly. For example,
23
CHAPTER 3. EXAMPLE HEALTH PROBLEMS 24
in breast tissue, milk is being produced and in the lungs, mucus is being produced.
In this thesis we use a number of techniques to build models that give researchers
insights into what the malfunctions are in the bodies signalling system which are leading
to the cancers. The studies used are on two different types of cancer that are very serious
for human health. These are breast and lung cancer.
3.2 Psoriasis
In Chapter 6 we will be applying machine learning algorithms to data for knowledge dis-
covery of psoriasis. Specifically we will be looking at the microbiome (the community of
microbes that inhabit the human body) of psoriasis.
Psoriasis is an inflammatory skin condition that is currently incurable [112]. The symp-
toms of psoriasis are silvery plaques that occur due to an increase in keratinocytes (a type
of skin cell) resulting in incomplete cornification (the formation of a dead layer of skin that
acts as a barrier) in the stratum corneum (the outermost layer of skin). The keratinocytes
multiply at a faster rate than normal leading to an abnormal epidermis where the outer
layer is defective [116].
It is known that a perturbed immune system is associated with psoriasis. The affected
skin contains components of the immune system such as T-Cells and cytokines [112]. These
cells lead to inflammation in the skin because they are pro-inflammatory.
Psoriasis is a growing problem in the developed world with nearly three in one hundred
people in the USA now suffering from the condition [125]. It is not known what triggers
psoriasis but some GWAS studies have identified some genetic variants that affect risk
(e.g. in IL21A and IL23R genes [109]). It is thought that environmental factors other than
genetic are likely to have an effect, such as stress, trauma and the composition of the skin
microbiome [112].
CHAPTER 3. EXAMPLE HEALTH PROBLEMS 26
Many researcher now wish to learn more about the relationships between the host
and the skin microbiome communities in the presence and absence of conditions such as
psoriasis.
Data
There are now many types of technologies to capture ever more types of biological data.
This has been described as the data deluge [12]. In order to show the broad usage and di-
versity of types of data available to the modern biologist, this thesis provides exploratory
case studies of a number tasks that a bioinformatician may undertake. The data types
include, sequence conservation data, 16S rRNA microbiome data, CpG methylation data,
gene expression data, structured biological pathway data, data about the physical charac-
teristics of microbiota and further genome annotation data. Each of these types of data
will be explained in this Chapter. As sequence conservation data and 16S rRNA micro-
biome data are both derived from genome sequence data we start with an overview of the
sequencing process.
4.1 Sequencing
Sequencing technology allows the genome and transcriptome of an organism to be directly
read. The technology for sequencing was initially prohibitively expensive, however devel-
opments that came about due to the human genome project greatly reduced the cost [151].
These technologies are known as second and third generation sequencing technology [124].
A prominent modern technology for sequencing is RNASeq [105]. Inexpensive sequencing
technology makes it feasible to sequence many more organisms and to compare them. For
example, to sequence all the microbes living on a person’s skin would not have been con-
sidered feasible at the time of the human genome project, as the size of this meta-genome
is an order of magnitude larger than the human genome. However, just a few years later
this undertaking was started by the human microbiome project [147].
28
CHAPTER 4. DATA 29
[72, 44, 14]. In Chapters 6 and 8 we use data from CpG methylation arrays and RNA gene
expression arrays, respectively.
4.3 Databases
4.3.1 Reactome
Reactome is a database of biological pathways that is made available for free online [30].
The complete database contains details of biochemical reactions in a number of species,
however in this thesis we limit our use to the data on humans.
Reactome uses a reaction centric data model, where entities participate in reactions that
are manually curated into pathways. Each step in a pathway has the literature citations
of the experiments used to ascertain that they exist.
We make use of the Reactome database in part three of this thesis. We use it as an
exemplar for a modern bioinformatics API in Chapter 7 and as background knowledge to
inform model construction for classification models in the task of distinguishing between
two types of lung caner in Chapter 8.
of the human genome. The data includes transcription factor binding sites, non-coding
RNA (that is, sites where DNA that is transcribed into RNA but this RNA is not then
translated into protein) and sites with different chromatin structure (chromatin holds the
DNA molecule in knots, controlling what parts of DNA are accessible to cellular machinery
using histone proteins). We use features derived from this data in Chapter 6.
Part II
32
Chapter 5
33
CHAPTER 5. PROLOG 34
a logic such as Clausal Logic we need to at least describe three elements: syntax, seman-
tics and proof theory [47]. The formal description of these elements will allow us to reason
about arguments such as:
That is we will be able to make statements and follow arguments that concern indi-
viduals such as VirusX and MicrobeY, sets of individuals such as bacteria, and relations
between individuals such as VirusX infects MicrobeY. In addition we would also like to be
able to reason with statements such as :
In such statements we are reasoning with infinite domains (ALL carnivores), we can do
this by giving abstract names to entities without explicitly naming them, this is explained
further in the following description of the syntax of DCL.
5.2 Syntax
The description of a logic’s syntax describes, a) what is the alphabet we are using, b) what
different types of words there are and c) what are the allowable ‘sentences’ in the logic.
In DCL individual names are called constants and in Prolog these are single words
starting with a lowercase letter, or strings of characters in single quotes
(constant, ‘Also a constant’). An arbitrary individual is denoted by a single word
starting with an uppercase letter this is a variable (an anonymous variable begins with an
underscore, and is an individual that we do not care to name) (Var, _AnonVar). Constants
and variables together are referred to as simple terms. In order to use an abstract name
to, for example, refer to an infinite domain, a complex term is used. A complex term
consists of a functor (also a word beginning with a lowercase letter) followed by a number
of terms separated by commas and encased in brackets (complexterm(Var,const)). A
ground term does not have any variables.
In order to define a relation between two (or more) individuals we use predicates which
are notationally the same as a constant. An atom is a predicate followed by a number
of terms encased in brackets (the same syntax as a complex term). Both for atoms and
CHAPTER 5. PROLOG 35
complex terms, the number of terms encased in brackets is referred to as the arity of the
atom/predicate or complex term. Each term in the brackets is referred to as an argument.
Predicates with a different arity refer to different relations and do not mean the same thing.
A predicate can also have zero arity, in which case it is a simple proposition that can either
be true or false.
We can connect atoms to make statements called clauses using the symbols , for ‘and’
; for ‘or’ and :- for ‘if’. A statement will only have one ‘if’ but may have multiple ‘ors’
and ‘ands’. The left part of a statement is the head of a clause and the right part of the
statement is the body of the clause, the ‘if’ (:-) is sometimes known as the ‘neck’. In DCL
(as opposed to full Clausal Logic) we only allow there to be one literal in the head of a
clause (these type of clauses are called Horn clauses), for reasons explained in section 5.3
on semantics.
A predicate definition can be made up of a number of clauses, which can be read as a
disjunction. Different predicate definitions are combined into a program and can be read
as a conjunction. The collection of predicate definitions is also sometimes known as a
Knowledgebase.
For example if we take this (meaningless) program:
u(A):-w(b,q(A)),x(a).
u(A):-x(A),z.
u(A,B):-y(n,m).
v(I):-v(b,Q).
w(b,q(1)):-true.
x(a):-true.
x(b):-true.
y(n,m):-true.
z:- true.
There are seven predicate definitions. In order to refer to a predicate we use the
combination of it’s name and it’s arity written like u/1. The first predicate definition is
CHAPTER 5. PROLOG 36
therefore u/1 which is made up of two clauses. The head of the first clause is the single
atom u(A). As this atom is in the head of the clause it is a positive literal. There are two
atoms in the body of the clause (negative literals), the first atom is w/2 it’s first argument
is the constant b and the second argument is the complex term q/1, q is the functor and
it’s only argument the variable A. The second body atom in the first clause is x/1 which
has a single argument - the constant a. The second clause that completes the predicate
definition for u/1 has the same head atom as the first clause, but the body is made up of
two atoms x(A) and the constant z. The second predicate is u/2 which is distinct to u/1
as it has a different arity. u/2 is defined by a single clause. The predicate w/2’s second
argument is a complex (compound) term with the functor q and a single argument – the
number 1. The predicates, w/2, x/1, y/2 and z are all facts because the only body atom is
the constant true which is a special atom which always evaluates to true. Clauses which
have this form are normally abbreviated to omit the neck and body and would therefore
simply be written as:
w(b,q(1)).
x(a).
x(b).
y(n,m).
z.
The predicate x/1 is the only other predicate with two clauses, representing the disjunc-
tion x(a) or x(b) as separate facts. The predicate z has an arity of zero and is therefore
a simple proposition which is true. A predicate can also be described as a relation when
its arity is greater than or equal to two e.g. the predicate y/2 is a relation but z/0 is not.
5.3 Semantics
5.3.1 Formal meaning of words and sentences
Sentences in Clausal Logic are said to be ‘truth functional’. This means that the meaning
of a sentence in the logic is assigned a truth value (in our case simply true or false), and
the semantics specifies under what conditions this can happen. For this we use a number
CHAPTER 5. PROLOG 37
of concepts such as the Herbrand universe, Herbrand base and Herbrand interpretation.
The Herbrand universe is the set of all individuals we are talking about in our clauses. So
for a program P it is the set of all ground terms that can be built from the functors and
constants in P. Therefore if a program contains a functor then its Herbrand universe is
infinite (because a nested infinite term can be created). The Herbrand base of a program
is the set of ground atoms that can be constructed using the predicates in P along with the
ground terms. A Herbrand interpretation is a possible state for the universe we are using,
to do this we assign a mapping from the items in Herbrand base set of P to either true or
false. By convention we can treat a Herbrand interpretation as a subset of the Herbrand
base by stating that these are the entities that are true and the entities not in the set are
false.
Example:
The Herbrand universe for the above program is the following infinite set:
{a,b,n,m,q(a),q(b),q(n),q(m),q(q(a)),...}
{u(a),u(b),u(n),u(m),u(q(a)),...
w(a,a),w(a,b), ...
x(a),x(b), ...
u(a,a),u(a,b), ...
y(a,a),u(a,b), ...
v(a),v(b), ...
z}
{u(a),u(q(a)}
In order to assign a truth value to each clause we use the following rules and the fact that
the body of a clause is a conjunction.
1. The body is true, and the head is true -> Clause is true
2. The body is true, and the head is false -> Clause is false
3. The body is false, and the head is true -> Clause is true
4. The body is false, and the head is false -> Clause is true
CHAPTER 5. PROLOG 38
Thus a clause is equivalent to the statement ‘head or not body’ and so a clause is a
disjunction of atoms with each atom in the body of the clause being negated. Body atoms
are therefore called negative literals and head atoms are called positive literals. If a clause
is true in an interpretation then that interpretation is a model for that clause and an
interpretation is a model for an entire program, if it is a model for each clause in the
program.
Adding further literals to a clause can therefore only restrict the possible models for
a program (this is an important property for reasoning about Prolog programs, both for
correcting mistakes in programs and for automatic reasoning of programs). If an atom is in
every model of a program, then it is a logical consequence of the program. The semantics
we use is called minimal model semantics and that means we only accept things as true
if they are true in every model. This avoids having to state everything that is not true in
our domain of discourse, and is also the reason we are restricted to having one atom in
the head of our clauses, because otherwise there could be multiple minimal models. This
is the difference between DCL and full Clausal Logic. This also means that when we want
to have a statement such as x;y:-z. (If z then x or y) we actually need to write either
x:-z, not(y). or y:-z,not(x). to explicitly choose the minimal model we intend. We
will discuss further the semantics of negation in a later section.
a:-b,c.
b:-d,e.
Resolving on b leads to :
a:-c,d,e.
When an atom that is being resolved upon contains a variable, we need to apply a
substitution so that the pair of atoms are made equal. We substitute terms for variables –
for example if we apply resolution on the following two clauses and wish to unify b(V,u)
and b(w,X) we would use the substitution {V->w,X->u}.
a(u,V) :-b(V,u).
b(w,X) :- c(w,Z),d(X,Z).
This process is called unification and the substitution is called a unifier. The resulting
clause is then:
a(u,w):-c(w,Z),d(u,Z).
:-u(X). u(A):-w(b,q(A)),x(a)
{A->X}
:-w(b,q(A)),x(a). w(b,q(1)):-true
{A->1}
:-x(a). :-x(a)
The proof tree shows each MGU, as Prolog attempts to derive the empty clause. In
this example the empty clause can be reached in three different ways, hence the query is
refuted. However, this is also true not only for A->1 but also for A->a and A->b. These
substitutions can be perceived of as the answers to the query. The first solution is the first
proof found, and the other solutions to this query which would be found on backtracking,
which means trying different branches of the proof tree. How these are selected and the
order they are selected is dependant on the selection rule of the inference engine.
When resolving with complex terms and attempting to unify them, we should ensure
that the two atoms do not have any of the same variables (so called ‘occurs check’).
However, for efficiency reasons Prolog does not normally do this unless you explicitly ask
for it. This means that under specific circumstances resolution in Prolog is unsound (see
section 5.4 for more details).
Prolog’s implementation of resolution uses Selection Linear Definite (SLD) resolution
[84]. The resolution strategy of SLD states how a literal to resolve upon is selected. This
is done by the selection rule, in Prolog the selection rule is left to right, and the search for
a matching clause is top down in a program. This means Prolog searches ‘Depth First’ by
default. This is illustrated in Figure 5.2
CHAPTER 5. PROLOG 41
?-u(A)
aaa
%% a
:-w(b,q(A)),x(a) :-x(A),z
A/1 A/a LL A/b
x(a) z z
2 22
Completeness: Inference rules that are complete allow us to derive any true sen-
tence from our axioms. If an inference rule is not complete there will be sentences
that are true in the logic that are not derivable.
does not have it’s head literal also in it’s body), Prolog will enter a loop and not ever find
the base case.
For example:
a(X):-a(X).
a(b).
This is a valid logical statement containing a tautology and a second case, simply querying
?-a(X) will result in an infinite loop in Prolog. Each time the system attempts to resolve
on a(X) it will generate a new identical clause. Prolog can be made complete by changing
its search strategy to breadth first. This can be done by using a meta-interpreter which
we will briefly describe in a later section.
The logic and control components are the declarative and procedural knowledge of a Prolog
program respectively. The declarative reading of a Prolog program will define the result
of a query if one is returned. Adding atoms to a clause can only reduce the number of
solutions to a query, and removing atoms from a clause or adding additional clauses to
a program can only increase the number of solutions to a program. These procedures
CHAPTER 5. PROLOG 43
A methodology for writing Prolog Programs is to first aim to logically describe the solu-
tion to the problem with declarative predicates. Writing this general program will often
allow queries from some instantiation patterns efficiently. Others may not immediately
allow efficient computation. The implementation can then be adapted taking into account
procedural knowledge. In Chapter 6 of this thesis we will illustrate this methodology when
we implement subgroup discovery in Prolog in order to mine biological datasets.
X:-Y,Z.
Would be read as; To satisfy X, satisfy Y and then satisfy Z. When reading a clause pro-
cedurally the order that the goals are processed is a factor to consider. Because SLD
resolution is top down we should normally write non-recursive cases before recursive cases.
The complete algorithm for the standard Prolog execution is presented in Appendix A.
A Prolog program can be declaratively correct, but procedurally incorrect. For example,
this could happen when the selection strategy for the inference rule causes the execution
of the program to enter an infinite loop. By reading a Prolog program procedurally we can
avoid this and further we can implement algorithms - a key advantage. For example, the
predicate unsortedlist_sortedlist/2 could define the logical relation between a sorted
and an unsorted list, but the programmer will often need to understand how to write the
different algorithms which will be most efficient for the intended use of the relation (e.g.
merge sort versus quick sort). This is especially relevant when implementing algorithms
that have been described in the literature and written in an imperative style. In chapter six
of this thesis we will show different implementations of algorithms for subgroup discovery
– that aim to define a relation between our datasets and a set of interesting subgroup rules.
Even though these algorithms attempt to define the same relation – the implementation
will be different and they will find different rules, and be applicable to data with different
properties.
Interestingly, there is research on automatically finding efficient programs and algo-
rithms but until this area is more mature it is the responsibility of the Prolog programmer
[31].
In addition to the logical features of Prolog, there are also non-logical features which
can only be understood procedurally. These include input and output where a program
needs to read a file or user input, or write output to a screen or other device. These features
fall outside of logic and are controlled by non-logical predicates. They are known as side
effect predicates and in later chapters we will see some of them in use.
The procedural understanding of a Prolog program can be difficult because in contrast
to the declarative reading, we need to understand the instantiation of variables and what
alternatives are found on backtracking. Indeed different algorithms could be implemented
for different variable instantiations. This is because a program can be queried in multiple
directions. We recommend that a Prolog programmer keeps the logic of there problem
separate from any input/output requirements. This separation of concerns allows for clean
declarative logical code for core functionality, and efficient reusable input/output code in
CHAPTER 5. PROLOG 45
Goal termination
?- length(Ls, _).
Ls = [] ;
Ls = [_8662] ;
Ls = [_8662, _8668] ;
Ls = [_8662, _8668, _8674] ;
...
Another important class of problems where Prolog is useful but termination may not
appear to occur, is when we are searching for properties of objects that we do not in
advance know to exist. For instance, if searching for a sub-graph with a certain substructure
within an infinite graph, where that sub-graph does not exist then the program should not
terminate[143].
In order to better understand termination in Prolog it is useful to distinguish two types
of non termination [143].
?- Q, false.
terminates existentially.
CHAPTER 5. PROLOG 46
?- Q, false.
...
does not terminate.
If a query does not terminate existentially, then it also does not terminate universally
[143]. A debugging technique called failure slices [114] (and the related technique of pro-
gram slicing) makes use of the logical properties of a Prolog program in order to help a
programmer fix mistakes in their programs. This technique can produce explanations for
non-termination. The idea is to slice the program into segments and insert a false goal.
This will allow us to isolate from where the non-termination originates in our program.
This kind of debugging technique is not available to imperative programming languages
such as Java or Python and is a key advantage to Prolog programming. It is even possible
to use an external Prolog program that will automatically generate program slices to help
find reasons for non-termination in our programs [114].
n_n2_sum(0,X,X).
n_n2_sum(s(X),Y,s(Z)):-n_n2_sum(X,Y,Z).
We can use this predicate for a variety of tasks, as demonstrated by the following queries.
In fact other Prolog data structures can be used this way to implicitly represent integers.
CHAPTER 5. PROLOG 48
We do not have to use the term s/1. For example we can implicitly represent the size of
a list by a list itself. This is illustrated by the following predicate definition.
n_n2_sum([],L,L).
n_n2_sum([H|T],L2,[H|L3]) :- n_n2_sum(T,L2,L3).
This predicate is normally called append/3 and it can be used to combine lists (to add two
numbers together), to split a list (to subtract one number from another), or to generate
lists of increasing length (to count). This can be achieved by making queries with different
instantiation patterns – that is with some arguments bound to explicit values so that they
are ground or other semi-instantiated, for example, [X|Y] is a list but we do not know how
long it is, or a argument could be a free (or unbound) variable.
The following shows example queries for adding 2+1, finding what numbers sum to
three and counting.
%2+1 =3.
?- append([_,_],[_],X).
X =[_,_,_].
%counting
?-append(_,_,X).
true;
X = [_6662|_6434] ;
X = [_6662, _6674|_6434] ;
X = [_6662, _6674, _6686|_6434];
CHAPTER 5. PROLOG 49
The following query asks the question, what X when added to two equals five.
?-5 #=X+2.
X=3.
In contrast clp(b) systems will maintain standard operators but these will be enclosed
in the sat/1 predicate. This predicate will attempt to satisfy the enclosed equation. For
example the following query shows that for X*Y to be satisfied, that is to evaluate to true
or 1 both X and Y have to be equal to one.
?-sat(X*Y).
X=Y,X=1.
Both clp(b) and clp(fd) systems work by first issuing constraint goals, followed by
labelling goals, which will enumerate solutions. As observed in the above two examples,
if a goal can immediately be satisfied then there is no need to explicitly call the labelling
predicates. The following examples illustrates a constraint satisfaction problem when this
is not the case and the labelling predicate is required to enumerate the solutions:
:-use_module(library(clpfd)).
n_n2_sum(N,N2,Sum):-
Sum #=N+N2.
1. A set of variables.
In both clp(b) and clp(fd) variables look like normal Prolog variables, however if,
on querying, a clp(fd) variable that is instantiated to something other than an integer,
a type error will be thrown. If a variable is instantiated to something other than zero or
one in a clp(b) variable then also an error will also be thrown. The domain of a clp(fd)
variable can initially be set with the in/2 or ins/2 predicates. For example to state that
the variable X is in the domain of 1 to 10 inclusive we would query:
?- X in 1..10.
X in 1..10.
If, on labelling of variables, there are several values that meet the constraints, then a
criterion for choosing amongst these can also be specified. This is called the labelling
strategy.
The actual algorithms for constraint satisfaction are not directly visible to a Prolog
programmer. The programmer simply interacts with the constraint solver. The intuition
behind a constraint satisfaction problem is that it can be conceived of as a hyper-graph,
where the nodes are the variables and the edges are the constraints. So the constraint
p(X, Y ) would have two arcs:(X, Y ) and (Y, X). A consistency algorithm will then control
the search for a solution.
We now describe a simplified algorithm of how this might work. We assume we have
the variables X and Y and the domains are DX and DY , there is a constraint p(X, Y )
present. The arc(X, Y ) is said to be arc consistent if for each value of X in DX , there is
some value for Y in DY satisfying the constraint p(X, Y ). If this is not consistent then
each value in DX that does not have a corresponding value in DY may be deleted from
DX . This is repeated until we end up with a consistent arc (X, Y ).
For a concrete example, we set the domains of X and Y to the integers 1 − 10. If we
have a constraint p(X, Y ) : X − 4 >= Y , then the arc (X, Y ) is not consistent, because
if for example X = 3 then Y can not take a value in DY (as 3 − 4 = −1 which is not in
the domain of Y ). So the domain of X— would be reduced to 5 − 10. When a domain is
reduced the CSP algorithm will check every other arc as they may now not be consistent.
This effect will percolate through the system, sometimes in a loop until either a domain is
empty in which case there is no answer or an answer could be a constraint, or it could be
a concrete answer.
CHAPTER 5. PROLOG 52
Examples:
In the case where a constraint is still there, then the system can be asked to label the
remaining variables.
?- X#>=10, X#<13,labeling([min(X)],[X]).
X = 10 ;
X = 11 ;
X = 12.
A user can define what order they wish to see the labelled results by using the first argument
of the labeling/2 predicate. In the previous query we asked for labels that minimise our
value of X. In this way certain classes of optimisation problems can be achieved – using
only declarative knowledge but still having good efficiency.
?- X is 5+2.
X=7.
However, unlike using the clp(fd) system, this can not be used in all directions. As
can be seen in the following query.
CHAPTER 5. PROLOG 53
?- 5 is X +2.
ERROR: Arguments are not sufficiently instantiated
ERROR: In:
ERROR: [8] 5 is _2816+2
ERROR: [7] <user>
One advantage is that the low level approach is compatible with floating point arith-
metic, allowing for decimal numbers. However, floating point arithmetic is often the cause
of programming errors, no matter what language they are written in. In the case where
you know how much precision you require for a given problem, it will often be a good idea
to multiply out the decimals numbers so that you can use integers, which will allow you to
use the constraint libraries and will make the rounding explicit which can be hidden with
floating point operations.
Negation in Prolog can be a misunderstood concept. The core principle is that Prolog has
a closed world assumption (CWA). When using the CWA negation is defined as negation as
failure, that is if goal can not be proved – it is said to be false. This does not correspond to
the usual logical definition of ‘not’. For this reason negation in modern Prolog uses \+/1—
predicate rather than the not/1 predicate (they perform exactly the same computation)
in order to emphasise this difference. A subtlety is that \+/1 means “cannot be proven
at this time”. This means that the order of the goals is important and we lose the pure
logical monotonic reading of the program. If the goal will only ever be called when ground
this will not be a problem. When the goal is ground it can be understood declaratively,
as it is sound. However negation as failure can only be understood procedurally and not
declaratively when the goal is not ground.
An alternative to using \+/1 predicate is to use the dif/2 predicate. This predicate is a
constraint predicate that is used to express inequality. When constraints are in a program
or goal, the solution found is not just a binding to variables but may be a list of constraints.
These constraint can be called upon later if their results are used in further goals.
For example, if we post a dif/2 goal:
?- dif(X,1).
dif(X, 1).
animal(dog,pet).
animal(A,wild):- dif(A,dog).
This predicate maintains logical properties such as monotonicity. This means that when
we add and subtract clauses and predicates the set of solutions expands and contracts in a
predictable way. Also, the meaning of a program where the literal order is changed is not
changed.
CHAPTER 5. PROLOG 55
5.7.1 Meta-predicates
Prolog programs can themselves be represented as terms. This is in important feature
that is made use of in ‘Pengines’ the subject of Chapter 8 of this thesis. Meta-predicates
take as an argument a Prolog term and this term is treated as a goal. This allows us to
call predicates that are constructed at run time. The basic meta predicate is the call/1
predicate. This predicate returns true if the clause passed to it can be proved true. From
this predicate further predicates can be defined, for example, the calln/n family. The
predicates allow us to augment the term passed to the call.
Other meta-predicates apply a goal to a list inputs. These include the maplist family
and the foldl family and scanl (these are called families because there are multiple
predicates defined that accept as inputs predicates of different arities).
These predicates often replace the use of loops in imperative languages. For example, if
a bioinformatician wanted to apply a function to elements of corresponding lists then they
would use maplist/3. Here we use our n_n2_sum/3 clfpd predicate from our discussion
on arithmetic, but we note that any predicate can be passed into these meta predicates
providing much power. This first query uses maplist to sum corresponding elements from
two lists.
?- maplist(n_n2_sum,[1,2,3],[4,5,6],Sums).
Sums = [5, 7, 9].
In the next query we pass the value 3 into the first argument of maplist to add 3 to
each element.
?- maplist(n_n2_sum(3),[4,5,6],Sums).
Sums = [7, 8, 9].
Now we show a query that ‘folds’ a list from the left direction. The effect of this using
the n_n2_sum/3 predicate is to sum the list.
?- foldl(n_n2_sum,[4,5,6],0,Sum).
Sum = 15.
Next we show the scanl/4 predicate, that when used with n_n2_sum/3, predicate
produces a list of iterative values:
?- scanl(n_n2_sum,[4,5,6],0,Sum).
Sum = [0, 4, 9, 15].
CHAPTER 5. PROLOG 56
These structures are more powerful then their functional equivalents (such as those im-
plemented in the Haskel programming language) because they are relational and therefore
can be used in multiple directions, including as a generator or checker.
The next query illustrates how backtracking can be used to find constraints on numbers
(that sum up to 5).
?- foldl(n_n2_sum,X,0,5).
X = [5] ;
X = [_3074, _3080],
_3080+_3074#=5 ;
X = [_3518, _3524, _3530],
_3524+_3518#=_3550,
_3530+_3550#=5 ;
...
Finally, we demonstrate how to check that the 3rd list is the sum of the first two lists
?- maplist(n_n2_sum,[-1,-2,-3],[1,2,3],[0,0,0]).
true.
Immediate uses for meta-predicates that can be imagined for a bioinformatician, include
various different traversals over graphs and applying data transformations to lists.
Four advantage of using these meta predicates versus a loop construct are :
1. We do not need to track iterators, removing the possibility that you have an off by
one error.
2. The code is often shorter and it can be argued it is easier to understand what the code
means, as we have declaratively defined the relation that the predicate is working
with and we do not have to understand the internal state of a changing system, rather
we can reason what the results will be.
3. The code is encapsulated, the relation is separated from the looping. This means that
there is more chance that your code will be correct and that your resulting analysis
and conclusions can be relied upon.
4. Flexibility of querying patterns, the same code can ‘undo’ a loop, we do not need to
write the opposite function and potentially have a mismatch in functionality.
CHAPTER 5. PROLOG 57
Recent developments in the Prolog community have defined a number of meta predicates
that are made use of in this thesis. These are described in Neumerkel and Kral [113] and
are implemented in a library called “reif”.
The primary predicate that is defined in library reif is if_/3 and its purpose is to
index the dif/2 predicate described above. The reason why it is important to do this is
that we want computational efficiency while retaining logical properties (i.e. the goal order
does not matter and predicates define true relations that can be used in multiple directions
and as a generator). This means that predicate calls should be deterministic when possible,
which means the called goal should not leave open or create redundant choice points.
The if_/3 predicate is a new building block that can be used to make a large number
of useful predicates, that are both efficient and retain the logical properties desired. It
also simplifies code writing as it reduces the number of conditions in comparison to using
dif/2 directly where a programmer needs to write all the conditions twice. Once for the
positive case and once for the negative case, but with if_/3 the condition only needs to
be stated once.
In order to use if_/3 you need a reified predicate definition. A number of common rei-
fied predicates have already been defined in the library reif such as =/3, #>=3 and #=</3 .
For example =(X, E, T) is a reified instance of the disjunction X = E ; dif(X, E). These
have been combined into this new predicate which has an additional argument, T, which
is true if the terms are equal and false otherwise.
It is a simple task to define additional reified predicates when needed and from these a
large number of utility predicates can be defined that meet the properties described before.
Examples include tfilter/3 and tpartition/3. The former of these is used to filter an
input list to a list where the supplied predicate is true for elements in the list, for example:
?- tfilter(dif(1),[a,1,b],Xs).
Xs = [a, b].
This goal succeeded deterministically. The later, creates a partition of the input list based
on the supplied reified predicate. For example:
The efficiency gains of using these meta predicates is important for dealing with large
datasets that bioinformaticians work with, as we will see Chapter 6 of this thesis. They
also reduce the amount of code and help reduce the number of mistakes in our programs.
Meta interpreters
?-setof(Pathway,
setof(Reaction,reaction_pathway(Reaction,Pathway),[_,_,_,_,|_]),
Pathways).
Note how we have used the implicit ‘list’ representation of the number four as discussed
in the section on Prolog arithmetic.
In SWI-Prolog version 7 there are also a number of other second order predicates that
are useful to know about. These are aggregate/4, aggregate/3, aggregate_all/3 and
findnsols/4. The predicate aggregate/4 internally uses setof/3, aggregate/3 uses
bagof/3 whereas aggregate_all/3 uses findall/3. All aggregate and aggregate_all
predicates are used with Template terms for either count, sum/1, min/1 or max/1 . These
could also be nested in an arbitrary named compound term. For example r(min(X),max(X)).
These second order predicates provide efficient database SQL like functions for sets of so-
lutions to Prolog queries and will often be useful for a bioinformatician. For full details of
how these work see the SWI-Prolog documentation.
The final predicate we will mention is findnsols/4. This is similar to findall/3 but
limits the number of solutions found to n solutions. On back-tracking it will find the next
n solutions. This predicate is useful when using pengines (covered in Chapter 7). It also
corresponds to the SQL LIMIT statement that is available in some database systems.
algorithms as lists of state changes and a DCG is an appropriate way to model this. In order
to implement the threading of state, you would use semi context notation, which requires
fewer arguments (than not using a DCG), making it easier to read and understand. For
more information on this technique see [143].
A DCG predicate is a rule where there is a head and a body. The body consists of terminals
and non terminals. A terminal is a list which will represent the elements it contains. A non-
terminal is another DCG which represents the elements that they themselves represent. We
refer to a DCG predicate as f//n. (f being the functor and n the arity) This distinguishes
it from the standard predicate indicator in Prolog (that only has one /). An important
feature of DCGs is that it is possible to insert regular Prolog goals, by inserting them inside
two braces {} inside the DCG rule.
The following DCG describes DNA, as a list of a,c,g and t elements.
To invoke a grammar rule we use the phrase/3 predicate, as shown in the following exam-
ple:
In order to see what a DCG predicate looks like as a standard Prolog predicate you
can query your Prolog system with the listing/1 predicate. For example:
?-listing(dna).
dna(A, A).
dna([a], A) :-
dna([], A).
dna([c], A) :-
dna([], A).
dna([g], A) :-
dna([], A).
dna([t], A) :-
dna([], A).
There are some key advantages of using DCGs in bioinformatics tasks. First, they are very
readable; there are only a few predicates definitions that will actually be needed to use
or modify and these will have fewer arguments than non DCG predicates accomplishing
the same task. Second using a DCG we would explicitly describe our lists when we need
them. Making it clear what our data structure is and formally describing its requirements,
with automatic checking that our data meets these requirements. This makes for better
self-documentation and less chance of mistakes in our programs.
2. A tutorial – often including demonstrations of code and allowing the opportunity for
a new user to play with concepts in order to facilitate understanding.
3. A technical reference – a detailed technical description of each part of the code that
users will interact with.
CHAPTER 5. PROLOG 63
In part three of this thesis, when we describe the Reactome Pengine tool, we will illustrate
these three types of documentation.
The declarative nature of Prolog aids understanding as a user can read the predicate
name to determine the function of the predicate. To a large extent this can mean that
Prolog code is self documenting. In order for this to be true, effort does need to be made
to correctly name predicates. Predicates should be named by declaring relations, verbs
should be avoided because these imply a direction of use and a procedural reading, hence
can often lead to developers missing out on some of the functionality that their program
provides. The exception to this rule is when a predicate is controlling a side effect, for
example, writing output or reading input. Hence, it is a good idea for predicate names to
correspond to arguments. Where possible we adopt this convention and give our predicates
names to correspond to the arguments.
Apart from the names of predicates, code can be made easier to use by providing com-
ments. This can allow greater detail about a predicates usages for reference material. The
PlDoc system is provided by SWI-Prolog to generate reference documentation. PlDoc uses
structural comments and from this it can generate high quality documentation. These
structural comments can be used in two ways. Firstly the system can automatically gen-
erate high quality Latex documentation for publication as a PDF file. Secondly, and more
powerfully, it provides a web server that allows a developer to browse documentation in
situ and share live documentation with others in a simple manner.
The structure of a PlDoc comment allows for a free text description of a predicate
but also optionally semi-formal information. For example, although Prolog is a typeless
language (variables can be bound to anything), it is often useful to specify types for the
intended use of an argument in a predicate. These informal types are often referenced in
the literature and documentation of a Prolog system. There are no standard types so care
must be taken to be consistent. For example, do not use ‘int’ and ‘integer’ to both refer
to an argument expecting an integer. The second feature that can be documented for an
argument is the expected ‘mode’ of the argument i.e. whether the argument is to be bound
to a ground or semi ground term (input arguments) or expected to be a free variable or at
least a structure that is partially ground/instantiated (output arguments).
CHAPTER 5. PROLOG 64
The recommended style for showing an arguments mode uses the following symbols:
• + argument is fully instantiated at run time i.e. it is an input, it does not necessarily
have to be fully ground.
A Prolog compiler will not enforce anything that is written in the Prolog documentation
concerning types and modes the structured comments are solely intended for human read-
ing. Additionally, a predicate can be marked as non-deterministic, semi-determinate or
multi as follows.
• If a predicate is not expected to fail and can only generate one value it is deterministic.
• If a predicate leaves choice points and may give multiple answers but it could fail,
then it is non deterministic.
• If a predicate can return multiple answers but can not fail it is multi
Sometimes a developer will decide to document each predicate with multiple type and
mode declarations to show each intended use. For example, the length/2 predicate has
the following documentation:
66
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 67
demonstrate the ability to find rules to describe subgroups of CpG sites that are different
in cancer (section 6.3.1) and subgroups of microbes that are present or not in lesional
psoriasis samples (6.3.2).
the class label of the example. An example is incorrectly covered if its description
satisfies the rule conditions, but the class label does not match the class label of the
example. Figure 6.1 illustrates examples of how rules can cover examples.
• a set of examples ε, instances for which the class labels are known, described in the
data description language
Find:
• Subgroup descriptions in the form of individual rules R formulated in the rule de-
scription language, each should be as large and consistent as possible.
In a classification task we would look for a minimal rule set which is complete and
consistent but in subgroup discovery this is relaxed. The union of the subgroups does
not need to have complete coverage of the positive class as we do not need to describe
every instance as belonging to a subgroup. Additionally a subgroup can have a fairly large
coverage of the negative class and still be interesting. Finally, instances can belong to
multiple subgroups so there can be some redundancy because we are then able to describe
multiple facets of the instances.
Figure 6.1: If target classes are not disjoint then a rule cannot be complete and consistent
[53]
InstanceId A1 A2 A3 C
Id1 a c 1.1 Pos
Id2 b z 0.2 Neg
Id3 q c 1.1 Pos
... ... ... ... ...
Idn A1n A2n A3n Cn
The second data description language used in this chapter is called multi-instance (MI)
representation. This representation is suitable to represent noisy data, where we can-
not label each individual or as we will see in our microbiome example when we are not
sure what set of attributes apply to an example. Here examples are bags of tuples (indi-
viduals) and the class label is applied to the bag rather than each individual tuple. Such
that an instance has the form : {(v1 ,j , . . . , vn ,j )1 , (v1 ,j , . . . , vn ,j )2 , . . . , (v1 ,j , . . . , vn ,j )B ,1 }
where B is the number of individuals in each bag. Each instance can have a different
number of individuals in its bag. An example in multi-instance data representation is
ej = ({(v1 ,j , . . . , vn ,j )1 , (v1 ,j , . . . , vn ,j )2 , . . . , (v1 ,j , . . . , vn ,j )B ,1 }, cj ) . If we represent a
MI dataset as a table multiple rows will correspond to each instance, as shown in Table
6.2.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 71
InstanceId A1 A2 A3 C
Id1 a c 1.1 Pos
b z 0.2
q c 1.1
Id2 q c 0.2 Neg
r c 0.1
v z 1.0
Id3 z v 0.2 Pos
e c 0.1
a q 0.3
... ... ... ... ...
Idn ... ... ... cn
Figure 6.2: An algorithm for the transformation of a set of attributes into a set of features
[53]
missing attribute values. In this chapter we adopt the pessimistic value strategy to handle
missing values - other strategies are detailed in [53]. The motivation for this strategy is
that unknown values should not affect the quality and potential usefulness of a feature in
rule construction, therefore a feature should not be able to distinguish between a positive-
negative pair when one of the instances values in the pair is unknown. Therefore if a value
is missing for a feature in a positive example then that value is set to false, and set to true
for a negative example. Thus the resulting positive-negative pairs can not be distinguished
by that feature. This results in a smaller number of discriminated pairs meaning that these
features have been penalised because they are built from attribute values with unknown
values. When feature construction is performed explicitly as a first step in rule learning,
then this results in a coverage table. Table 6.3 is an example coverage table for an AVL
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 74
InstanceId F1 F2 F3 ... Fl C
Id1 T F T ... T Pos
Id2 T T F ... F Neg
Id3 F T T ... T Pos
... ... ... ... ... ... ...
Idn {T∨F} {T∨F} {T∨F} ... ... Cn
method inspired by social insect behaviour. Table 6.4 details existing subgroup discovery
algorithms, including what search strategy they use and if they learn a set of rules all at
once, or learn rules sequentially one by one.
General-to-specific search is guided by a heuristic and is the most common search
strategy. A number of these algorithms such as CN2-SD can be set so that the search
maintains a ‘beam of rules’ rather than a single rule. Specific-to-general search is useful
when there are too few instances available for the incremental search of general-to-specific
to be effective, however, it is not commonly used in subgroup discovery and has had the
most success in Inductive Logic Programming applications.
General-to-specific heuristic search algorithms will learn one rule at a time. In order to
stop the same rule being learnt repeatedly, instances that have been already covered by a
rule need to be penalised. In classification rule learning, covered instances will be removed
completely from the subsequent rule learning process. In subgroup discovery however, it
may be interesting for an instance to be described in multiple ways and because unlike in a
classification task a subgroup rule is independent(the rules do not collectively form a rule
list), completely removing covered instances is not appropriate. Hence, algorithms such as
CN2-SD and RSD employ a method where a weight for each instance is maintained, when
an instance is covered by a rule, its weight is reduced. The heuristic used to guide the
search is also adapted(from classification) to take into account the weight on the instances
covered. Alternatively the adaption of AQ (Algorithm Quasi-Optimal) in [138] keeps a
list of uncovered positive examples and restricts the adding of features to a rule to be a
value of a random uncovered instance. This results in this instance being guaranteed to
be covered.
Specific-to-general algorithms will also learn a single rule for a given set of input in-
stances, and in this way the number of subgroups can be predefined. However, the effect of
this search strategy is that the question becomes ‘how can we describe what these instances
have in common, compared to the negative class?’ rather than ‘What are interesting sub-
groups of this data?’ because we have generalised a set of instances. Another potential
problem with this approach is that if a large number of instances are generalised into
one subgroup rule this will often result in very long rules (potentially infinite if the data
description language is sufficiently rich) which are difficult to interpret.
Exhaustive algorithms will output a complete set of rules, the researcher can then set
a cut off quality value or set a fixed maximum number of subgroups. Randomised beam
algorithms such as genetic algorithms will output a set of rules equal to the beam width
Exhaustive rule learning algorithms Name Citation
Explora [80]
SD-Map [7]
Apriori-SD [77]
Heuristic rule learning algorithms Name Search-Order Learns-Sequentially/Learns as a set
SD General-to-Specific Sequentially [54]
AQ-Family General-to-Specific Sequentially [101]
Explora General-to-Specific Sequentially [80]
Prism General-to-Specific Sequentially [19]
CN2-SD General-to-Specific Sequentially [90]
Midos General-to-Specific Sequentially [166]
CHAPTER 6. DESCRIPTIVE RULE INDUCTION
(population) this is normally set to between 10 and 100. These can then be further filtered
using a quality criteria (e.g. all rules above an accuracy threshold) or by taking the top K
subgroups or by using a statistical test to see how likely a permuted dataset would result
in subgroups of this size.
Heuristic general-to-specific algorithms and genetic algorithms can be used with a va-
riety of heuristics to guide the search. In fact, for rule learning algorithms that work by
sequentially finding rules one after the other, it is possible to employ two different heuris-
tics – one for learning each individual rule and one for selecting rules [150]. An effective
heuristic is the Weighted Relative Accuracy [92], it is a generalisation of the rate difference
heuristic that works with weighted instances (required in the sequential learning strategy).
Illustrative example
Table 6.5 shows an artificial example for subgroup discovery. Each row in the table is a
microbe and each attribute (column) corresponds to some property of that microbe. The
class label for each instance describes whether the environment in which the microbe lives,
is the sea or not
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 79
Rule Coverage
IF Size = Medium AND Shape = circular THEN ‘lives Yes(0/9) no (3/5).
in the sea’ = N
IF Size = Medium THEN ‘lives in the sea’ = N Yes(2/9) no (3/5).
IF Shape = Spiky THEN ‘lives in the sea’ = Y Yes(6/9) no (1/5).
Table 6.7: How the data table is organised for studies such as [55]. There are P genes
(attributes) and N patients. N << P .
In that study it was shown how to mine patients gene-expression data in order to find
rules of the following form:
PatienthasCancer <-
Gene1=Expressed ^ Gene2=Expressed ^,...,^ GeneP=NotExpressed.
This method of data analysis from Gamberger et al [55] has the advantage that the
found rules are easier to understand than a model which has many thousands of small
contributions of gene expressions built using techniques such as support vector machines.
But although recognised in that paper the large number of genes compared to the number
of patients [82] is still statistically difficult. An alternative way to analysis datasets such as
this, is instead of attempting to find disease markers for patients by finding subgroups of
patients, we attempt to find subgroups of expressed genes that are different in the exper-
imental conditions (Depicted in table 6.8). This allows researchers to better understand
how the described subgroups of genes affect a phenotype or experimental condition. It
has been recognised [89, 142] that the data analysis technique of Gene Set Enrichment
Analysis (GSEA - a common biological data analysis technique) is equivalent to restricting
subgroup rules to having a single feature. In effect the subgroups are predefined and are
not searched for, the data analysis is to simply find which ones are significant. To see this
take note of the following rule:
GeneIsDifferntlyExpressed <-GO-term1.
This rule would typically be described by a biologist as “Genes annotated with GO-Term1
are enriched” and would be found by performing GSEA with a genes that have been labelled
with Gene Ontology terms on a set of data from an experiment or set of phenotypes.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 81
Table 6.8: How the data table can be organised for studies such as SEGS[142]. There are
P Gene Ontology Terms (attributes) and N Genes. N >> P .
Where this has been recognised, researchers have used subgroup rule learning algo-
rithms to make new ‘Gene sets’ by taking conjunctions of existing sets. These sets are the
intersection of all sets in the rules features. In the paper ‘Contrasting subgroup discovery
[89]’ a terminology comparison between these two fields is given and we recreate it here in
table 6.9.
Table 6.9: A translation of the vocabulary used in subgroup discovery literature and gene
set expression analysis literature from the paper contrasting subgroup discovery
If we turn our attention to the related field of epidemiology then we can see other data
analysis techniques that we can recognise as restricted forms of subgroup discovery. For
instance, a common task in the field of epidemiology when mining epigenetics data is to
search for Differentially Methylated Regions (DMRs). DMRs are thought to have an effect
on cancer and other diseases. Two examples of methods to find DMRs are Adjacent Site
Clustering (A-Clustering) [136] and ‘Bump hunter’ [75] which we will now briefly describe.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 82
A-Clustering
A-Clustering is a method for the detection of co-regulated methylation regions, and re-
gions associated with an exposure. CpG sites within regions identified by A-Clustering are
modelled as multivariate responses to an environmental exposure, assuming that the ex-
posure affects all CpG sites in the region equally. A-Clustering first identifies methylation
regions based on correlation between methylation sites, independently of any exposure. It
then analyses these regions to identify those affected by an exposure. The clustering that
A-Clustering performs includes possible restrictions based on the distance between CpG
sites on the DNA. The algorithm can also have a pre-step called dbp-merge which merges
sets of sites between highly correlated sites that are physically close on a chromosome.
In effect, this technique is clustering instances (CpG sites) and then testing which of
these clusters are associated with the phenotype or experimental condition of interest. As
this method does not take the labels of instances into account when building clusters it
is not using all the information available to build the groups which subgroup discovery
algorithms use.
Bumphunter
In contrast to A-Clustering Bumphunter does use the labels for the instances. The red bar
in the image (6.3 B) is the cut off that defines what class each CpG site is in. Viewing this
in the context of our formal description of subgroups discovery, we can see this method as
restrictions on the data description language and rule description language.
The ‘bump or DMR’ in the image can be described by the subgroup rule as :
CpGDiffentInCancer<-CpG-Location>=42233400^CpG-Location<42234400
Thus the rules are all built from one attribute (the genome location) and have a fixed length
of two features, one fixed to a >= and the other < with the latter strictly larger than the
other. Recognising that research activities like searching for DMRs is the equivalent to
‘subgroup discovery’ allows us to formulate the problem in the formal manner of subgroup
discovery and make use of the knowledge obtained in the machine learning literature. This
allows us to expand and generalise these methods by adding further attributes or changing
the attributes to the description of CpG sites and allowing rules be more expressive (for
example to be longer and to be made from features built from multiple attributes i.e. not
just genomic location).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 83
Figure 6.3: This figure is taken from Jaffe et al[75] it shows an example of a Differentially
Methylated Region (DMR). Panel A shows the methylation measurements from a colon
cancer dataset plotted against genomic location on chromosome 2. Eight normal (blue)
and eight cancer (red) samples are shown for each location. The curves represent a smooth
estimate of the population-level methylation for cancer (red) and normal (blue) samples.
The green bar represents a region found to be a cancer DMR(Interesting subgroup). In
panel B the black curve is an estimate of the population-level difference between normal and
cancer. The curve is expected to vary due to measurement error and biological variation
but rarely exceed a certain threshold, for example those represented by the red horizontal
lines. A candidate DMR is a region where the black curve is outside these boundaries.
By observing other modern genomic technology applications and research articles we can
also identify other datasets such as the 16S rRNA microbiome studies which have similar-
ities to gene expression and CpG methylation datasets. Hence, we can also formulate a
useful data mining task of subgroup discovery of the microbiome.
Evaluations of subgroups
Table 6.10: A translation of the vocabulary used in subgroup discovery literature and DMR
literature that we have constructed.
Table 6.11: A translation of the vocabulary used in subgroup discovery literature and a
new task for characterising subgroups of microbes that we have identified.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 85
are not simply anomalies but represent true patterns in the data. Once subgroups have
been found and checked for significance they can be shared with the research communities
and domain experts in order for them to judge the rules on there utility. Potential utility
of subgroups can be used for decision support, for example, which microbes should be
isolated for further study and what evolutionary history of a CpG site tells us about our
vulnerability to cancer.
on lines 31 -33. This list is turned into a satisfiability problem on line 29 in the predicate
features_example_covered/3. This predicate defines when a rule covers an example.
The predicate rulefeatures_examples_coverednumber/3 (defined on lines 23 to 25) de-
fines the relation of a rule to a set of examples and how many of those examples are covered
by the rule. It does this by mapping the features_examples_covered/3 predicate to each
example in its given list and counts how many of those examples are covered by the rule
using a satisfiability card goal (line 25). The rulefeatures_examples_coverednumber/3
is called twice – once for the positives examples and once for the negative example in the
main relation data_rulefeatures_value/3. Finally, on line 21 the clp(b) labelling is
applied to the list of features that represents the rule in order to find the best rule.
Code Block1
1 :- use_module(library(clpb)).
2 :- use_module(library(clpfd)).
3
4 data(PosExamples,NegExamples):-
5 PosExamples=[ [0,1,0,1],
6 [0,1,0,1],
7 [1,0,0,1]],
8 NegExamples=[ [1,0,1,0],
9 [1,0,1,0],
10 [0,1,1,0]].
11
12 data_rulefeatures_value(data(Positives,Negatives),Features, Value):-
13 Positives =[Example|_Rest],
14 same_length(Features, Example),
15 length(Positives,NumberOfPositives),
16 [TP,FP] ins 0..NumberOfPositives,
17 Value #= TP-FP,
18 labeling([max(Value)], [TP,FP]),
19 rulefeatures_examples_coverednumber(Features,Positives, TP),
20 rulefeatures_examples_coverednumber(Features,Negatives, FP),
21 labeling(Features).
22
23 rulefeatures_examples_coverednumber(Features, Examples, Number):-
24 maplist(features_example_covered(Features), Examples, Numbers),
25 sat(card([Number], Numbers)).
26
27 features_example_covered(FeatureList,ExampleList,Covered):-
28 features_example_constraints(FeatureList,ExampleList,Structure),
29 sat(Covered =:= *(Structure)).
30
31 features_example_constraints([],[],[]).
32 features_example_constraints([H1|T1],[H2|T2],[(H1=:=(H1*H2))|Structure]):-
33 features_example_constraints(T1,T2,Structure).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 87
?- data(Ps,Ns),data_rulefeatures_value(data(Ps,Ns),Fs,V).
Ps = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],
Ns = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
Fs = [0, 0, 0, 1],
V = 3 ;
This returns the best rule, represented as a list F of 0’s and 1’s. The value of 1 in a
position in the list indicates that that this is a feature in the identified rule. In this first
example the rule found is that F4 =1.
P os ⇐= F 4 = true.
In words:
In this case the rule covers all 3 positives and 0 negative examples, so it does not strictly
represent a subgroup (as the coverage is complete). On backtracking we would find the next
best rule according to our heuristic value. In this way every subgroup is found alongside
its value.
?- data(P,N),data_rulefeatures_value(data(P,N),F,V).
P = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],
N = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
F = [0, 0, 0, 1],
V = 3 ;
P = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],
N = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
F = [0, 1, 0, 1],
V = 2 ;
P = [[0, 1, 0, 1], [0, 1, 0, 1], [1, 0, 0, 1]],
N = [[1, 0, 1, 0], [1, 0, 1, 0], [0, 1, 1, 0]],
F = [1, 0, 0, 1],
V = 1 ;
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 88
In order to adapt this algorithm for the multi instance case, we make a number of
changes. Code Block 2 shows a small MI dataset that we use for this example. Code
Block 3 shows the adapted code that makes use of the library reif (line 3) described
previously (section 5.7.1). On line 17 we use the if_/3 predicate to reify the relation
#>= with the second argument set to one. We do this so that we can define the relation
numberbiggerthan1_t/2. We can use this predicate in conjunction with
rulefeatures_milInstanceBag_coverednumber/3 to define the relation
rulefeatures_milExamples_coverednumber/3, by mapping the result of calls to
rulefeatures_milInstanceBag_coverednumber/3 with numberbiggerthan1_t/2. This
is a direct translation of the definition of the MI Learning coverage rule (an instance is
covered by a rule, if at least one individual in its bag is covered by the rule). Suitable
adaptations are given to the other predicates and the names are changed to reflect that
they now describe relations between bags of individuals as instances rather than instances
as simple vectors.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 89
Code Block2
1 mil_data(PosExamples,NegExamples):-
2 PosExamples =
3 [
4 [[1, 0, 0], [1, 0, 0], [1, 0, 1]],
5 [[1, 1, 0], [1, 1, 0], [0, 1, 1]],
6 [[1, 1, 1], [0, 1, 1], [0, 1, 1]],
7 [[0, 1, 0], [1, 1, 1], [0, 1, 1]],
8 [[1, 0, 1], [1, 0, 1], [0, 1, 0]],
9 [[0, 0, 0], [1, 0, 0], [1, 0, 1]],
10 [[0, 0, 1], [1, 1, 0], [0, 0, 0]],
11 [[1, 0, 1], [1, 0, 1], [1, 0, 1]],
12 [[0, 0, 0], [0, 0, 0], [1, 1, 1]],
13 [[1, 0, 1], [1, 1, 0], [1, 0, 1]]
14 ],
15
16 NegExamples =
17 [
18 [[0, 0, 0], [0, 0, 1], [0, 0, 0]] ,
19 [[0, 1, 0], [1, 1, 1], [0, 0, 1]] ,
20 [[0, 1, 0], [0, 1, 0], [1, 1, 1]] ,
21 [[1, 0, 1], [1, 1, 0], [1, 1, 1]] ,
22 [[1, 1, 1], [1, 1, 0], [1, 1, 1]] ,
23 [[1, 0, 0], [0, 1, 0], [0, 1, 1]] ,
24 [[1, 1, 0], [1, 1, 0], [0, 1, 0]] ,
25 [[0, 1, 1], [1, 0, 1], [1, 1, 0]] ,
26 [[0, 0, 1], [0, 1, 0], [0, 0, 0]] ,
27 [[0, 0, 0], [0, 1, 0], [0, 1, 0]]
28 ].
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 90
Code Block3
1 :- use_module(library(clpb)).
2 :- use_module(library(clpfd)).
3 :- use_module(library(reif)).
4
5 mildata_rulefeatures_value(data(Positives,Negatives),Features, Value):-
6 Positives =[Example|_Rest],
7 same_length(Features, Example),
8 length(Positives,NumberOfPositives),
9 [TP,FP] ins 0..NumberOfPositives,
10 Value #= TP-FP,
11 labeling([max(Value)], [TP,FP]),
12 rulefeatures_milExamples_coverednumber(Features,Positives, TP),
13 rulefeatures_milExamples_coverednumber(Features,Negatives, FP),
14 labeling(Features).
15
16 numberbiggerthan1_t(Number,Value):-
17 if_(#>=(Number,1),Value=1,Value=0).
18
19 rulefeatures_milExamples_coverednumber(Features,Examples,ExamplesCoveredNumber):-
20 length(Examples,ExampleSize),
21 ExamplesCoveredNumber in 0 .. ExampleSize,
22 labeling([max(ExamplesCoveredNumber)],[ExamplesCoveredNumber]),
23 maplist(rulefeatures_milInstanceBag_coverednumber(Features),Examples,Numbers),
24 maplist(numberbiggerthan1_t,Numbers,Truths),
25 sat(card([ExamplesCoveredNumber],Truths)).
26
27 rulefeatures_milInstanceBag_coverednumber(Features, Bag, NumberInBagCovered):-
28 length(Bag,BagSize),
29 NumberInBagCovered in 0..BagSize,
30 labeling([min(NumberInBagCovered)],[NumberInBagCovered]),
31 maplist(features_milIndivdual_covered(Features), Bag, Numbers),
32 sat(card([NumberInBagCovered], Numbers)).
33
34 features_milIndivdual_covered(FeatureList,ExampleList,Covered):-
35 features_milIndividual_constraints(FeatureList,ExampleList,Structure),
36 sat(Covered =:= *(Structure)).
37
38 features_milIndividual_constraints([],[],[]).
39 features_milIndividual_constraints([H1|T1],[H2|T2],[(H1=:=(H1*H2))|Structure]):-
40 features_milIndividual_constraints(T1,T2,Structure).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 91
When we test this implementation we obtain a number of subgroups (we omit the return
of variables P and N for brevity):
?- mil_data(_P,_N),mildata_rulefeatures_value(data(_P,_N),F,V).
F = [1, 0, 1],
V = 3 ;
F = [1, 0, 0],
V = 3 ;
F = [0, 0, 1],
V = 2
...
This rule describes a subgroup of our (artificial) data where the class distribution of the
subgroup is different to the class distribution of the entire dataset.
Running time
To understand the running time of the AVL implementation as the size of the datasets
increase we use the time/1 predicate. The data is two dimensional so we sum the number
of examples and the number of features together to derive a single problem dimension
shown as Size in Table 6.12.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 92
Table 6.12: Number of inferences, CPU time in seconds and Logical inferences per second
for method 1 with increasing sizes of random datasets. Size is the number of features and
the number of instances of each class.
The running time results shown in Table 6.12 include the number of inferences and the
number of logical inferences per second as well as the CPU time. We conclude from this
time analysis that at present the Binary Decision Diagram algorithm behind the clp(b)
system (that attempts to solve the satisfiability problem), is currently too slow to work
on our complete (real rather than test) datasets. However, our implementation may still
be useful in the future when improved satisfiability libraries are implemented. If a future
Prolog version incorporates these improvements then this implementation will likely run
without any changes (it is known that SWI-Prolog does not contain the state of the art
implementations of these libraries).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 93
Code Block4
1 rule_instances_weight(Features, WExamples, Number):-
2 maplist(x_y_xypair,Weights,Examples,WExamples),
3 maplist(rule_instance_t(Features), Examples, Numbers),
4 maplist(n_n2_product,Weights,Numbers,Products),
5 list_sum(Products,Number).
6
7 rule_instance_t(FeatureList,ExampleList,Covered):-
8 build_structure(FeatureList,ExampleList,Structure),
9 sat(Covered =:= *(Structure)).
10
11 rule_instance_truth(Rule,Instance,Truth):-
12 rule_instance_t(Rule,Instance,T),
13 bool01_t(T,Truth).
14
15 build_structure([],[],[]).
16 build_structure([H1|T1],[H2|T2],[(H1=:=(H1*H2))|Structure]):-
17 build_structure(T1,T2,Structure).
18
19 specialise(Input,Output):-
20 select(0,Input,1,Output).
21
22 ps_ns_rule_sumP_sumN_value(Ps,Ns,Rule,Pws,Nws,Value):-
23 rule_instances_weight(Rule,Ps,Pws),
24 rule_instances_weight(Rule,Ns,Nws),
25 Value #=Pws-Nws.
26
27 ps_ns_rule_tv_fv_rulepath(Ps,Ns,Start,TP,FP,[r(Special,Value,cm(TP2,FP2))|Result]):-
28 FP#\=0, %Stop when 0 covered negatives.
29 TP #>0,
30 Value #>0,
31 findall(Special,specialise(Start,Special),Specials),
32 maplist(ps_ns_rule_sumP_sumN_value(Ps,Ns),Specials,TP_X,FP_X,Values),
33 maplist(w_x_y_z_wxyz,Values,TP_X,FP_X,Specials,Weight_RuleStructures),
34 keysort(Weight_RuleStructures,Sorted),
35 reverse(Sorted,RSorted),
36 RSorted =[Value-s(TP2,FP2,Special)|_Rest],
37 ps_ns_rule_tv_fv_rulepath(Ps,Ns,Special,TP2,FP2,Result).
38 ps_ns_rule_tv_fv_rulepath(_,_,_,_,_,[]).
39
40 ps_ns_rulepath(Ps,Ns,Result):-
41 Ps=[_W-Example|_],
42 length(Example,Size),
43 length(Start,Size),
44 Start ins 0..0,
45 ps_ns_rule_tv_fv_rulepath(Ps,Ns,Start,_TP,_FP,Result).
46
47 x_y_xypair(X,Y,X-Y).
48
49 w_x_y_z_wxyz(W,X,Y,Z,W-s(X,Y,Z)).
50
51 n_n2_sum(X,Y,Z):-
52 Z#=X+Y.
53
54 n_n2_product(X,Y,Z):-
55 Z#=X*Y.
56
57 list_sum(List,Sum):-
58 foldl(n_n2_sum,List,0,Sum).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 95
Code Block5
1 i_wi(I,100-I).
2
3 t_weighted_reduced(P_2,FatList,DietList):- t_weighted_reduced_(FatList,DietList,P_2).
4
5 t_weighted_reduced_([],[],_).
6 t_weighted_reduced_([W-X|Xs0],Ts,P_2):-
7 if_(call(P_2,X),(w0_w1(W,W2),Ts= [W2-X|Ts0]),(Ts=[W-X|Ts0])),
8 t_weighted_reduced_(Xs0,Ts0,P_2).
9
10 w0_w1(W,W1):-
11 W1 #= W div 2.
12
13 empty_t([],true).
14 empty_t([_|_],false).
15
16 data_rules(data([],_Neg),[]).
17 data_rules(data(Pos,Neg),[Rule|Rules]):-
18 once(ps_ns_rulepath(Pos,Neg,RulePath)),
19 if_(empty_t(RulePath),Rules=[],(
20 reverse(RulePath,[r(Rule,_V,_CM)|_]),
21 t_weighted_reduced(rule_instance_truth(Rule),Pos,PosReduced),
22 t_weighted_reduced(rule_instance_truth(Rule),Neg,NegReduced),
23 data_rules(data(PosReduced,NegReduced),Rules)
24 )).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 96
Running time
Table 6.13: Number of Inferences, CPU time in seconds and Logical inferences per second
for Method 2 with increasing sizes of random datasets. Size is the number of features and
the number of instances of each class.
Table 6.13 shows that Method 2 can cope with much larger dataset sizes than Method
1, but is still not powerful enough for our dataset sizes. This algorithm has the same
time complexity as CN2 and CN2-SD [21, 90]. An additional problem with these first two
implementations (Method 1 and Method 2) is that representing instances as lists of features
does not work, for our dataset sizes. Our hardware cannot learn with rules represented as
a lists of 300k length with 500k instances.
refer to ‘chromosomes’ in the following we are using it in the candidate rule sense and
similarly when we refer to parent, mother, father and child. These are all candidate rules,
not anything to do with our data instances. Finally, the ‘population’ is the current set of
candidate rules, not the population of microbes in the microbiome study.
In contrast to the previous methods where a rule (for the purposes of the search) was
represented as a list of binary values (where a 1 would indicate that feature was included
in the rule) in this method we represent a rule as a list of indices to a set of features. i.e a
rule could now be: [f1,f500,f2000,f8400]. Which would represent the rule:
IF f1=true AND f500=true AND f2000= true AND f8400 = true THEN pos class
In this way the rule size can be fixed, which allows us to manually set a preference for
shorter or longer rules. It is possible to search for very long rules, which would not be
explored by the previous two methods.
We will now provide a discussion of the main predicate definitions in this implemen-
tation. Firstly, code block 6 shows how cross over is implemented [158]. The primary
predicate is mum_dad_c1_c2/4. This relates a pair of ‘mother’ and ‘father’ chromosomes to
two ‘child’ chromosomes. This predicate definition makes use of the association list data
structure, which has better performance for look up than traversing a list using member/2.
On line 7 the list of index positions is split for the mother and father chromosomes by the
number of cross over points (the number of cross over points is set at 3, and is stored as a
Prolog fact in the program, but for searching for very long rules this could be increased).
The split chromosomes of the ‘mother’ and ‘father’ are then inter-weaved to form the
new ‘child’ chromosomes by using the predicate s_L1_L2_LA_LB/4. This is illustrated by
the following query:
s_L1_L2_LA_LB([[0,0,0,0,0],[0,0],[0,0]],[[1,1,1,1,1],[1,1],[1,1]],L1,L2).
L1 = [[1, 1, 1, 1, 1], [0, 0], [1, 1]],
L2 = [[0, 0, 0, 0, 0], [1, 1], [0, 0]] ;
false.
These ‘child’ chromosomes are created as split lists, and they are concatenated into a
single list using a DCG (lines 13-14). This is the primary method to generate new rules in
the search.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 98
Code Block6
1 mum_dad_c1_c2(M,D,C1,C2):-
2 length(M,LM),
3 numlist(1,LM,Index),
4 maplist(x_y_xypair,Index,M,MPair),
5 maplist(x_y_xypair,Index,D,DPair),
6 splitlocs_length(SplitLocs,LM),
7 list_splitlocs_splits(Index,SplitLocs,SplitIndexes),
8 list_to_assoc(MPair,MAssoc),
9 list_to_assoc(DPair,DAssoc),
10 maplist(assoc_keys_values(MAssoc),SplitIndexes,MumSplit),
11 maplist(assoc_keys_values(DAssoc),SplitIndexes,DadSplit),
12 s_L1_L2_LA_LB(MumSplit,DadSplit,C1Split,C2Split),
13 phrase(concatenation(C1Split), C1),
14 phrase(concatenation(C2Split),C2),!.
15
16 s_L1_L2_LA_LB(L1,L2,LA,LB):-
17 interweave_L1_L2_LA_LB(odd,L1,L2,LA,LB),!.
18
19 interweave_L1_L2_LA_LB(_,[],[],[],[]).
20 interweave_L1_L2_LA_LB(odd,[L1_H|L1_T],[L2_H|L2_T],[L2_H|LA_T],[L1_H|LB_T]):-
21 interweave_L1_L2_LA_LB(even,L1_T,L2_T,LA_T,LB_T).
22 interweave_L1_L2_LA_LB(even,[L1_H|L1_T],[L2_H|L2_T],[L1_H|LA_T],[L2_H|LB_T]):-
23 interweave_L1_L2_LA_LB(odd,L1_T,L2_T,LA_T,LB_T).
24
25 x_y_xypair(X,Y,X-Y).
26
27 my_get_assoc(A,K,V):-
28 get_assoc(K,A,V).
29 assoc_keys_values(A,K,V):-
30 maplist(my_get_assoc(A),K,V).
31
32 list_splitlocs_splits(L,SplitLocs,S):-
33 list_splitlocs_splits(L,SplitLocs,S,1),!.
34
35 list_splitlocs_splits(L,[],[L],_).
36 list_splitlocs_splits(L,SplitLocs,[[H|Split1]|RestSplits],Count):-
37 L=[H|T],
38 SplitLocs =[First|Rest],
39 H #< First,
40 Count2 #= Count+1,
41 list_splitlocs_splits(T,[First|Rest],[Split1|RestSplits],Count2).
42 list_splitlocs_splits(L,SplitLocs,[[H]|RestSplits],Count):-
43 L=[H|T],
44 SplitLocs =[First|Rest],
45 H = First,
46 Count2 #= Count+1,
47 list_splitlocs_splits(T,Rest,RestSplits,Count2).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 99
Code block 8 shows the hill climbing mutation operation [134]. This mutation operation
performs a local hill climbing search to find if there is a chromosome near by (in feature
space) to the current candidate chromosome that has a better fitness evaluation. The
search takes a random allele in the chromosome and tries to find local improvements that
are nearby, constraining the search using clp(fd) to ensure that we do not try and test
an invalid feature (feature id minus 1 for example would not make sense). This assumes
that the feature order has some semantic meaning. For example, the features for CpG
sites which relate to a measure of how closely a site is preserved have an ordering. A list
of ordered features (made from this real valued attribute) could be:
f(1,attribute_1,>=,11),f(2,attribute_1,>=12),f(3,attribute_1, >=13).
This shows that there is a relationship (in this case +10) between the feature Id, and
what the feature is testing for. When the features refer to attributes that do not have an
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 100
ordering this process is more of a random search – rather than a hill climbing search.
Code Block8
1 chromosome_hillclimbmutated(Range,C1,C2a,C_Fitnesses):-
2 list_listindexed(C1,C1Index),
3 random_member(Feature-Index,C1Index),
4 Value #= Feature,
5 numberoffeatures(NumberOfFeatures),
6 findall(NewValue,(
7 Max #=NumberOfFeatures,
8 Min #=1,
9 Top #=Value+Range,
10 Bottom #= Value-Range,
11 NewValue in Min..Max,
12 NewValue #=<Top,
13 NewValue #\=Value,
14 NewValue #>=Bottom,
15 label([NewValue])
16 ),
17 NewValues),
18 maplist(old_index_value_new(C1,Index),NewValues,NewCs),
19 maplist(heuristic_chromosome_fitness(rate_diff),NewCs,ChildFitness),
20 maplist(x_y_xypair,ChildFitness,NewCs,C_Fitnesses),
21 pop_sortedpop(C_Fitnesses,[C2|Rest]),
22 C2 = _ValueOfC2-C2a.
This is demonstrated by the following query, where we see the 4th allele has been
randomly selected to be ‘hill climbed’ (the value of the chromosome is the sum and we
have added screen output):
?-chromosome_hillclimbmutated(4,[1,10,5,11],C2,F).
Mutated Sorted:
31-[1, 10, 5, 15].
30-[1, 10, 5, 14].
29-[1, 10, 5, 13].
28-[1, 10, 5, 12].
26-[1, 10, 5, 10].
25-[1, 10, 5, 9].
24-[1, 10, 5, 8].
23-[1, 10, 5, 7].
C2 = [1, 10, 5, 15],
F = [23-[1, 10, 5, 7], 24-[1, 10, 5, 8], 25-[1, 10, 5, 9],
26-[1, 10, 5, 10], 28-[1, 10, 5|...], 29-[1, 10|...],
30-[1|...], 31-[...|...]].
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 101
In order to see what instances are covered by a feature, different partitioning predicates
are implemented (Code Block 9 and Code Block 10). They have a similar implementation
in both data mining tasks (with suitable adaptations for the multi-instance case shown in
Code Block 10), and there is a version for the positive instances and the negative instances
in both tasks in order to account for missing values following our pessimistic value strategy –
from [53]. For instance, in the CpG mining task the cpgpartition_pos_ts_fs_feature/4
predicate is used. This predicate makes uses of the meta predicate if_/3 (in a nested
call) from library reif that we described in the section on meta predicates in Chapter
5. This enables a fast, deterministic implementation, that does not run out of memory
when applying the feature to the approximately 500k instances in the CpG mining task.
Nevertheless, this is the most computationally intensive part of the algorithm as the number
of meta-calls is proportional to the number of instances in the case of CpG methylation
task, and in the worst case it is proportional to the combined number of individuals in all
the bag instances in the microbiome case (the library reif actually uses a technique called
‘goal expansion’ that is able to reduce the number of these meta-calls).
Code Block9
1 cpgpartition_pos_ts_fs_feature([],[],[],_).
2 cpgpartition_pos_ts_fs_feature([X|Xs0],Ts,Fs,feature(At,_,Op,FValue)):-
3 cpg_ats_i(X,AtList),
4 atom_concat(#,Op,Op2),
5 maplist(atterm_atname,AtList,Ats),
6 if_(memberd_t(At,Ats),
7 (
8 memberd(attribute(At,AtValue3),AtList),
9 if_(call(Op2,AtValue3,FValue), (Ts=[X|Ts0],Fs=Fs0),
10 ( Ts =Ts0,Fs=[X|Fs0]))
11 )
12 ,(Fs=[X|Fs0],Ts=Ts0)),
13 cpgpartition_pos_ts_fs_feature(Xs0,Ts0,Fs0,feature(At,_,Op,FValue)).
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 102
Code Block10
1 otupartition_pos_ts_fs_feature([],[],[],_).
2 otupartition_pos_ts_fs_feature([X|Xs0],Ts,Fs,Feature):-
3 Feature = f(At,Op,FValue),
4 instance(X,XBag,_),
5 length(XBag,BagSize),
6 if_(bag_feature_t(BagSize,XBag,f(At,Op,FValue)),
7 ( Ts=[X|Ts0],Fs=Fs0),
8 ( Fs=[X|Fs0],Ts=Ts0)),
9 otupartition_pos_ts_fs_feature(Xs0,Ts0,Fs0,f(At,Op,FValue)).
10
11 individual_atvalues(I,AtValues):-
12 findall(A-V,img_data(I,A,V),AtValues).
13
14 individualcovered_feature_t(I,f(At,_Op,V),T):-
15 memo(individual_atvalues(I,AtValueList)),
16 if_(memberd_t(At-V,AtValueList),T=true,T=false).
17
18 bag_feature_t(_Size,[],_,false).
19 bag_feature_t(Size,[OneFromBag|RestOfBag],F,Truth):-
20 length(RestOfBag,L),
21 if_(individualcovered_feature_t(OneFromBag,F),Truth=true,bag_feature_t(Size,RestOfBag,F,Truth)).
Code block 11 shows how the rate difference heuristic is implemented taking into ac-
count the number of instances covered by the chromosome/rule.
Code Block11
1 heuristic_chromosome_fitness(rate_diff,C,Fitness):-
2 posExamples(PosExamples),
3 negExamples(NegExamples),
4 maplist(f,C,Features),
5 length(PosExamples,PositiveLength),
6 length(NegExamples,NegativeLength),
7 examples_features_filtered(pos,PosExamples,Features,TruePos,1),
8 examples_features_filtered(neg,NegExamples,Features,FalsePos,1),
9 length(TruePosExamples,TruePositives),
10 length(FalsePosExamples,FalsePositives),
11 Fitness is (TruePositives/PositiveLength)-(FalsePositives/NegativeLength).
In order to demonstrate the genetic algorithm we provide the following example query
that simply optimises the sum of the chromosome numbers
?-gensize_popsize_genomelength_chromosome_type(4,4,5,X,number(1-1000)). The
following query output shows how the performance improves after each generation (note
that the generations are counting down rather than up):
?-gensize_popsize_genomelength_chromosome_type(4,4,5,X,number(1-1000)).
Init pop:
2076-[154, 212, 279, 645, 786].
2196-[637, 223, 405, 316, 615].
2865-[467, 803, 640, 892, 63].
2307-[596, 701, 87, 519, 404].
Sorted pop:
2962-[596, 803, 640, 519, 404].
2865-[467, 803, 640, 892, 63].
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 104
Running time
This core algorithm is very fast compared to the other two implementations. However, it
takes time to check how many instances are covered by a rule as this is not done in advance
(which is the case of the other two algorithms). This is done to save space, as it is not
possible to load into the search algorithm all the coverage values of all candidate rules.
In order to see how this algorithm performs, we test with increasing sub-samples of the
CpG data. We fix the algorithm at 3 generations, 3 features and an initial population size
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 105
of 3. We then modify the size of the dataset by taking sub-samples of the CpG dataset.
Table 6.14 shows the result of this and we can see that the algorithm scales linearly in
terms of the dataset size.
Table 6.14: CPU time for genetic algorithm for increasing sizes of CpG dataset samples
We use the CpG dataset from Gene Expression Omnibus (GEO) dataset GSE60185 [48],
consisting of 285 array samples. Of these, 47 are taken from normal breast tissue, and
238 from breast tissue afflicted with cancer. The methylation array (Illumina Infinium
HumanMethylation450 microarray) recorded methylation levels across the whole genome
– 468,424 CpG sites. We used the normalised version of the dataset, where the data has
been processed for probe filtering, color bias correction, background subtraction and subset
quantile normalisation, [141].
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 106
As we described earlier in this chapter, the objects of study here are the CpG sites
not the participants. Therefore, in order to assign class labels to each CpG site we first
take the mean difference of the methylation levels for people with breast cancer versus
those with normal samples. Then, taking this vector of mean differences we apply optimal
2-means clustering [153] in order to assign each CpG site into two categories – differentially
methylated in breast cancer and non-differentially methylated in breast cancer. We refer
to these as the positive and negative classes, respectively.
This results in :
98321 negative CpGs (not differentially methylated in the cancer array samples).
We care equally about false positive and false negative identification of differentially
methylated CpGs sites because when learning rules based on this classification both of
these errors will effect the quality of rule produced. The clustering method described
above suitably trades off the false positive and false negative rates because it finds the
optimal threshold for dividing our CpG sites into two categories. This assumes that the
data can be split into two categories based on there being two methylation profiles in cancer
(i.e. if there are no real clusters this algorithm will still bring back some division of the
data into two arbitrary groups). This process can be thought of as setting the red line in
Figure 6.3.
Our features are constructed by the algorithm in Figure 6.2, derived from the attributes
used in a previous study [132]. These attributes consist of a number of genomic sequence
annotations, including conservation data throughout the genome in both protein coding
and non protein coding regions. Other annotations are taken from [25] the ENCODE
Project, including transcribed non-coding RNAs, transcription factor binding sites and
chromatin structure. In total there are 10 attributes groups – see table 6.15 (Hash et
al. call these feature groups, but we follow the convention in the rule learning literature
of reserving the word feature to mean a binary test [53]). In these attribute groups are a
number of attributes, which correspond to the individual data files that have these sequence
annotations.
We derived 333,719 features from these 10 attributes groups, where each feature consists
of an attribute, an attribute value and an operator. For example, the following feature
consists of the attribute ”46-Way Sequence Conservation”, attribute value 0.2 and operator
>=.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 107
Id Name Description
A 46-Way Sequence Conservation: Based on multiple sequence alignment scores,
at the nucleotide level, of 46 vertebrate
genomes compared with the human genome.
B Histone Modifications (ChIP-Seq): Based on ChIP-Seq peak calls for histone
modifications.
C Transcription Factor Binding Sites Based on PeakSeq peak calls for various tran-
(TFBS PeakSeq): scription factors.
D Open Chromatin (DNase-Seq): Based on DNase-Seq peak calls.
E 100-Way Sequence Conservation: Based on multiple sequence alignment scores,
at the nucleotide level, of 100 vertebrate
genomes compared with the human genome.
F GC Content: Based on a single measure for GC content cal-
culated using a span of five nucleotide bases
from the UCSC Genome Browser.
G Open Chromatin (FAIRE): Based on formaldehyde-assisted isolation of
regulatory elements (FAIRE) peak calls.
H Transcription Factor Binding Sites Based on SPPpeak calls for various transcrip-
(TFBS SPP): tion factors.
I Genome Segmentation: Based on genome-segmentation states using a
consensus merge of segmentation’s produced
by theChromHMM and Segway software.
J Footprints: based on annotations describing DNA foot-
prints across cell types from ENCODE.
Table 6.15: Attributes used in the CpG data mining task. For more details on these see
[132]
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 108
The large number of instances results in a large number of features, but importantly
we still have more objects (CpG sites) than we do features and this means we do not have
a “large p small n problem” [42]. Also, these features are binary tests not real valued
numbers and so the hypothesis space is smaller than using real valued attributes directly.
We ran the genetic algorithm for 20 generations with an initial population of 100 and
a chromosome size set to 6.
Results
The genetic algorithm took 5 hours to complete. The rules found incorporate features
that have been built from files of Chip-Seq and DNase-Seq files, as well as the 100-Way
sequence conservation (attribute groups, B, D, and E). There are two different types of files
for the attribute group B, gapped peaks and broad peaks. BroadPeak means peaks where
histone modifications span wider ranges of genomic region. In contrast, Gapped Peaks
means interpreted data where the regions may have been spliced or incorporate gaps in the
genomic sequence [149]. The final attribute used in the rules are thresholds on 100-Way
multiple alignment sequence (attribute group E).
The top 3 identified rules (and the number they cover of each class) are as follows:
IF
gappedPeak_E108-H3K9ac.gappedPeak.gz < 13613081900 = TRUE
AND gappedPeak_E097-H3K9me3.gappedPeak.gz < 13524832050 = TRUE
AND broadPeak_E094-H3K4me1.broadPeak.gz >= 216 = TRUE
AND broadPeak_E069-H3K27ac.broadPeak.gz < 52 6= TRUE
AND wgEncodeCshlLongRnaSeq_wgEncodeCshlLongRnaSeqBjCellPapContigs-
bedRnaElements.gz < 624876 = TRUE
AND broadPeak_E121-H3K9ac.broadPeak.gz < 808 = TRUE
THEN PosClass.
Covering 150 positives instances, and 20 negative instances.
IF
gappedPeak_E019-H3K9ac.gappedPeak.gz, < 3041216650 = TRUE
AND broadPeak_E065-H3K27me3.broadPeak.gz >= 159 = TRUE
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 109
THEN PosClass.
Covering 211 positives instances, and 44 negative instances.
IF
100-Way_PHYLOP < 4 = TRUE
AND broadPeak_E065-H3K27me3.broadPeak.gz >= 159 = TRUE
AND gappedPeak_E009-H3K4me1.gappedPeak.gz >= 13684353550 = TRUE
AND broadPeak_E003-H3K4me1.broadPeak.gz < 300 = TRUE
AND gappedPeak_E003-H2A.Z.gappedPeak.gz < 9185280200 = TRUE
AND 100-Way_PHYLOP, f13, >, -1003) = TRUE
THEN PosClass.
Covering 381 positives instances, and 120 negative instances.
Interestingly this last rule has two features built from the same attribute 100-Way_PHYLOP,
this means that the subgroup contains examples that lie in a range of sequence conserva-
tion. This perhaps indicates that the covered CpG sites live in parts of the DNA that have
diverged from a common ancestor for the same amount of time but are now distinct areas
(Speculation).
For this application we use a dataset from Muirhead [107] combined with the meta data
from the IMG/M database. In this dataset there are 255 samples taken from patients with
lesional psoriasis and 257 samples taken from people without lesional psoriasis. For each
sample we have the abundance levels of different microbes obtained by using 16S rRNA
Sequencing. The 16S rRNA sequencing data is processed by assigning reads to operational
taxonomic units (OTUs). For further details of how this is done see [107].
As in section 6.3.1, the object of study here is not the participants, but the individual
OTUs. The OTUs are the instances for our data analysis. In order to assign class labels to
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 110
each OTU we aggregate the samples from each person in the two different groups (lesional
and non lesional). If an OTU is only present in lesional samples then this is assigned the
positive class. If an OTU is present in non lesional samples then this is in the negative
class (so this class contains OTUs that are only found in non lesional samples as well as
OTUs found in both lesional and non lesional samples). This results in 891 positive OTUs
and 2,185 negative OTUs (3,076 total instances).
The process from Muirhead [107] assigned each OTU the most specific taxonomic unit
possible. This means that some OTUs can be associated with a species of bacteria, whereas
others can only be determined at the genus (or higher) taxonomic level. As attributes for
the microbes we used IMG/M meta data. The IMG/M meta data on bacterial genomes is
for a more specific level in the taxonomic tree than the data we have from Muirhead [107].
This mean that the attributes in the IMG/M data are associated with at least the species
level, and sometimes individual strains of a species. This means that our data is noisy –
we do not know exactly which set of attributes apply to which microbe from the Muirhead
study [107]. In order to account for this we use the multi-instance version of our genetic
subgroup discovery algorithm. We create bags for the lowest taxonomic level available for
each instance. For example, if the lowest level is species, then we assign all the strains
from that species in the IMG/M data into a bag for that instance.
To give a concrete example, OTU 5118 (a positive class instance) has the follow-
ing taxonomic classification: phylum=‘Proteobacteria’, class=‘Alphaproteobacteria’, or-
der=‘Rhizobiales’, family=‘Phyllobacteriaceae’, genus=‘Phyllobacterium’. We do not know
the exact species, and in our IMG/M data we have two species in this taxonomic group:
‘Phyllobacterium sp. YR531’ and ‘Phyllobacterium sp. UNC302MFCol5.2’, so the at-
tributes from these two species are in the instance bag for OTU 5118. This is illustrated
in Table 6.16. In total, we use 25 discrete attributes. Our features are constructed by the
algorithm in Figure 6.2 and this results in a total of 200 features.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 111
OTU-ID A1 A2 A3 C
OTU1 a c 1.1 Pos
b z 0.2
q c 1.1
OTU2 q c 0.2 Neg
r c 0.1
v z 1.0
OTU3 z v 0.2 Pos
e c 0.1
a q 0.3
... ... ... ... ...
OTUdn ... ... ... cn
Table 6.16: A multi-instance data table, each OTU has a bag of species attributes
We ran the genetic algorithm for 20 generations with an initial population of 50 and a
chromosome size set to 3.
Results
The genetic algorithm took 4 hours to complete, and the top 5 identified rules (and the
number they cover of each class) are as follows:
IF Ecosystem = ‘Ponds’
AND Phenotype = ‘Unknown’
AND Metabolism = ‘Chlorophenol degrading’
THEN OTU is present in lessional psoriasis
Each of these rules describes a subgroup of bacteria whose class distribution is different
to the overall class distribution. These rules are interpretable, such that they can be used by
a biologist to generate hypotheses for further exploration. For example, the last subgroup
found describes specific bacteria that are said to live in ‘Aquatic Soil’ and on a ‘Biofilm’,
a biologist could attempt to culture these microbes and study under what conditions do
they grow and how they interact with different environments. Further questions around
the ways in which these bacteria react to different classes of anti-antibiotics. Another area
of study could be prompted by the second to last subgroup, where the metabolism and
phenotype of the microbiota has been characterised. An expert biologist may be able to
postulate, different host/microbiota interactions based on these specific identified concepts.
These could then be tested in the laboratory.
CHAPTER 6. DESCRIPTIVE RULE INDUCTION 113
Structured data
114
Chapter 7
Reactome Pengine
The work presented in this chapter is based on the publication : Neaves, Samuel R., Sophia
Tsoka, and Millard, Louise AC. “Reactome Pengine: A web-logic API to the homo sapiens
reactome.” Bioinformatics 1 (2018): 3.
7.2 Introduction
Reactome [30] is a web service that includes a database of the molecular details of cellular
processes, and is one of the leading tools for bioinformaticians working with biological
115
CHAPTER 7. REACTOME PENGINE 116
pathways. Currently, users access data in Reactome using either HTML (the website), a
REST API, a SPARQL API, or by downloading the complete dataset for local processing.
The APIs provide a convenient way to access the data but restrict this access to a set
of predefined API calls (For example a user can choose download only the subset of data
they requested), whereas downloading the complete dataset means that the data can be
processed exactly as required. This chapter presents the tool Reactome Pengine that
allows the flexibility of the latter, with the convenience of the former. This makes queries
more efficient, saving both bandwidth and storage space, and this is achieved by using
the logic programming language, Prolog. Logic programming is a paradigm for computer
programming, where knowledge is represented in a restricted form of first order logic, as
a set of facts and rules called a knowledge base [22]. See Chapter 5 for more details. The
knowledge base is interrogated with queries, which are powerful due to inbuilt procedures
that use the facts and rules together to infer solutions. As we argue throughout this thesis
Logic programming has much potential in bioinformatics [108, 4, 6] and it can be used
with Reactome to build predictive models which we demonstrate in Chapter 8.
7.3 Implementation
Recently, a library for building web servers using SWI-Prolog [160] called Pengines [88],
has been developed. Pengines allows data providers to make their Prolog knowledge base
available to users via a web service (that uses a web-logic API), accessed as if it was on
the users machine. In addition, users can send programs to the pengine to manipulate
the data as they wish. This is very different from the traditional way of accessing data,
where a user is either constrained to the set of queries defined in a (non-logical) web API,
or has to download a dataset in bulk. Pengine services support federated queries, similar
to SPARQL, but with Turing complete programs executing on remote services rather than
SQL-like queries. We will describe this in more detail in later sections.
The Reactome Pengine tool presented here uses the Pengine library to make a Prolog
knowledge base, built on Reactome data, available to researchers on the internet. The
mainstay of the knowledge base are facts retrieved from the Reactome HomoSapiens.owl
RDF file, which contains circa 1.35 million RDF triples. In addition, we have also pro-
vided an intuitive set of data access predicates (which are similar to functions in other
programming paradigms) that sit on top of the RDF data. These define relations between
Reactome entities and provide access to the data at a higher level of abstraction. Users
CHAPTER 7. REACTOME PENGINE 117
can query the RDF directly or use this abstraction layer (or both), or make their own
abstraction of the data. Our abstraction layer includes predicates that represent reactions
as nodes on a graph, with edges between the nodes as described in the following two cases.
First, an edge exists when an output of a reaction is an input of another reaction. This edge
type we name precedes. Second, an edge exists when an output of a reaction r1 is a control
of another reaction r2 , and the particular edge type depends on how the output of r1
controls r2 (e.g. activation or inhibition, or subtypes of these). An example predicate that
relates two reactions via a linking entity is ridReaction_ridLink_type_ridReaction/4.
We also provide predicates with indexed (therefore fast) access to a set of queries that
we expect to be useful for researchers, but that are computationally intensive (and hence
slow without indexing). For example ridPathway_reactions/2 relates pathways to the
complete list of biochemical reaction IDs.
The Pengine library has inbuilt mechanisms to ensure the integrity of the server on
which it is hosted. Security is ensured by allowing only safe predicates to be run on the
Pengine server. Upon running a query the service first checks that the query is safe and
returns an error if this is not the case. For example, sending a program that calls shell/1
would result in an error (because a user could for example send a shutdown command to
the server). The Pengine library also contains a number of methods to manage resource
allocation on the server, including restricting request execution time and the maximum
number of requests that can be executed simultaneously. For more details see [88]. Finally,
the service runs inside a docker container which isolates the service from the underlying
machine, and facilitates scaling and load balancing to meet demand. Queries to Reactome
Pengine are logged such that over time we can augment the inbuilt predicates with the
popular queries and programs, and also build new indexes to improve performance and
functionality as the service is used. This also means we can explore the possibility of
applying machine learning on the collected programs to automatically learn predicates
that are useful for users. Documentation describing the logical API is available at:
https://apps.nms.kcl.ac.uk/reactome-pengine/.
To use SWISH to interact with Reactome Pengine (shown as grey/solid arrows in
Figure 7.1) the user writes a program (program A in Figure 7.1) for SWISH, that will
itself contain a query or program (program B in Figure 7.1) to be processed on Reactome
Pengine. When SWISH executes program A, the constituent program B is forwarded to
Reactome Pengine. Upon receiving program B Reactome Pengine executes it and sends
the results back to SWISH. SWISH continues program A and then displays the results in
CHAPTER 7. REACTOME PENGINE 118
Pengine executes the program and returns data back to the local calling program to finish
its execution.
Figure 7.1: Diagram of possible user interactions with the Reactome Pengine server.
Thick arrows: data sent to Reactome Pengine; thin arrows: data returned from Reac-
tome Pengine. Yellow, dashed: Direct interaction with Reactome Pengine; grey, solid:
Interaction with Reactome Pengine via SWISH
"Rnf111".
Command line interaction 1
?- use_module(library(pengines)).
true.
?- pengine_rpc(‘https://apps.nms.kcl.ac.uk/reactome-
pengine/’,rid_name(‘Protein1’,Name)).
Name = "Rnf111" .
The second more advanced example shows how to send a program (alongside a query
that calls the program) to Reactome Pengine to be computed remotely. For instance,
a bioinformatician can use Reactome Pengine to explore paths of reactions through the
human reactome. Code block 1 shows an example Prolog program that can be used for this
purpose (Adapted from https://stackoverflow.com/questions/30328433/definition-
of-a-path-trail-walk/30595271#30595271) ; available on Github at:
https://github.com/samwalrus/reactome-pengine).
This program includes two core elements. First, the predicate path_program/1 retrieves a
list of clauses that themselves define a program that will be sent to the Reactome Pengine.
Second, the predicate path_from_to/3 is the main predicate that a bioinformatician would
use to query the Reactome for paths in a variety of ways (without downloading the entire
dataset to their machine). For instance, a researcher can use this predicate to: a) establish
whether a path exists from a particular reaction to another, b) retrieve all paths from
a reaction, or c) retrieve all paths to a reaction. The path_from_to/3 predicate first
retrieves the Reactome Pengine server address (line 25) and the program (specified in
path_program/1 lines 4-23 and called on line 26), and then sends this program alongside
a specified query to Reactome Pengine (lines 27-29). Command line interaction 2 shows
example commands that use this program. Furthermore, in the notebook examples 6 and
7 show additional refinements to this program, such as further constraints for properties
CHAPTER 7. REACTOME PENGINE 121
of reaction paths.
Code Block 1: paths.pl
1 :-use_module(library(pengines)).
2 reactome_server(S):-
3 S=‘https://apps.nms.kcl.ac.uk/reactome-pengine’.
4 path_program(Program):-
5 Program=[
6 (:- meta_predicate path(2,?,?,?)),
7 (:- meta_predicate path(2,?,?,?,+)),
8 (graph_path_from_to(P_2,Path,From,To):-
9 path(P_2,Path,From,To)),
10 (path(R_2, [X0|Ys], X0,X):-
11 path(R_2, Ys, X0,X, [X0])),
12 (path(_R_2, [], X,X, _)),
13 (path(R_2, [X1|Ys], X0,X, Xs) :-
14 call(R_2, X0,X1),
15 non_member(X1, Xs),
16 path(R_2, Ys, X1,X, [X1|Xs])),
17 (non_member(_E, [])),
18 (non_member(E, [X|Xs]) :-
19 dif(E,X),non_member(E, Xs)),
20 ( e(R1,R2):-
21 ridReaction_ridLink_type_ridReaction(R1,_,_,R2)
22 )
23 ].
24 path_from_to(Path,From,To):-
25 reactome_server(Server),
26 path_program(Program),
27 pengine_rpc(Server,
28 graph_path_from_to(e,Path,From,To),
29 [src_list(Program)]).
Figure 7.2: Lines 6-22 are the program sent to Reactome Pengine. In this example, the
program is a list of terms, where each term is a clause that will be interpreted by Reactome
Pengine.
Basic usage
The basic usage section of the note book contains 6 interactive queries (1-6). We first briefly
explain the data model of the underlying Reactome dataset. Secondly, we explain the data
model that we have built into Reactome Pengine. We then explain where computations
take place. Next we give some example queries that illustrate the capabilities of Reactome
Pengine in the context of a SWISH notebook, including graphical rendering, R integration
and Javascript applications. Also note that further functionality of the Reactome Pengine
can be utilised by building client applications in the full desktop version of SWI-Prolog.
1. Underlying data model of Reactome
The underlying data model is based on a RDF triple graph. This is fully documented on the
Reactome website [30]. In brief, the principle type of entity in Reactome is the reaction and
reactions have inputs and output entities. Reactions are also optionally controlled. Each
biological entity such as proteins, small molecules, and reactions are given an ID. Entities
also include ’Complexes’ and ’Protein sets’. A complex is a set of molecular entities that
have combined together. A protein set is when a set of proteins can perform the same
biological function. Both complexes and protein sets can also themselves be composed of
complexes and protein sets. We can query this data directly with the rdf/3 predicate. In
the online version you can see how this is done by clicking the blue play triangle in the
CHAPTER 7. REACTOME PENGINE 123
relevant cell. You would also be able to then find further solutions, with the ’next’ button.
SWISH Query 1
:-use_module(library(pengines)).
reactome_server(‘https://apps.nms.kcl.ac.uk/reactome-pengine’).
rdf(X,Y,Z):-
reactome_server(S),
pengine_rpc(S,rdf(X,Y,Z),[]).
?-rdf(X,Y,Z).
X = ‘http://www.reactome.org/biopax/47/48887#PublicationXref12’,
Y = ‘http://www.biopax.org/release/biopax-level3.owl#author’,
Z = ^^("Chang, Nan-Chi", ‘http://www.w3.org/2001/XMLSchema#string’)
– ACTIVATION
– ACTIVATION-ALLOSTERIC
– INHIBITION
– INHIBITION-ALLOSTERIC
– INHIBITION-COMPETITIVE
– INHIBITION-NONCOMPETITIVE
We can view this graph for an individual pathway using ridPathway_links/2 shown below.
We can use this predicate in all directions, i.e 1) enumerate pathways and link pairs (neither
argument is instantiated), 2) find the links for a particular pathway (pathway argument is
CHAPTER 7. REACTOME PENGINE 124
instantiated), 3) see which pathway has a set of links (links argument is instantiated), or 4)
check whether a given pathway has a given set of links (both arguments are instantiated).
Most predicates in the API are capable of multi-directional queries. Details are given in
the Reactome Pengine API documentation.
Generate pathways and a list of links in the pathway using:
SWISH Query 2.1
reactome_server(S),
pengine_rpc(S,ridPathway_links(P,Ls)).
?-ridPathway_links(P,L).
L = [ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction6’, ‘Complex18’, precedes,
‘BiochemicalReaction7’), ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction5’, ‘Complex17’,
‘ACTIVATION’, ‘BiochemicalReaction6’), ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction5’,
‘Complex17’, precedes, ‘BiochemicalReaction6’), ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction7’,
‘Complex14’, precedes, ‘BiochemicalReaction5’), ridReaction_ridLink_type_ridReaction(‘BiochemicalReaction7’,
‘Protein53’, precedes, ‘BiochemicalReaction6’)],
P = ‘Pathway1’;
Generate pathways that have an activation link where the activating entity is ’Complex452’
using:
SWISH Query 2.2
?-ridPathway_links(P,_Ls),
member(ridReaction_ridLink_type_ridReaction(R1, ‘Complex452’, ‘ACTIVATION’, R2),_Ls).
P = ‘Pathway100’,
R1 = ‘BiochemicalReaction301’,
R2 = ‘BiochemicalReaction302’;
4. Finding the Protein Complex that contains the Sonic Hedgehog Protein
Sonic HedgeHog (SHH) is a well studied protein. If we want to query Reactome Pengine
to find what Protein complex it takes part in we can use the following query.
SWISH Query 4.1
?-reactome_server(S),pengine_rpc(S,(rid_name(RidSonic,"SHH"),ridProteinSet_component(RidProteinSet,RidSonic),
ridComplex_component(RidComplex,RidProteinSet),rid_name(RidComplex,ComplexName)),[]).
ComplexName = "Patched:Hedgehog",
RidComplex = ‘Complex1074’,
RidProteinSet = ‘Protein1901’,
RidSonic = ‘Protein1902’,
S = ‘https://apps.nms.kcl.ac.uk/reactome-pengine’;
small_pathway(P):-
ridPathway_links(P,L),
length(L,S),
S<35.
small_pathway_with_complex452_activation(P):-
small_pathway(P),
pathway_with_complex452_activation(P).
Running these three queries allows you to see that the answers to the third query is
the intersection of the answer to the first two queries as we expected.
SWISH Query 5.1
?-pathway_with_complex452_activation(P).
P = ‘Pathway100’;
P = ‘Pathway101’
Here we write a program to find a path on the graph of reactions across the whole
Reactome. The predicate path_program/1 returns the program we wish to run, as a
list of clauses. The predicate path_from_to/3 retrieves the server address and program
and sends this, along with the query, to Reactome Pengine. The identified path is then
returned. To perform the same query without using Reactome Pengine the entire database
would need to be downloaded.
SWISH Program 6.1
:-use_module(library(pengines)).
path_program(Program):-
Program=[
(:- meta_predicate path(2,?,?,?)),
(:- meta_predicate path(2,?,?,?,+)),
(graph_path_from_to(P_2,Path,From,To):-path(P_2,Path,From,To)),
(path(R_2, [X0|Ys], X0,X):-path(R_2, Ys, X0,X, [X0])),
(path(_R_2, [], X,X, _)),
(path(R_2, [X1|Ys], X0,X, Xs) :- call(R_2, X0,X1),non_member(X1, Xs),path(R_2, Ys, X1,X, [X1|Xs])),
(non_member(_E, [])),
(non_member(E, [X|Xs]) :- dif(E,X),non_member(E, Xs)),
( e(R1,R2):-ridReaction_ridLink_type_ridReaction(R1,_,_,R2)) %this makes a two place edge term
].
%Send a program and a query to the pengine reactome and return the result.
path_from_to(Path,From,To):-
reactome_server(Server),
path_program(Program),
Query=graph_path_from_to(e,Path,From,To),
pengine_rpc(Server,Query,[src_list(Program)]).
The following query performs a breadth first search to find the shortest paths, finding each
CHAPTER 7. REACTOME PENGINE 128
Advance usage
The advanced usage section of the notebook contains an additional four queries (7-11).
7. Definite Clause Grammars
As we have discussed (in chapter 5) in Prolog we can describe lists declaratively, so for
instance, we can write a definite clause grammar (DCG) for paths. We remind the reader
that a DCG predicates have a slightly different syntax to standard Prolog predicates, but
this does not effect there ability to be sent to the Reactome Pengine. For simplicity of
presentation the DCG in the example below runs on the SWISH server. This example finds
paths that pass through a reaction and that satisfy a user defined rule. In this case we
simply ask for a path that passes through a reaction that has CTDP1 (Protein11301) as
an input. Additionally we add Path=[_,_,|_]. to the query. This serves as a constraint
to find paths with at least two steps.
SWISH Program 7.1
ridReaction_input(Rid,I):-
reactome_server(S),
pengine_rpc(S,ridReaction_input(Rid,I),[]).
begin -->[].
begin -->[_],begin.
needs -->{ridReaction_input(R,‘Protein11301’)},[R].
graphviz_program(Program):-
Program =[
(linktype_color(X,green):- member(X,[‘ACTIVATION’, ‘ACTIVATION-ALLOSTERIC’])),
(linktype_color(X,red):- member(X,[‘INHIBITION’,
‘INHIBITION-ALLOSTERIC’,
‘INHIBITION-COMPETITIVE’,
‘INHIBITION-NONCOMPETITIVE’])),
(linktype_color(precedes,black)),
(link_graphvizedge(ridReaction_ridLink_type_ridReaction(R1, _, Type, R2),
edge(R1Name->R2Name,[color=Color])):-
linktype_color(Type,Color),
rid_name(R1,R1Name),
rid_name(R2,R2Name))
].
pathway_graphviz(P,G):-
reactome_server(S),
graphviz_program(Program),
pengine_rpc(S,(ridPathway_links(P,L),maplist(link_graphvizedge,L,GE)),[src_list(Program)]),
G = digraph(GE).
CHAPTER 7. REACTOME PENGINE 130
pathways(P):-
P=[‘Pathway1’,‘Pathway2’,‘Pathway5’,‘Pathway12’].
pathway_pairsize(P,PName-L):-
reactome_server(S),
pengine_rpc(S,(ridPathway_reactions(P,R),rid_name(P,PName)),[]),
length(R,L).
chart(Chart):-
pathways(P),
maplist(pathway_pairsize,P,Pairs),
Chart = c3{data:_{x:elem, rows:[elem-count|Pairs], type:bar},
axis:_{x:_{type:category}}}.
CHAPTER 7. REACTOME PENGINE 131
renderer.
SWISH Program 9.1
:- use_rendering(table).
:- use_module(library(sgml)).
:- use_module(library(xpath)).
elem_in(URL, Elem,X,Y) :-
load_html(URL, DOM, []),
xpath(DOM, //’*’(self), element(Elem,X,Y)).
url(‘https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=GSM38051&id=4196&db=GeoDb_blob01’).
splitter_row_cols(S,R,C):-
split_string(R,S,’’,C).
url_datatable(U,Table):-
url(U),
elem_in(U,pre,_,[_One,_Two,_Three,_Four,_Five,Data|_Rest]),
split_string(Data,’\n’,’’,Rows),
maplist(splitter_row_cols(’\t’),Rows,Table).
row_pair([X,Y,_,_],X-Y).
assoc_key_val(Assoc,Key,Value):- get_assoc(Key, Assoc,Value).
assoc_key_val(Assoc,Key,na):- \+get_assoc(Key,Assoc,_Value).
?-length(List,10),append([_|List],_Rest,_All),url_datatable(_U,_All).
List=
|-------------------+-----------+-----+------------|
| "AFFX-BioB-5_at" | "757.7" | "P" | "0.00039" |
| "AFFX-BioB-M_at" | "933.7" | "P" | "0.000095" |
| "AFFX-BioB-3_at" | "525.6" | "P" | "0.000095" |
| "AFFX-BioC-5_at" | "1999.5" | "P" | "0.000044" |
| "AFFX-BioC-3_at" | "2339.5" | "P" | "0.000044" |
| "AFFX-BioDn-5_at" | "4321.3" | "P" | "0.000044" |
| "AFFX-BioDn-3_at" | "9229.4" | "P" | "0.00007" |
| "AFFX-CreX-5_at" | "21949.9" | "P" | "0.000044" |
| "AFFX-CreX-3_at" | "26022.8" | "P" | "0.000044" |
| "AFFX-DapX-5_at" | "1171.1" | "P" | "0.00006" |
|-------------------+-----------+-----+------------|
We can now integrate our web scraping and Reactome Pengine in a single query. Here
CHAPTER 7. REACTOME PENGINE 133
?-reactome_server(_S),
length(_List,12050),append([__List],_Rest,_All),url_datatable(_U,_All), maplist(row_pair,_List,_Pairs),
list_to_assoc(_Pairs,_Assoc),pengine_rpc(_S,ridProtein_probelist(’Protein11042’,Probelist),[]),
maplist(atom_string,Probelist,_ProbelistStrings), maplist(assoc_key_val(_Assoc),_ProbelistStrings,Valuelist).
10. R Integration
The R programming language is built into SWISH, which means that we can perform
statistical analysis using familiar tools. The syntax used is based on Real [5]. In the
example below we query the number of edges in a set of pathways and the number of
reactions in the same set. We then use R to calculate the correlation, fit a line and plot
these using R’s qplot function.
SWISH Program 10.1
:- <- library("ggplot2").
program(Program):-
Program=[(pathways_30(P):- findnsols(30,Rid,rid_type_iri(Rid,‘Pathway’,_),P)),
(ridPathway_reaction_(P,R):-ridPathway_links(P,L),member(E,L),E=..[_,R,_,_,_]),
(ridPathway_reaction_(P,R):-ridPathway_links(P,L),member(E,L),E=..[_,_,_,_,R]),
(ridPathway_reactions_(P,Rs):-
setof(R,ridPathway_reaction_(P,R),Rs)),
(pathway_numberedges_numberreactions(P,NE,NR):-
ridPathway_links(P,Ls),length(Ls,NE),
ridPathway_reactions_(P,Rs),length(Rs,NR)
),
(xs_ys(Xs,Ys):-pathways_30(P),maplist(pathway_numberedges_numberreactions,P,Xs,Ys))].
get_data(Xs,Ys):-
reactome_server(S),
program(Program),
pengine_rpc(S,xs_ys(Xs,Ys),[src_list(Program)]).
CHAPTER 7. REACTOME PENGINE 134
Correlation = [0.9967855883592985],
NumberOfEdges = [734, 174, 8, 57, 57, 7, 6, 2, 30, 2, 5, 9, 61, 60, 56, 9, 2, 2, 7, 57, 66, 66, 45, 7, 5, 20,
80, 4, 11, 14],
NumberOfReactions = [469, 134, 7, 52, 52, 4, 5, 3, 30, 3, 5, 11, 54, 52, 51, 7, 3, 2, 6, 52, 57, 57, 42, 4, 5
, 14, 54, 4, 11, 15]
And we use this query to plot the data with a fitted line:
SWISH Query 8.2
?-get_data(NumberOfEdges,NumberOfReactions),
<-qplot(NumberOfEdges,NumberOfReactions,geom=c("point","smooth"),
xlab="Number of Edges", ylab="Number of Reactions").
This graph shows shows different pathways and how the number their of edges and the number of their reactions are related.It
is plotted by calling the R function qplot with the parameter ‘geom’ set to ‘point’ and ‘smooth’. This results in a loess fit
line with confidence limits which are added by default.
d3. Therefore, we can make interactive charts and web applications with Reactome data
inside a SWISH notebook. To see the Javascript code double click the text in the notebook
and scroll down to the ‘script’ tags.
We illustrate this functionality by combining Pengine Reactome and web-scraping to
build a simple application to show Hive Plots [86]. In Hive Plots the geometric placement
of nodes on a graph has meaning, based on user defined rules. As we are using Prolog, we
can build these rules easily. In the example below we visualise two features of reactions.
These features are mapped to the geometric placement of nodes in the graph. The first
feature is based on the network properties (the degree) of the reaction nodes. A reaction
is assigned to one of three categories:
This first feature is illustrated by placing nodes on one of three axes: Vertical axis for
category 1, 4 o’clock axis for category 2, and 8 o’clock axis for category 3.
For the second feature, we use the data of gene expression that we have scraped from
the GEO website and perform the following steps:
1. Query Reactome Pengine for the set of probes that code for the set of proteins that
are inputs for each reaction in the graph.
2. For each probe we retrieve the expression levels from the scraped GEO data.
3. Calculate the sum of the expression levels of the probes of each reaction.
This second feature is illustrated by the distance of the reaction node from the centre
of the graph. In the notebook a drop down menu is presented which when selected will
run the query 10.1. This will result in a term which is parsed by javascript to produce the
figure 7.3. The visualisation enables comparison of the graph properties and expression
CHAPTER 7. REACTOME PENGINE 136
Figure 7.3: Hive plot of Pathway 21: This plot gives a geographic representation of a
pathway network. This plot is created ‘live’ by scraping the NCBI website and combining
this information with the data in Reactome. It depicts the gene expression values ag-
gregated into reactions for GSM38051 alongside some network properties of the reaction
graph. Each node represents a reaction. Nodes are placed on one of three axes. 12 O’clock
axis: reactions that have a larger outdegree than indegree. 4 O’clock axis: reactions that
have a larger indegree than outdegree. 8 O’clock axis: reactions that have an equal in and
outdegree. Distance from the centre indicates an aggregate value of the expressed probes
that code for proteins that take part in this reaction. This kind of plot can be used to
quickly see if network properties are associated with gene expression and how this might
change for different samples or different pathways.
script that can be used as part of an existing UNIX pipeline. An example script is given
in Code Block 2, this is then demonstrated in a UNIX pipeline in Code Block 3 (using
example file proteins.txt). This script takes a file with a Reactome protein identifier on
each line and outputs the Affymetrix probe identifiers for each protein (retrieved from
Reactome Pengine). In order to execute this script the user will need to make the script
executable using the UNIX command ‘chmod’.
Code Block 2
1 #!/usr/bin/env swipl
2
3 :- use_module(library(pengines)).
4 :- initialization main.
5
6 server(S):-S="https://apps.nms.kcl.ac.uk/reactome-pengine".
7
8 main :-
9 catch(readloop, E, (print_message(error, E), fail)),
10 halt.
11 main :-
12 halt(1).
13
14
15 readloop:-
16 read_line_to_string(user_input,String),
17 string_test(String).
18
19 string_test(String):-
20 dif(String,end_of_file),
21 atom_string(Atom,String),
22 ridProtein_probelist(Atom,Animal),
23 writeln(Animal),
24 readloop.
25
26 string_test(Term):-
27 Term = end_of_file,
28 fail.
29
30
31 ridProtein_probelist(R,P):-
32 server(S),
33 pengine_rpc(S,ridProtein_probelist(R,P),[]).
34
:
CHAPTER 7. REACTOME PENGINE 139
technology is useful for bioinformatics research. While downloading the Reactome dataset
in its entirety is currently feasible, biological datasets will only increase in size, such that
efficient and flexible data querying approaches, such as pengines, will be imperative for
future analysis involving the integration of omics data.
We compare Reactome Pengine with 1) the REST API and 2) the SPARQL API. First,
the REST API is limited to a number of queries designed by the Reactome maintainers
(see documentation here:
http://www.reactome.org/pages/documentation/developer-guide/restful-service/#API).
Any query that can be performed using the REST API, can also be performed using Re-
actome Pengine. For example, the REST service can be used to find the sub-pathways of
‘Apoptosis’, and an equivalent query using Reactome Pengine is given in Code Block 4.
Code Block 4
1 :-use_module(library(pengines)).
2 reactome_server(‘https://apps.nms.kcl.ac.uk/reactome-pengine’).
3
4 my_program(P):-
5 P=[
6 (
7 pathwayName_subpathway(PName,SubName):-
8 rid_name(RidPathway,PName),
9 ridPathway_component(RidPathway,RidComponent),
10 rid_type_iri(RidComponent,’Pathway’,_),
11 rid_name(RidComponent,SubName)
12 )
13 ].
14 pathwayName_subpathway(PName,SubName):-
15 reactome_server(S),
16 my_program(P),
17 pengine_rpc(S,pathwayName_subpathway(PName,SubName),[src_list(P)]).
The Reactome SPARQL API is more flexible than the REST API for two key reasons.
First, SPARQL can be used to specify SQL like queries over the data, rather than a
predefined subset specified by an API. Second, SPARQL can be used to interrogate several
datasets in a single query (called a federated query). For example, a bioinformatician could
query both Reactome and Uniprot to integrate data from these disparate sources.
While SPARQL is more flexible than a REST API, it is less flexible than Reactome
Pengine because it is not a full programming language. This means that typically de-
velopers using SPARQL will have a two language setup, for example, SPARQL might
be embedded in Java. This can be problematic due to the paradigm mismatch, where
SPARQL is relational and Java is an imperative object oriented language. This is not the
case for Prolog which has a relational paradigm itself, and is a full programming language.
Therefore, using Reactome Pengine from within a Prolog program means that the data
can be queried and manipulated within a single program.
The Reactome Pengine Prolog API allows for simpler and more flexible federated queries
than using SPARQL. Complex federated queries are simpler to compose in the Reactome
Pengine due to its ability to build composite queries, as discussed above. Furthermore,
because we can make use of standard Prolog libraries within the query sent to the Reactome
CHAPTER 7. REACTOME PENGINE 142
Pengine, we can also include queries to other data services, including REST, SPARQL,
HTML and other pengine services. An example of this is given in the accompanying
SWISH notebook (example 9).
It is also possible to have the web-logic query embedded in another language that supports
http requests. For example a shell script, java script or python. Code Block 6 gives an
example of a node js program that uses the pengines npm module:
https://www.npmjs.com/package/pengines
This feature is useful when introducing Reactome Pengine to existing code pipelines that
might not be written in Prolog. However building Prolog pipelines and using Prolog as the
‘glue’ language is very powerful as we have illustrated throughout this work. Notably the
ability to ‘name and reuse’ queries (example 5 in the accompanying notebook and the fact
that Prolog is a Homoiconic language – where data and code use the same syntax) means
that Prolog pipelines are very effective for bioinformatic work – especially for queries across
multiple data end points (sometimes known as federated queries).
Code Block 6
1 /* We first require the pengines library, then we define our prolog query and define functions for success
2 and failure which log to the console.*/
3
4 pengines = require(‘pengines’);
5
6 peng = pengines({
7 server: "https://apps.nms.kcl.ac.uk/reactome-pengine/pengine",
8 sourceText: ‘small_pathway(P):- ridPathway_links(P,L), length(L,S), S<35.’,
9 ask: "small_pathway(X)",
10 chunk: 100,
11 }
12 ).on(‘success’, handleSuccess).on(‘error’, handleError);
13 function handleSuccess(result) {
14 console.log(result)
15 }
16 function handleError(result) {
17 console.error(result)
18 }
CHAPTER 7. REACTOME PENGINE 143
In addition to providing the data available from Reactome, Reactome Pengine also in-
cludes over 30 public predicate definitions which offer intuitive and fast access to ele-
ments of the data. Full details of these predicates are available in the online documenta-
tion {https://apps.nms.kcl.ac.uk/reactome-pengine/documentation. Furthermore,
as Reactome Pengine is monitored, commonly used queries can be added to Reactome
Pengine as new predicate definitions.
It is possible to change the format of replies from the Reactome Pengine from Prolog
terms to either CSV or JSON file formats. For an example of a shell script that queries
a pengine for a CSV file see: https://github.com/SWI-Prolog/swish/blob/master/
client/swish-ask.sh
7.7 Summary
Reactome Pengine is a web service that provides a simple way to logically query the human
reactome on the web. It provides both raw RDF data access and a set of built-in predicates
to facilitate this. It can be accessed by both local Prolog programs and web notebooks
such as SWISH. Programs hosted on SWISH notebooks allow for easy sharing and the
ability to render query solutions graphically. In contrast, programs running in SWI-Prolog
client applications have access to the full power of Prolog including system calls and they
also maintain privacy of local computations. Either of these options allow researchers to
perform analysis of data that requires querying the human reactome and integrating with
other data sources. The Pengine technology allows the user to bring the small program to
the large data. Increasingly more (and larger) biological datasets are becoming available
online and while we have presented a Pengine web service for Reactome, it is possible to
build these for any other online biological dataset. This is potentially very powerful, as
researchers will not have to download and manage these datasets but can build pipelines
that consist of a set of programs sent to these pengine web services. This will result in the
uptake of a unified knowledge representation of first order logic and tremendous resource
savings.
Chapter 8
The work presented in this chapter is based on the publication : Neaves, Samuel R., Louise
AC Millard, and Sophia Tsoka. “Using ILP to Identify Pathway Activation Patterns in
Systems Biology.” International Conference on Inductive Logic Programming. Springer,
Cham, 2015.
144
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 145
the advantage of being able to construct novel sets by sharing variables across predicates
that define the sets. For example, a set could be defined as the genes annotated with two
Gene Ontology terms.
Other ways researchers have tried to integrate the use of known relations includes
adapting the classification approach of stage 2. New features are built by aggregating
across a predefined set of genes. For example, an aggregation may calculate the average
expression value for a pathway [69].
A major limitation of current classification approaches is that the models are con-
structed from either genes or crude aggregates of sets of genes, and so ignore the detailed
relations between entities in a pathway. In order to incorporate more complex relations an
appropriate network representation is needed, such that biological relations are adequately
represented. For example, a simple directed network of genes and proteins does not repre-
sent all the complexities of biochemical pathways, such as the dependencies of biochemical
reactions. To do this bipartite graphs or hyper-graphs can be used [157].
One way to incorporate more complex relations is by creating topologically defined
sets, where a property of the network is used to group nodes into related sets. One method
to generate these sets is Community Detection. However, this approach can create crude
clusters of genes, that do not account for important biological concepts. Biologists may be
interested in complex biological interactions rather than just sets of genes.
Network motif and frequent sub-graph mining are methods that can look for structured
patterns in biological networks [79]. However, in these approaches the patterns are often
described in a language which is not as expressive as first order logic. This means they are
unable to find patterns with uninstantiated variables, or with relational concepts such as
paths or loops.
To our knowledge only one previous work used ILP for this task [70]. Here the authors
propose identifying features consisting of the longest possible chain of nodes in which
non-zero node activation implies a certain (non-zero) activation in its successors, which
they call a Fully Coupled Flux. Their work is preliminary, with limited evaluation of the
performance of this method.
The aim of this chapter is to illustrate how we can identify pathway activation patterns,
that differ between biological samples of different classes. A pathway activation pattern is
a pattern of active reactions on a pathway. Our novel approach uses known relations be-
tween entities in a pathway, and important biological concepts as background knowledge.
These patterns may give a biologist different information than models built from simple
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 147
gene features. Therefore, we seek to build models that are of comparative predictive per-
formance to those of previous works, while also providing potentially useful explanations.
In this work we take a propositionalization-based ILP approach, where we represent the
biological systems as a Prolog knowledge base (composed of first order rules and facts), and
then reduce this to an attribute-value representation (a set of propositions), before using
standard machine learning algorithms on this data. We therefore begin with an overview
of propositionalization, and a discussion of why it is appropriate for this task.
as under this approach we do not allow a predicate which relates individuals. This strong
inductive bias is appropriate for our case, as we do not wish to consider relationships
between the individuals. The fourth reason is that we can perform many other learning
tasks on the transformed data, with the vast array of algorithms available for attribute-
values datasets.
In this work we use query-based propositionalization methods, and now describe some
key algorithms. A review of some publicly available propositionalization methods was re-
cently performed by Lavrac et al. [91]. These include Linus, RSD, TreeLiker (HiFi and
RelF algorithms), RELAGGS, Stochastic Propositionalization, and Wordification, along-
side the more general ILP toolkit, Aleph [137]. Other methods that were not mentioned
in that review include Warmr [41], Cardinalisation [41], ACORA [123] and CILP++ [49].
There has also been work on creating propositionalization methods especially for linked
open data, both in an automatic way [129], and in a way where manual SPARQL queries
are made [128]. The methods in these papers are not appropriate for our work because
our data is not entirely made up of linked open data, and we wish to include background
rules encoding additional biological knowledge. It is also worth noting that certain kernel
methods can be thought of as propositionalization [38].
Wordification treats relational data as documents and constructs word-like features.
These are not be appropriate for our task, as they do not correspond to the kind of
patterns we are looking for, i.e. features with uninstantiated variables. Stochastic proposi-
tionalization performs a randomised evolutionary search for features. This approach may
be interesting to consider for future work. CILP++ is a method for fast bottom-clause
construction, defined as the most specific clause that covers each example. This method is
primarily designed to facilitate the learning of neural networks, and has been reported to
perform no better than RSD when used with a rule-based model [49].
ACORA, Cardinalisation and RELAGGS are database inspired methods of proposition-
alization. They are primarily designed to perform aggregation across a secondary table,
with respect to a primary table. ACORA is designed to create aggregate operators for
categorical data, whereas RELAGGS performs standard aggregation functions (summa-
tion, maximum, average etc.) suitable for numeric data. Cardinalisation is designed to
use complex aggregates, where conditions are added to an aggregation. In our work we
manually design an aggregation method, described in Section 8.4.2. These aggregation
systems are not appropriate for graph-based datasets, because representing the graph as
two tables (denoting edges and nodes) and aggregating on paths through the graph would
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 149
require many self joins on the edge table. Relational databases are not optimised for this
task, such that the resulting queries would be inelegant and inefficient.
The propositionalization methods we use in this work are TreeLiker and Warmr. Tree-
Liker is a tool that provides a number of algorithms for propositionalization including
RelF [87]. RelF searches for relevant features in a block-wise manner, and this means that
irrelevant and irreducible features can be discarded during the search. The algorithms in
TreeLiker are limited to finding tree-like features where there are no cycles. RelF has been
shown to scale much better than previous systems such as RSD, and can learn features with
tens of literals. This is important for specifying non-trivial pathway activation patterns.
Warmr is a first order equivalent of frequent item-set mining, where a level wise search
of frequent queries in the knowledge base is performed. Warmr is used as a proposition-
alization tool by searching for frequent queries in each class. In Warmr it is possible to
specify the language bias using conjunctions of literals, rather than just individual literals,
and to put constraints on which literals are added. This allows strong control of the set
of possible hypotheses that can be considered. Finally, unlike TreeLiker, Warmr can use
background knowledge, defined as facts and rules.
8.4 Methods
An overview of the process we take is shown in Figure 8.1. First, we extract the reaction
graph for each pathway, from Reactome. Second, we infer the instantiated reaction graphs
for each instance in the dataset. Third, we identify pathway activation patterns using
propositionalization, and then build classification models to predict the lung cancer types.
Lastly, we evaluate our models using a hold-out dataset. We begin with a description of
the datasets we use in this work.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 150
We use a two class lung cancer dataset obtained from GEO, which was previously used in
the SBV Improver challenge [127]. This dataset is made up from the following datasets:
GSE2109, GSE10245, GSE18842 and GSE29013 (n=174), used as training data, and
GSE43580 (n=150), used as hold-out data. We used the examples where the participants
were labelled as having either SCC or AC lung cancer. This is the same data organisation
as that used in SBV Improver challenge, to allow us to compare our results with the top
performing method from this challenge.
This data contains gene expression measurements from across the genome measured by
Affymetrix chips. Each example is a vector of 54,614 real numbers. Each value denotes
the amount of expression of mRNA of a gene. There is a uniform class distribution of
examples, in both the training and holdout dataset.
We use the Reactome database to provide the background knowledge, describing biological
pathways in humans. Reactome [30] is a collection of manually curated peer reviewed
pathways. Reactome is made available as an RDF file, which allows for simple parsing
using SWI-Prologs semantic web libraries, and contains 1,351,811 triples. Reactome uses
the bipartite network representation of entities and reactions. Entity types include nucleic
acids, proteins, protein complexes, protein sets and small molecules. Protein complexes and
protein sets can themselves comprise of other complexes or sets. In addition, a reaction may
be controlled (activated or inhibited) by particular entities. A reaction is a chemical event,
where input entities (known as substrates), facilitated by enzymes, form other entities
(known as products).
Figure 8.2a shows a simple illustration of a Reactome pathway. P nodes denote proteins
or protein complexes, R nodes denote reactions, and C nodes denote catalysts. A black
arrow illustrates that a protein is an input or output of a reaction. A green arrow illustrates
that an entity is an activating control for a reaction. A red arrow illustrates that an entity
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 151
a) C P b)
P P1 R2 P
R2
P R1 P2 P
P R1 R3
P3
R3 P
P
R4
P R4 P
Figure 8.2: Reaction graph illustrations. There are three types of relationships between
reactions: follows (black solid lines), activation (green dashed), and inhibition (red dash-
dotted). Figure a) is the initial extracted graph from Reactome which is bipartite with
Reactions and entities as nodes and figure b) is the reaction-centric graph where Reactions
are directly linked. Both a) and b) depict the same pathway.
is an inhibitory control for a reaction. Reaction R1 has 3 protein substrates and 3 protein
products, and is controlled by catalyst C. Reactions R3 and R4 both have one protein
substrate and one protein product. R3 is inhibited by P 2, such that if P 2 is present then
reaction R3 will not occur. R4 is activated by P 3, such that P 3 is required for reaction
R4 to occur.
We reduce the Reactome bipartite graph to a Boolean network of reactions. This simplifies
the graphs while still adequately encoding the relationships between entities. Previous work
has shown that Boolean networks are a useful representation of biological systems [154],
and unlike gene and protein Boolean networks ours encodes the dependencies between
reactions.
The Boolean networks we create are reaction-centric graphs, where nodes are reactions
and directed edges are labelled either as ‘activation’ ,‘inhibition’ or ‘follows’ corresponding
to how reactions are connected. For example, Figure 8.2b shows the reaction-centric graph,
corresponding the Reactome graph shown in Figure 8.2a. Reaction R2 follows R1, because
in the Reactome graph P 1 is an output of R1 and an input to R2. Reaction R1 inhibits
R3, because P 2 is an output of R1, and it is also an inhibitory control of R3. Reaction R1
activates reaction R4, because P 3 is an output of R1, and an activating control of R4.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 152
on on
on on
on OR OR
on
on on C
AND
OR
activation
off
off on
on OR
A inputs
AND
Reaction
on on
on
Binary probe value
off OR
Protein
off OR
Protein set
off OR
off Protein complex
B
Figure 8.3: Illustration of logical aggregation. Known biological mechanisms can be repre-
sented as OR or AND gates. The triangular nodes are binary probe values, created using
Barcode.
Boolean networks [154] are a common abstraction in biological research, but these are
normally applied at the gene or protein level not at the reaction level. In order to use a
Boolean network abstraction on a reaction network, we apply a logical aggregation method
that aggregates measured probe values (from the GEO dataset) into reactions. This creates
a binary value for each reaction, to create instantiated versions of the reaction-centric graph
created in the previous step.
Before we can use this logical aggregation we first transform the original probe values
into binary values, an estimated value denoting whether a gene is expressed or not. We
do this using Barcode [99], a tool for converting the continuous probe values to binary
variables, by applying previously learnt thresholds to microarray data. It is important to
note that Barcode makes it possible to compare gene expressions, both within a sample,
and between samples that are potentially measured by different arrays.
The logical aggregation process is illustrated in Figure 8.3. This process takes the
binary probe values as input, and uses the structure provided by the Reactome graph, and
key biological concepts, to build reaction level features. As we have already described,
each reaction has a set of inputs that are required for a particular reaction. We interpret
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 153
each reaction input as a logical circuit with the following logical rules. The relationship
between probes and proteins is treated as an OR gate (matched by Uniprot IDs), because
multiple probes can map to the same protein. We are assuming that the measurement from
a single probe indicates with high probability that the protein product is present or not.
The formation of a protein complex requires all of its constituent proteins and therefore is
treated as an AND gate. A protein set is a set of molecules that are functionally equivalent
such that only one is needed for a given reaction, and so this is treated as an OR gate.
Inputs to a reaction are treated as an AND gate. A reaction is on if the inputs are on, any
activating agents are on, and any inhibitory agents are off. We note that both protein sets
and protein complexes can themselves comprise of arbitrarily nested complexes or sets.
Figure 8.3 illustrates the logical aggregation rules of a single reaction. This reaction has
two inputs and one activating control. The two inputs are a protein complex and a protein
set, and the values of these are calculated using their own aggregation processes, labelled
A and B. The aggregation in process A, starts with the binary probe values, and first infers
the values of three proteins. The protein complex is then assigned a value of on because all
proteins required for this complex are present (are all on themselves). The aggregation in
process B starts by inferring the values of two proteins from the probe values. One protein
is on and the other is off. The protein set is assigned the value on because only one protein
in this set is required for this protein set to be on. There also exists an activating control
for the reaction, a protein whose value is determined by a process labelled C. This protein
is assigned the value on, because both probe values are on, when at least one is required.
As all inputs of the reaction are on and the activating control is also on, the reaction is
assigned the value on.
To identify pathways we first run TreeLiker on each pathway. This generates a set of
attribute-value features for each instantiated pathway. We use TreeLiker with the RelF
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 154
set(template,[reaction(-R1,#onoroff),link(+R1,-R2,!T1),
reaction(+R2,#onoroff),link(+R2,-R3,!T2),reaction(+R3,#onorff),
link(!RA,-R4,!T3),link(+R4,!RB,!T4),link(+R1,-R2,#T1),
link(+R2,-R3,#T2),link(!RA,-R4,#T3),link(+R4,!RB,#T4)])
This language bias contains two types of literals; reaction/2 and link/3. The second
argument of the reaction literal is always constrained to be a constant depicting if a reaction
is on or off. The link/3 literal depicts the relationship between two reactions, where the
third argument of the link literal is either a variable or a constant describing the type of
relationship - either follows, activates or inhibits. For example, an identified pattern may
contain the literal link(r1,r2,follows), specifying that an output entity of reaction r1
is an input to reaction r2.
We then test the performance of the features of each pathway, using 10 fold cross
validation. We use the J48 decision tree algorithm (from Weka) because this builds a
model that give explanations for the predictions. We calculate the average accuracy across
folds, for each pathway, and rank the pathways from highest to lowest accuracy. We then
use the top ranked pathways as input to three different methods, to identify predictive
pathway activation patterns.
Method 1
This approach simply takes a pathway of interest, generates a single model using the
J48 algorithm using the training data, and then evaluates this performance on the hold-
out data. The decision tree can then be viewed to determine which activation patterns
are predictive of lung cancer type. We demonstrate this approach with the top-ranked
pathway.
We illustrate using Warmr to generate pathway activation patterns, using one of our iden-
tified ‘top’ pathways.
We use Warmr with two particular concepts in the background knowledge. First, we use
a predicate longestlen/3, that calculates the longest length of on reactions in an example,
for the pathway on which Warmr is being run. The arguments are: 1) the beginning
reaction of a path, 2) the end reaction of the path with longest length, and 3) the length of
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 155
this path. This longest length concepts corresponds to the fully coupled flux of a previous
work [70].
Second, we use the predicates inhibloop/1 and actloop/1, that depict inhibition and
activation loops, where a path of on reactions form a loop and one of the edges is an
inhibition or activation edge, respectively. Inhibition and activation loops are common
biological regulatory mechanism [148].
We then use the OneR (Weka) algorithm to identify the single best pathway activation
pattern found by Warmr, and then evaluate this pattern on the hold-out data.
Our combined method takes advantage of the beneficial properties of the two algorithms,
by using Warmr to extend the patterns identified by TreeLiker. This effectively switches
the search strategy from the block-wise approach of TreeLiker, to the level-wise approach of
Warmr. The reason for doing this is to identify any relations between reactions that exist
between entities within the TreeLiker feature, that could not be identified in TreeLiker
due to its restriction to tree structures. This results in long cyclical features that neither
TreeLiker nor Warmr would be able to find on their own.
While we could use the features generated by method 1, and extend these, in this section
we also demonstrate the possibility of using our approach for generating descriptions of
subgroups. We identify a subgroup with the CN2SD algorithm[90], using the training
data. The activation patterns defining this subgroup are then extended using Warmr. The
following code is an example language bias we use in Warmr:
rmode(1: (r(+S,-A,1),link(A,-\B,follows),link(B,-\C,_),r(S,C,0),
r(S,B,0), link(B,-\D,_),r(S,D,1),link(A,-\E,_),r(S,E,1))).
rmode(1: link(+A,+B,#)).
The first rmode contains the feature that was previously identified using TreeLiker. The
second rmode uses the literal link, to allow Warmr to add new links to the TreeLiker
feature. After extending the activation pattern using Warmr, we then evaluate this on the
hold-out data.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 156
Table 8.1: Top 5 pathways identified. Mean accuracy across 10 folds of cross-validation on
the training dataset.
8.5 Results
To reiterate, the aim of this work is to build explanatory models that help biologists
understand the system perturbations associated with conditions, in this case lung cancer.
Therefore although we give a quantitative classification performance of our models in order
to allow performance comparisons, we additionally emphasise the form that the classifi-
cation models take and how these are of interest to biologists. Table 8.1 shows the top 5
pathways found using our TreeLiker/J48 method.
1.0
0.8
Sensitivity
0.6
0.4
0.2
0.0
model. The ROC curves of the SBV model and the hexose update model, are shown in
Figure 8.4. Our hexose uptake model is a decision tree with a single feature (i.e. a decision
stump):
This corresponds to a chain of three on reactions, where the model predicts SSC if this
feature exists and AC otherwise. This Pathway Activation Pattern is present in 67 of the
76 individuals with SSC, and 17 of the 74 individuals with AC.
In Figure 8.4a we show an example instantiation of the hexose uptake pathway, for a
particular individual. For this individual, the three variables A,B,C in the feature given
above, are instantiated to the following reactions:
Binding of phospho-p27/p21:
p107 (RBL1) binds Degradation of Ubiquitination of phospho- Cdk2:Cyclin E/A to the SCF
CyclinE/A:CDK2 ubiquitinated p27/p21 by p27/p21 (Skp2):Cks1 complex
the 26S proteasome
Figure 8.5: The pattern found by Warmr instantiated for individual GSM1065725. There
is a self-activating loop, and this is highlighted by the grey box.
actloop(C),largestlen(E,F,G),greaterthan(G,5),link(E,H,follows),r(H,0)
The rule states that a sample is classified as SCC cancer if there is a self activating loop
for a reaction C, and that the longest chain of on reactions is from reaction E to reaction
F which is a chain at least 6 reactions long. Additionally, following reaction E there is
also a reaction H that itself is not on.
This suggests that one of the differences between the SCC and AC cancer is that in the
cell cycle SCC tumours have a self activating loop, that causes a longer chain of reactions
to occur than in the AC tumour types. The instantiation of the learnt rule/pattern to
a particular individual is shown in Figure 8.5. In this example there is a chain of 7 on
reactions, and this also contains the self-activating loop.
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 159
Figure 8.6: The three features in the subgroup description. The solid lines represent the
feature found by TreeLiker, the dotted lines show the Warmr extensions. Green rounded
squares on, red octagons off, blue squares on or off reactions.
8.6 Conclusions
In this work we have shown the potential of ILP methods for mining the abundance of highly
structured biological data. Using this method we have identified differences in Pathway
Activation Patterns that go beyond the standard analysis of differentially expressed genes,
enrichment analysis, gene feature ranking and pattern mining for common network motifs.
We have also demonstrated the use of logical aggregation with a reaction graph, and
how this simplifies the search for hypotheses to an extent where searching all pathways
is tractable. We have introduced a novel approach that uses Warmr to extend features
initially identified with TreeLiker. This makes it possible to search for long cyclical features.
We have identified pathway activation patterns predictive of the lung cancer type,
in several pathways. The model we built on the hexose uptake pathway has predictive
performance comparable with the top method from a recent challenge, but also provides
CHAPTER 8. USING ILP IN SYSTEMS BIOLOGY 160
biologically relevant explanations for its predictions. Each identified activation pattern is
evaluated on the hold-out data, such that this should be the expected performance on new,
unseen examples. The pathway activation patterns we have found are in clinically relevant
pathways [167, 40]. Patterns identified using this method may give diagnostic and clinical
insights that biologists can develop into new hypotheses for further investigation.
Chapter 9
In this thesis we argue that logic programming should be used to a greater extent than
present for computational biological tasks. Part 1 of this thesis introduced the domain
of discourse, explaining the background on health research for cancer and psoriasis. We
described how different types of biological data are collected and stored currently. Part
2 synthesised knowledge about Prolog and illustrated how to use Prolog to tackle two
biological data mining tasks, namely, subgroup discovery of CpG sites that are different
in breast cancer and subgroups of microbes that are present in lesional psoriasis samples.
A number of algorithms were implemented, including versions for attribute value learning,
and versions for multi-instance learning. Part 3 emphasised structural data, showing
how Prolog can be used to implement advanced web services to access large complex
biological datasets on the web. We then showed how using ILP techniques allows us to
mine structured biological data, in order to generate rules that gave explanations in the
form of ‘pathway activation patterns’.
Recalling Chapter 1, we identified a number of current problems in bioinformatics
research. These are important problems due to the need for more efficient and effective
health research. These were:
161
CHAPTER 9. SUMMARY AND FUTURE DIRECTIONS 162
Throughout this thesis we tackled these points, we now summarise how each point was
addressed by our contributions.
Data sitting in difficult to combine silos: This was primarily addressed by the
implementation of Reactome Pengine (Chapter 7), but also in our synthesis of Prolog
knowledge (Chapter 5). We described how logic could be used as the ‘glue’ for our collective
knowledge. This was done by describing how we can relate knowledge, using rules defined
by predicates. We described how new predicates could combine existing predicates to
create flexible rules, that can be used to query a knowledgebase, and how these predicates
could be expanded to cross the world wide web, by implementing a Prolog based API to
the human reactome. This allows federated queries across multiple data sources.
Data and knowledge are stored separately: This was addressed by both our
Prolog synthesis (Chapter 5) and our description of the implementation of the Reactome
Pengine API (Chapter 7). We described that when a piece of knowledge has been generated
(perhaps by a machine learning algorithm in the form of a rule), we do not want that
classification or subgroup rule to only be presented in scientific paper, but also for it to be
provided in a form that can be deployed on our own or other datasets. We described how
Prolog allows us to have ‘knowledgebases’ not just ‘databases’ and this means that rules
(representing knowledge) and data can be stored together. Our implemented ‘Reactome
Pengine’ tool allows rules to be sent to different datasets alongside a query. This aids
reproducible research and a wider use of what is collectively known, by making it easier to
deploy known knowledge on existing data.
Data storage and transfer is resource intensive: This was addressed by our
implementation of Reactome Pengine (Chapter 7). Instead of needing to transfer terabytes
of data for analysis, small programs are transferred and run in the cloud, meaning that a
user does not need to download and store large biological datasets on their own machine.
Difficulties choosing the right analysis technique: We addressed this problem in
two ways, first in the subgroup discovery chapter we identify a number of common bio-
logical data analysis techniques such as finding Differentially Methylated Regions that can
be considered as subgroup discovery tasks. Recognising this allowed us to use techniques
developed in the machine learning literature to generalise these tasks and to apply exist-
ing frameworks by developing implementations of existing algorithms and developing new
purely declarative pattern mining techniques. Secondly, in both our subgroup discovery
work and our our ILP work we argued that we need to take into account the purpose of
learning a model from our data. Do we need to build a classifier, that will be used for
CHAPTER 9. SUMMARY AND FUTURE DIRECTIONS 163
predication on unseen data? Or do we wish to understand some of the aspects of why one
set of data is different from another set? If that is the case then the subgroup discovery
task is a better match than the classification task and what the features in our rules rep-
resent is also important - for example, we showed how pathway activation patterns give
more insight than simple gene features.
Incorrect assumptions in models: This was primarily addressed by our work on
using ILP to find pathway activation patterns (Chapter 8). This described fusing biological
pathway data from the Reactome database and gene expression calling data (from the bar
code tool) with example datasets of gene expression data, in order to create models that
take into account some of the biological knowledge about how genes interact. We did not
assume that genes are independent variables. By using ILP methods we were able to find
rules that performed as well as the state of the art classification models, whilst at the same
time offering further explanations and potentially new interesting information in the form
of ‘pathway activation patterns’ represented as first order rules.
Overly complex models: This was again addressed by our work on using ILP and
our work on subgroup discovery. In both of these contributions we used data mining
techniques that resulted in comprehensible rules. These rules describe subgroups of CpG
sites, microbes and patients, and biologists will be able to plan future studies based on the
outputs of these data mining explorations.
domain experts that could inform the model construction process and provide evaluations
of the identified rules.
would provide interesting insights into biological pathways that cross the human/microbe
interface.
available to other users of the service, allowing for an automatic sharing of programs. It
could also act as a new data source for data mining in bioinformatics. Researchers could
investigate applying machine learning algorithms to learn from these submitted programs.
If we say that programs represent some knowledge about a domain rather than just data,
mining such future ‘knowledge-sets’ could tap into a rich seem of new ideas and concepts.
of composite queries, and they can be read by humans as well as machines. This cannot
be said for deep neural networks.
The algorithm for Prolog execution [15]. In this algorithm Matched means unification
without occurs check. The algorithm omits details on how a user can ask for alternative
solutions by forcing backtracking.
168
APPENDIX A. PROLOG EXECUTION ALGORITHM 169
[1] Bruce Alberts. Molecular biology of the cell. Garland science, 2017.
[3] David B Allison, Xiangqin Cui, Grier P Page, and Mahyar Sabripour. Microarray
data analysis: from disarray to consolidation and consensus. Nature reviews genetics,
7(1):55, 2006.
[5] Nicos Angelopoulos, Samer Abdallah, and Georgios Giamas. Advances in integrative
statistics for logic programming. International Journal of Approximate Reasoning,
78:103–115, 2016.
[6] Nicos Angelopoulos and Jan Wielemaker. Accessing biological data as Prolog facts.
In Proceedings of the 19th International Symposium on Principles and Practice of
Declarative Programming, pages 29–38. ACM, 2017.
[7] Martin Atzmueller and Frank Puppe. SD-Map–A fast algorithm for exhaustive sub-
group discovery. In European Conference on Principles of Data Mining and Knowl-
edge Discovery, pages 6–17. Springer, 2006.
[8] Martin Atzmüller, Frank Puppe, and Hans-Peter Buscher. Profiling examiners using
intelligent subgroup mining. In Proceedings of the 10th Workshop on Intelligent Data
Analysis in Medicine and Pharmacology (IDAMAP-05), pages 46–51, 2005.
170
BIBLIOGRAPHY 171
[9] Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature News,
533(7604):452, 2016.
[10] Fernando Baquero and César Nombela. The microbiome as a human organ. Clinical
Microbiology and Infection, 18(s4):2–4, 2012.
[11] Stephen B Baylin, Manel Esteller, Michael R Rountree, Kurtis E Bachman, Kornel
Schuebel, and James G Herman. Aberrant patterns of DNA methylation, chromatin
formation and gene expression in cancer. Human molecular genetics, 10(7):687–692,
2001.
[12] Gordon Bell, Tony Hey, and Alex Szalay. Beyond the data deluge. Science,
323(5919):1297–1298, 2009.
[13] Asa Ben-Hur and William Stafford Noble. Kernel methods for predicting protein–
protein interactions. Bioinformatics, 21(suppl 1):i38–i46, 2005.
[14] Marina Bibikova, Bret Barnes, Chan Tsan, Vincent Ho, Brandy Klotzle, Jennie M Le,
David Delano, Lu Zhang, Gary P Schroth, Kevin L Gunderson, et al. High density
DNA methylation array with single CpG site resolution. Genomics, 98(4):288–295,
2011.
[15] Ivan Bratko. Prolog programming for artificial intelligence. Pearson education, 2001.
[16] Vince Buffalo. Bioinformatics data skills: Reproducible and robust research with open
source tools. ” O’Reilly Media, Inc.”, 2015.
[17] Mats Carlsson. SICStus Prolog User’s Manual,. Swedish Institute of Computer
Science, 2016.
[18] Matias Casás-Selves and James DeGregori. How cancer shapes evolution and how
evolution shapes cancer. Evolution: Education and outreach, 4(4):624–634, 2011.
[19] Jadzia Cendrowska. Prism: An algorithm for inducing modular rules. International
Journal of Man-Machine Studies, 27(4):349–370, 1987.
[21] Peter Clark and Tim Niblett. The CN2 induction algorithm. Machine learning,
3(4):261–283, 1989.
BIBLIOGRAPHY 172
[23] AL Cogen, V Nizet, and RL Gallo. Skin microbiota: a source of disease or defence?
British Journal of Dermatology, 158(3):442–455, 2008.
[26] Charles E Cook, Mary Todd Bergman, Robert D Finn, Guy Cochrane, Ewan Birney,
and Rolf Apweiler. The European Bioinformatics Institute in 2016: data growth and
integration. Nucleic acids research, 44(D1):D20–D26, 2015.
[27] Vı́tor Santos Costa. The life of a logic programming system. In International Con-
ference on Logic Programming, pages 1–6. Springer, 2008.
[28] Elizabeth K Costello, Christian L Lauber, Micah Hamady, Noah Fierer, Jeffrey I
Gordon, and Rob Knight. Bacterial community variation in human body habitats
across space and time. Science, 326(5960):1694–1697, 2009.
[29] Francis Crick. Central dogma of molecular biology. Nature, 227(5258):561, 1970.
[31] Andrew Cropper and Stephen H Muggleton. Learning efficient logical robot strategies
involving composable objects. In IJCAI, pages 3423–3429, 2015.
[33] Elodie M Da Costa, Gabrielle McInnes, Annie Beaudry, and Noël J-M Raynal. DNA
methylation–targeted drugs. The Cancer Journal, 23(5):270–276, 2017.
[34] Shuo Dai, Yong Zhang, Limin Jia, and Yong Qin. A subgroup discovery algorithm
based on genetic fuzzy systems. In Proceedings of the 2015 Chinese Intelligent Au-
tomation Conference, pages 171–177. Springer, 2015.
[35] Charles Darwin and William F Bynum. The origin of species by means of natural
selection: or, the preservation of favored races in the struggle for life. Penguin, 2009.
[36] Richard Dawkins. The Selfish Gene. Oxford University Press, USA, 1976.
[37] Richard Dawkins. Climbing mount improbable. WW Norton & Company, 1997.
[38] Luc De Raedt. Logical and relational learning. Springer Science & Business Media,
2008.
[39] Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: A probabilistic
prolog and its application in link discovery. 2007.
[41] Luc Dehaspe and Luc De Raedt. Mining association rules in multiple relations. In
Nada Lavrač and Sao Deroski, editors, Inductive Logic Programming, number 1297
in Lecture Notes in Computer Science, pages 125–132. Springer Berlin Heidelberg,
January 1997.
[42] Guoqing Diao and Anand N Vidyashankar. Assessing genome-wide statistical signif-
icance for large p small n problems. Genetics, 194(3):781–783, 2013.
[43] Theodosius Dobzhansky. Nothing in biology makes sense except in the light of evo-
lution. The american biology teacher, 75(2):87–91, 2013.
[44] Sorin Drăghici. Statistics and data analysis for microarrays using R and bioconductor.
CRC Press, 2011.
BIBLIOGRAPHY 174
[45] David J Duggan, Michael Bittner, Yidong Chen, Paul Meltzer, and Jeffrey M Trent.
Expression profiling using cDNA microarrays. Nature genetics, 21(1s):10, 1999.
[46] Ron Edgar, Michael Domrachev, and Alex E Lash. Gene expression omnibus: NCBI
gene expression and hybridization array data repository. Nucleic acids research,
30(1):207–210, 2002.
[48] Thomas Fleischer, Arnoldo Frigessi, Kevin C Johnson, Hege Edvardsen, Nizar
Touleimat, Jovana Klajic, Margit LH Riis, Vilde D Haakensen, Fredrik Wärnberg,
Bjørn Naume, et al. Genome-wide dna methylation profiles in progression to in situ
and invasive carcinoma of the breast with impact on gene transcription and prognosis.
Genome biology, 15(8):435, 2014.
[49] Manoel VM França, Gerson Zaverucha, and Artur S dAvila Garcez. Fast relational
learning using bottom clause propositionalization with artificial neural networks.
Machine learning, 94(1):81–104, 2014.
[50] Thom Frühwirth. Constraint handling rules. In Constraint programming: Basics and
trends, pages 90–107. Springer, 1995.
[51] Jonathan C Fuller, Pierre Khoueiry, Holger Dinkel, Kristoffer Forslund, Alexan-
dros Stamatakis, Joseph Barry, Aidan Budd, Theodoros G Soldatos, Katja Linssen,
and Abdul Mateen Rajput. Biggest challenges in bioinformatics. EMBO reports,
14(4):302–304, 2013.
[52] Johannes Fürnkranz and Peter A Flach. Roc ‘n’ rule learningtowards a better un-
derstanding of covering algorithms. Machine Learning, 58(1):39–77, 2005.
[53] Johannes Fürnkranz, Dragan Gamberger, and Nada Lavrač. Foundations of rule
learning. Springer Science & Business Media, 2012.
[54] Dragan Gamberger and Nada Lavrač. Expert-guided subgroup discovery: Method-
ology and application. Journal of Artificial Intelligence Research, 17:501–527, 2002.
[55] Dragan Gamberger, Nada Lavrač, Filip Železnỳ, and Jakub Tolar. Induction of com-
prehensible models for gene expression datasets by subgroup discovery methodology.
Journal of biomedical informatics, 37(4):269–284, 2004.
BIBLIOGRAPHY 175
[56] Zhan Gao, Chi-hong Tseng, Bruce E Strober, Zhiheng Pei, and Martin J Blaser.
Substantial alterations of the cutaneous bacterial biota in psoriatic lesions. PloS
one, 3(7):e2719, 2008.
[57] Craig Gentry. A fully homomorphic encryption scheme. Stanford University, 2009.
[58] Dirk Gevers, Subra Kugathasan, Lee A Denson, Yoshiki Vázquez-Baeza, Will
Van Treuren, Boyu Ren, Emma Schwager, Dan Knights, Se Jin Song, Moran Yas-
sour, et al. The treatment-naive microbiome in new-onset crohns disease. Cell host
& microbe, 15(3):382–392, 2014.
[59] Joseph C Giarratano and Gary Riley. Expert systems: principles and programming.
Brooks/Cole Publishing Co., 1989.
[61] Patrick Goymer. Natural selection: The evolution of cancer. Nature News,
454(7208):1046–1048, 2008.
[62] Michael W Gray, Gertraud Burger, and B Franz Lang. The origin and early evolution
of mitochondria. Genome biology, 2(6):reviews1018–1, 2001.
[63] Casey S Greene, Jie Tan, Matthew Ung, Jason H Moore, and Chao Cheng. Big data
bioinformatics. Journal of cellular physiology, 229(12):1896–1900, 2014.
[64] Elizabeth A Grice and Julia A Segre. The skin microbiome. Nature Reviews Micro-
biology, 9(4):244, 2011.
[65] Anna Hart and Jeremy Wyatt. Evaluating black-boxes as medical decision aids:
issues arising from a study of neural networks. Medical informatics, 15(3):229–236,
1990.
[66] Stephen S Hecht. Cigarette smoking and lung cancer: chemical mechanisms and
approaches to prevention. The lancet oncology, 3(8):461–469, 2002.
[68] Franciso Herrera, Cristóbal José Carmona, Pedro González, and Marı́a José Del Je-
sus. An overview on subgroup discovery: foundations and applications. Knowledge
and information systems, 29(3):495–525, 2011.
[69] Matěj Holec, Jiřı̀ Klma, Filip Železnỳ, and Jakub Tolar. Comparative evaluation
of set-level techniques in predictive classification of gene expression samples. BMC
Bioinformatics, 13(Suppl 10):S15, June 2012.
[70] Matěj Holec, Filip Železnỳ, Jiřı̀ Kléma, Jiřı̀ Svoboda, and Jakub Tolar. Using bio-
pathways in relational learning. Inductive Logic Programming, page 50, 2008.
[72] Laura Hoopes. Genetic diagnosis: DNA microarrays and cancer. Nature Education,
1(3), 2008.
[73] Curtis Huttenhower, Dirk Gevers, Rob Knight, Sahar Abubucker, Jonathan H Bad-
ger, Asif T Chinwalla, Heather H Creasy, Ashlee M Earl, Michael G FitzGerald,
Robert S Fulton, et al. Structure, function and diversity of the healthy human mi-
crobiome. Nature, 486(7402):207, 2012.
[74] Julian Huxley. Evolution the modern synthesis. George Allen and Unwin, 1942.
[75] Andrew E Jaffe, Peter Murakami, Hwajin Lee, Jeffrey T Leek, M Daniele Fallin,
Andrew P Feinberg, and Rafael A Irizarry. Bump hunting to identify differentially
methylated regions in epigenetic epidemiology studies. International journal of epi-
demiology, 41(1):200–209, 2012.
[76] Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes.
Nucleic acids research, 28(1):27–30, 2000.
[77] Branko Kavšek and Nada Lavrač. APRIORI-SD: Adapting association rule learning
to subgroup discovery. Applied Artificial Intelligence, 20(7):543–583, 2006.
[78] Someswa Kesh and Wullianallur Raghupathi. Critical issues in bioinformatics and
computing. Perspectives in health information management/AHIMA, American
Health Information Management Association, 1, 2004.
[79] Wooyoung Kim, Min Li, Jianxin Wang, and Yi Pan. Biological network motif detec-
tion and evaluation. BMC Systems Biology, 5(Suppl 3):S5, December 2011.
BIBLIOGRAPHY 177
[81] Willi Klösgen and Michael May. Spatial subgroup mining integrated in an object-
relational spatial database. In European Conference on Principles of Data Mining
and Knowledge Discovery, pages 275–286. Springer, 2002.
[82] Michael R Kosorok, Shuangge Ma, et al. Marginal asymptotics for the large p,
small n paradigm: with applications to microarray data. The Annals of Statistics,
35(4):1456–1486, 2007.
[84] Robert Kowalski and Donald Kuehner. Linear resolution with selection function. In
Automation of Reasoning, pages 542–577. Springer, 1983.
[85] Petra Kralj, N Lavrac, Dragan Gamberger, and Antonija Krstacic. Supporting fac-
tors to improve the explanatory potential of contrast set mining: Analyzing brain
ischaemia data. In 11th Mediterranean Conference on Medical and Biomedical Engi-
neering and Computing 2007, pages 157–161. Springer, 2007.
[86] Martin Krzywinski, Inanc Birol, Steven JM Jones, and Marco A Marra. Hive plots.
Rational approach to visualizing networks. Briefings in bioinformatics, 13(5):627–
644, 2011.
[87] Ondřej Kuželka and Filip Železnỳ. Block-wise construction of tree-like relational
features with monotone reducibility and redundancy. Machine Learning, 83(2):163–
192, 2011.
[88] Torbjörn Lager and Jan Wielemaker. Pengines: Web Logic Programming Made
Easy. Theory and Practice of Logic Programming, 14(4-5):539–552, July 2014.
[89] Laura Langohr, Vid Podpečan, Marko Petek, Igor Mozetič, Kristina Gruden, Nada
Lavrač, and Hannu Toivonen. Contrasting subgroup discovery. The Computer Jour-
nal, 56(3):289–303, 2012.
BIBLIOGRAPHY 178
[90] Nada Lavrač, Branko Kavšek, Peter A Flach, and Ljupčo Todorovski. Subgroup
discovery with CN2-SD. Journal of Machine Learning Research, 5(Feb):153–188,
2004.
[91] Nada Lavrač and Anže Vavpetič. Relational and semantic data mining. In Logic
Programming and Nonmonotonic Reasoning, pages 20–31. Springer, 2015.
[92] Nada Lavrač, Filip Železnỳ, and Peter A Flach. RSD: Relational subgroup discovery
through first-order feature construction. In International Conference on Inductive
Logic Programming, pages 149–165. Springer, 2002.
[94] Dianhuan Lin, Eyal Dechter, Kevin Ellis, Joshua B Tenenbaum, and Stephen H
Muggleton. Bias reformulation for one-shot function induction. 2014.
[95] José Marı́a Luna, José Raúl Romero, Cristóbal Romero, and Sebastián Ventura.
On the use of genetic programming for mining comprehensible rules in subgroup
discovery. IEEE transactions on cybernetics, 44(12):2329–2341, 2014.
[96] Chaysavanh Manichanh, Lionel Rigottier-Gois, Elian Bonnaud, Karine Gloux, Eric
Pelletier, Lionel Frangeul, Renaud Nalin, Cyrille Jarrin, Patrick Chardon, Phillipe
Marteau, et al. Reduced diversity of faecal microbiota in crohns disease revealed by
a metagenomic approach. Gut, 55(2):205–211, 2006.
[97] Victor M Markowitz, I-Min A Chen, Krishna Palaniappan, Ken Chu, Ernest Szeto,
Yuri Grechkin, Anna Ratner, Biju Jacob, Jinghua Huang, Peter Williams, et al. Img:
the integrated microbial genomes database and comparative analysis system. Nucleic
acids research, 40(D1):D115–D122, 2011.
[98] Vivien Marx. Biology: The big challenges of big data, 2013.
[101] Ryszard S Michalski. On the quasi-minimal solution of the general covering problem.
Proceedings of the International Symposium on Information Processing, 1969.
[103] Brad L Miller, David E Goldberg, et al. Genetic algorithms, tournament selection,
and the effects of noise. Complex systems, 9(3):193–212, 1995.
[104] Xochitl C Morgan, Timothy L Tickle, Harry Sokol, Dirk Gevers, Kathryn L Devaney,
Doyle V Ward, Joshua A Reyes, Samir A Shah, Neal LeLeiko, Scott B Snapper,
et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and
treatment. Genome biology, 13(9):R79, 2012.
[105] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara
Wold. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature
methods, 5(7):621, 2008.
[106] Stephen Muggleton, Cao Feng, et al. Efficient induction of logic programs. Citeseer,
1990.
[108] Chris Mungall. Experiences Using Logic Programming in Bioinformatics, pages 1–21.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
[109] Rajan P Nair, Kristina Callis Duffin, Cynthia Helms, Jun Ding, Philip E Stuart,
David Goldgar, Johann E Gudjonsson, Yun Li, Trilokraj Tejasvi, Bing-Jian Feng,
et al. Genome-wide scan reveals association of psoriasis with il-23 and nf-κb path-
ways. Nature genetics, 41(2):199, 2009.
[110] Samuel R. Neaves et al. Using ILP to Identify Pathway Activation Patterns in
Systems Biology, pages 137–151. Springer International Publishing, 2016.
[111] Samuel R Neaves, Sophia Tsoka, and Louise AC Millard. Reactome pengine: A
web-logic API to the homo sapiens reactome. Bioinformatics, 1:3, 2018.
[112] Frank O. Nestle, Daniel H. Kaplan, and Jonathan Barker. Psoriasis. New England
Journal of Medicine, 361(5):496–509, 2009. PMID: 19641206.
BIBLIOGRAPHY 180
[113] Ulrich Neumerkel and Stefan Kral. Indexing dif/2. arXiv preprint arXiv:1607.01590,
2016.
[114] Ulrich Neumerkel and Fred Mesnard. Localizing and explaining reasons for non-
terminating logic programs with failure-slices. In International Conference on Prin-
ciples and Practice of Declarative Programming, pages 328–341. Springer, 1999.
[115] Ulrich Neumerkel, Markus Triska, and Jan Wielemaker. Declarative language ex-
tensions for prolog courses. In Proceedings of the 2008 international workshop on
Functional and declarative programming in education, pages 73–78. ACM, 2008.
[116] F Niyonsaba, A Suzuki, H Ushio, I Nagaoka, H Ogawa, and K Okumura. The hu-
man antimicrobial peptide dermcidin activates normal human keratinocytes. British
Journal of Dermatology, 160(2):243–249, 2009.
[117] Petra Kralj Novak, Nada Lavrač, and Geoffrey I Webb. Supervised descriptive rule
discovery: A unifying survey of contrast set, emerging pattern and subgroup mining.
Journal of Machine Learning Research, 10(Feb):377–403, 2009.
[118] Richard A O’Keefe. The craft of Prolog, volume 86. MIT press Cambridge, 1990.
[120] Rafael S Parpinelli, Heitor S Lopes, and Alex Alves Freitas. Data mining with an
ant colony optimization algorithm. IEEE transactions on evolutionary computation,
6(4):321–332, 2002.
[121] William R Pearson and David J Lipman. Improved tools for biological sequence
comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988.
[122] Sérgio Pereira, Adriano Pinto, Victor Alves, and Carlos A Silva. Brain tumor seg-
mentation using convolutional neural networks in mri images. IEEE transactions on
medical imaging, 35(5):1240–1251, 2016.
[123] Claudia Perlich and Foster Provost. Distribution-based aggregation for relational
learning with identifier attributes. Machine Learning, 62(1-2):65–105, 2006.
BIBLIOGRAPHY 181
[124] Bahareh Rabbani, Hirofumi Nakaoka, Shahin Akhondzadeh, Mustafa Tekin, and
Nejat Mahdieh. Next generation sequencing: implications in personalized medicine
and pharmacogenomics. Molecular BioSystems, 12(6):1818–1830, 2016.
[125] Tara D Rachakonda, Clayton W Schupp, and April W Armstrong. Psoriasis preva-
lence among adults in the united states. Journal of the American Academy of Der-
matology, 70(3):512–516, 2014.
[126] David A Relman. The human microbiome: ecosystem resilience and health. Nutrition
reviews, 70(s1), 2012.
[127] Kahn Rhrissorrakrai, J. Jeremy Rice, Stephanie Boue, Marja Talikka, Erhan Bilal,
Florian Martin, Pablo Meyer, Raquel Norel, Yang Xiang, Gustavo Stolovitzky, Ju-
lia Hoeng, and Manuel C. Peitsch. SBV Improver Diagnostic Signature Challenge:
Design and results. Systems Biomedicine, 1(4):3–14, September 2013.
[128] Petar Ristoski. Towards linked open data enabled data mining. In The Semantic
Web. Latest Advances and New Domains, pages 772–782. Springer, 2015.
[130] John Alan Robinson. A machine-oriented logic based on the resolution principle.
Journal of the ACM (JACM), 12(1):23–41, 1965.
[131] Taisuke Sato and Yoshitaka Kameya. Prism: a language for symbolic-statistical
modeling. In IJCAI, volume 97, pages 1330–1339, 1997.
[132] Hashem A Shihab, Mark F Rogers, Julian Gough, Matthew Mort, David N Cooper,
Ian NM Day, Tom R Gaunt, and Colin Campbell. An integrative approach to pre-
dicting the functional effects of non-coding and coding sequence variation. Bioinfor-
matics, 31(10):1536–1543, 2015.
[134] David B Skalak. Prototype and feature selection by sampling and random mutation
hill climbing algorithms. In Machine Learning Proceedings 1994, pages 293–301.
Elsevier, 1994.
BIBLIOGRAPHY 182
[135] Peter B Snow, Deborah S Smith, and William J Catalona. Artificial neural networks
in the diagnosis and prognosis of prostate cancer: a pilot study. The Journal of
urology, 152(5):1923–1926, 1994.
[136] Tamar Sofer, Elizabeth D Schifano, Jane A Hoppin, Lifang Hou, and Andrea A
Baccarelli. A-clustering: a novel method for the detection of co-regulated methylation
regions, and regions associated with exposure. Bioinformatics, 29(22):2884–2891,
2013.
[137] Ashwin Srinivasan. The Aleph Manual, 2001. URL http://www. comlab. ox. ac.
uk/activities/machinelearn/Aleph/aleph.html, 69, 2019.
[138] Julius Stecher, Frederik Janssen, and Johannes Fürnkranz. Shorter rules are bet-
ter, arent they? In International Conference on Discovery Science, pages 279–294.
Springer, 2016.
[139] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Ben-
jamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R.
Golub, Eric S. Lander, and Jill P. Mesirov. Gene set enrichment analysis: A
knowledge-based approach for interpreting genome-wide expression profiles. Pro-
ceedings of the National Academy of Sciences, 102(43):15545–15550, October 2005.
[140] Adi L Tarca, Nandor Gabor Than, and Roberto Romero. Methodological approach
from the best overall team in the SBV Improver Diagnostic Signature Challenge.
Systems Biomedicine, 1(4):217–227, 2013.
[141] Nizar Touleimat and Jörg Tost. Complete pipeline for infinium
R human methylation
450k beadchip data processing using subset quantile normalization for accurate dna
methylation estimation. Epigenomics, 4(3):325–341, 2012.
[142] Igor Trajkovski, Nada Lavrač, and Jakub Tolar. Segs: Search for enriched gene sets
in microarray data. Journal of biomedical informatics, 41(4):588–601, 2008.
[144] Markus Triska. The finite domain constraint solver of SWI-Prolog. In International
Symposium on Functional and Logic Programming, pages 307–316. Springer, 2012.
[145] Markus Triska. The Boolean constraint solver of SWI-Prolog: System description.
In FLOPS, volume 9613 of LNCS, pages 45–61, 2016.
BIBLIOGRAPHY 183
[148] John J Tyson, Katherine C Chen, and Bela Novak. Sniffers, buzzers, toggles and
blinkers: dynamics of regulatory and signaling pathways in the cell. Current opinion
in cell biology, 15(2):221–231, 2003.
[149] UCSC contributors. Frequently asked questions: Data file formats. https://genome.
ucsc.edu/FAQ/FAQformat.html#format13", 2018. [Online; accessed 20-May-2018].
[150] Anita Valmarska, Nada Lavrač, Johannes Fürnkranz, and Marko Robnik-Šikonja.
Refinement and selection heuristics in subgroup discovery and classification rule
learning. Expert Systems with Applications, 81:147–162, 2017.
[151] J Craig Venter, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural,
Granger G Sutton, Hamilton O Smith, Mark Yandell, Cheryl A Evans, Robert A
Holt, et al. The sequence of the human genome. Science, 291(5507):1304–1351,
2001.
[152] Peter M Visscher, Matthew A Brown, Mark I McCarthy, and Jian Yang. Five years
of GWAS discovery. The American Journal of Human Genetics, 90(1):7–24, 2012.
[153] Haizhou Wang and Mingzhou Song. Ckmeans. 1d. dp: optimal k-means clustering
in one dimension by dynamic programming. The R journal, 3(2):29, 2011.
[154] Rui-Sheng Wang, Assieh Saadatpour, and Rka Albert. Boolean modeling in systems
biology: an overview of methodology and applications. Physical Biology, 9(5):055001,
October 2012.
[155] James D Watson, Francis HC Crick, et al. Molecular structure of nucleic acids.
Nature, 171(4356):737–738, 1953.
[156] Tyler Weirick, Giuseppe Militello, Yuliya Ponomareva, David John, Claudia Döring,
Stefanie Dimmeler, and Shizuka Uchida. Logic programming to infer complex RNA
BIBLIOGRAPHY 184
[157] Ken Whelan, Oliver Ray, and Ross D King. Representation, simulation, and hypoth-
esis generation in graph and logical models of biological networks. In Yeast Systems
Biology, pages 465–482. Springer, 2011.
[158] Darrell Whitley. A genetic algorithm tutorial. Statistics and computing, 4(2):65–85,
1994.
[160] Jan Wielemaker et al. SWI-Prolog. Theory and Practice of Logic Programming,
12(1-2):67–96, January 2012.
[161] Jan Wielemaker, Torbjörn Lager, and Fabrizio Riguzzi. SWISH: SWI-Prolog for
sharing. CoRR, abs/1511.00915, 2015.
[165] PCY Woo, SKP Lau, JLL Teng, H Tse, and K-Y Yuen. Then and now: use of 16S
rDNA gene sequencing for bacterial identification and discovery of novel bacteria in
clinical microbiology laboratories. Clinical Microbiology and Infection, 14(10):908–
934, 2008.
[167] Rongrong Wu, Lorena Galan-Acosta, and Erik Norberg. Glucose metabolism provide
distinct prosurvival benefits to non-small cell lung carcinomas. Biochemical and
biophysical research communications, 460(3):572–577, 2015.
BIBLIOGRAPHY 185
[168] Shi Ying, Dan-Ning Zeng, Liang Chi, Yuan Tan, Carlos Galzote, Cesar Cardona,
Simon Lax, Jack Gilbert, and Zhe-Xue Quan. The influence of age and gender on
skin-associated microbial communities in urban and rural human populations. PloS
one, 10(10):e0141842, 2015.
[169] Lei Zhang, Linlin Wang, Bochuan Du, Tianjiao Wang, Pu Tian, and Suyan Tian.
Classification of non-small cell lung cancer using significance analysis of microarray-
gene set reduction algorithm. BioMed research international, 2016, 2016.