Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Big Data in Healthcare

Hubbi Nashrullah Muhammad


What is Big Data?
Outline

● Big Biomedical and Health Data
● Big Data in Drug Discovery
● Big Data in Systems Pharmacology
● Role of Big Data in Clinical Research
What is Big Data?
From traffic patterns and shopping habits to web history and medical
records, data is recorded at a massive scale—every second—stored, and
analysed to enable all the technology and services that we use everyday.

Big Data ➜ datasets so large and complex it is difficult to process using


traditional methods and on-hand database management tools
So, what exactly is Big Data?

New data is
Datasets are
Volume Velocity generated
huge
every second

The 4 Vs of
Big Data
Many different Not all data are
types of data are Variety Veracity useful, some are
available useless
Source: IBM Big Data Hub
Source: IBM Big Data Hub
Source: IBM Big Data Hub
Source: IBM Big Data Hub
Big Biomedical and Health Data
Volume of Biomedical
and Health Data

92% of all MRI, CT, X-Ray,


human genes etc.

Electronic Health
ProteomicsDB Medical imaging
Records (EHR)

5.17 TB in size >80% of all Vast amounts of


as of 2016 hospitals in US complex data
Velocity of Biomedical
and Health Data

So much data is Next generation


generated every day technology

Computer technology (parallel


Sequencing tech Provides up-to-date data
computing, distributed computing, etc.)
produces billions of DNA for fast response in
allows large scale data processing and
sequence data per day emergency: e.g. pandemics
analysis
Variety of Biomedical
and Health Data
Genomics

Transcriptomics
Variety of data types E.g. “Omics” data
provides systematic data
and structures at all levels
Proteomics

Doctors’ notes from EHRs


Unstructured data Metabolomics
Clinical trial results

Lab results Phenomics

Medical images
What is Big Data?
How does it compare to other types of data?

Types of Data/Examples Quantity/ Inference Method


Quality of Data

Interventional – Randomized Small/ Analysis of Variance


Clinical Trials Generally Excellent

Observational – Registries, Medium/ Regression


Surveys, Passive Mobile Data Very Variable

Unstructured – Large/ Artificial Intelligence, Natural Language


Social media, research articles Often challenging to Processing
analyze

Big Data – Large/ Multi-dimensional Analytics and


Combining Multiple Types Often difficult to combine Visualization tools
and/or Data Sources
Where does all this data come from?

Electronic Health
Observational Records
Studies Billing data,
Molecular
insurance claims
Databases
Clinical Doctors’ notes,
Trials lab results,
discussions

Real World
Research
Data
Big Data in Drug Discovery
Cellular Binding
Virtual Screening
Assays Protein Binding
Assays

High-Throughput Drug Discovery in


Screening
the Age of Big Data

Biological Efficacy

Cheminformatics Physicochemical
Databases Information
Toxicity
Open Access Drug Databases

Binding DB Therapeutic Target Database


Binding affinities of small, drug-like molecules Known therapeutic protein targets with
to protein targets. pathway information and corresponding drugs.

ChEMBL SuperTarget
Manually curated bioactive molecules with ~7300 drug-target associations with ~5000
drug-like properties. manually annotated

DrugBank PharmGkb
FDA-approved and experimental drugs with Pharmacogenomic-focused genetic, molecular,
drug target, bio- and chemoinformatic data. cellular, and clinical data for drugs

ZINC ChemBank
~21 million compounds that are commercially Biomedical measurements derived from cell
available and prepared for virtual screening. lines treated with small molecules.
Open Access Databases on Protein-Protein/-Gene/-Other Interactions

BioGRID MINT
~730,000 raw protein and genetic interactions Molecular interaction database focusing on
from major model organisms. experimentally validated protein–
protein interactions.
DGIdb
Drug gene interaction database curated from Database of Interacting Proteins
multiple well-established databases. Manual and computational curation of
experimentally determined protein–protein
interactions.
ExPASy STRING
Known and predicted protein–protein MatrixDB
interactions from experimental repositories Interactions between extracellular proteins
and computational methods. (i.e., collagen and laminins) and
polysaccharides.
Open Access Genomics Databases

Gene Expression Omnibus UCSC Cancer Genome Browser


Public functional array- and sequence-based Interactive annotated cancer genome-browser
genomics data repository. website hosted by the University of California,
Santa Cruz.

Oncomine
Cancer microarray database that can be GDOC
subdivided by treatment, patient survival Broad collection of bioinformatics and systems
and other demographics. biology tools for analysis and visualization of
four major ‘omics’ types: DNA, mRNA,
microRNA and metabolites.
The Cancer Genome Atlas
Large-scale genome sequencing platform for
multiple cancers led by the NCI and the NHGRI.
Open Access Proteomics Databases

dbDEPC PRIDE
Database of differentially expressed proteins in Centralized standards-compliant mass
human cancer. spectrometry proteomics and post-translational
modifications

GeMDBJ Proteomics The Human Protein Atlas


Clinical and cell line protein LC-MS/MS and Immunohistochemistry-based protein
2D-difference gel electrophoresis for expression profiles of various human tissues,
expression levels. cancers and cell lines

Plasma Protein Database UniProt


Initiative of the Human Proteome Organization Comprehensive protein sequence and
to characterize human plasma and serum annotation data
proteome.
Open Access Metabolomics Databases

BiGG HumanCyc
Genomic-based reconstruction of human Human metabolic pathway/genome
metabolism for systems biology simulation bioinformatics database constituting over
and flux modeling 28,000 genes

HMDB SMPDB
Human small molecule metabolites with Small molecule pathway database with >400
associated chemical, clinical and molecular unique human pathways not found in other
biology information databases
Big Data provides new New patterns and associations are
insights for drug design being discovered by mining this data

New observations New therapeutic New drug-target


targets associations

E.g., identifikasi senyawa baru


E.g., menyembuhkan suatu
dengan efek tertentu
penyakit tidak cukup dengan New drug repurposing
walaupun memiliki struktur
mempengaruhi hanya satu hypotheses
yang sangat berbeda dengan
protein/target saja
obat yang sudah ada
Big Data in Systems Pharmacology
Big Data Genomics
and Oncology Tumors

Different individual ➜
different tumor cells Heterogenous

Same individual ➜
different tumor tissue
➜ different cells
Undergo further changes
when stressed (hypoxia,
Same individual ➜
same tumor tissue ➜ exposed to drugs)
different cells!

Resistance to drugs arise!


Big Data Genomics Use Big Data (e.g. The Cancer Genome Atlas)
to analyse differential expression patterns
and Oncology across different...

Cell lines Disease progression Treatment regiments

New insight on resistance mechanisms and pharmacological targets!

Identified two core


resistance mechanisms
E.g. Characterising acquired
New combinatorial
resistance to vemurafenib in
therapy strategies!
melanomas.
MAPK pathway PI3K-PTEN-AKT
reactivation activation
Shi H, Hugo W, Kong
X. Cancer Discov 2014.
Role of Big Data in Clinical Research
Main types of
clinical research

Prospective Retrospective Translational Evidence-based


study study research study

Using randomisation techniques, each


“Gold standard” for
RCTs clinical research
known and unknown confounding factors
are distributed evenly among groups.

Different groups
Participants divided Results from each
receive different LIMITATIONS!
into groups group compared
interventions
E.g. RCT evaluating efficacy of chemotherapy drug

Set primary endpoint: Calculate Recruit Randomise


Overall Survival (OS) sample size participants into groups

Baseline

Intervention:
different therapies, Measurement(s) Analyse
control
Second- or third-line therapies,
dynamic processes, many other factors

Commonly neglected problem: statistical analysis is typically based on the different effectiveness
of different interventions provided at baseline.

It is taken for granted that the intervention provided at baseline is the only variable factor to be
considered, and any subsequent interventions (second- or third-line therapies) will not affect
outcome ➜ usually not the case.
Current statistical analyses
Such point-to-point
focus on relationship between
relationship may not
baseline intervention and
provide the whole picture
specific outcome

Dynamic Large amounts of


processes other data produced

Intervention Endpoint/outcome
Observational Billing data, Electronic Health
Studies insurance claims Records
Molecular
Clinical Databases
Trials Doctors’ notes,
lab results,
discussions
Real World
Research
Data

Low generalisability High generalisability

More rigorous Less rigorous


Not a complete solution ➜ provide Retains real-world features,
Big Data new insights and complement RCTs reflect real-world problems

Use of data stored in


Without modification or
electronic databases
Big Data in clinical research screening with strict inclusion
collected from daily routine
and exclusion criteria
clinical practice

Demographics
Billing data and
EHRs Health insurance
claims
Laboratory
results
Registries for chronic
Procedures, surgery and infectious diseases
and clinical outcomes
Big Data Presents resources and
Address the limitations
augments RCTs methodologies that can supplement
RCTs in clinical research
of RCTs

Keys to utilizing Big Data Data sharing


in clinical research
Increase range of
Government support confounding factors that
(regulations, guidelines) can be studied in a trial
Multidisciplinary
and multi centre
collaboration Aggregation of multiple
data types and elements
Increased ability to detect
and study rare events in
Aggregating data from More patients ➜ more data disease populations
multiple institutions has
many advantages
Overcomes problem of loss to follow-up
Example research Requirement for clinical
Types of studies Note
questions data

Risk factor Is urine output on ICU entry High resolution (other risk Multivariate model, stratified analysis,
evaluation associated with mortality factors should be provided) and propensity score analysis can be
outcome? employed

Effectiveness of Will drug A improve outcome High resolution (including Intervention may be given for patients
intervention of patients with septic shock? a large number of with different conditions. These
confounding factors) conditions should be controlled to avoid
"selective treatment"

Prediction model Prediction model for ICU Moderate resolution The predictive value of whole model is
delirium (general description of risk stressed, rather than a single risk factor
factors)

Epidemiological The incidence and Low resolution A simple description is enough and no
study prevalence rate of risk factor adjustment is required
catheter-related blood stream
infection in ICU

Implementation of Is the policy of screening and Low resolution No complex clinical data are required
healthcare policy controlling hypertension
effective in lowering
cardiovascular event rate?

You might also like