Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Machine learning for biomedical

Applications
Prof. Gajendra P.S. Raghava
Head, Center for Computational Biology

Web Site:
http://webs.iiitd.edu.in/raghava/
Welcome to BIO542(MLBA)
• Course: Machine learning for biomedical Applications
• Class Time: 9.30 to 11 AM (Monday and Wednesday) , Class Room: C102
• Instructor: Prof. G. P. S. Raghava (raghava@iiitd.ac.in, raghavagps@gmail.com)
• TAs: Nisha Bajia (nishab@iiitd.ac.in)
• Important URLs & email
• Mailing list : bio542@iiitd.ac.in
• Google Classroom: joining code zg7qwoy
• https://classroom.google.com/c/NjE2ODk4MzI3MzU2?cjc=zg7qwoy
• Website: http://webs.iiitd.edu.in/raghava/
• Please go through academic dishonesty policy very carefully
• https://www.iiitd.ac.in/education/resources/academicdishonesty
• Visiting hours: Student may visit between 4.30 to 5.30 PM (A-302, New Academic
Building) for any question/doubts/discussion. Check availability in advance.
Course Description
This course is specifically designed to accommodate a diverse
range of students with backgrounds in biology, medical
science, pharmacology, bioinformatics, and computer science.
The course is structured into three comprehensive sections: i)
Challenges in Biomedical Sciences, ii) Intruduction to Machine
Learning Techniques and iii) Solving Biomedical Problems
using Machine Learning Techniques. Students will gain
insights into the complexities of working with biomedical data
and understand the unique requirements of applying machine
learning techniques to solve real-world biomedical problems.
In this course, students will work on case studies and real-life
projects that require innovative applications of machine
learning to solve problems in biological and health sciences.
Post conditions
(Expectations from students after course)

• Understanding biomedical applications like annotation of


biomolecules, drug design and image-based diagnostics

• Ability to train, test and evaluate machine/deep learning-based


models for predicting disease biomarkers and designing of
therapeutic molecules

• Familiarity with feature engineering and ability to implement


CNN for classification of biomedical images

• Ability to Implement of research projects in the field of


biomedical sciences
Performance Evaluation
Group Activities
Assignments: 20% Guoup activity
Individual Activities
Mid-sem Exam: 30% Individual
End-sem Exam: 30% Individual
Quiz: 20% Individual
Performance Evaluation
(Individual Activity)
• Quiz: Total three quiz will be conducted in class, best
two will be counted. Objective test (Short-answer and
MCQ); No re-eam is possible.
• Mid-sem Exam: Paper will have both type of
questions (Short/Long); re-exam (medical leave)
• End-sem Exam: Paper will have both type of
questions (Short/Long); re-exam (medical leave)
Performance Evaluation
(Group activity)
Assignments: There will be two assignments, 10 marks
each. Assignments will be submitted by a group of
maximum three students. It will be based on Kaggle in
class competition.
Week-wise plan
• Week1: Introduction to biomedical science and the challenges
of biomedical informatics
• Week2: Statistical and machine learning based models for
classification and prediction
• Week3: Python libraries for developing machine learning based
models
• Week4: Annotation and feature generation of proteins
• Week5: Computer-aided protein therapeutics and vaccine
development
• Week6: Creation of datasets and evaluation of models in
biological sciences
Week-wise plan
• Week7: Models for drug discovery
• Week8: Feature engineering and feature manipulation
• Week9: Explainable machine learning techniques
• Week 10: Ensemble methods (Random Forest)
• Week11: Development of models for identification genetic
biomarkers for diseases
• Week12: Deep learning-based models using python libraries
• Week13: Application of CNN in classification of biomedical
images
Plan for Tutorials
• Python libraries for machine learning
• Major biological databases
• Kaggle Competition
• Python libraries for features engineering
• Deep learning libraries
• Image Classification by python
Accidents Environment

Food Age of Organism

Causes of Possible
Diseases Solutions

• Disease-associated Pathogens (Virus, • Understanding biology at genome level


Bacteria, Fungus etc.) • Drugs particularly against drug resistant
• Disordered or Malfunction (e.g., Cancer) diseases
• Malnutrition (Healthy food) • Subunit or Epitope-based Vaccines
• Side-effects of drugs • Disease Biomarkers for early detection
• Mental Health & Stress • Drug biomarkers
Biomedical- Applications
Concept Level
★Proteome annotation ★Drugs discovery ★Vaccine Design ★Biomarkers

Molecules or Objects
Proteins & Peptides Gene Expression Chemoinformatics Image annotation
• Structure prediction • Disease • Drug design • Image
• Subcellular biomarkers • Chemical Classification
localization • Drug biomarkers descriptor • Medical images
• Therapeutic • mRNA expression • QSAR models • Disease
Application • Copy number • Personalized classification
• Ligand binding variation inhibitors • Disease diagnostics
Life Expectancy in India
Life expectancy in world (UN 2023)
Five kingdoms of living organism
Unicellular
Animalia
Five kingdoms of living Eukaryotic, Multicellular
organism Heterotropic
Vertebrates and invertebrates
Prokaryotae
Protoctista
• Unicellular and Microscopic.
• Mainly small eukaryotic organisms.
• no nuclear membrane
• Many live in aquatic environments
• no ER, no mitochondria
• Non animals, plants or fungi g
• Cell wall made of murein.
• Algae, slime moulds, Plasmodium.
• Bacteria or Cyanobacteria

Fungi Plantae
Eukaryotic, Multicellular • Eukaryotic, Multicellular
Cell wall made of chitin • Cell wall made of cellulose.
No photosynthetic pigments • Photosynthetic pigment
Mushroom, Mold, Puffball • Plants
Cell: minimum unit of life

Neuron Paramecium Chlamydomonas Saccharomyces Helicobactor


cerevisiae Pylori

Single celled Yeast cell Bacteria causes


Nerve cells from Protozoan green algae (Single Cell) stomach ulcer
mammalian brain
Human Cell
All cells in a person had same DNA and genes

v Different tissues have different


type of cells

v Different type of cells expresses


different type and level of genes.

v These cell express variety of


proteins responsible for different
tissues.
Unicellular organisms

Paramecium Yeast Euglena

Amoeba
Bacteria
Multicellular organisms
Life: Growth, Survival and Reproduction

Types of Biomolecules
Carbohydrates

Lipids

Proteins

Nucleic Acids (DNA & RNA)


29
Carbohydrates
n Carbohydrate means “hydrated” carbon
n Composing elements C, H, O
n Hydrogen and Oxygen are in a ratio of 2:1
n Can be simple monomers like glucose
n Can be complex polymers like cellulose

30
Lipids
n Composing elements C, H, O
n Lipids are loosely defined as groups of organic
molecules that are insoluble in water.
n Include:
n fats
n oils
n Waxes
n Phospholipids
n steroids: sex hormones and cholesterol
n some vitamins
n glycolipids (lipids with carbohydrates attached)

31
Concept Map
Section 2-3

Carbon
Compounds
include

Carbohydrates Lipids Nucleic acids Proteins


that consist of that consist of that consist of that consist of

Sugars and
Fats and oils Nucleotides Amino Acids
starches
which contain which contain which contain which contain

Carbon, Carbon, Carbon,hydrogen, Carbon,


hydrogen, hydrogen, oxygen, nitrogen, hydrogen,oxygen,
oxygen oxygen phosphorus nitrogen,
Major molecules: Proteins, DNA, RNA
Proteins
• Most of activities are performed by proteins
• A protein is a polymer of 20 natural amino acids
• FASTA is a commonly used format to present protein
sequence
• FASTA file contain amino acid in single letter code
• Sequence of a protein is also called protein primary
structure
Proteins are made of 20 type of natural
amino acids
Protein Sequences in FASTA format
Major molecules: Proteins, DNA, RNA
DNA (Deoxyribonucleic acid)
• It is a polymer of 4 nucleotides (A, T, G, C)
• It is a double chain (two complementary strands)
• One strand is 5’ -> 3 ‘ and another 3’ -> 5’
Major molecules: Proteins, DNA, RNA
RNA
• RNA molecules are similar to DNA
• Uracil (U) instead of thymine (T)
• Normally RNA is in single strand
• DNA have single function but RNA have different functions
• mRNA, tRNA, rRNA, miRNA, siRNA etc
The genome is our
Genetic Blueprint
• Nearly every human cell
contains 23 pairs of
chromosomes
• 1 - 22 and XY or XX
• XY = Male
• XX = Female

• Length of chr 1-22, X, Y


together is ~3.2 billion bases
(about 2 meters diploid)
•Transcription
•DNA -> RNA

•Translation
•RNA -> Protein
Central dogma of molecular biology
Central dogma of molecular
biology
• mRNA then goes through the pores of the nucleus with
the DNA code and attaches to the ribosome.
Transcription, Translation and Protein synthesis

Transcription

• Process of copying DNA to RNA


• Does not need a primer to start
• Can involve multiple RNA polymerases
• Divided into 3 stages
• Initiation
• Elongation
• Termination
Genes to Protein
Translation (six open reading frame)
Genetic Codes
• mRNA carrying the DNA instructions and tRNA carrying
amino acids meet in the ribosomes.
mRNA Levels Indirectly Measure Gene Activity
The activity of a gene (expression) can be determined by the
presence of its complementary mRNA
Gene Expression

Every cell contains the same DNA

Genes code for proteins through


the intermediary of mRNA

Cells differ in the DNA (gene) which is active at any one time
Gene Expression

pseudo-colour
sample
image
(labelled)

probe
(on chip)
DNA sequencing
• Sanger sequencing techniques
• Maxam–Gilbert sequencing (1977-80)
• Pyrosequencing (1993)
• Next generation sequencing techniques
Genome Gallery
Galerie
genomů
Genome size of Important species
Coronavirus 1 Chr. 3*104 bp
Bacteriophage λ (virus) 1 chr 5*104
Escherichia Coli 1 5*106
S. cerevisaie (yeast) 32 1*107
Caenorhabditis elegans (worm) 12 5*108
D. melanogaster (fruit fly) 8 2*108
Homo sapiens (human) 46 3*109
Important techniques
• Cutting DNA using restriction enzymes
• Gel electrophoresis for measuring size of DNA/protein
• DNA cloning for generating copies of DNA fragments
• Polymerase chain reaction for producing many copy of DNA
• DNA sequencing technique
• Microarray for measuring gene expression
• RNAseq for sequencing genome and expression of genes
Restriction Enzymes
• Discovered in 1962 in bacteria; In 1970 purified and
charterized
• Molecular scissors that cut DNA at specific points.
• Found naturally in a wide variety of prokaryotes
• An important tool for manipulating DNA.
• Example

RE Strain of origin Recognition


site
EcoRI E. coli (strain RY13) GAATTC
Hind III H. influenza AAGCTT
BamHI B. amyloliquefaciens GGATCC
Types of cloning

• Recombinant DNA technology


DNA/ molecular/ gene cloning

• Reproductive cloning
Adult DNA cloning

• Therapeutic cloning
Embryo/ Biomedical cloning
Gene Cloning
Polymerase Chain Reaction
(Lab technique for DNA replication and Amplification)

• PCR, can produce many copies of a DNA segment


• A three-step cycle—heating, cooling, and replication
• Step 1: Denature of DNA: Two strands are separated at 95C
• Step 2: Primers Anneal: Primers anneal/bind to their
complementary sequences on the single strands of DNA at 40 to
65C.
• Step 3: DNA polymerase Extends the DNA chain At 72C, DNA
Polymerase extends the DNA chain by adding nucleotides to the 3’
ends of the primers.
Polymerase Chain Reaction
(Lab technique for DNA replication and Amplification)
Glycomics Lipidomics
(Sugars) (Lipids)

Metabolomics Chromosome
(23 pair) Epigenomics
M
M

Ac
Ac

Cell Nucleus Chromatin


Organ, Tissue
Genomics (3×109)
miRNA

World of OMICs DNA (4 chemicals: A, T, G, C)

Non-coding RNA Transcriptomics


mRNA (copies)

M C
A

A
I
V

Y
M
D
E Proteomics
Glycomics (Sugars attached proteins) Protein (20 chemicals: A, C, D ..)

You might also like