Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

42001 Bioinformatics

Assignment I: Bioinformatics Analysis

Due date 3 October, via email or upload.

Report format Recording of a video presentation

In the past weeks, you have learned how to use Linux commands and Bioinformatics tools to analyse
sequence variation and to quantify gene expression from datasets generated by next-generation
sequencing protocols. In this assignment, your team is tasked with (i) the development of a
bioinformatics pipeline for the analysis of Exome-Seq and RNA-seq datasets and (ii) the application of
this pipeline to interrogate a leukaemia patient sample to confirm known and to identify novel
sequence variants for potential clinical followup.

The dataset is located in ../../student_resources/assignment-1/ which contains a folder with your


group number (i.e. Group_1, Group_2 …). The folder contains the following datasets from the patients
leukaemia blast cells:

(1) DNA was captured using and the Agilent SureSelect Human All Exon Probes and sequencing
was performed using an Illumina HiSeq X Ten machine. The dataset was aligned to the human
genome (hg19) using the tool bowtie2 and the output is available as:
• ExomSeq.bam
• ExomSeq.bamStats

Note, that due to the disease status, matched cancer to normal tissue is not available for this
patient. Therefore, we propose to apply the Mutect2 pipeline learned in the course.

(2) Genome sequence variants previously identified for the patient. These are provided as excel
table:
• KnownVariants.xlsx

(3) RNA was extracted with the miRNeasy Mini Kit and sequencing was performed using an
Illumina HiSeq2500 machine. The dataset was aligned to the human genome (hg19) using the
tool bowtie2 and the output is available as:
• RNASeq.bam
• RNASeq.bamStats

This setup is very typical for a real-world project that you might encounter as a bioinformatics
specialist in a research or a diagnostic laboratory. Please find a list of aims to guide your analysis below:

(A) Investigate and analyse all datasets. Special consideration should be given to:

1) Data Quality and any pre/post-processing steps that are useful?

2) Identification of DNA variation?


3) Quatificaion of RNA expression?

4) Integration of DNA variation and RNA expression?

5) Interpretation of results?

(B) A group presentation on your bioinformatics pipeline and findings.

- Bioinformatics Pipeline (~5 min)


- Results I: SNPs (known, novel)
- Results II: INDELs (known, novel)
- Results III: Expression (genes with SNPs/INDELs)
- Results VI: Discussion of candidate SNPs/INDLEs for experimental/clinical followup
- Conclusions: Pros/Cons Bioinformatics Pipeline, SNPs/INDELs, Project Comments

Of note, these datasets have not been previously analyzed and might lead to useful results that are of
interest to our research group.

Useful Hacks
1. DNA sequence alignment (option mem)
• http://bio-bwa.sourceforge.net/bwa.shtml

2. Mutation calling with Mutect2:


• Cibulskis et al, Sensitive detection of somatic point mutations in impure and heterogeneous cancer
samples. Nat Biotechnol. 2013 Mar;31(3):213-9.

• https://software.broadinstitute.org/gatk/documentation/tooldocs/4.beta.4/org_broadinstitute_
hellbender_tools_walkers_mutect_Mutect2.php

• java -Xmx4g -jar /home/student_resources/apps/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar

2.1 Database of germline mutations incl allele fractions (--germline_resource)


• https://gnomad.broadinstitute.org/

• https://software.broadinstitute.org/gatk/blog?id=11337

• https://gatkforums.broadinstitute.org/gatk/discussion/24057/how-to-call-somatic-mutations-
using-gatk4-mutect2#latest

/home/student_resources/index/af-only-gnomad.raw.sites.hg19.vcf.gz
/home/student_resources/index/af-only-gnomad.raw.sites.hg19.vcf.idx

2.3 Runtime:
• Expect this to run ~8hrs (~See Linux hacks)

3. Variant Annotation:
3.1 Variant Effects
• McLaren et al, The Ensembl Variant Effect Predictor, Genome Biology 2016; 17(122)

• https://asia.ensembl.org/info/docs/tools/vep/index.html

3.2 DNA Variation databases


• https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
/home/student_resources/index/common_all_20180418.vcf

• https://www.ncbi.nlm.nih.gov/variation/docs/ClinVar_vcf_files/
/home/student_resources/index/clinvar_20190902.vcf
4. Other Tools
2.1 Picard
• https://broadinstitute.github.io/picard/

• https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/picard_sam_Add
OrReplaceReadGroups.php

• java -Xmx4g -jar /home/student_resources/apps/picard/picard.jar

2.2 SamTools (index)


• http://www.htslib.org/doc/samtools.html

3.3. Linux
3.1 Execute code in the background: Nohup [CODE] &

You might also like