Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

Preprocessing of Genotyping & scRna seq data

Checkpoint (4) 14/12/2023


Director: Pr. Lluis Quintana-Murci Co-supervisor: Dr. Maxime Rotival
Human Evolutionary Genetics Unit
Marwan Sharawy
Sars Genes expression
Previous Progress

Performed on 88 libraries (176 indiviual) Performed on integrated dataset

Align raw data to referenceAmbient RNA & Process datasets Merging & batch Automatic celltype Downstream
& sample demultiplexing Doublets removal (QC, Scran Normalization, HVG...etc) correction datasets annotation analysis

Starsolo Cellbender Scanpy & Scran SCVI CellID Scanpy & GseaPY..etc
Demuxlet Doublet Detection

Reference used for all this Pipline : Human + IAV PR8 + Sars COV2 GE

Build our own reference using new strain and redo EVERYTHING
Build reference

• Download FASTA files for new reference strains.


• Manually build GTF files as there are no strain-specific GTF files available in databases.
Previous Progress

Performed on 88 libraries (176 indiviual) Performed on integrated datast

Align raw data to referenceAmbient RNA & Process datasets Merging & batch Automatic celltype Downstream
& sample demultiplexing Doublets removal (QC, Scran Normalization, HVG...etc) correction datasets annotation analysis
Cellbender
Starsolo Scanpy & Scran SCVI CellID Scanpy & GseaPY..etc
Doublet Detection
Demuxlet Scrublet

• After building the reference, rerun the entire pipline

Adjustments:

• Doublets: KNN + DD led to the leakage of some doublets.

• In the new pipeline, utilize two different doublet detection tools to


address missed doublets. Scrublet+DD .

• Remove anything detected as a doublet by either DD or Scrublet.


Previous Progress

Performed on 88 libraries (176 indiviual) Performed on integrated datast

Align raw data to referenceAmbient RNA & Process datasets Merging & batch Automatic celltype Downstream
& sample demultiplexing Doublets removal (QC, Scran Normalization, HVG...etc) correction datasets annotation analysis
Cellbender
Starsolo Scanpy & Scran SCVI CellID Scanpy & GseaPY..etc
Doublet Detection
Demuxlet Scrublet

• Cell ID: Instead of randomly subsampling the Yann dataset, for use as a
reference for cell ID.

• Filtering out only Europeans

• Increase the number of generated gene signatures per cell type from 500
to 1000.
Further QC
Further QC
Further QC

• Visualizing count of cells per cell type for cell that have less than 1500 total count read
Post QC UMAP
Annotation

• Split by condition for finner annotation • Perform clustering


Anntotaion
Anntotaion
Per cluster composition
Manunal anntotaion
• Automatic labeling : Very Good but not perfect
• case example : T.Reg cells
Manunal anntotaion
• case example : Mait cells
SARS genome
SARS reads

SARS

Coverage

Junctions

• Most reads come from the end of the genome (3' prime end, "poly A").

• Are we sequencing the viral genome?


SARS reads

• If we are sequencing the genome, why only monocytes?

• Could it be an infection? Specifically, is phagocytosis by monocytes of the Omicron variant a possibility? Or could this be an
effect of vaccination or prior exposure?
SARS reads

"Spliced reads suggest that viral genes are being processed, indicating successful invasion by the virus."
Clustering

('OT + kNN + Leiden')


• OT defines distances to compare high-dimensional data represented as probability distributions.

OT is not scalable for 1 m cell


To Do

• further manual cell annotations

• run compositional analysis (differences in cell composition)

• GSEA between conditions per celltype

• Understand what i have .......spend some time with the literature for innate immunity,viral response,
diffrent immune cell functions and interactions etc..

• Prepere for labmeeting 8th of janurary

• compute bulk expression for eQTL


benchmark of scvi/ pca /topo
metry /OT
Raw _control

10 KNN , using raw data


PCA

10 KNN , 30 pca
Mwogli / Optimal transport

10 KNN , OT 30 latent_dim
Mwogli / Optimal transport

10 KNN , OT 30 latent_dim
Topometry
obtain properly weighted eigenbases to represent the underlying data manifold
the eigencomponents are the 'latent space' (a.k.a. the dimensionality reduced
spaced), similar to the latent space learned by autoencoders like scVI or the
principal components learned by PCA.

You might also like