Genomic Data Preprocessing Through Different Libraries

Genomics data
preparation
Presented by:
Shujaat Muhammad
Introduction to Bioinformatics
Useful Databases, tools
Outline
How to Retrieve DNA Sequence?
Introduction to iLearnOmega
What is bioinformatics?
• Bioinformatics is defined as the applica-
tion of tools of computation and analysis
to the capture and interpretation of bio-
logical data. It is an interdisciplinary
field, which harnesses computer sci-
ence, mathematics, and biology.[1]
Useful Databases, tools
Useful bioinformatic websites (available freely on the internet)
• National Center for Biotechnology Information (www.ncbi.nlm.nih.gov)—maintains bioinformatic tools and databases
• National Center for Genome Resources (www.ncgr.org/)—links scientists to bioinformatics solutions by collaborations, data,
and software development
• Genbank (www.ncbi.nlm.nih.gov/Genbank)—stores and archives DNA sequences from both large scale genome projects and
individual laboratories
• Unigene (www.ncbi.nlm.nih.gov/UniGene)—gene sequence collection containing data on map location of genes in chromo-
somes
• European Bioinformatic Institute (www.ebi.ac.uk)—centre for research and services in bioinformatics; manages databases of
biological data
• Ensemble (www.ensembl.org)—automatic annotation database on genomes
• BioInform (www.bioinform.com)—global bioinformatics news service
• SWISS-PROT (www.expasy.org/sprot/)—important protein database with sequence data from all organisms, which has a
high level of annotation (includes function, structure, and variations) and is minimally redundant (few duplicate copies)
• International Society for Computational Biology (www.iscb.org/)—aims to advance scientific understanding of living systems
through computation; has useful bioinformatic links
How to
Retrieve DNA
Sequence?
Go to this link
https://www.ncbi.nlm.nih.gov/genome/guide/human/
For this sequence copy “chr1:9747084-
9749721”
Check the chromosome number; click on the specific chromosome in this window
chromosome 1
1. Paste the copied data and
click on the arrow. This window will open.
The sequence will be selected
2. Click and down-

load FASTA
FASTA Format?
• In bioinformatics and biochemistry, the FASTA format is a text-based format for repres-
enting either nucleotide sequences or amino acid sequences, in which nucleotides or
amino acids are represented using single-letter codes. The format also allows for se-
quence names and comments to precede the sequences
• Sequence in FASTA format begins with a single-line description, followed by lines of se-
quence data. The description line is distinguished from the sequence data by a greater-
than (">") symbol at the beginning. An example sequence in FASTA format is:
> - train non_TATA FP002706 GRMZM2G167283_1 :+U EU:NC; range -1000 to -
201.-907 to -657
GCTCACAGGATACGTTCCCAACGTGCGCTAAAATATTAGTGCATAATTCATTTTCGAGGGCTGT-
TGCTGCATGCGTGGGGTCATTTTTCGGAGAGAGAAAAGGGAAAATTCTCGTTCCTTTCCACGCT-
GTCGAAGGAGTCCTAAGTCTGTGTGCATAGATGAAAGTGGTAAAATTATAGCATCTTTTAGAT-
TATAATTACAAATATGTAAAGTCATTTATAAGAGTGGGTTATGGAGAGTTGCTATCTGTA
• Now we will continue
with extracted Sequence
Dataset
A Bioinformatic tool
iFeatureOmega
iFeatureOmega
• iFeatureOmega is a comprehensive platform for generating, analyzing and visualizing 189

representations for biological sequences, 3D structures and ligands
• Three versions (i.e. iFeatureOmega-Web, iFeatureOmega-GUI and iFeatureOmega-CLI) of
iFeatureOmega have been made available to cater to both experienced bioinformaticians
and biologists with limited programming expertise.
• iFeatureOmega also expands its functionality by integrating 15 feature analysis algorithms
including ten cluster algorithms, three dimensionality reduction algorithms and two fea-
ture normalization algorithms) and providing nine types of interactive plots for statistical
features visualization (including histogram, kernel density plot, heatmap, boxplot, line
chart, scatter plot, circular plot, protein three dimensional structure plot and ligand struc-
ture plot
• In this lecture we will cover iFeatureOmega-CLI
Installation
• iFeatureOmega is an open-source Python-based toolkit, which operates in the Python environ-
ment.
• Prior to installing and running iFeatureOmega, all the third-party packages should be
installed in the Python environment
• Python version (v.3.7.4)
• Biopython (v.1.78)
• Pandas (v.1.3.4)
• Numpy (v.1.21.4)
• NetworkX (v.2.5)
• Matplotlib (v.3.4.3)
• There were seven main python class in
iFeatureOmega-CLI, including “iPro-
The work- tein”, “iDNA”, “iRNA”, “iStructure”

and “iLigand”, which are feature ex-
traction methods for protein sequences,
flow of iFea- DNA sequences, RNA sequences, pro-

tein structures and ligand molecules, re-
spectively. “iAnalysis” class is also
tureOmega- equipped for feature analysis, while
“iPlot” class is used to generate the cor-
responding plots.
CLI
The input and output
format of iFeatureOmega
• To calculate features for protein, DNA and RNA sequences,
a set of DNA, RNA or protein sequences in FASTA format
is used as input. To calculate analyze protein structural fea-
tures, users need to provide the protein structure of interest,
which must comply with the PDB(1) (https://
www.rcsb.org/) format by either providing a PDB accession
or uploading your own structure file (i.e., in PDB or CIF
format). For analyzing ligands, candidate ligand molecules
can be provided with SMILES format or SDF format.
iFeatureOmega-CLI usage
• The CLI-based version, which was developed and implemented as a py-
thon package. This version accepts the JSON format configuration file,
allowing users to conveniently specify the parameter values of individual
features and analysis algorithms
• The classes in iFeatureOmega-CLI are:
• iDNA: extract feature descriptors for DNA sequences
• iRNA: extract feature descriptors for RNA sequences
• iProtein: extract feature descriptors for protein sequences
• iLigand: extract feature descriptors for chemical molecules
• iStructure: extract feature descriptors for protein structure
• iAnalysis: implement feature analysis for extracted feature
descriptors
• iPlot: generate various statistical plots for generated feature
descriptors
Extract feature descriptors
• In each class, users can run the following command view the example code for “iDNA” class.
• Take the DNA feature descriptors extraction as an example:
• Step 1: running python and import iFeatureOmega
• Run python from the command line.
• Step 2: import iFeatureOmega
• $ python
• >>> import iFeatureOmegaCLI
• Step 3: Load data file
• >>> dna = iFeatureOmegaCLI.iDNA("<abs_path>/data_examples/
DNA_sequences.txt")
• “iDNA” is used to extract feature descriptors for DNA sequences, and the class should be initialized
with the sequence file (i.e. “<abs_path>/data_examples/DNA_sequences.txt” in this example,
<abs_path> indicates the absolute path of “data_examples” directory.)
• Step 4: Display the available feature descriptor types
• For “iDNA”, “iRNA”, “iProtein”, “iLigand” and “iStructure”, each of these classes has a function
named “display_feature_types”, which is used to display the names of available feature extraction
methods (Figure)
• Step 5: Extract feature descriptor
• Use the “get_descriptor” function to extract the feature descriptors. Take the “Kmer” feature
descriptor as an example (Figure )
• Step 6: Save the feature descriptor values to file

• Four functions including “to_csv”, “to_tsv”, “to_svm” and “to_arff”, are provided to save the cal-
culated feature descriptor values to the “CSV”, “TSV”, “SVM” and “arff”, respectively.
• >>> dna.to_csv("Kmer.csv", index=False, header=False)
Feature analysis
• Run the following command to view the
example code for feature analysis (Figure ):
• >>>
print(iFeatureOmegaCLI.iAnalysis.__d
oc__)
Running example for “iAnalysis” class
• Step 1: Open a data file

• >>> import iFeatureOmegaCLI
• >>> import pandas as pd
• # read data to pandas DataFrame
• >>> df = pd.read_csv('data_examples/data_without_labels.csv', sep=',',
header=None, index_col=None)
• Step 2: Create an instance for “iAnalysis” class
• # Initialize an instance of iAnalysis
• >>> data = iFeatureOmegaCLI.iAnalysis(df)
• Step 3: Implement an analysis algorithm and save analysis result
• >>> data.kmeans(nclusters=3) # k-means clustering
• >>> data.MiniBatchKMeans(nclusters=2) # Mini-Batch K-means
• >>> data.GM() # Gaussian mixture
• >>> data.Agglomerative() # Agglomerative
• >>> data.Spectral() # Spectral
• >>> data.MCL() # Markov clustering
• >>> data.hcluster() # Hierarchical clustering
• >>> data.APC() # Affinity propagation clustering
• >>> data.meanshift() # Mean shift
• >>> data.DBSCAN() # DBSCAN
# save cluster result
• >>> data.cluster_to_csv(nclusters=2)
# dimensionality reduction
• >>> data.t_sne(n_components=2) # t-distributed stochastic neighbor embedding
• >>> data.PCA(n_components=2) # Principal component analysis
• >>> data.LDA(n_components=2) # Latent dirichlet allocation
• # save dimensionality reduction result
• >>> data.dimension_to_csv(file="dimension_reduction_result.csv")
# feature normalization
• >>> data.ZScore() # ZScore
• >>> data.MinMax() # MinMax
# save feature normalization result
• >>> data.normalization_to_csv(file="feature_normalization_result.csv")
Data visualization
• Run the following command to view
the running code for data visualization
(Figure ):
• >>>
print(iFeatureOmegaCLI.iP
lot.__doc__)
Data visualization
•Step 1: Open a data file

• >>> df = pd.read_csv('data_examples/data_with_labels.csv', sep=',', header=None,
index_col=None)
•Step 2. Initialize an instance of “iPlot”

• >>> myPlot = iFeatureOmegaCLI.iPlot(df, label=True)
• Note: The first column of the data file we be taken as sample labels when label is True.
•Step 3. Scatter plot

• # scatter plot
• myPlot.scatterplot(method="PCA", xlabel="PC1", ylabel="PC2", output="scatterplot.pdf")
• For scatter plot, the dimensionality reduction algorithm was used to decompose the data to
two-dimensional space. And the dimension reduced data will be plotted as points in a plane.
Three dimensionality reduction algorithms can be selected (i.e. PCA, t_sne and LDA).
• The parameters of “scatterplot”:
method: dimensionality reduction algorithm (select from PCA, t_sne, LDA)
xlabel: the name of abscissa label of scatter plot

ylabel: the name of ordinate label of scatter plot
output: the name of output file

• Step 4. Histogram and kernel density plot
# histogram and kernel density
• myPlot.hist()
• Step 5. Heatmap
• # heatmap
• myPlot.heatmap(samples=(0, 50),
descriptors=(0, 100),
output='heatmap.pdf')
• The parameters of “heatmap”:
samples: the range of displayed samples
(i.e. rows of the data matrix)
descriptors: the range of displayed
descriptors (i.e. columns of the data mat-
rix)
output: the name of output file
Line Chart
• myPlot.line(output='linechar
t.pdf’)
Boxplot
• myPlot.boxplot(descriptors=(
0, 10),
output='boxplot.pdf')
• The parameters of “boxplot”:
descriptors: the range of dis-
played descriptors (i.e. columns of
the data matrix)

Genomic Data Preprocessing Through Different Libraries

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Genomic Data Preprocessing Through Different Libraries

Uploaded by

Copyright:

Available Formats

Genomics data

Useful Databases, tools

2. Click and down-

• iFeatureOmega is a comprehensive platform for generating, analyzing and visualizing 189

The work- tein”, “iDNA”, “iRNA”, “iStructure”

flow of iFea- DNA sequences, RNA sequences, pro-

• Step 6: Save the feature descriptor values to file

• Step 1: Open a data file

•Step 1: Open a data file

•Step 2. Initialize an instance of “iPlot”

•Step 3. Scatter plot

method: dimensionality reduction algorithm (select from PCA, t_sne, LDA)

xlabel: the name of abscissa label of scatter plot

output: the name of output file

You might also like