Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

A modeling and simulation web tool for plant biologists

Background: At the molecular level, nonlinear networks of heterogeneous molecules


control many biological processes, so that systems biology provides a valuable approach in
this field, building on the integration of experimental biology with mathematical modeling.
One of the biggest challenges to making this integration a reality is that many life scientists
do not possess the mathematical expertise needed to build and manipulate mathematical
models well enough to use them as tools for hypothesis generation. Available modeling
software packages often assume some modeling expertise. There is a need for software
tools that are easy to use and intuitive for experimentalists.

Outcomes: This paper introduces PlantSimLab, a web-based application


developed to allow plant biologists to construct dynamic mathematical models
of molecular networks, interrogate them in a manner similar to what is done in
the laboratory, and use them as a tool for biological hypothesis generation. It is
designed to be used by experimentalists, without direct assistance from
mathematical modelers.

RNA contact predictions by integrating structural patterns


Background: It is widely believed that tertiary nucleotide-nucleotide interactions are
essential in determining RNA structure and function. Currently, direct coupling analysis
(DCA) infers nucleotide contacts in a sequence from its homologous sequence alignment
across different species. DCA and similar approaches that use sequence information alone
typically yield a low accuracy, especially when the available homologous sequences are
limited. Therefore, new methods for RNA structural contact inference are desirable because
even a single correctly predicted tertiary contact can potentially make the difference
between a correct and incorrectly predicted structure. Here we present a new method
DIRECT (Direct Information REweighted by Contact Templates) that incorporates a
Restricted Boltzmann Machine (RBM) to augment the information on sequence co-
variations with structural features in contact inference.

Results:Benchmark tests demonstrate that DIRECT achieves better overall performance


than DCA approaches. Compared to mfDCA and plmDCA, DIRECT produces a substantial
increase of 41 and 18%, respectively, in accuracy on average for contact prediction. DIRECT
improves predictions for long-range contacts and captures more tertiary structural features .
Automatic discovery of 100-miRNA signature for cancer
classification using ensemble feature selection
Background: MicroRNAs (miRNAs) are noncoding RNA molecules heavily involved in
human tumors, in which few of them circulating the human body. Finding a tumor-
associated signature of miRNA, that is, the minimum miRNA entities to be measured for
discriminating both different types of cancer and normal tissues, is of utmost importance.
Feature selection techniques applied in machine learning can help however they often
provide naive or biased results.

Results: An ensemble feature selection strategy for miRNA signatures is proposed. miRNAs
are chosen based on consensus on feature relevance from high-accuracy classifiers of
different typologies. This methodology aims to identify signatures that are considerably
more robust and reliable when used in clinically relevant prediction tasks. Using the
proposed method, a 100-miRNA signature is identified in a dataset of 8023 samples,
extracted from TCGA. When running eight-state-of-the-art classifiers along with the 100-
miRNA signature against the original 1046 features, it could be detected that global
accuracy differs only by 1.4%. Importantly, this 100-miRNA signature is sufficient to
distinguish between tumor and normal tissues. The approach is then compared against
other feature selection methods, such as UFS, RFE, EN, LASSO, Genetic Algorithms, and EFS-
CLA. The proposed approach provides better accuracy when tested on a 10-fold cross-
validation with different classifiers and it is applied to several GEO datasets across different
platforms with some classifiers showing more than 90% classification accuracy, which
proves its cross-platform applicability.

Shared data science infrastructure for genomics data


Background: Creating a scalable computational infrastructure to analyze the wealth of
information contained in data repositories is difficult due to significant barriers in
organizing, extracting and analyzing relevant data. Shared data science infrastructures like
Boag is needed to efficiently process and parse data contained in large data repositories.
The main features of Boag are inspired from existing languages for data intensive computing
and can easily integrate data from biological data repositories.

Outcomes: As a proof of concept, Boa for genomics, Boag, has been implemented to
analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag
provides a massive improvement from existing solutions like Python and MongoDB, by
utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage
footprint that scales well and requires fewer lines of code. We execute scripts through Boag
to answer questions about the genomes in RefSeq. We identify the largest and smallest
genomes deposited, explore exon frequencies for assemblies after 2016, identify the most
commonly used bacterial genome assembly program, and address how animal genome
assemblies have improved since 2016. Boag databases provide a significant reduction in
required storage of the raw data and a significant speed up in its ability to query large
datasets due to automated parallelization and distribution of Hadoop infrastructure during
computations.

Additional Neural Matrix Factorization model for


computational drug repositioning
Background: Computational drug repositioning, which aims to find new applications for
existing drugs, is gaining more attention from the pharmaceutical companies due to its low
attrition rate, reduced cost, and shorter timelines for novel drug discovery. Nowadays, a
growing number of researchers are utilizing the concept of recommendation systems to
answer the question of drug repositioning. Nevertheless, there still lie some challenges to be
addressed: 1) Learning ability deficiencies; the adopted model cannot learn a higher level of
drug-disease associations from the data. 2) Data sparseness limits the generalization ability
of the model. 3)Model is easy to overfit if the effect of negative samples is not taken into
consideration.

Outcomes: In this study, we propose a novel method for computational drug


repositioning, Additional Neural Matrix Factorization (ANMF). The ANMF model makes use
of drug-drug similarities and disease-disease similarities to enhance the representation
information of drugs and diseases in order to overcome the matter of data sparsity. By
means of a variant version of the autoencoder, we were able to uncover the hidden features
of both drugs and diseases. The extracted hidden features will then participate in a
collaborative filtering process by incorporating the Generalized Matrix Factorization (GMF)
method, which will ultimately give birth to a model with a stronger learning ability. Finally,
negative sampling techniques are employed to strengthen the training set in order to
minimize the likelihood of model overfitting. The experimental results on the Gottlieb and
Cdataset datasets show that the performance of the ANMF model outperforms state-of-the-
art methods.

You might also like