Professional Documents
Culture Documents
Independent Study and Research
Independent Study and Research
Proteins are one of the most important biological macromolecules, essential for
practically all biological processes and activities in the cell. Experimental data
suggests that big proteins aggregate the key structural and biological activities.
Proteins are made up of amino acid sequences, hence understanding the link
between amino acid sequences and protein function is a critical challenge in
molecular biology. Protein sequence analysis is vital for understanding and
predicting protein function and structure. Predicting protein structure has
emerged as a prominent and crucial use of current computational biology, with
far-reaching scientific consequences in protein function, illness prediction,
treatment design, and pharmaceutical research and development. With the
improvement of contemporary methodologies, researchers began to focus on the
broad use of computational approaches for predicting protein structure and
sequence analysis. In the field of 18 computational biology, several
computational techniques can be used to predict protein-protein interactions and
protein-sequences, including homology modelling, natural language processing,
deep learning models, machine learning, and artificial intelligence (AI). In this
chapter, the authors attempted to consolidate recent research works on Protein
Sequence Analysis and offer a review to help researchers accelerate their future
work. This will aid in comparing and selecting efficient approaches for different
use cases.
Introduction
Proteins are the most important macromolecules in animals, as they play a role in
almost every activity in living cells. All proteins are made up of twenty-one amino
acids (AAs), which are combined in diverse ways to produce different proteins.
Amino acids are composed of carbon, oxygen, nitrogen, hydrogen, and, in certain
cases, sulphur atoms. These atoms unite to create side chains, which are linked to
the core carbon atom, amino groups, and carboxyl groups. The sole difference
between various amino acids is their side chain, which determines their 32
properties. A peptide bond, commonly known as a covalent connection between
two amino acid molecules, is a substituted amide linkage. As shown, such a link
is formed by eliminating the water molecules bound to the alpha carboxyl portion
of the first AA molecule and the alpha-amino group of the second AA molecule.
Similar to this, two peptide bonds can connect three amino acids to form a tri-
peptide, four to form a tetra-peptide, and so on . This type of combining of many
AAs produces a polypeptide. The part of an AA in a peptide that remains after
water loss is known as a residue. A protein can include up to 1000 AA residues.
Polypeptides and proteins are commonly used interchangeably. Polypeptide
molecules have a molecular weight (MW) of fewer than 10,000 Daltons, while
proteins have a higher MW. Proteins normally require a partner to complete their
job; they cannot do it alone. The companion might be proteins, DNA, or RNA .
A single protein molecule inside a cell cannot offer much functionality; however,
when many proteins are present, they work together to produce a functional unit.
Protein-protein interactions (PPI) occur when a protein interacts with another
protein or when two or more proteins communicate with one another via a
signalling pathway . Proteins interact to control and mediate a wide range of
cellular biological activities. For example, cell signalling, cellular transport (PPI
is employed when molecules enter and exit the cell), and muscular contraction
(which is facilitated by PPI between active myosin filaments). Thus, PPIs play an
important role in a variety of scenarios. However, disruption or the formation of
abnormal relationships can lead to illness. This motivates many researchers to
anticipate PPI as early as possible when illness symptoms appear. Certain
disorders develop symptoms later in life, necessitating complex treatment or
possibly death. Prior understanding of PPIs can help uncover pharmaceutical
targets, new biological processes, and novel disease therapies. Computational
approaches are more efficient and successful for predicting PPIs than
investigative methods such as tandem affinity purification (TAP), protein chips,
and biological procedures.
The writers developed the paper, which includes the following: Section 2
provides an introduction to protein, its structure, and enzyme categorization.
Section 3 provides a definition of deep learning and examines its architecture, as
well as several prevalent methods for implementation. Section 4 provides a quick
summary of relevant research done in this subject up to the present data. Section
5 represents a comparative comparison of several approaches for predicting
protein sequences, Section 6 depicts the deep-learning methodology we
presented, and Section 7 summarises the findings of the study.
Protein
In this section authors have discussed about the structure of protein and the
various classification in Enzyme Class.
Structure of Protein
In the following section, authors have discussed about the various types of
structures of protein.
Figure 3 shows how to train deep learning models and evaluate their performance.
It describes how the purpose of training is to minimise a loss function by updating
the model parameters using mini-batches of training data. The article also adds
that deep learning models have a unique set of hyperparameters that must be
optimised for the greatest predicted performance. It examines the use of
evaluation metrics, notably cross-validation and training/validation/testing data
splits, to select the best set of hyperparameters and measure model performance.
Cross-validation approaches for testing deep learning models can be time-
consuming and computationally costly.
(i) Multi-Modal Learning: It has been demonstrated that combining several data
modalities can help predict protein-protein interaction locations. Local and global
sequence-based characteristics may be recorded via a sliding window and a text
CNN, respectively. The topic of whether to fuse these elements early or late in
the model design, as well as if distinct input encoders are required, must be
addressed. Furthermore, the relationship between the various modes must be
considered throughout the computation. Feature ablation research can assist in
understanding the appropriate participation fraction of each sort of mode.
Literature Review
In this section, authors have provided a clear idea about the proposed
methodology.
Step 1: Author Have a basic amino acid bank (As an example of basic 20 amino
acids). Now every protein is formed using this basic building block (amino acids).
Step 4: After getting the training on the said dataset (70%-80% Segment of fools
prepared data),
the model is tested for accuracy using the test-dataset (The 30%-20% of the full
data).
Step 5: Validation of the model has been done with the dataset that the model has
never interacted with. After this step, the model needs to be packed with the
functional dependencies and finally get ready to be used and deployed.
Analysis
1. Data Imbalance and Representativeness: Despite efforts to resolve data
imbalance through preprocessing, maintaining total representativeness
remains difficult, particularly for diverse datasets. This constraint may
have an impact on the generalizability and application of prediction models
to a wide range of protein binding events.
2. Feature Engineering and Dimensionality Reduction: While feature
engineering and dimensionality reduction approaches such as PCA are
useful, they may fail to capture all important chemical properties or cause
information loss. This constraint may have an influence on the accuracy
and completeness of molecular characteristics retrieved for protein binding
analysis.
3. Quality and Quantity of Labelled Data: Supervised learning algorithms,
such as Random Forests and SVMs, heavily rely on the quality and quantity
of labelled data for training. Insufficient or biased data can limit the
performance and robustness of these models in predicting protein binding
patterns accurately.
4. Complexity and Resource Demands: Ensemble methods like bagging
and boosting enhance model resilience but can increase complexity and
computational demands. This complexity might pose challenges in
deployment and scalability, particularly for real-time applications or large-
scale datasets.
5. Interpretability and Generalizability: Unsupervised techniques,
including clustering algorithms, aid in pattern identification but may
struggle with noisy or sparse data, leading to less interpretable results.
Ensuring both interpretability and generalizability across diverse protein
binding scenarios remains a challenge.
6. Integration of Multi-Modal Data: While integrating multi-modal data
sources contributes to a comprehensive analysis, it introduces challenges
in data fusion, compatibility, and alignment. Ensuring seamless integration
and meaningful interpretation of data from various sources is crucial for
accurate and actionable insights in protein binding analysis.
Conclusion