Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Form 1: Ph D Research Proposal:

Applicant’s Name: Md. Inzamam-Ul-Hossain

Working title or area of study:


Prediction of Essential Protein Using Metaheuristic Algorithm (Chemical Reaction Optimization) and
Machine Learning Techniques

Background:
Bioinformatics helps to develop procedures or tools to understand biological data using a field like Computer Science
[1,2]. It helps the combined knowledge of biology, computer science, mathematics, statistics and so on fields to analyze
and understand biological data. Many bioinformatics methods have been developed to analyze the structure, function
and evaluation of protein [3]. For all activities of the cell (which is also called the smallest unit of life) of organisms,
protein plays an important role [4]. Different types of proteins are important for different activities of organisms. Protein
can be classified as essential and non-essential proteins. The former type of protein is useful for growth; whereas the
lack of this type of protein can cause lethality, infertility, dysfunction or cell death [5]. Without essential protein,
organism even not reproduce [9]. Essential protein prediction is an important issue in pharmaceuticals as somebacterial
essential protein is found lethal to bacteria and used as drug [10]. It also requires to know the minimal requirement of
protein in a cell. By this prediction, we can also find the reason of revoke growth of organs.

Many experimental and computational methods were designed to predict and find essential
proteins. The experimental methods identify essential proteins through single gene knockouts, conditional
knockouts, and RNA interference, which are very expensive and time consuming. Furthermore, the
experimental methods are not suitable for all organisms, such as human. So with the development of high throughput
experimental technologies, different types of genome-related data, such as Protein-Protein
Interaction (PPI) data, cellular localization data, protein sequence data, and gene expressing data
are available. Many features of essential proteins have been discovered through analyzing the biological
information. Therefore a large number of computational methods make use of these features to predict essential proteins
[10]. One of the aims of the essential protein finding is to find the importance of protein for normal cellular function
or retain life. Machine learning approaches are observed a better way to classify essential protein.

Protein-Protein Interaction (PPI) networks help to understand cell function, disease machinery and drug design [6]. PPI
helps to collect properties of biological and topological information. Using this network, protein is designed as nodes
and their relationships are represented as edges. PPI Networks are mathematical representations of the physical contacts
between proteins in a cell [7]. Many researches have used PPI network to differentiate essential and nonessential
proteins [5].

Many researches in literature have been conducted to predict essential protein. Some of the recent works have been
shortly described in below.

Feature Selection is an approach by which candidate features can be selected from original features [8]. It helps to get
better result, improve learning speed and simplicity of representation. The task of feature selection is to choose a small
subset of features which successfully describe the target [11]. The goal of the feature section is to find features with low
dimension, hold sufficient information, enhance separability and comparable among the same category. Feature
selection helps to discard diverse data and unrelated data and then those important data can be used for data mining. In
literature, feature selection has used to select features (among topological, biological and composed) and predict
essential protein [10]. Ming Fang et al. [13] has proposed Elite Search mechanism-based Flower Pollination Algorithm
(ESFPA) to identify protein essentiality. The proposed study at first chooses those features which are related with protein
essential. After that ESFPA algorithm has used to get the optimal features. Finally, the optimal features are fed to an
ensemble method to determine protein essentiality. Here using data mining approach, the ensemble method combined
Logistic model trees, REPTree, decision tree j48, Random Tree, Random Forest and Naïve Bayes classifiers. The
proposed method shows better result than existing methods to predict protein essentials.
Data mining (DM) is the process by which we can see the summarization of the data and analyze it for certain purpose
[12]. It is also called knowledge discover. By the name, we understand that it finds useful information from scattered
information in a meaningful way. DM helps to make relation among the fields of the related data which is also helpful
for finding the important factors that are used to make decision(s). With the help of DM techniques, data analysis has
become easy which is quite hard in other ways. Many applications use this technique for business, disease detection,
time series analysis, bioinformatics, fraud detection etc. DM helps us to find unseen information, hidden predictive
information from presently available data. At this moment, a huge amount of research is going on in bioinformatics
which result a great amount of valuable information generated. Chiou-Yi Hor et al.[5] proposed a modified backward
feature selection method and built classifier named support vector machine (SVM) predictors to predict essential protein
using Saccharomyces cerevisiae and Escherichia coli data. The proposed method has used SVM software for using
LIBSVM classifier. Here the features used are sequence properties (amino acid occurrence and average amino acid
PSSM), protein properties (cell cycle and metabolic process), topological properties (bit string of double screening
scheme and betweenness centrality related to physical interactions) and other properties (phyletic retention and essential
index). Here in case of S.cerevisiae dataset; among 60 features a different subset of features is selected for essential
protein prediction. PR (phyletic retention), EI (essentiality index), Cytoplasm, Nucleus, Occurrence of A.A.I, Bit String
of DSS, Occurrence of A.A.W are the nearly common features selected by the proposed method to make most of the
subset of features for essential protein prediction. The proposed method uses two other existing methods named mRMR
and CMIM and showed that the size of features of the proposed method is small and performance is better than others.

A feature selection method for Prediction Essential Protein was proposed by Jiancheng Zhong et al. [10]. This method
uses Support Vector Machine-Recursive Feature Elimination (SVM-RFE) to select feature space for essential protein
prediction. By using this method, those features that share biological meaning are removed among 26 features by
applying the Pearson Correlation Coefficient (PCC). This method identifies six features as feature space by makinga a
rank list using SVM-RFE first then rearrange the rank list by using PCC. This method uses S. cerevisiae (Bakers Yeast)
dataset and its corresponding PPI dataset. This proposed method compares the result with other machine learning
methods like Naive Bayes, Bayes Network and NBTree and shows that 6 features perform better result than other
combination of features (all features and only 8 features).

Wei Liu et. al. [4] proposed improved particle swarm optimization (EPPSO) method to detect essential protein. This
method has updated rules of the velocity vector and particle position over particle swarm optimization. Here in this
method, top p essential proteins are predicted based on index measurement. This proposed method combines PPI
network topological and biological properties to improve the result. This method shows better result in terms of speed,
accuracy and number of essential protein detection. Here yeast dataset has used which is collected from DIP database
and gene expression data are collected from GEO database. Based on the statistical analysis, it is observed that the
proposed method detect essential protein correctly compared to other algorithms.

Albert Y. S. Lam et. al. [14] proposed a tutorial on Chemical Reaction Optimization (CRO) to compare that CRO has
greater performance than other existing optimization algorithms. CRO can be applied to discrete and continuous
attributes and can be used to solve several types of problems. This algorithm can be used to find nearly optimal solution
from NP-hard problems. There are lots of metaheuristic algorithms available but not all are good for all types of
problems. CRO is a population based metaheuristic, which has been used to solve many problems with success. It can
be used based on user requirements by adjusted the implementation of the algorithm. The implementation of this
algorithm is easy and can be used object oriented programming language like Java, C++ etc. some of the important
characteristics of CRO are: deploying different operators to adjust different problems, solved those problems which can
not be solved by other algorithms, easy to design different operators, use advantages of Simulated Annealing (SA) and
Genetic Algorithm, easy to implement, can be run in parallel. CRO has three stages: initialization, iterations and final
stage.

Aim of the proposal:


Based on the above knowledge, the aim is to predict essential protein using metaheuristic algorithm (like CRO) and then
use machine learning techniques to check the performance of prediction.

Objective of the proposal:


The following objectives are set for the proposal:
- Collect reliable protein-protein interaction (PPI) dataset.
- Identify best optimization algorithm to predict essential protein.
- Identify machine learning methods to get the performance of essential protein prediction.
- Compare performance with existing related methods.

Materials and Methods:


For the prediction of essential protein, protein-protein interaction network dataset can be collected from DIP database
(https://dip.doe-mbi.ucla.edu/dip/Download.cgi?SM=7) and gene expression data can be collected from GEO
(https://www.ncbi.nlm.nih.gov/geo/). Many features may exist in the dataset which if considered may diverse the
solution to predict essential protein. So, these two datasets will be fed to a metaheuristic algorithm like CRO to make
rank features based on the objective function. This rank helps to understand which feature space is suitable for essential
protein prediction. From a metaheuristic algorithm, the nearly optimal solution can be obtained. This solution is the final
outcome which is obtained after a threshold satisfied or when a further iteration does not improve performance by the
objective function.

After getting the solution from the final stage of the algorithm, an ensemble method can be used based on those
classifiers which show better performance of the selected features. To apply the ensemble method, Weka software [15]
can be used. Figure 1 shows the proposed method that can predict essential protein. In this method, 10 folds cross
validation can be used. By 10 folds cross validation, we mean that dataset is divided into 10 different folds where 9 folds
use for training and remaining 1-fold use for testing. The ensemble method can be applied to combine any number of
classifiers that shows good performance on the dataset.

Metaheuristic Optimal Feature Space


PPI Network Dataset
All Features Algorithm (CRO)

Performance
Evaluation Ensemble Method
10 Folds Cross
Validation

Figure 1: An overview of the Proposed Essential Protein Prediction Method

Significance:
The proposed method is expected to get better performance than the existing methods. To evaluate and compare the
performance of the proposed method some statistical measurements can be used. The formulas of the measurement are
as follows:

Assume, TP (True Positive) represents number of correctly classified essential proteins, FN (False Negative) represents
the number which represents essential proteins as non-essential proteins, TN (True Negative) represents the number
which represents non-essential proteins correctly classified, FP (False Positive) represents the number of non-essential
protein classified as essential protein. Then,

𝑇𝑃
Sensitivity (SN)=(𝑇𝑃+𝐹𝑁)
𝑇𝑁
Specificity (SP) = (𝑇𝑁+𝐹𝑃)
𝑇𝑃
Predictive Value (PPV) = (𝑇𝑃+𝐹𝑃)
𝑇𝑁
Negative Predictive Value (NPV) = (𝑇𝑁+𝐹𝑁)
𝑃𝑃𝑉
F-Measure = 2 ∗ 𝑆𝑁 ∗ (𝑆𝑁+𝑃𝑃𝑉)
(𝑇𝑃+𝑇𝑁)
Accuracy = (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)

For the above equations, the higher value represents greater performance. So, the proposed method is expected to get
higher values compared to other existing methods to predict essential proteins.

Along with the above statistical measurement, another performance measurement will be conducted with respect to the
number of features taken in the ensemble. At first, the feature space will be used to predict essential proteins, next to all
features will be used to predict essential proteins; other number of features also will be examined to check which number
of features gives a good performance.

References:
1. Bioinformatics: https://en.wikipedia.org/wiki/Bioinformatics [Accessed: 29/11/2018]
2. Interdisciplinarity: https://en.wikipedia.org/wiki/Interdisciplinarity [Accessed: 30/11/2018]
3. Protein: https://en.wikipedia.org/wiki/Protein [Accessed: 30/11/2018]
4. Wei Liu, Jin Wang, Ling Chen, BolunChen,Prediction of protein essentiality by the improved particle swarm optimization.
Soft Comput. 22(20): 6657-6669 (2018)
5. Hor CY, Yang CB, Yang ZJ, Tseng CT. Prediction of protein essentiality by the support vector machine with statistical
tests. EvolBioinform Online. 2013;9:387-416. Published 2013 Oct 3. doi:10.4137/EBO.S11975
6. Vella D, Marini S, Vitali F, Di Silvestre D, Mauri G, Bellazzi R. MTGO: PPI Network Analysis Via Topological and
Functional Module Identification. Sci Rep. 2018;8(1):5499. Published 2018 Apr 3. doi:10.1038/s41598-018-23672-0
7. Protein-protein interaction networks: https://www.ebi.ac.uk/training/online/course/network-analysis-protein-interaction-
data-introduction/protein-protein-interaction-networks [Accessed: 29/11/2018]
8. Luis Carlos Molina, LluisBelanche, Angela Nebot: Feature Selection Algorithms: A Survey and Experimental Evaluation.
ICDM 2002: 306-313
9. R. S. Kamath, A. G. Fraser, Y. Dong, G. Poulin, R. Durbin, M. Gotta, A. Kanapin, N. Le Bot Et al, Systematic functional
analysis of the caenorhabditiselegans genome using rnai, Nature, vol. 421, no. 6920, pp. 231-237, 2003.
10. Zhong J, Wang J, Peng W, et al. A feature selection method for prediction essential protein[J]. Tsinghua Science and
Technology, 2015, 20(5): 491-499.
11. Selwyn Piramuthu, Evaluating feature selection methods for learning in data mining applications. European Journal of
Operational Research 156(2): 483-494 (2004)
12. Ipsita Bhattacharya, M. P. S. Bhatia (2010), SVM Classification to Distinguish Parkinson Disease Patients, Proceedings of
the 1st Amrita ACM-W Celebration on Women in Computing in India.
13. Fang M, Lei X, Cheng S, Shi Y, Wu FX. Feature Selection via Swarm Intelligence for Determining Protein Essentiality.
Molecules (Basel, Switzerland). 23. PMID 29958434 DOI: 10.3390/molecules23071569
14. Albert Y. S. Lam, Victor O. K. Li, Chemical Reaction Optimization: a tutorial - (Invited paper). Memetic Computing 4(1):
3-17 (2012)
15. Weka 3: Data Mining Software in Java: https://www.cs.waikato.ac.nz/ml/weka [Accessed: 29/11/2018]
Form 2: Consent of Supervisor

Full Time Part Time

Name of Applicant: Md. Inzamam-Ul-Hossain

Title of Research: Prediction of Essential Protein Using Metaheuristic Algorithm (Chemical Reaction
Optimization) and Machine Learning Techniques.

1. No. of PhD student(s) you are-

(i) Supervising 01 (ii) co-supervising 0

2. Please write down your expertise in the proposed study area (50-100 words).

Finding essential protein from the protein-protein network is a field of bioinformatics. The applicant will try to solve
this problem using metaheuristic algorithm and machine learning techniques. I have been doing research and supervising
students to solve different problems in the area of bioinformatics since 2003. Recently I with the thesis students have
solved several problems in the area of bioinformatics using a meta-heuristic algorithm called Chemical Reaction
Optimization (CRO) and got very good results. We have published more than 10 papers in this area in the international
conferences and local as well as international journals with impact factors. In several works we have used the machine
learning technique also. Even I have a Ph.D student who is doing his research using machine learning technique at
present. So, the proposed research area is within my expertise and knowledge of research.

3. How does the proposed research topic match with the Discipline’s research interest (50-100 words)?

Algorithm is one of the core courses in computer science and engineering discipline. Algorithms are used to solve
different kinds of problems in the various areas of science and engineering. On the other and machine learning technique
and data mining are recent study areas for the discipline. To identify the essential proteins from the protein interaction
network has become a hot research topic in proteomics and it is a field of research in bioinformatics. Since Meta-
heuristic algorithm and machine learning as well as data mining technique will be used to solve the protein essentiality
problem the research area is very much relevant to the discipline.

4. Please describe the necessary infrastructure and research supports of the Discipline that might be necessary for
successful completion of the proposed research (50-100 words).

To accomplish this research the student needs to take several relevant courses. The computer science and engineering
discipline has enough teaching resources to undergo the relevant courses. This research needs only computer, internet
facility, compilers of one or more programming languages, MATLAB and WEKA software. The discipline has
enough facilities to support this research.

You might also like