Protein Classification Using Hybrid Feature Selection Technique

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Protein Classification Using Hybrid Feature

Selection Technique

Upendra Singh(B) and Sudhakar Tripathi

Department of Computer Science and Engineering,


National Institute of Technology Patna, Patna, Bihar, India
dobestupendra@gmail.com, stripathi.cse@nitp.ac.in

Abstract. Protein function prediction is a challenging classification


problem. A computational method is vital to perform the function pre-
diction of Proteins. For this various Feature Selection techniques had
proposed by eminent researcher. But several techniques are model based
or for a specific type of problem. In this paper, we make a compara-
tive analysis of different supervised machine learning methods for the
prediction of functional classes of proteins using a set of physiochemi-
cal features. For an attribute or feature selection we have used a novel
hybrid feature selection technique to overcome some of the limitations of
existing technique and also present a comparative analysis of the clas-
sification of enzymes function or family using different computational
intelligence techniques with proposed hybrid feature selection.

Keywords: Classification · Machine leaning · Data mining · Feature


selection · Function prediction

1 Introduction
Protein function prediction is a very challenging task in Bioinformatics. The
Functional knowledge of a protein is very crucial to design or evolve new
approaches in the biological process. Protein Function Prediction methods are
used to assign biological or biochemical roles of proteins. The Information
required for prediction might be coming from nucleic acid sequence homology [1],
protein domain structures, gene expression profiles and protein-protein interac-
tion, etc. Protein function prediction involves mainly sequence similarity based
approach, structure based approach or by using both sequence and structural
properties. The protein function prediction based on experiment requires a large
amount of experimental resources and human exertion to analyze a single pro-
tein. Computational techniques which are capable to explore the functional dis-
covery of protein functions surprisingly reduce lab testing and also provide an
efficient way of function prediction. In earlier times, approaches based on homol-
ogy were used for the prediction of protein function, but these approaches were
unable to perform with accuracy in case of a new protein was dissimilar to
previous one. Hence, to address these types of classification problems several
computational techniques have evolved during the past several years.

c Springer Nature Singapore Pte Ltd. 2016
A. Unal et al. (Eds.): SmartCom 2016, CCIS 628, pp. 813–821, 2016.
DOI: 10.1007/978-981-10-3433-6 97
814 U. Singh and S. Tripathi

The most popular computational methods for protein function prediction


are based on amino acid sequences, since this information is available for the
most of existing Proteins. It may not be always adequate for prediction. It also
has a less percentage of accuracy. The assumption behind this approach is the
different proteins or enzyme having similarity in sequences perform preserve the
same functionality.

2 Related Work
There has been done an extensive research on Protein functional classification.
In recent decades of research the prime focus was given to the functional class
prediction based on either sequence derive features or structural features. Several
classification techniques such as SVM, ANN, and Decision Tree have been used
in earlier studies [1–5]. The SVM has shown significant performance as compared
to other traditional machine learning techniques for example, neural networks
(NNs) in various domains. However, the performance of SVM is quite similar to
the black box model. They dont produce comprehensive models that involve to
the predictions work. In recent decades of research, many studies have focused
on Protein Structure prediction by the use of machine learning techniques like
Neural network or Support Vector Machine have attained a worthy figure of
accuracy [6–10]. Besides of this, how a learning was made and why a decision
was being taken, does not reveals by these methods. It is very important to
contain the ability to explain why and how a conclusion is made for the approval
of the machine learning technology. Kumar et al. [11] have used the structural
properties and sequence derive feature set to determine the enzyme functional
class and subclass domain using support vector machine. They have used three
tier model in which at the first level of model they have distinguished the queried
protein in enzyme or non-enzyme. At the second level of their model they have
performed or carried out the enzyme functional classification and at the third
level they have performed the sub functional classification. Lou et al. [12] have
used sequence based information to find the DNA-binding sites using support
vector information. Dobson et al. [8] proposed a method by using EC number
to assign the function with the help of the structure of protein. They have used
one class against one class strategy using SVM for protein function prediction.
He get the accuracy in the range of 35–60%. Paliwal et al. [7] have explored the
physicochemical feature of the amino acids with PSSM containing information
concerned with evolutionary activities to carry out feature extraction. These
features are used for the prediction task of protein structural class with the
help of an ensemble classifier over 4 different benchmarks. For the evaluation of
classification performance they have used 10-fold cross validation method. Liu
et al. [13] have proposed a method using random forest to determine the DNA
binding proteins and amino acid sequences of protein for predicting DNA-binding
residue prediction. The author claims that using a novel and hybrid approach of
feature selection results better prediction performance.
Protein Classification Using Hybrid Feature Selection Technique 815

3 Methodology
Various Classification techniques and Feature selection methods are used for
Protein Function Prediction. This section contains a brief description of different
classification techniques and feature selection methods below which are used here
for functional classification of protein

– SVM: Traditionally SVM was used for binary classification. But in recent
year of research work, it is widely used in multi class problems [14]. In SVM,
we construct a set of hyper plane that separate the class members. In multi
class classification where linear separation of data is not possible, SVM use
Kernel function. In SVM mostly used kernels are Sigmoid, RBF, Polynomial,
linear, etc. One against all and All vs All are mainly used approach in Multi
Class Classification by SVM. The basic steps which involve by Support Vector
Machine in solving a classification problem can be describe briefly as follows.
Initially, with the help of a non-linear mapping function, it processes the trans-
formation of the input space into a large dimensional feature space. After that,
it constructs hyper plane which separate the data instances by maximum dis-
tance from the nearest training points set [15].
– Random Forest: This is an ensemble classifier [16–18] which creates no.
of decision trees. These individual trees predict the output class for testing
instance. Majority of predicted class by individual tree is taken into account
for final prediction. Feature Selection Attributes :- p1, p2, p3., pn.
Size of subset for feature selection = sqrt(n).
No of tree generation = 100 (user specified).
– Radial Basis Function Network: Logistic regression applied to K-means
clusters as basis functions. Radial basis function network uses k-means while
implementing the input layer architecture. The hidden layer of network uses
radial basis function.
n
fi (x) = Wi .j.ri (x) (1)
i=1

where f is the function corresponding to j th output unit and ri is the radial


basis function.
– Naive Bayes: This is very basic approach for supervised classification prob-
lem. In Naive Bayes the conditional probability is calculated for each attribute.
After that by product rule we calculate the joint conditional probability of
attribute. The well-known Bays rule is used to derive conditional probability
of Class variable which depicts the target class. Output class with highest
probability.
n

P (C/A1, A2, ....., An) = ( (P (Ai )P (c)))P (C)/P (A1, A2, ......., An) (2)
i=1

– Feature Selection: Feature selection refers to the attribute selection or


attributes subset selection. The main purpose of feature selection technique is
816 U. Singh and S. Tripathi

to remove the irrelevant and redundant features and also for dimension reduc-
tion [1].
There are the list of different feature selection algorithms provided in WEKA
tool from which we have applied four feature selection methods for efficient
feature selection.
– CfsSubsetEval: This feature selection technique select those features which
have higher co-relation value with target class and preferably lower co-relation
among them.
Pearson’s correlation coefficient [13] is used while evaluating the relevance
of attribute. Pearson’s correlation coefficient is covariance of two variables
divided by the product of their standard deviations.

ρxy = cv(x, y)σxσy. (3)

where cv is covariance and σ is the standard deviation of data.


– InfoGainAttributeEval: It is an attribute selection method inbuilt in
WEKA tool. It calculate the significance of an attribute by measuring the
information gain with respect to the class.

Inf ormationG ain(Class, Attribute) = H(Class) − H(Class|Attribute)


Gain = H(y) − H(y/x) = H(x) − H(x/y)
(4)
where H denotes the entropy and entropy is the measure of impurity in dataset
and is calculated as:
n
H(x) = − p(xi )log2 p(xi ) (5)
i=1
– FilteredSubsetEval: This feature selection method provide a subset of fea-
tures by applying an inbuilt arbitrary filter. Here filter that is used does not
make change in order or no of attribute.

4 Implementation
Protein Functional classification involves different steps as data acquisition, data
preprocessing, Feature Selection and Classification.
Data Acquisition: We have acquired the raw data from protein server RCSB.
We have used twelve Physiochemical feature or attribute in this paper for mak-
ing comparative analysis. These features are structure based attributes. The
Features used in this paper are Angle Beta, Z-no., Resolution, Structure Molec-
ular wt., Residue count, Average B Factor, Refinement resolution, Ligand mw,
pH value, Percentage solvent content, Chain length and Molecular wt.
Dataset Description: Dataset contains following no of instances from different
classes shown in Table 1.
Data Pre-processing: Since this is a multi class classification problem, we
have consider sampling technique so that we can get sufficient no of training
Protein Classification Using Hybrid Feature Selection Technique 817

Table 1. Dataset description

Enzyme class Transferase Oxidoreductase Lyase Ligase Isomerase Hydrolase


No. of instances 3711 1380 632 414 362 3363

and test instances from each class. After preprocessing the raw data we get
a labeled refined dataset having different or unequal figure of instances from
corresponding class. Hence, here we applied stratified sampling to get a balanced
dataset sample.
We applied different feature selection algorithms after data preprocessing to
find out the most relevant attribute for our target class prediction using WEKA.
Proposed Method for Feature Selection: In our proposed method of feature
selection we have used frequency of occurrence of attributes in above used four
feature selection methods. We have calculated the count of frequency of different
attribute in all four applied feature selection methods as Molecular weight, Chain
length, Structure Molecular weight, Ligand molecular weight with 4 frequency
and Residue Count, Angle Beta, Percentage solvent content with 3 frequency.
These sub set of 7 features selected using hybrid feature selection technique. We
arrange the attribute in their descending order of count. If count is equal for two

Table 2. Feature selection applied

Feature Infogain Classifier Filtered cfSubsetEval


selection AttributeEval SubsetEval SubsetEval
techniques
Features Molecular wt Angle Beta Angle Beta Angle Beta
chain length Z-no
Residue count Resolution Residue count Residue count
Structure MW Structure MW
Ligand mw residue count Ligand mw Ligand mw
Angle Beta Average B
Factor
pH value Ligand mw Chain length Chain length
Z-no pH value
Percentage Percentage Molecular Molecular
Solvent Solvent Weight Weight
content content
Resolution Chain length
Refinement
resolution
Average B
Factor
818 U. Singh and S. Tripathi

Table 3. Result summary on imbalanced data without using feature selection

Classification technique Accuracy (%) Precision Sensitivity F-measure


Nave bays 33.13 0.44 0.331 0.344
RBF network 47.95 0.466 0.48 0.447
Random Forest 80.6125 0.806 0.806 0.801
SVM 60.21 0.58 0.60 0.59

or more attribute then arrange in order of having max information gain first and
so on.
To get a relevant and significant subset of feature subset we use back elimi-
nation process on above ranked attribute. In back elimination process we remove
the feature iteratively having lowest rank to get desired level of accuracy. Table 2
shows different feature selection applied here. For make a comparative analysis of
Enzyme Classification we have used four classification techniques, SVM, RBFN,
Random forest and Nave Bays. For simulation and analysis we have used data
mining tools weka and Knime which are open source softwares. We have use
8-fold cross-validation method on weka tool to asses and test the model. Since
we have enough no of data sample from each class after sampling of data for
simulation, it decrease the possibility of biasing the result. We have iteratively
check the result on K = 5 to 10 and get significant result over k = 8 in k-fold
validation method.

5 Result and Analysis

For classification we consider two cases and compared their accuracy.


Case-1. Take all attribute and do classification without feature selection or data
balancing.
Case-2. Apply various feature selection algorithm and decide the rank of
attribute [19,20] based on frequency or occurrences of attribute, we select first
7 attributes and classification algorithms applied on selected attributes with
imbalanced dataset and random sampled dataset. We compute accuracy as well
as other performance measures of two cases and analyses which scenario is more
cost effective and will give best result. Table 3 shows the accuracy level of differ-
ent classification techniques using the protein dataset without using any feature
selection or any data balancing or sampling approach such as random sam-
pling etc. Table 4 shows the classification accuracy of same classifiers by using
hybrid feature selection with same imbalanced dataset and with random sampled
dataset (Fig. 1).
Protein Classification Using Hybrid Feature Selection Technique 819

Table 4. Result summary using hybrid feature selection and taking first 7 attributes.
(RS = Random Sampling ID = Imbalance Dataset)

Classification Accuracy (%) Precision Sensitivity F-measure


RS ID RS ID RS ID RS ID
Nave bays 35.35 35.02 0.446 0.45 0.353 0.35 0.359 0.367
RBF network 48.87 50.00 0.494 0.476 0.488 0.5 0.490 0.462
Random Forest 80.49 80.9775 0.802 0.808 0.804 0.81 0.801 0.806
SVM 63.21 62.90 0.667 0.60 0.632 0.629 0.646 0.618

Fig. 1. Graphical representation of prediction

6 Conclusion and Future Work

In this paper we have used four different classification method and present a very
brief description about the classification techniques used here and also about
feature selection technique. The prediction result shows a significant increase
in accuracy level by using hybrid feature selection. The above result analysis
shows that Random Forest method has a better accuracy over other classifi-
cation techniques using feature selection over imbalanced dataset. The present
research work addresses the feature selection issue in context of Protein clas-
sification by using a novel hybrid approach of feature selection which improve
the performance of classification model. Adjusting various parameters for train-
ing, further data refining and modeling of whole dataset and ensemble of SVM
with RBF kernel AND Random Forest would result a comprehensive compu-
tational model for Protein Function Prediction using hybrid feature selection.
Here we have used only twelve attribute for protein function prediction. Hence
in future the classifier performance might be improved by using more relevant
and appropriate attribute for protein function prediction.
820 U. Singh and S. Tripathi

References
1. Lee, B.J., Lee, H.G., Ryu, K.H.: Design of a novel protein feature, enzyme function
classification. In: CIT Workshops 2008. IEEE 8th International Conference on
Computer and Information Technology Workshops, pp. 450–455. IEEE (2008)
2. Yadav, A., Jayaraman, V.K.: Structure based function prediction of proteins using
fragment library frequency vectors. Bioinformation 8(19), 953–956 (2012)
3. Garg, A., Raghava, G.P.: A machine learning based method for the prediction of
secretory proteins using amino acid composition, their order and similarity-search.
Silico Biol. 8(2), 129–140 (2008)
4. Mer, A.S., Andrade-Navarro, M.A.: A novel approach for protein subcellular loca-
tion prediction using amino acid exposure. BMC Bioinform. 14(1), 1 (2013)
5. Jensen, L.J., Skovgaard, M., Brunak, S.: Prediction of novel archaeal enzymes from
sequence-derived features. Protein Sci. 11(12), 2894–2898 (2002)
6. Capra, J.A., Laskowski, R.A., Thornton, J.M., Singh, M., Funkhouser, T.A.: Pre-
dicting protein ligand binding sites by combining evolutionary sequence conserva-
tion and 3d structure. PLoS Comput. Biol. 5(12), e1000585 (2009)
7. Dehzangi, A., Paliwal, K., Sharma, A., Dehzangi, O., Sattar, A.: A combination
of feature extraction methods with an ensemble of different classifiers for protein
structural class prediction problem. IEEE/ACM Trans. Comput. Biol. Bioinform.
10(3), 564–575 (2013)
8. Dobson, P.D., Doig, A.J.: Predicting enzyme class from protein structure without
alignments. J. Mol. Biol. 345(1), 187–199 (2005)
9. Wang, L., Yang, M.Q., Yang, J.Y.: Prediction of DNA-binding residues from pro-
tein sequence information using random forests. BMC Genomics 10(1), 1 (2009)
10. Kumar, C., Choudhary, A.: A top-down approach to classify enzyme functional
classes and sub-classes using random forest. EURASIP J. Bioinform. Syst. Biol.
2012(1), 1 (2012)
11. Yadav, S.K., Bhola, A., Tiwari, A.K.: Classification of enzyme functional classes,
subclasses using support vector machine. In: 2015 International Conference
on Futuristic Trends on Computational Analysis and Knowledge Management
(ABLAZE), pp. 411–417. IEEE (2015)
12. Lin, W.-Z., Fang, J.-A., Xiao, X., Chou, K.-C.: iDNA-Prot: identification of dna
binding proteins using random forest with grey model. PLoS One 6(9), e24756
(2011)
13. Wu, J., Liu, H., Duan, X., Ding, Y., Wu, H., Bai, Y., Sun, X.: Prediction of DNA-
binding residues in proteins from amino acid sequences using a random forest model
with a hybrid feature. Bioinformatics 25(1), 30–35 (2009)
14. Samb, M.L., Camara, F., Ndiaye, S., Slimani, Y., Esseghir, M.A.: A novel RFE-
SVM-based feature selection approach for classification. Int. J. Adv. Sci. Technol.
43, 27–36 (2012)
15. Tiwari, A.K., Srivastava, R.: A survey of computational intelligence techniques in
protein function prediction. Int. J. Proteomics (2014)
16. Gao, M., Skolnick, J.: DBD-hunter: a knowledge-based method for the prediction
of DNA-protein interactions. Nucleic Acids Res. 36(12), 3978–3992 (2008)
17. Frank, E., Hall, M., Trigg, L., Holmes, G., Witten, I.H.: Data mining in bioinfor-
matics using weka. Bioinformatics 20(15), 2479–2481 (2004)
18. Nagao, C., Nagano, N., Mizuguchi, K.: Prediction of detailed enzyme functions
and identification of specificity determining residues by random forests. PloS one
9(1), e84623 (2014)
Protein Classification Using Hybrid Feature Selection Technique 821

19. Gulati, H.: Predictive analytics using data mining technique. In: 2015 2nd Inter-
national Conference on Computing for Sustainable Global Development (INDIA-
Com), pp. 713–716. IEEE (2015)
20. Kishore, R., Tripathi, S.: A comparative analysis of enzyme classification
approaches using hybrid feature selection technique. In: International Conference
on Circuit, Power and Computing Technologies (ICCPCT). IEEE (2016)

You might also like