Professional Documents
Culture Documents
Analysis of Impact of Principal Component Analysis and Feature Selection For Detection of Breast Cancer Using Machine Learning Algorithms
Analysis of Impact of Principal Component Analysis and Feature Selection For Detection of Breast Cancer Using Machine Learning Algorithms
Analysis of Impact of Principal Component Analysis and Feature Selection For Detection of Breast Cancer Using Machine Learning Algorithms
net/publication/367378275
CITATIONS READS
0 63
1 author:
Chitra G Desai
National Defence Academy
66 PUBLICATIONS 156 CITATIONS
SEE PROFILE
All content following this page was uploaded by Chitra G Desai on 24 January 2023.
Chitra Desai
Department of Computer Science, National Defence Academy, Pune
Abstract: Dimensionality reduction for medical data is seen as a challenging task for datasets
that are huge in dimensions and carry critical information. Feature selection and compression
techniques can be applied to high-dimension datasets to reduce the features. In feature
selection, we select a set of features from the existing features and ignore the remaining based
on certain feature selection criteria. In compression, we recreate new features from the existing
ones by retaining important information from the original features. However, feature selection,
particularly, in the medical dataset can lead to the loss of important information if not
understood rightly during exploratory data analysis. There are several techniques for feature
selection and compression for dimensionality reduction. This paper presents principal
component analysis PCA, a compression technique on breast cancer dataset and performs
detection using machine learning algorithm. This paper also presents feature selection using
one of the machine learning algorithms.
Initially, pre-processing is performed on to the dataset, followed by exploratory data analysis.
In depth study about the data, its characteristics and distribution are carried out. With the help
of data visualization attempts have been made to gain insight into the data with reference to the
standards in the domain of breast cancer. The data is cleaned and standardized before applying
PCA. Using box plot the data set is checked for outliers as the aim to standardize the data by
removing the mean and scaling each feature to unit variance.
The selection of number of principal components plays important role which further impacts
the accuracy of machine learning algorithm. Using scree plot, attempt has been made to select
appropriate number of principal components. The information captured in low dimension space
is represented using bivariate scatter plot to gain understanding of the data. Here experiments
with and without PCA using different machine learning algorithms for detection of breast
cancer has been demonstrated. The dataset is split into 80:20 for training and testing. The
machine learning algorithms developed here for detection of breast cancer are – Logistic
Regression, Support vector machine, Decision trees and Random Forest. Using random forest
feature selection is demonstrated.
The impact of PCA is analysed by computing the cost function with respect to each machine
learning algorithm with and without PCA. The confusion matrix in each case is plotted
separately for training and testing data. The values of true positive, true negative, false positive,
false negative using confusion matrix plays significant role in medical data. Based on these
values the training and testing report for precision, recall, F1 score and support are generated.
The precision and recall curve are plotted to gain insight into average precision and average
recall of training and testing data. The performance of all models is evaluated, model tuning as
required is performed on individual models and are ranked accordingly.
Keywords: Principal Component Analysis, Feature Selection, Machine Learning Algorithm,
Breast Cancer Detection, Confusion Matrix, Data Visualization, Exploratory Data Analysis,
Data Pre-Processing
1. Introduction
The versatile use of computer-based system in health care and various diagnostics equipment
has resulted in huge extent of data being generated. This data is found useful for prediction
and classification using machine learning algorithms, to identify patterns, perform data mining,
extract knowledge (knowledge discovery), anomaly detection and clustering [1,2,3]. Data in
health care is generated in the form of medical records, administrative reports, important
findings for setting benchmarks [4], clinical trials, health insurance, surveys conducted and
many more. Data can be in several forms like text, images, structured or unstructured. It can
be data from blood or tissue sample, x-rays, CT-scan, mammograms, MRI scan, results
obtained from health devices like electrocardiogram (ECG), electroencephalogram (EEG), data
in speech format for example clinical conversations etc. The huge and complex nature of data
leads to several challenges. Challenges like dealing with noisy data, high dimensionality,
security aspect related to health care data, integrating data from various sources, selection of
appropriate tool set, issues related to growing data – particularly for storage and retrieval, lack
of professionals to handle data are some of the commonly observed. From the several
challenges one of the critical contests addressed here is of high dimensionality for medical data.
Medical data sets contain many features and instances due to which predicting classification
accuracy using machine learning algorithms becomes challenging task [5]. It becomes difficult
to visualize training set with huge number of features [6]. It is difficult to figure out which
instance or which feature will have what impact on classification. Features that are redundant
or carry poor quality input will also hamper the predictive capability of the model. Thus, using
all the features may lead to problem of curse of dimensionality and eventually impact
computational complexity and classification performance [7]. This ascends the need for
dimensionality reduction.
Dimensionality reduction for medical data is seen as a challenging task for datasets that are
huge in dimensions and carry critical information. Feature selection and compression that is
feature extraction techniques can be applied to high dimension dataset to reduce the features.
The analysis of impact of dimensionality reduction on breast cancer detection data set from
UCI machine learning repository is discussed in this paper. The description of the data set is
presented in section 2. Before applying any feature selection or feature extraction technique, it
is essential that data should be pre-processed. Also, to gain insight into the data exploratory
analysis needs to be performed. Section 3 presents pre-processing and exploratory data analysis
for the breast cancer detection data set. Section 4 presents the standardization part of the data.
The classifiers used in this paper for detection of breast cancer are – Logistic Regression,
Support vector machine, Decision trees and Random Forest. A brief introduction to these
classifiers is presented in section 5.
In feature selection, sub set of features are selected from the existing features ignoring the
remaining based on certain feature selection criteria. In feature selection the original features
(subset) are preserved, which effectively defines the dataset [8]. However, feature selection,
particularly, in medical dataset can lead to loss of important information if not understood
rightly during exploratory data analysis. It is therefore essential to understand which feature
selection technique to be chosen. Feature selection techniques can be supervised and
unsupervised. The supervised techniques include wrapper, filter and embedded methods [9].
Optimal feature selection can be also be achieved for medical data set using nature inspired
algorithms like genetic algorithm [10], particle swarm optimization [11,12], artificial bee
colony [13]. This paper demonstrates the feature selection on medical data set using Random
Forest Classifier in section 6. Random forest classifier belongs to the class of embedded
methods which combines the qualities of both wrapper and filter class for feature selection.
Embedded methods consider dependency between the features, have better computational
complexity then that of wrapper, have high performance accuracy compared to filter and less
prone to overfitting [14].
Feature extraction ie compression of features is less prone to overfitting compared to feature
selection [14]. In compression we recreate new features from the existing one by retaining
important information from the original features. Here, new reduced set of features is created
from the existing one by applying algebraic transformation based on some optimization criteria
[15,16]. There are several features extraction techniques like Principal Component analysis
(PCA), kernel principal component analysis (KPCA), independent component analysis (ICA)
[17] and Linear Discriminant Analysis (LDA) [18]. LDA using statistical technique, reduces
the dimensionality and preserves as much possible the class discriminatory information. PCA
is the linear transformation method for feature extraction, KPCA is a nonlinear PCA developed
by using the kernel method, ICA linearly transforms original features into statistically
independent features. This paper presents principal component analysis (PCA), a compression
technique on breast cancer dataset and perform detection using machine learning algorithm.
PCA is discussed in section 7. Section 8 presents results and conclusion.
2. Data Set
The data set for breast cancer diagnostics [19] is analysed in this paper which consist of total
569 instances across 32 attributes. Out of the 32 attributes, one attribute is ID, the other one is
diagnosis which is the target variable and the remaining 30 attributes are the feature vectors.
Features are calculated from a digitized image of a fine needle aspirate of a breast mass.
These features represent the characteristics of the cell nuclei present in the image. The detail
description of each of these feature vectors can be referred from [20,21,22,23]. Here the
feature values here are recorded up to four significant digits and stored using float64 data
type. These features are - Diagnosis, radius_mean, texture_mean, perimeter_mean,
area_mean, smootheness_mean, compactness_mean, concavity_mean, concave_points_mean,
Figure 1 High Correlation between variables and variables with zero Value
Outliers impact significantly on the power of statistical tests. Using box plot it is possible to
detect the outliers for the variables. Figure 3 shows box plot for area_mean. The outliers for
area mean on maximum side of malignant cases are: 2501, 2499, 2250, 2010 and 1828. The
mean of area_mean for malignant is 978.37 which is computed using the outlier. These outlier
values indicate how they have affected the mean of area_mean of malignant instances.
Figure 4 Scatter Plot showing spread of malignant and benign with respect to radius_mean
and concavity_mean
Figure 4 presents scatter plot across two variables, that is concavity_mean and radius_mean,
which clearly shows associativity of two variables in distinguishing benign and malignant
cases. The corelation between different variable is as shown by heatmap in figure 5. Thus, all
variables can be explored to the possible extent with the help of statistical values and
visualization.
Figure 5 Heatmap
Let us assume that X holds the feature vectors and y holds the target variable for further
discussion continued.
Figure 6 Statistical insight into the Feature Vectors showing Varying Range of Values across
each Feature Vector
understand learning from both machine and human perspective [26]. Humans learn by
observing or by getting directly involved in the task. With experience human tend to improve
the performance for the said task. Machine learning is emulating human learning and
improving with experience. It learns the experience with the help algorithm. The machine
learning models can be parametric or non-parametric. A machine learning model is termed
parametric if it learns the data based on the distribution of the data, for example, logistic
regression. Support vector machine (SVM), decision tree and random forest are non-parametric
as they do not make assumption on the basis of distribution of data. For all the four models of
classification– logistic regression, SVM, decision tree and random forest, performance
evaluation can be done using confusion matrix [27].
In confusion matrix the values are represented in matrix as true positive (TP), true negative
(TN), false positive (FP), false negative (FN) considering the actual and predicted values. TP
refers that the model predicted positive and actual was positive, in TN the model predicted
negative and actual was negative, in FP the model predicted positive but actual was negative,
in FN the model predicted negative but actual was positive. Th obtained values for TP, TN, FP
and FN are useful to compute indicators - precision, recall and f1-score [28]. Precision is given
by equation (1) , recall by equation (2) and f-score by equation (3)
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 … (1)
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁 … (2)
𝑇𝑃
𝐹1 = 1 … (3)
𝑇𝑃+ (𝐹𝑃+𝐹𝑁)
2
For the probabilistic forecast for binary classification, ROC curve or precision-recall curve are
helpful tools. It is observed from figure 2 that the two classes ‘B’ and ‘M’ of the imbalanced.
So, precision-recall curve is preferred for analysis of the performance of the models. ROC
curve is best suited when the classes in target variable are balanced. For binary classification,
precision-recall curve is more informative compared to ROC curve [29]
Figure 8 shows an example of sigmoid function f(x) for input x. The graph in figure 11 shows
the suitability of logistic regression for binary classification. As mentioned above in our breast
cancer dataset, we have represented ‘B’ with ‘0’ and ‘M’ with ‘1’. If the value returned by the
model for input x is higher than or equal to threshold 0.5, then ‘M’ (1) is assigned else ‘B’ (0)
is assigned.
The model is trained using logistic regression algorithm and as shown in figure 9, the
confusion matrix and precision-recall curve is obtained.
The model gives accuracy of 97.36%. It is observed that for training data, the average precision
(AP) is 0.93 and for test data it is 0.98. The average recall (AR) is 0.67 for training data and
0.56 for test data.
Figure 10 shows the confusion matrix and precision recall curve for breast cancer detection
dataset using SVM. It is observed that for training data, the average precision (AP) is 0.92 and
for test data it is 0.98. The average recall (AR) is 0.68 for training data and 0.57 for test data.
A drop in average precision in training data of SVM by 0.01and increase in average recall by
0.01 in both training and testing data is observed compared to logistic regression.
Figure 11 shows the confusion matrix and precision recall curve for breast cancer detection
dataset using decision tree. It is observed that for training data, the average precision (AP) is
1.00 and for test data it is 0.77. The average recall (AR) is 0.50 for training data and 0.65 for
test data.
5.4 Random Forest
In decision tree there is a single tree whereas in random forest [38] there is accumulation of
trees which forms the forest. In random forest multiple decision trees are created randomly. It
uses ensemble learning, where it trains bunch of individual models in a parallel way. Here,
each model is trained by random subset of data. The larger subset ensures managing the
biasness in the distribution of data. Figure 12 shows the confusion matrix and precision recall
curve for breast cancer detection dataset using random forest.
Figure 12 Confusion Matrix and Precision recall curve for Random Forest
The AP for training data is 1 and for testing data it is 0.88. The AR for training data is 0.85
and testing data it is 0.79. It is observed that AR for feature extraction using random forest
classifier is maximum compared to AR of logistic regression, SVM ,decision tree and random
forest (without feature extraction).
Figure 34 Confusion Matrix and Precision recall curve for Feature Selection using Random
Forest Classifier
same as to the original dimension (thirty-one) is computed explaining the relationship of how
all variables relate to each other. It has both direction and magnitude. The number of principal
components is further selected based on how much variance is explained by each component
or eigenvalue criterion. Using scree plot decision to select the number of principal components
can be done as shown in figure 15 below for our dataset.
While selecting the number of Principal components it is essential to ensure these components
explain maximum variance, so that it retains maximum information. Table 2 shows variance
explained by each of the component and its percentage. As we have chosen for four components
so there are four eigen vectors and four eigen values (one for each eigen vector). Figure 16
shows values for four eigen vector e1, e2, e3, e4 and also eigen values λ1, λ2, λ3, λ4 for
corresponding eigen vector.
On observing the elbow curve in the scree plot (figure 15) it is decided to select four principal
components. The scree plot shows maximum variability explained by the first principal
component, the other components from 2 to 3 explains considerably moderate variability before
it starts flattening from 4 PC onwards. All components before the curve flattens are selected.
For the data under study, we have obtained single elbow curve thereby simplifying the choice
of principal components. However, there can be situation where more than one elbow curve
could be observed, thereby making it difficult to decide upon number of components.
Considering the eigen vector an eigen values it also observed that the eigen values are in
descending order that is the first eigen vector corresponds to the first principal component pc1
and so on. Thus, using PC, we have reduced the number of dimensions for the given breast
cancer data set from thirty-one to four. That is using principal component analysis we have
obtained principal components that are utilized further as feature vectors. These feature vectors
are formed using eigen vectors by representing the data from original axes to the new axes
represented by principal components. The pair plot for the four components is presented in
figure 17.
Figure 40 Confusion Matrix and PR Curve for Decision Tree with PCA
Figure 51 Confusion Matrix and PR Curve for Random Forest with PCA
On obtaining the four principal components, a new feature set to be given to classification
algorithm is available with us. Using these four principal components the confusion matrix and
precision curve are plotted. Figure 18 to 21 shows confusion matrix and precision recall curve
for logistic regression (with PCA), SVM (with PCA), decision tree (with PCA) and random
forest with PCA.
It is observed from the above figures of precision curve, that random forest with PCA obtains
the highest rank for prediction.
Table 3 AP and AR
Four machine learning algorithms were considered for the study – Logistic regression, Support
Vector machine, Decision Tree and Random Forest. Prediction using these algorithms were
carried out with and without PCA. Using PCA, the 31 features were reduced to 4 components.
Also, for analysing the impact on prediction using feature extraction, feature extraction using
random forest classifier was performed. Out of 31 features, 10 features were selected by
computing relative feature importance using random forest classifier.
The performance evaluation of the models was performed using confusion matrix and precision
recall curve. The AP and AR values indicated in the table 2 shows that using feature extraction
and feature reduction, the performance of the predicting algorithms is actually enhanced with
all benefits offered by feature extraction and feature reduction in terms of resource utilization
and other as discussed above.
Using feature extraction (using random classifier) and feature reduction (PCA), the
performance of the prediction algorithm has enhanced and found suitable for breast cancer
detection compared to without PCA and without feature extraction.
Reference list
5. Eva Tuba, Ivana Strumberger, Timea Bezdan, Nebojsa Bacanin, Milan Tuba,
Classification and Feature Selection Method for Medical Datasets by Brain Storm
Optimization Algorithm and Support Vector Machine, Procedia Computer Science, Volume
162,2019, Pages 307-315, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2019.11.289
6. M. Verleysen and D. François, "The curse of dimensionality in data mining and time
series prediction," in International Work-Conference on Artificial Neural Networks, pp. 758-
770: Springer, 2005.
7. S. N. Katole and S. P. Karmore, "A New Approach of Microarray Data Dimension
Reduction For Medical Applications," 2015 2nd International Conference on Electronics and
Communication Systems (ICECS), pp. 409-413, 2015 doi: 10.1109/ECS.2015.7124936.
8. D. L. Padmaja and B. Vishnuvardhan, "Comparative Study of Feature Subset
Selection Methods For Dimensionality Reduction On Scientific Data," in IEEE 6th
International Conference on Advanced Computing (IACC), pp. 31-34: IEEE, 2016
9. Girish Chandrashekar, Ferat Sahin, “A Survey On Feature Selection Methods”,
Computers & Electrical Engineering, Volume 40, Issue 1,Pages 16-28,ISSN 0045-7906, 2014
https://doi.org/10.1016/j.compeleceng.2013.11.024 .
10. T.Santhanam,M.Padmavathi, “Application Of K-Means And Genetical Algorithms
For Dimension Reduction By Integrating SVM For Diabetes Diagnosis”, Procedia Computer
Science 47,pp-76–83, 2015.
11. H.H.Inbarani,A.T.Azar,G.Jothi, “Supervised Hybrid Feature Selection Based On PSO
And Rough Sets For Medical Diagnosis”, Computer Methods And Programs In Biomedicine
113(1), pp-175–185, 2014.
12. S.M.Vieira,L.F.Mendonc ̧a,G.J.Farinha,J.M.Sousa, “Modified Binary PSO For
Feature Selection Using SVM Applied To Mortality Prediction Of Septic Patients”, Applied
Soft Computing 13(8),pp-3494–3504, 2013.
13. M.S.Uzer,N.Yilmaz,O.Inan, Feature selection method based on artificial bee colony
algorithm and support vector machines for medical datasets classification, The Scientific
World Journal Article ID 419187(2013)1–10
14. Zebari et al., A Comprehensive Review of Dimensionality Reduction Techniques for
Feature Selection and Feature Extraction , Journal of Applied Science and Technology
Trends Vol. 01, No. 02, pp. 56 –70, (2020)
15. M. K. Elhadad, K. M. Badran, and G. I. Salama, "A novel approach for ontology-
based dimensionality reduction for web text document classification," International Journal
of Software Innovation (IJSI), vol. 5, no. 4, pp. 44-58, 2017.
16. D. A. Zebari, H. Haron, S. R. Zeebaree, and D. Q. Zeebaree, "Enhance the
Mammogram Images for Both Segmentation and Feature Extraction Using Wavelet
Transform," in 2019 International Conference on Advanced Science and Engineering
(ICOASE), 2019, pp. 100-105: IEEE
17. L. J. Cao and W. K. Chong, "Feature extraction in support vector machine: a
comparison of PCA, XPCA and ICA," Proceedings of the 9th International Conference on
Neural Information Processing, 2002. ICONIP '02., 2002, pp. 1001-1005 vol.2, doi:
10.1109/ICONIP.2002.1198211.
18. Youness Aliyari Ghassabeh, Frank Rudzicz, Hamid Abrishami Moghaddam, Fast
incremental LDA feature extraction, Pattern Recognition, Volume 48, Issue 6,2015,Pages
1999-2012,ISSN 0031-3203,https://doi.org/10.1016/j.patcog.2014.12.012.
19. Street W. N, Wolberg W. H. and Mangasarian O.L, https://ftp.cs.wisc.edu/math-
prog/cpo-dataset/machine-learn/cancer/WDBC/
20. O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August
1995.
21. W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine
learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative
Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995.
22. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized
breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery
1995;130:511-516.
23. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived
nuclear features distinguish malignant from benign breast cytology.Human Pathology,
26:792--796, 1995.
24. Ferlay J, Ervik M, Lam F, Colombet M, Mery L, Piñeros M, et al. Global Cancer
Observatory: Cancer Today. Lyon: International Agency for Research on Cancer; 2020
(https://gco.iarc.fr/today, accessed February 2021).
25. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830,
2011
26. Underwood, T. (2020). Machine Learning and Human
Perspective. PMLA/Publications of the Modern Language Association of America, 135(1),
92-109. doi:10.1632/pmla.2020.135.1.92
27. Ting K.M. (2017) Confusion Matrix. In: Sammut C., Webb G.I. (eds) Encyclopedia of
Machine Learning and Data Mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-
4899-7687-1_50
28. Goutte C., Gaussier E. (2005) A Probabilistic Interpretation of Precision, Recall and
F-Score, with Implication for Evaluation. In: Losada D.E., Fernández-Luna J.M. (eds)
Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol
3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_25
29. Saito T, Rehmsmeier M (2015) The Precision-Recall Plot Is More Informative than
the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE
10(3): e0118432. https://doi.org/10.1371/journal.pone.0118432
30. Choney Zangmo and Montip Tiensuwan, Application of logistic regression models to
cancer patients: a case study of data from Jigme Dorji Wangchuck National Referral
Hospital (JDWNRH) in Bhutan, 2018 J. Phys.: Conf. Ser. 1039 012031
31. Breslow NE, Day NE, Heseltine E. Statistical methods in cancer research. Lyon:
International Agency for Research on Cancer; 1980.
32. Ayer, Turgay, Jagpreet Chhatwal, Oguzhan Alagoz, Charles E. Kahn Jr, Ryan W.
Woods, and Elizabeth S. Burnside. "Comparison of logistic regression and artificial neural
network models in breast cancer risk estimation." Radiographics 30, no. 1 (2010): 13-22.
33. Liu, Lei. "Research on logistic regression algorithm of breast cancer diagnose data
by machine learning." In 2018 International Conference on Robots & Intelligent System
(ICRIS), pp. 157-160. IEEE, 2018.
34. Andriy Burkov, The Hundred-Page Machine Learning Book, Publisher: Andriy
Burkov (2019)
35. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–97.
36. Quinlan, J.R. Induction of decision trees. Mach Learn 1, 81–106 (1986).
https://doi.org/10.1007/BF00116251
37. A. Navada, A. N. Ansari, S. Patil and B. A. Sonkamble, "Overview of use of decision
tree algorithms in machine learning," 2011 IEEE Control and System Graduate Research
Colloquium, 2011, pp. 37-42, doi: 10.1109/ICSGRC.2011.5991826.
38. A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R
News 2(3), 18--22.
39. Bartholomew, D. J. (2010). Principal components analysis, Int. Encycl. Educ., pp.
374–377, doi: 10.1016/B978-0-08-044894-7.01358-0
40. Andriy Burkov, The Hundred-Page Machine Learning Book, Publisher: Andriy
Burkov (2019)
41. Saraswathi, V. and Gupta, D. (2019), Classification of Brain Tumor using PCA-RF in
MR Neurological Images, 2019 11th Int. Conf. Commun. Syst. Networks, COMSNETS 2019,
vol. 2061, pp. 440–443, doi: 10.1109/COMSNETS.2019.8711010
42. A. A., Ripmiatin,E. and Effendi,Y. (2018). Dimensionality Reduction using PCA and
K-Means Clustering for Breast Cancer Prediction, Lontar Komput. J. Ilm.Teknol. Inf., vol. 9,
no. 3, p. 192, 2018 doi: 10.24843/lkjiti.2018.v09.i03.p08
43. Astuti, W.and Adiwijaya, “Support Vector Machine And Principal Component
Analysis For Microarray Data Classification”, J. Phys. Conf. Ser., vol. 971, no. 1, 2018
doi:10.1088/1742-6596/971/1/012003
44. Yang, W., Si, Y., Wang, D., & Guo, B., “Automatic Recognition Of Arrhythmia Based
On Principal Component Analysis Network And Linear Support Vector Machine”,
Computers In Biology And Medicine, 101, 22–32, 2018
https://doi.org/10.1016/j.compbiomed.2018.08.003