Professional Documents
Culture Documents
Prediction of Idiopathic Recurrent Spontaneous Miscarriage Using Machine Learning
Prediction of Idiopathic Recurrent Spontaneous Miscarriage Using Machine Learning
Abstract— Recurrent spontaneous miscarriage (RSM) is spectroscopy combined with ML algorithm may facilitate a
defined as the spontaneous loss of two or more clinically better understanding of this pathology.
diagnosed pregnancies within 20 weeks of gestation. Despite
extensive research, etiology remains undefined in 50% of RSM Keywords— IRSM, Raman spectroscopy, machine learning, SVM,
cases, and are classified as idiopathic. Thus, further study is DT, XGBoost, CNN, AdaBoost, RF, GB, ANN, PCA, OPLS-DA
warranted to understand molecular mechanism associated with
I. INTRODUCTION
the disease pathogenesis. In the present study, we aim to
identify Raman fingerprints in endometrial/uterine tissues of Recurrent spontaneous miscarriage (RSM) is defined as the
women with history of idiopathic recurrent spontaneous spontaneous loss of two or more clinically diagnosed
miscarriage (IRSM) and controls by performing Raman pregnancies within 20 weeks of gestation [1]. Various causative
spectroscopy with chemometric analysis and spectral factors such as endocrine, genetic, immunological, anatomic,
classification models. Unsupervised analysis such as principal infectious, thrombophilic, environmental, and metabolic are
component analysis (PCA), hierarchical cluster analysis (HCA) known to be responsible for RSM [2]. Despite extensive
and supervised analysis such as orthogonal projections to latent research in the past few decades, there is no underlying cause
structures discriminant analysis (OPLS-DA) showed a distinct and half of the patients will remain without a diagnosis. The
separation between IRSM and controls. The principal condition of these patients is known as idiopathic. Thus, further
component loading plots indicated that proteins, amino acids, research is warranted to understand the mechanism of
cholesterol and glutamate were responsible for the separation idiopathic recurrent spontaneous miscarriage (IRSM).
between the two groups. The pre-processed Raman spectral Recently, there has been considerable interest to explore the
data were subjected to eight different machine learning (ML) pathophysiology of complex diseases using powerful label-free
classifiers with hyperparameter optimization to develop vibrational techniques [3-5]. Raman spectroscopy, based on
prediction models. Comparing the various algorithms, support inelastic scattering of light that gives information about
vector machine (SVM), decision tree (DT), Extreme Gradient vibrational modes of molecular bonds, is one such technique
Boosting (XGBoost), convolutional neural network (CNN), which is totally non-invasive in nature and is increasingly
and artificial neural network (ANN) outperform the other gaining popularity to obtain an insight into the characteristics
models based on accuracy (< 85%). Next, grid search and of proteins, carbohydrates, lipids and nucleic acids [6-7].
Bayesian optimization was used for tuning the hyperparameters Various diseases lead to structural and chemical changes in
of all methods. Further, 10-fold cross-validation was done to tissues at a molecular level which are reflected through intrinsic
validate the model performances. Raman fingerprints and can be effectively used as sensitive
phenotypic markers of the disease.
The present findings confirm the feasibility of using Raman
Authorized licensed use limited to: SAINT LOUIS UNIVERSITY LIBRARIES. Downloaded on May 02,2024 at 08:19:13 UTC from IEEE Xplore. Restrictions apply.
Recently, the development of artificial intelligence Fig. 1 Workflow of the study plan
(AI) has emerged rapidly into healthcare; it has established
marked potential in disease diagnostics and treatment. Machine Data processing
learning (ML) is a subset of AI and is currently being explored
in various areas of reproductive health. ML has been used for All spectra obtained was smoothed using Savitzky–Golay
the development of models for prediction of endometriosis, method and baseline corrected to remove noise and
polycystic ovary syndrome (PCOS) and preeclampsia [8-10]. fluorescence background using ORIGIN PRO 8.5 (Origin Lab
ML algorithms help to interpret rapidly and efficiently the vast Corporation, Northampton, MA, USA). The band position of
amount of complex biomedical data in terms of valuable feature each biomolecules was determined using the peak analyzer
information. Therefore, ML is recognized as a promising algorithm of Origin Pro software (Northampton, MA, USA).
method for the identification of patterns and has been Data preprocessing enabled clear spectral peaks and improved
extensively applied to understand disease mechanisms, spectral quality considerably. The Raman peaks were assigned
reducing inevitable errors in diagnosis, treatment and clinical based on earlier published articles and literature [13-15]. The
decision-making [11-12]. According to our knowledge, this is processed Raman data were further taken for chemometric
the first study where Raman spectroscopy combined with ML analysis.
is used to understand disease mechanism of IRSM, a common
women’s health issue. We aim to identify the best performing Chemometric analysis
algorithm by hyperparameter tuning and develop predictive
models for understanding the disease pathophysiology of Principal component analysis (PCA) was conducted to gain
IRSM. information related to the spectral differences between the type
of samples. Moreover, to obtain information about similarity
between the samples, hierarchical cluster analysis (HCA) was
II. METHODOLOGY performed with Euclidean distance matrix.
This study employed recruitment of women with history of
IRSM (n=15) and controls (n=15) as per inclusion criteria at the Euclidean Distance (x, y)= ∑ni=1(yi-xi)2
Assisted Reproduction Unit, Institute of Reproductive HCA allows visualization of the overall grouping and
Medicine, Salt Lake, Kolkata. Endometrial tissue samples
accordingly sub-groups the spectra based on their similarities
(lining of the uterine cavity where the embryo implants) were
collected from both the groups during a favourable period of [16]. HCA and PCA were performed using the ChemoSpec
the menstrual cycle, known as the window of implantation package of R. We also obtain the weight (loading plot) from
(WOI). All tissue samples were fixed in 4% formalin and linear transformation, which contains new variables, and
embedded in paraffin for Raman spectroscopy. All spectra were indicates the molecular vibrational modes that explained most
acquired in the spectral range of 400-1800 cm−1 (the fingerprint of the variance. Supervised classification models such as
region) and intensity measured. This region provides complete OPLS-DA (orthogonal partial least squares discriminant
information about the biomolecules such as lipids, analysis) was applied to visualize class separation using
carbohydrates, proteins and nucleic acids. Further, Raman SIMCA 13.0.1 (Umetrics, Sweden). OPLS-DA augment class
spectral data were evaluated by ML algorithms including segregation by eliminating variability that is not relevant to
Extreme Gradient Boosting (XGBoost), decision tree (DT), class separation.
support vector machine (SVM), Adaptive Boosting
(AdaBoost), random forest (RF), gradient boosting (GB)
convolutional neural network (CNN) and artificial neural
network (ANN). The overall workflow of the study design is Machine learning algorithms
shown in Figure 1.
Following spectral preprocessing, Raman spectral data was
subjected to ML in python using scikit-learn library. The
spectral data were divided randomly into training set and
validation set in a ratio of 70:30. The training set was used to
train a classification model and the validation set used to
evaluate the model performance.
Several ML algorithms, including SVM, AdaBoost, XGBoost,
DT, RF, GB, CNN and ANN were used to build the prediction
models for IRSM. We also implemented neural networks to
improve upon ML methodologies. The performance of each
classifier algorithm was measured independently.
Authorized licensed use limited to: SAINT LOUIS UNIVERSITY LIBRARIES. Downloaded on May 02,2024 at 08:19:13 UTC from IEEE Xplore. Restrictions apply.
hyperplane minimizes the distance between the nearest points converted to a single dimension using the flatten layer. The
in each class [17]. This algorithm achieves high discriminative output layers consisted of a dense layer, and a sigmoid
power by using special nonlinear functions (polynomial activation function, which added the final classification
kernels) to transform the input space into a multidimensional label(predicted) to the output. The flowchart of the CNN model
space. SVM has the advantages of increasing separation of class is shown in Figure 2.
and reducing predictable error. Additionally, SVM is used for
both non-linear and linear discriminatory analyses. The
equation of polynomial kernels is given as
K (x, y) = tanh (ϒ.xT y + r)d, ϒ > 0 (2)
B. Adaptive Boosting
D. Decision Tree
H. Artificial Neural Network
DT is a powerful ML model that builds a decision tree from the
set of class labeled training samples [20]. The tree can be An artificial neural network was implemented using the
explained by two entities, namely leaves and decision nodes. PyTorch library. Initially, linear layers were added, which takes
The leaves are the decisions or the final outcomes, and the an input as input dimension, and the number of hidden
decision nodes are where the data is split. dimensions (128), which are the output unit of layers. A ReLU
activation function was added next, and another linear layers
E. Random Forest were added. Further, sigmoid activation function was added for
the final output layer. A forward step function was defined,
RF is an ensemble method that operates by construction of which activates a forward pass of the network. Next, a
multitude of decision trees by random selection of features with backpropagation pass was used to update the weights using
controlled variance [21]. Adam optimizer. Further the loss was calculated using the
output of forward pass and backpropagation was initialized. An
F. Gradient Boosting
optimizer. step() function was called, which updates the
GB is an ensemble technique which creates the model model’s parameters using the gradients that were calculated
consisting of weaker prediction models, mostly decision trees during the backpropagation step, and defined optimizer with a
and weak basic classifiers gradually become better than learning rate α=0.01. The architecture of ANN model is shown
previous weak classifiers [22]. in Figure 4.
Authorized licensed use limited to: SAINT LOUIS UNIVERSITY LIBRARIES. Downloaded on May 02,2024 at 08:19:13 UTC from IEEE Xplore. Restrictions apply.
were cross validated using GridSearchCV; separate cross
validation for each model was also conducted for additional
verification. After tuning the hyperparameters, K-fold (CV)
(k=10) was performed for each set of hyperparameter values and
the performance measured by assessing the area under receiver
operating characteristic (ROC) curve (AUC). The
hyperparameters that showed the highest accuracy were chosen
further for all ML algorithms.
Fig. 3 (A) Graph of ReLU function (B) Sigmoid function
Cross -Validation
Authorized licensed use limited to: SAINT LOUIS UNIVERSITY LIBRARIES. Downloaded on May 02,2024 at 08:19:13 UTC from IEEE Xplore. Restrictions apply.
Table 1. Raman shifts and assignment of peak The study was carried a step further to examine the performance
S. No IRSM Controls Vibrations
of the ML algorithms on Raman spectral data. For this purpose,
eight different models i.e. XGBoost, DT, SVM, AdaBoost, RF
1. 1650 1665 Amide I
GB, CNN and ANN were developed. To improve the model
2. 1446 1451 CH2 bending mode of proteins and
lipids performances, grid-search method and Bayesian optimization
3. 1254 1249 Amide III was used to perform hyperparameter tuning of all classifiers.
4. 1090 1096 Glutamate The best parameters selected with fixed defined values are
5. 550 548 Cholesterol shown in Figure 7 and Table 2.
Authorized licensed use limited to: SAINT LOUIS UNIVERSITY LIBRARIES. Downloaded on May 02,2024 at 08:19:13 UTC from IEEE Xplore. Restrictions apply.
Fig. 7 Hyperparameter tuning of classifiers (A) Support vector machine (SVM) (B) Random forest (RF) (C) Adaptive Boosting (AdaBoost) (D)
Decision tree (DT) (E) Extreme Gradient Boosting (XGBoost) (F) Gradient boosting (GB)
The performance of the applied classification models was Fig. 8 Receiver operating characteristic (ROC) curve of different
determined in terms of sensitivity, specificity, accuracy, F1 models (A) Support vector machine (SVM) (B) Extreme Gradient
Boosting (XGBoost) and (C) Decision tree (DT) (D) Convolutional
scores and AUC values. Sensitivity and specificity are neural network (CNN)
measures of classification success in predicting the presence or
absence of a disease. Additionally, accuracy is the most
Table 3. Comparison of machine learning algorithms
common performance metric for classification algorithms. F1
score specifies the harmonic mean of sensitivity and precision.
Models Sensitivity Specificity Accurac AUC F1
Further, K-fold CV (k=10) was used to validate model y score
performances using grid search. Comparing all the algorithms, SVM 100% 81% 90% 0.91 90%
SVM, XGBoost, DT classifiers demonstrated best performance XGBoost 88% 81% 85% 0.85 84%
and achieved the classification accuracy of 90% 85%, and 85% DT 88% 81% 85% 0.85 84%
respectively. (Table 3). The proposed neural network CNN and CNN 100% 82% 90% 0.91 90%
ANN 100% 88% 92% 0.94 89%
ANN model also showed very good results in terms of
accuracy. The 1D CNN, showed a mean accuracy of 90% while
ANN model attained a mean accuracy of 92%. The ROC curves
of classifiers are presented in Figure 8.
In addition, learning curves were also generated to measure the
algorithm learning performances with different training data
sizes over time. The learning curves of classifiers are shown in
Figure 9.
Authorized licensed use limited to: SAINT LOUIS UNIVERSITY LIBRARIES. Downloaded on May 02,2024 at 08:19:13 UTC from IEEE Xplore. Restrictions apply.
macromolecules such as lipids, glutamate, proteins etc. was
observed in the endometrial tissue microenvironment of IRSM
cases. It is concluded that Raman spectroscopy in conjunction
with ML approach helped identify biomolecular fingerprints in
endometrium of women with history of IRSM. This approach
paves way for better understanding of the endometrial
impairment at a molecular level in women undergoing IRSM.
ACKNOWLEDGMENT
The authors thankfully acknowledge the financial support
provided by MHRD, India, from Indian Institute of Technology,
Kharagpur.
REFERENCES
[1] Eshre Guideline Group on RPL, Bender Atik R, Christiansen
OB, Elson J, Kolte AM, Lewis S, Middeldorp S, Nelen W,
Peramo B, Quenby S, Vermeulen N, “ESHRE guideline:
recurrent pregnancy loss,” Human Reproduction Open, vol.
2018(2), 2018.
[2] Ford HB, Schust DJ, “Recurrent pregnancy loss: etiology,
Fig. 9 Learning curves of different models (A) Support vector machine (SVM) diagnosis, and therapy,” Reviews in Obstetrics and
(B)Extreme Gradient Boosting (XGBoost) and (C) decision tree (DT) Gynecology, vol. 2(2), pp. 76, 2009.
[3] Shen Z, He Y, Shen Z, Wang X, Wang Y, Hua Z, Jiang N, Song
Among all models generated, SVM, DT, XGBoost, CNN and Z, Li R, Xiao Z, “Novel exploration of Raman microscopy and
ANN exhibited highest classification model accuracy on non-linear optical imaging in adenomyosis,” Frontiers in
Medicine, pp. 9, 2022.
Raman spectral data. These findings confirm the feasibility of
[4] Chen SJ, Zhang Y, Ye XP, Hu K, Zhu MF, Huang YY, Zhong
using Raman spectroscopy in combination with the ML M, Zhuang ZF, “Study of the molecular variation in pre-
algorithm for understanding disease pathophysiology of IRSM. eclampsia placenta based on micro-Raman spectroscopy,”
Archives of Gynecology and Obstetrics, vol. 290(5), pp. 943-
946, 2014.
[5] Guleken Z, Bulut H, Bulut B, Paja W, Parlinska-Wojtan M,
IV. CONCLUSION Depciuch J, “Correlation between endometriomas volume and
Raman spectra. Attempting to use Raman spectroscopy in the
diagnosis of endometrioma,” Spectrochimica Acta Part A:
We present a novel throughput strategy using Raman Molecular and Biomolecular Spectroscopy, vol. 274, pp.
121119, June 2022.
spectroscopy to understand the altered biochemical signatures
[6] Bendifallah S, Puchar A, Suisse S, Delbos L, Poilblanc M,
in endometrial tissue of IRSM women. Raman spectral analysis Descamps P, Golfier F, Touboul C, Dabi Y, Daraï E, “Machine
showed a significant decrease in the expression of proteins, learning algorithms as new screening approach for patients with
endometriosis,” Scientific Reports, vol. 12(1), pp. 1-2, January
lipids, glutamate and cholesterol in IRSM as compared with 2022.
controls. Additionally, we performed PCA, HCA combined [7] Lv W, Song Y, Fu R, Lin X, Su Y, Jin X, Yang H, Shan X, Du
with OPLS-DA for automated classification of data. We further W, Huang Q, Zhong H, “Deep learning algorithm for automated
applied eight ML algorithms including SVM, AdaBoost, detection of polycystic ovary syndrome using scleral images,”
Frontiers in Endocrinology, vol. 12, 2021.
XGBoost, DT, RF, GB, CNN and ANN and evaluated the [8] Schmidt LJ, Rieger O, Neznansky M, Hackelöer M, Dröge LA,
models using five different evaluation metrics, namely Henrich W, Higgins D, Verlohren S, “A machine-learning–
accuracy, sensitivity, specificity, F1-score, and ROC metrics. based algorithm improves prediction of preeclampsia-
associated adverse outcomes,” American Journal of Obstetrics
Grid search and Bayesian optimization was used for and Gynecology, February 2022.
hyperparameter optimization to obtain the best combination of [9] Blass I, Sahar T, Shraibman A, Ofer D, Rappoport N, Linial M,
parameters. We observed that SVM, DT, XGBoost, CNN and “Revisiting the risk factors for endometriosis: A machine
learning approach,” Journal of Personalized Medicine, vol.
ANN performed better than other models in term of accuracy. 12(7), pp. 1114, July 2022.
[10] Kodipalli A, Devi S, “Prediction of PCOS and Mental Health
This study has considerable clinical utility since the Using Fuzzy Inference and SVM,” Frontiers in Public Health,
endometrium i.e., the lining of the uterus where the embryo vol. 9, 2021.
implants in women with IRSM remains poorly understood. A [11] Goyal A, Kuchana M, Ayyagari KP, “Machine learning
significant downregulation of various biological predicts live-birth occurrence before in-vitro fertilization
Authorized licensed use limited to: SAINT LOUIS UNIVERSITY LIBRARIES. Downloaded on May 02,2024 at 08:19:13 UTC from IEEE Xplore. Restrictions apply.
treatment,” Scientific Reports, vol. 10(1), pp. 1-2, December datasets,” International Journal of Control Theory and
2020. Applications, vol. 9(40), 2016.
[12] Liu R, Bai S, Jiang X, Luo L, Tong X, Zheng S, Wang Y, Xu [20] Singer G, Marudi M, “Ordinal decision-tree-based ensemble
B, “Multifactor prediction of embryo transfer outcomes based approaches: The case of controlling the daily local growth rate
on a machine learning algorithm,” Frontiers in Endocrinology, of the COVID-19 epidemic,” Entropy, vol. 22(8), pp. 871,
vol. 12, 2021. August 2020.
[13] De Gelder J, De Gussem K, Vandenabeele P, Moens L, [21] Nguyen JM, Jézéquel P, Gillois P, Silva L, Ben Azzouz F,
“Reference database of Raman spectra of biological Lambert-Lacroix S, Juin P, Campone M, Gaultier A, Moreau-
molecules,” Journal of Raman Spectroscopy, vol. 38(9), pp. Gaudry A, Antonioli D, “Random forest of perfect trees:
1133-1147, September 2007. concept, performance, applications and perspectives,”
[14] Movasaghi Z, Rehman S, Rehman IU, “Raman spectroscopy of Bioinformatics, vol. 37(15), pp. 2165-2174, August 2021.
biological tissues,” Applied Spectroscopy Reviews, vol. 42(5), [22] Tabrizchi H, Tabrizchi M, Tabrizchi H, “Breast cancer
pp. 493-541, September 2007. diagnosis using a multi-verse optimizer-based gradient boosting
[15] Krafft C, “Raman spectroscopy and microscopy of cells and decision tree,” SN Applied Sciences, vol. 2(4), pp. 1-9, April
tissues,” Encyclopedia of Biophysics, vol. 2178, 2013. 2020.
[16] Sodo A, Verri M, Palermo A, Naciu AM, Sponziello M, [23] Lakhdari K, Saeed N, “A new vision of a simple 1D
Durante C, Di Gioacchino M, Paolucci A, di Masi A, Longo F, Convolutional Neural Networks (1D-CNN) with leaky-ReLU
Crucitti P, “Raman spectroscopy discloses altered molecular function for ECG abnormalities classification,” Intelligence-
profile in thyroid adenomas,” Diagnostics, vol. 11(1), pp. 43, Based Medicine, vol. 6, pp. 100080, January 2022.
December 2022. [24] Radzi SF, Karim MK, Saripan MI, Rahman MA, Isa IN, Ibahim
[17] William SN, Teukolsky SA, “What is a support vector machine, MJ, “Hyperparameter tuning and pipeline optimization via
Nat Biotechnol,” vol. 24, pp. 1565-1567, 2006. Grid search method and Tree-based AutoML in breast cancer
prediction,” Journal of Personalized Medicine, vol. 11(10), pp.
[18] Sevinç E, “An empowered AdaBoost algorithm 978, September 2021.
implementation: A COVID-19 dataset study,” Computers &
Industrial Engineering, vol. 165, pp. 107912, March 2022. [25] Giola C, Danti P, Magnani S, “Learning curves: A novel
approach for robustness improvement of load forecasting,”
[19] Ramraj S, Uzir N, Sunil R, Banerjee S, “Experimenting Engineering Proceedings, vol. 5(1), pp. 38, July 2021.
XGBoost algorithm for prediction and classification of different
Authorized licensed use limited to: SAINT LOUIS UNIVERSITY LIBRARIES. Downloaded on May 02,2024 at 08:19:13 UTC from IEEE Xplore. Restrictions apply.