Assignment Bigdata

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17












This study case aims to examine and analyze the impact of numerous factors on breast cancer
prediction using machine learning algorithms. This study focuses on three main research
studies. The first is the impact of data type. The study case focuses on this part to investigate
the use of multiple types of data especially Gene Expression(GE), DNA Methylation (DM),
and a combination of both, affects the accuracy of breast cancer prediction. Next is a
comparison of the classification model. This is to evaluate and compare the effectiveness of
three classification models in predicting breast cancer: Support Vector Machine (SVM),
Decision Tree, and Random Forest. The objective is to determine which model performs better
in consistency and accuracy. And then is the scalability of the computing framework. This is
to investigate the impact of the computing framework on classification performance. This
assesses whether using the scalable platform Spark when combined with the Weka framework
improves classification performance and speeds up the whole process.

This study case also intends to give insights into optimizing breast cancer prediction by taking
into different data types, classification models, and computing frameworks. This research is
carried out by using machine-learning approaches to breast cancer datasets, namely GE and
DM data. It also increases knowledge of how these factors interact and influence the accuracy
of the prediction with possible potential consequences for better medical services and
healthcare. The analysis is described in the following parts: background relevant work,
materials and techniques, experimental results, and concluding notes with suggested future

The use case described in the abstract pertains to the domain of healthcare, with a specific focus
on breast cancer prediction using Big Data analytics and machine learning. The healthcare
industry is undergoing a transformative phase with the integration of information technology,
leading to the generation of vast amounts of data. This data-driven era, often referred to as the
era of Big Data, presents both opportunities and challenges in various sectors, including

The challenges and trends are explosive growth of data. The abstract highlights the explosive
growth of data in various sectors, including the biomedical field. In healthcare, the collection
of patient data, biomedical research data, and other health-related information has increased
significantly. Then, diverse data types. The data in healthcare is diverse, ranging from gene
expression and DNA methylation data to patient records. Analyzing and extracting meaningful
insights from this diverse set of data types is a complex task. Next, need for advanced analysis.
The traditional methods of analyzing healthcare data may not be sufficient to uncover hidden
patterns and valuable knowledge. Advanced analytics, particularly machine learning, is
identified as a solution to harness the potential insights from large datasets. After that, real-
time analysis. The abstract mentions the need for real-time analysis in healthcare. This is
crucial for timely decision-making, especially in the context of disease prediction and


The first data type is Gene Expression (GE) Data. This involves information about the ability
of genes to convert genetic information into gene products, such as proteins. It includes data
from Microarray experiments, monitoring the expression levels of thousands of genes under
specific conditions. Second, DNA Methylation (DM) Data. This data type pertains to the
modification of DNA by adding methyl groups. It provides information about gene control and
can be used as a biomarker for early detection and therapy monitoring. The use case deals with
large amounts of data, considering the exponential growth of data in the healthcare and
biomedical fields. The scale is substantial due to the inclusion of gene expression, DNA
methylation, and a combination of both datasets. For gene expression data, formats may include
information about the transcription and translation steps, involving messenger RNA (mRNA)
and gene products like proteins. Microarray experiments are used for monitoring gene
expression levels. DNA methylation data involves information about modifications to DNA
strands through the addition of methyl groups. The frequency of data generation is not
explicitly mentioned in the abstract, but considering the real-time analysis requirement, it
suggests a need for handling data generated at a high frequency. The use of digital technologies,
such as digital mammography, and the integration of genomics into cancer research contribute
to frequent data generation. Then, the sources of data are the biomedical field and healthcare
records. In the biomedical field, Patient data from different diseases, specifically focusing on
breast cancer. Gene expression data was obtained through Microarray experiments. DNA
methylation data for studying modifications to DNA strands. From the healthcare records,
patient records include information relevant to breast cancer prediction.


1. Data Size and complexity

The traditional analytical approaches are challenged by the size and complexity of medical
datasets, particularly those containing gene expression and tumor symptoms. The sheer amount
and complexity of these datasets provide challenges for the effective processing and extraction
of useful insights. As a result, current efforts in the field of breast cancer diagnosis and
recurrence prediction have been devoted to overcoming these obstacles. The objective is to
devise methodologies that can effectively navigate and analyze large datasets with various
cases and attributes. These study cases need to untangle the subtle patterns and relationships
within the data, revealing information on critical factors affecting breast cancer diagnosis and
recurrence. The development of these advanced techniques, such as machine learning
algorithms and big data analytics, demonstrate organized efforts to discover the potential
contained within these massive datasets, with the ultimate goal of improving the accuracy,
efficiency, and clinical relevance of breast cancer prediction models. This strategy change
towards innovative methodologies indicates a commitment to tackle the specific difficulties
posed by the amount and complexity of data in medical research, particularly in the context of
breast cancer.
2. Feature selection

The problem of developing accurate prediction models in the complex environment of medical
datasets is strongly dependent on the meticulous identification of the most relevant elements.
The abundance of information included within these datasets needs an organized approach to
identifying and prioritizing the factors having the highest predictive value for breast cancer
diagnosis and recurrence. As a result, feature selection emerges as a critical facet in the process
of refining and enhancing predictive models. this process involves the strategic curation of
relevant features which reduces the risk of model overfitting and improving interpretability.

3. Accuracy of predictive models

Achieving high accuracy within predictive models is the primary goal in the field of breast
cancer prediction. The clinical relevance of these models is dependent on their ability to
provide accurate and reliable predictions allowing for early identification and informed
decision-making in healthcare settings. Recognizing the complicated nature of breast cancer,
researchers are diligently examining the effectiveness of several machine-learning algorithms.
The importance of accuracy as a statistic emphasizes the crucial importance of identifying
subtle details and patterns across massive and complex information. Researchers want to
uncover the specific strengths and limits of each technique by putting these datasets to a variety
of algorithms, including but not limited to Support Vector Machines (SVM), decision trees, K-
Nearest Neighbours (KNN), and random forests. This extensive review procedure is based on
the recognition that prediction model accuracy is dependent on the complex interaction
between algorithmic techniques and the particular characteristics of breast cancer datasets.

1. Data Size and Complexity

The solution to the challenges posed by the size and complexity of medical datasets,
especially in the realm of breast cancer diagnosis and recurrence prediction. Traditional
analytical approaches are deemed inadequate due to the sheer volume and intricacy of
datasets containing gene expression and tumor symptom data. The suggested solution
involves a paradigm shift towards advanced methodologies, prominently featuring machine
learning algorithms and big data analytics. The objective is to develop techniques capable
of effectively navigating and analyzing extensive datasets, unraveling subtle patterns and
relationships within the data that are crucial for understanding breast cancer diagnosis and
recurrence. By embracing innovative technologies, such as Spark MLlib for scalable
machine learning in distributed environments, researchers aim to improve the accuracy,
efficiency, and clinical relevance of breast cancer prediction models. This strategic
commitment to innovation reflects a concerted effort to address the specific challenges
arising from the vast and intricate nature of medical data, particularly in the context of
breast cancer research.

2. Features Selection

The use case underscores the critical challenge of feature selection in the development of
accurate prediction models within the intricate landscape of medical datasets, particularly
those associated with breast cancer diagnosis and recurrence. Given the wealth of
information within these datasets, a methodical approach is imperative to identify and
prioritize elements with the highest predictive value. Feature selection emerges as a pivotal
aspect in refining and enhancing predictive models. This strategic process involves the
meticulous curation of relevant features, aiming to mitigate the risk of model overfitting
while simultaneously improving interpretability. In essence, the proposed solution
emphasizes the significance of a targeted and organized feature selection process to ensure
the development of more accurate and clinically relevant breast cancer prediction models
within the complex context of medical datasets.
3. Accuracy of Predictive Models

The use case highlights the paramount importance of achieving high accuracy in predictive
models for breast cancer, emphasizing its direct impact on the clinical relevance of these
models within healthcare settings. The complex nature of breast cancer prompts researchers to
meticulously assess the effectiveness of various machine-learning algorithms. Central to this
assessment is the recognition that accuracy, as a statistical metric, plays a crucial role in
discerning subtle details and patterns within massive and intricate datasets. To
comprehensively understand the strengths and limitations of each technique, researchers
employ a thorough review process involving diverse algorithms such as Support Vector
Machines (SVM), decision trees, K-Nearest Neighbours (KNN), and random forests. This
extensive evaluation acknowledges the nuanced interplay between algorithmic techniques and
the unique characteristics of breast cancer datasets, underscoring the necessity of a nuanced
and adaptable approach to maximize predictive model accuracy in this intricate domain.


The use of Apache Spark to create our Big Data application for predicting breast cancer resulted
in major improvements in healthcare analytics, giving both hopeful outcomes and real
advantages. The predictive models, particularly the Support Vector Machine (SVM),
performed admirably in breast cancer prediction across many datasets, with accuracy rates
above 99% on the gene expression dataset. This increased accuracy, particularly in medical
situations, emphasizes the dependability and precision of our models.

The use of Apache Spark, a distributed computing platform, was critical in simplifying the
processing of large datasets. When compared to typical non-scalable systems, its parallel
processing capabilities and concentration on in-memory analysis greatly reduced processing
times. This increase in operational efficiency is critical in healthcare, where quick forecasts
may have a significant influence on patient outcomes.

Our analysis of Receiver Operating Characteristic (ROC) curves revealed the SVM classifier's
excellent performance, particularly on the gene expression dataset. The constantly high area
under the ROC curve demonstrates the model's ability to distinguish between distinct classes.
Statistical analysis, such as Wilcoxon Rank Sum tests, confirmed the significance of our
findings. In terms of accuracy, precision, and recall, the SVM classifier performed statistically
differently than other classifiers, confirming the robustness of the SVM model in breast cancer

Beyond the technological achievements, our Big Data application's high accuracy and
efficiency translate into real-world advantages for both healthcare practitioners and patients.
Early and accurate identification of breast cancer can result in better treatment results, lower
healthcare costs, and more patient satisfaction. Our scalable machine learning algorithms help
to promote personalized medicine and evidence-based healthcare decision-making.

The success of this Big Data application not only sets the path for more study in medical data
analytics but also acts as a model for similar applications in healthcare and beyond.
Investigating advanced feature selection approaches, diving into deep learning architectures,
and improving predictive models might improve the accuracy and efficiency of breast cancer
prediction even more. In summary, our case study highlights the revolutionary potential of Big
Data analytics in the medical sector, providing a look into a future in which data-driven insights
improve healthcare procedures.


The use of Apache Spark to create our Big Data application for breast cancer prediction resulted
in substantial gains in healthcare analytics, providing improvements across key performance
indicators (KPIs) and showing the solution's meaningful consequences. Notably, the predictive
models, with an emphasis on the Support Vector Machine (SVM), demonstrated exceptional
accuracy, exceeding 99% on the gene expression dataset and above 90% for decision trees and
random forest classifiers across many datasets. This increased precision has important
consequences for medical applications, emphasizing precision and dependability.

Furthermore, the adoption of Apache Spark, a distributed computing system improved

operational efficiency by allowing for the rapid processing of large-scale datasets. Spark's
parallel processing capabilities and emphasis on in-memory analysis greatly lowered
processing times when compared to traditional non-scalable environments, which is especially
important in healthcare contexts where timely forecasts might impact patient outcomes.

The examination of Receiver Operating Characteristic (ROC) curves demonstrated the SVM
classifier's improved performance, particularly on the gene expression dataset. SVM's
continuously high area under the ROC curve demonstrated its ability to distinguish between
patients with and without breast cancer. Statistical studies, including Wilcoxon Rank Sum tests,
highlighted the importance of the results, with the SVM classifier outperforming other
classifiers in accuracy, precision, and recall.

Beyond technological achievements, our Big Data application has a significant real-world
impact. High accuracy and operational efficiency result in real benefits for healthcare
practitioners and patients, such as improved treatment results, lower healthcare costs, and
increased patient satisfaction. Scalable machine learning algorithms aid in the advancement of
personalized treatment and the promotion of data-driven decision-making in healthcare.

Lastly, the success of this Big Data application paves the way for future medical data analytics
research and applications. Investigating sophisticated feature selection approaches, diving into
deep learning architectures, and constantly improving predictive models might improve breast
cancer prediction accuracy even more. The created framework in this case study serves as a
helpful model for comparable applications in healthcare and beyond, highlighting Big Data
analytics' disruptive potential in the medical sector.


Precision and inventiveness are essential in the battle against breast cancer. This essay
examines two case studies that investigate the use of big data analytics in the diagnosis of breast
cancer. It highlights important takeaways and suggests interesting directions for further

Lessons Learned:

Setting the groundwork with data preparation an example case study A highlighted the
significance of pre-processing by showing how the Wisconsin Diagnostic Breast Cancer
(WDBC) dataset's missing values were eliminated to enhance model performance. This
emphasizes how crucial it is to carefully prepare data before supplying it to algorithms.

Cross-validation fosters confidence methods were used in both case studies to guarantee
that the models' claimed accuracy extended to untested data. This increases the validity of their
conclusions and builds confidence in the forecasting abilities of the models.

Hence, why do algorithms matter? A case study showed how different algorithms
performed differently, with SMO outperforming KNN and decision trees in terms of accuracy.
This emphasizes the necessity of selecting algorithms carefully because their applicability
varies based on the data and task at hand.

Then, feature selection simplifies success. Case study B demonstrated the advantages of
feature selection by extracting representative tumor features and lowering the dimensionality
of the data through the use of K-means clustering. As a result, computing time was much
decreased, and the model's generalizability might have been enhanced.

Future Scope:

Deep learning goes deeper. In the area of breast cancer detection, the potential of deep
learning models, such as CNNs and RNNs, to extract intricate features from data is still mostly
unrealized. Investigating these potent structures may result in even greater precision and a more
comprehensive comprehension of the illness.

The merging of data opens new avenues. Incorporating lifestyle variables, genetic
information, and medical history has great potential for developing more complete models.
This "holistic" approach may open the door to personalized medicine, which would involve
developing treatment plans specifically for each patient based on their distinctive profile.

While present research focuses on detection and recurrence prediction, future initiatives
could use big data to discover risk indicators and create preventative measures. This goes
beyond prediction. This proactive strategy may considerably lower the incidence of breast
cancer worldwide.

As AI grows more and more integrated with healthcare, ethical issues lead the way; it is
crucial to guarantee fairness and reduce prejudice in algorithms. Prioritizing ethical issues is
crucial for researchers as they create and use big data-driven models in clinical settings.

This research investigating the impact of data types, classification models, and computing
frameworks on breast cancer prediction holds great potential for advancement in this crucial
area. Analyzing these factors allows for a more comprehensive understanding of prediction
effectiveness and can ultimately guide the development of improved diagnostic and treatment

Here's a breakdown of the three key factors of investigation:

Data Types:

• GE (Gene Expression): Analyzing gene expression patterns has emerged as a

powerful tool for cancer prediction. By studying how genes are activated or silenced,
researchers can identify potential markers associated with tumor development.
• DM (DNA Methylation): DNA methylation is a chemical modification that affects
gene activity. Analyzing methylation patterns can provide insights into gene regulation
and potentially reveal cancer-specific alterations.
• Combined: Combining GE and DM data can offer a more holistic view of the cellular
processes involved in cancer. This potentially leads to more accurate predictions by
capturing both gene expression and regulatory mechanisms.
Classification Models:

• SVM (Support Vector Machine): SVM is a powerful algorithm for classification

tasks. It excels at finding optimal hyperplanes that separate different data classes,
making it a popular choice for cancer prediction.
• Decision Tree: Decision trees are rule-based models that classify data based on a series
of yes/no questions. They offer interpretability and are relatively simple to understand.
• Random Forest: Random forest combines multiple decision trees to improve accuracy
and robustness. It can handle complex data and often outperforms single decision trees.

Computing Frameworks:

• Spark: Spark is a distributed computing platform that excels in processing large

datasets efficiently. This makes it ideal for analyzing massive biomedical data sets like
• Weka: Weka is a popular open-source software suite for data mining and machine
learning. It offers a user-friendly interface and a wide range of algorithms, making it
accessible to researchers with varying levels of expertise.
This flowchart of a training process for a machine learning model to predict breast cancer based
on three different datasets: Gene Expression (GE), Demographic and medical history (DM),
and a combined dataset of both GE and DM. The flowchart shows the steps involved in the
process, from configuring the Spark computing environment to evaluating the performance of
the trained model.
Step in the Training Process:

1. Configure SparkContext: This step sets up the Spark environment, which is a

distributed computing platform used for processing big data.
2. Start SparkContext: This step starts the SparkContext, which is the main entry point for
Spark applications.
3. Convert to Row Format: This step converts the data into a format that can be used by
4. Split Training and Testing: This step splits the data into a training set and a testing set.
The training set is used to train the model, while the testing set is used to evaluate its
5. Decision Tree Classifier: This step creates a decision tree classifier, which is a type of
machine learning model that can be used for classification tasks.
6. Create Pipeline and Train Classifier: This step creates a pipeline that includes the
decision tree classifier and trains the model on the training data.
7. Prediction: This step uses the trained model to make predictions on the testing data.
8. Evaluation: This step evaluates the performance of the model on the testing data using
metrics such as accuracy, precision, recall, and specificity.
9. End SparkContext: This step stops the SparkContext.

Additional Notes:

• The flowchart also compares the performance of the models on two different platforms:
Spark and Weka. Spark is a big data processing platform that can handle large datasets
more efficiently than Weka, which is a traditional machine learning software.
• The study found that SVM generally outperformed the other two models in terms of
accuracy, precision, and recall. However, Decision Trees and Random Forests may also
be viable options depending on the specific requirements of the application.

These flowcharts give a clear picture of the machine learning pipeline that was employed in
the big data study for breast cancer prediction. It emphasizes how crucial feature engineering,
model training and assessment, model deployment and selection, and data pre-processing are.
Researchers can create and implement precise and useful machine-learning models for breast
cancer prediction by following these steps.
The table shows the results of a model simulation that compares the performance of three
different machine learning classifiers for breast cancer prediction: SVM, Decision Tree, and
Random Forest. The models were evaluated using four different metrics: Accuracy, Precision,
Recall (Sensitivity), and Specificity. The table also compares the performance of the models
on two different platforms which is Spark and Weka.
Principal Results:

• SVM fared the best overall on all platforms and metrics. It was the most accurate on
Weka (98.03%) and Spark (99.68%). Additionally, on Spark, it had the highest recall
and precision, while on Weka, it had the highest specificity.
• Random Forest and Decision Tree both fared comparably, with Random Forest
typically exhibiting slightly higher precision and Decision Tree exhibiting slightly
better accuracy. For both criteria, the differences were negligible, nevertheless.
• Except for the recall metric for Random Forest and Decision Tree, Spark performed
better than Weka overall. This implies that training and executing these models for
breast cancer prediction could be more effectively accomplished on Spark.

Extra Thoughts:

• All three models have very high accuracy, more than 95% on both platforms. This
implies that they are all capable of accurately forecasting cases of breast cancer.
• In general, the models exhibit good precision and recall; yet they fall somewhat short
of accuracy across the board for all three models. This suggests a little trade-off
between, on the one hand, accurately recognizing true negatives and avoiding false
negatives, and, on the other, correctly identifying true positives and avoiding false
• Particularly for Spark, the models' specificity is typically lower than the other measures.
This indicates that rather than missing a case of cancer, the models are more likely to
mistakenly categorize healthy patients as having cancer.

Overall, the results of this study suggest that machine learning models can be used
effectively for breast cancer prediction. SVM appears to be the best-performing model in this
study, although Decision Tree and Random Forest may also be viable options depending on
the specific requirements of the application.

However, it is important to note that these results are based on a single dataset and may
not generalize to other datasets. Further research is needed to confirm these findings and to
explore the potential of other machine-learning models for breast cancer prediction.

In conclusion, the comprehensive examination of breast cancer prediction through Big Data
analytics and machine learning offers crucial insights for optimizing healthcare practices. The
study's emphasis on data types, classification models, and computing frameworks reveals that
a combined approach involving Gene Expression (GE) and DNA Methylation (DM) data
significantly enhances predictive accuracy. The Support Vector Machine (SVM) emerges as
the standout classification model, demonstrating superior performance in accuracy, precision,
and recall. The adoption of Apache Spark as a distributed computing platform not only
improves operational efficiency but also outperforms traditional platforms like Weka,
especially in handling large biomedical datasets. Real-world implications underscore the
tangible benefits of early and precise breast cancer identification, leading to improved
treatment outcomes and heightened patient satisfaction. The study's lessons learned, including
the importance of meticulous data preparation and feature selection, pave the way for future
research avenues. Recommendations include confirmatory studies on diverse datasets and
exploration of advanced techniques such as deep learning. Ethical considerations in algorithmic
fairness become paramount as the integration of AI in healthcare advances. Overall, the case
study illuminates the transformative potential of Big Data analytics in healthcare, setting the
stage for continuous innovation and improved breast cancer prediction models.


S. Alghunaim and H. H. Al-Baity, "On the Scalability of Machine-Learning Algorithms for

Breast Cancer Prediction in Big Data Context," in IEEE Access, vol. 7, pp. 91535-91546, 2019,
doi: 10.1109/ACCESS.2019.2927080.

You might also like